# 분류 분석

## 이진 고객 이탈

고객의 웹사이트 광고를 제공하는 마케팅 대행사에서 당사의 고객이탈율이 꽤 높다는 사실을 알게됐습니다.<br>
그 회사에서는 즉시 고객 관리자들을 할당했지만, 어떤 고객이 이탈 할 것인지 예측하는 기계학습 모델을 만들어서 가장 이탈확률이 높은 고객에게 우선적으로 고객 관리자를 배치할 수 있기를 원합니다.<br>
고객이 잠재적 이탈 고객인지 여부를 분류하는 분류모델을 만드세요.

데이터는 customer_churn.csv로 저장돼있습니다. 아래는 각 컬럼의 정의입니다.
- Name : 회사의 최근 담당자의 이름
- Age : 고객의 나이
- Total_Purchase : 구매한 총 광고
- Account_Manager : 바이너리 0 = 고객 관리자 없음, 1 = 고객 관리자 할당됨
- Years : 거래 유지 연도수
- Num_sites : 서비스를 이용 중인 웹사이트 수
- Onboard_date : 마지막 연락처가 등록된 날짜
- Location : 고객 사무실 주소
- Company : 고객 회사의 이름

모델 생성 후에는 고객이 제공하는 new_customers.csv 파일에 저장된 새로운 데이터에 대한 예측값을 제시하세요. <br>
고객은 이 데이터를 통해 관리가 필요한 고객들을 알고 싶어 합니다.

In [124]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
from sklearn.preprocessing import RobustScaler # 스케일러다! (x-중앙값)/IQR => 이상치에 영향을 덜 받게 하려고 씀 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

### 데이터 불러오기

In [125]:
# churn 1 : 이탈함 아니면 이탈 안함
# new_customers.csv 이거 테스트로 합시다~~~

In [126]:
df = pd.read_csv("./data/customer_churn.csv")
df.head()

Unnamed: 0,Names,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Onboard_date,Location,Company,Churn
0,Cameron Williams,42.0,11066.8,0,7.22,8.0,2013-08-30 07:00:40,"10265 Elizabeth Mission Barkerburgh, AK 89518",Harvey LLC,1
1,Kevin Mueller,41.0,11916.22,0,6.5,11.0,2013-08-13 00:38:46,"6157 Frank Gardens Suite 019 Carloshaven, RI 1...",Wilson PLC,1
2,Eric Lozano,38.0,12884.75,0,6.67,12.0,2016-06-29 06:20:07,"1331 Keith Court Alyssahaven, DE 90114","Miller, Johnson and Wallace",1
3,Phillip White,42.0,8010.76,0,6.71,10.0,2014-04-22 12:43:12,"13120 Daniel Mount Angelabury, WY 30645-4695",Smith Inc,1
4,Cynthia Norton,37.0,9191.58,0,5.56,9.0,2016-01-19 15:31:15,"765 Tricia Row Karenshire, MH 71730",Love-Jones,1


### 데이터 확인

In [127]:
df.shape

(900, 10)

In [128]:
df.dtypes

Names               object
Age                float64
Total_Purchase     float64
Account_Manager      int64
Years              float64
Num_Sites          float64
Onboard_date        object
Location            object
Company             object
Churn                int64
dtype: object

In [129]:
df.isna().sum()

Names              0
Age                0
Total_Purchase     0
Account_Manager    0
Years              0
Num_Sites          0
Onboard_date       0
Location           0
Company            0
Churn              0
dtype: int64

### 머신러닝용 데이터 전처리

우리는 수치형 데이터만 이용할 예정입니다. Account_Manager는 다루기 어렵지 않기 때문에 모델학습에 포함할 수도 있겠지만 무작위로 할당된 것이기 때문에 큰 의미가 없을 수 있습니다.

In [130]:
df = df.drop(["Names","Onboard_date","Location","Company"], axis = 1)

In [131]:
df.head()

Unnamed: 0,Age,Total_Purchase,Account_Manager,Years,Num_Sites,Churn
0,42.0,11066.8,0,7.22,8.0,1
1,41.0,11916.22,0,6.5,11.0,1
2,38.0,12884.75,0,6.67,12.0,1
3,42.0,8010.76,0,6.71,10.0,1
4,37.0,9191.58,0,5.56,9.0,1


In [132]:
df["Churn"].unique()

array([1, 0])

# 일단 독립변수 종속변수 나눌것임

In [133]:
x = df.drop(["Churn"], axis = 1)
y = df["Churn"]

# 언더 스코어 적용하기 전

In [134]:
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size = 0.3,
    stratify = y,
    random_state = 666
)

In [135]:
len(x_train), len(x_test)

(630, 270)

In [136]:
# 모델에 넣어야지!
logi = LogisticRegression()
logi.fit(x_train, y_train)

pred = logi.predict(x_test)

obunlu = confusion_matrix(y_test, pred)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [137]:
obunlu

array([[222,   3],
       [ 37,   8]])

In [138]:
print(accuracy_score(y_test, pred))

0.8518518518518519


In [139]:
model = sm.Logit(y_train, x_train)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.402559
         Iterations 7


0,1,2,3
Dep. Variable:,Churn,No. Observations:,630.0
Model:,Logit,Df Residuals:,625.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 20 Nov 2025",Pseudo R-squ.:,0.1065
Time:,17:40:53,Log-Likelihood:,-253.61
converged:,True,LL-Null:,-283.85
Covariance Type:,nonrobust,LLR p-value:,2.296e-12

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Age,-0.0984,0.015,-6.368,0.000,-0.129,-0.068
Total_Purchase,-0.0002,4.47e-05,-4.360,0.000,-0.000,-0.000
Account_Manager,0.3020,0.224,1.349,0.177,-0.137,0.741
Years,0.0046,0.083,0.055,0.956,-0.159,0.168
Num_Sites,0.4769,0.063,7.516,0.000,0.353,0.601


In [157]:
# 모델에 넣어야지!
logi = LogisticRegression()
logi.fit(x_train, y_train)

pred = logi.predict(x_test)

obunlu = confusion_matrix(y_test, pred)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [158]:
obunlu

array([[211,  14],
       [ 21,  24]])

In [159]:
print(accuracy_score(y_test, pred))

0.8703703703703703


# p_value 나가리 된 독립변수 제외 : Years 제거

In [156]:
x_train, x_test, y_train, y_test = train_test_split(
    x.drop(["Years"],axis=1),
    y,
    test_size = 0.3,
    stratify = y,
    random_state = 666
)

# 언더 스코어 적용한 후

In [140]:
# Churn 비율 어떻게 됨?
df["Churn"].value_counts()

# 아 언더 스코어를 해야하겠구나... 예측모델이 뻑나겠구나~~~~
# 하는것을 알 았 습 니 다

Churn
0    750
1    150
Name: count, dtype: int64

In [141]:
# 언더 스코어 적용하기 위해서 150으로 할거임~~
stay_index = df[df["Churn"] == 0].sample(150, random_state = 666).index.tolist() 
out_index = df[df["Churn"] == 1].index.tolist() # 남을겁니다. # 이탈잡니다ㅇㅇ

In [142]:
# 반반 뽑아야지~~~
random_index = stay_index + out_index

In [143]:
stay_index[0]
x.iloc[0]

Age                   42.00
Total_Purchase     11066.80
Account_Manager        0.00
Years                  7.22
Num_Sites              8.00
Name: 0, dtype: float64

In [144]:
x.head()

Unnamed: 0,Age,Total_Purchase,Account_Manager,Years,Num_Sites
0,42.0,11066.8,0,7.22,8.0
1,41.0,11916.22,0,6.5,11.0
2,38.0,12884.75,0,6.67,12.0
3,42.0,8010.76,0,6.71,10.0
4,37.0,9191.58,0,5.56,9.0


In [145]:
sample_x = x.iloc[random_index]
sample_y = df["Churn"][random_index]

In [146]:
sample_y.count()

np.int64(300)

In [147]:
len(sample_x), sample_y.count()

(300, np.int64(300))

In [160]:
x_train, x_test, y_train, y_test = train_test_split(
    sample_x,
    sample_y,
    test_size = 0.3,
    stratify = sample_y,
    random_state = 666
)

# 모델 평가 2

In [149]:
# 모델에 넣어야지!
logi2 = LogisticRegression()
logi2.fit(x_train, y_train)

pred = logi2.predict(x_test)

obunlu = confusion_matrix(y_test, pred)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [150]:
obunlu

array([[38,  7],
       [ 8, 37]])

In [151]:
print(accuracy_score(y_test, pred))

0.8333333333333334


In [152]:
# 애들일단 리포트 찍어봐 미쳐부리겠네

In [153]:
model = sm.Logit(y_train, x_train)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.570476
         Iterations 5


0,1,2,3
Dep. Variable:,Churn,No. Observations:,210.0
Model:,Logit,Df Residuals:,205.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 20 Nov 2025",Pseudo R-squ.:,0.177
Time:,17:40:54,Log-Likelihood:,-119.8
converged:,True,LL-Null:,-145.56
Covariance Type:,nonrobust,LLR p-value:,1.736e-10

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Age,-0.0872,0.022,-4.006,0.000,-0.130,-0.045
Total_Purchase,-0.0001,6e-05,-2.355,0.019,-0.000,-2.37e-05
Account_Manager,0.0621,0.315,0.197,0.844,-0.555,0.680
Years,0.0265,0.118,0.224,0.823,-0.205,0.258
Num_Sites,0.5329,0.093,5.732,0.000,0.351,0.715


In [163]:
# Total_Purchase 랑 Account_Manager 상관관계가 나가리입니다~
x_train, x_test, y_train, y_test = train_test_split(
    sample_x.drop(["Years"],axis=1),
    sample_y,
    test_size = 0.3,
    stratify = sample_y,
    random_state = 666
)

In [164]:
# 모델에 넣어야지!
logi3 = LogisticRegression()
logi3.fit(x_train, y_train)

pred = logi3.predict(x_test)

obunlu = confusion_matrix(y_test, pred)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [165]:
obunlu

array([[36,  9],
       [ 7, 38]])

In [166]:
model = sm.Logit(y_train, x_train)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.570595
         Iterations 5


0,1,2,3
Dep. Variable:,Churn,No. Observations:,210.0
Model:,Logit,Df Residuals:,206.0
Method:,MLE,Df Model:,3.0
Date:,"Thu, 20 Nov 2025",Pseudo R-squ.:,0.1768
Time:,17:48:26,Log-Likelihood:,-119.82
converged:,True,LL-Null:,-145.56
Covariance Type:,nonrobust,LLR p-value:,3.881e-11

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Age,-0.0860,0.021,-4.084,0.000,-0.127,-0.045
Total_Purchase,-0.0001,5.73e-05,-2.398,0.016,-0.000,-2.51e-05
Account_Manager,0.0598,0.315,0.190,0.849,-0.557,0.677
Num_Sites,0.5388,0.089,6.035,0.000,0.364,0.714


In [167]:
print(accuracy_score(y_test, pred))

0.8222222222222222
