# Introduction

The Telco customer churn dataset a hypothetical telecommunications company that provides home phone and internet services to 7,043 customers in California during the third quarter. The dataset indicates which customers have departed from the services, which have remained, and which have newly subscribed.

▼The aim of this project:
predict whether a customer is likely to churn based on various factors related to their engagement with the company's services. This predictive capability will enable the company to proactively identify and address customer concerns, leading to improved customer retention rates and more informed business decisions.

▼The flow of the analysis
1. Data exploration and visualization
2. Data handling
3. 

# About the dataset

The dataset we are working with contains information about 7,043 customers of a telecommunications company based in California. The dataset comprises 21 variables, each capturing different aspects of customer behavior and demographics.

Variables

- CustomerId: Customer's ID
- Gender: Gender of the customer
- SeniorCitizen: Whether the customer is a senior citizen (1) or not (0)
- Partner: Whether the customer has a partner (Yes or No)
- Dependents: Whether the customer has dependents (Yes or No)
- Tenure: Number of months the customer has stayed with the company
- PhoneService: Whether the customer has phone service (Yes or No)
- MultipleLines: Whether the customer has multiple lines (Yes, No, or No phone service)
- InternetService: Type of internet service (DSL, Fiber optic, or No)
- OnlineSecurity: Whether the customer has online security (Yes, No, or No internet service)
- OnlineBackup: Whether the customer has online backup (Yes, No, or No internet service)
- DeviceProtection: Whether the customer has device protection (Yes, No, or No internet service)
- TechSupport: Whether the customer has tech support (Yes, No, or No internet service)
- StreamingTV: Whether the customer has streaming TV (Yes, No, or No internet service)
- StreamingMovies: Whether the customer has streaming movies (Yes, No, or No internet service)
- Contract: Type of contract (Month-to-month, One year, Two year)
- PaperlessBilling: Whether the customer has paperless billing (Yes or No)
- PaymentMethod: Payment method (Electronic check, Mailed check, Bank transfer, Credit card)
- MonthlyCharges: Monthly charges paid by the customer
- TotalCharges: Total charges paid by the customer
- Churn: Whether the customer has churned (Yes or No)


In [1]:
import pandas as pd

data = pd.read_csv('Telco-Customer-Churn.csv')

data.head(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
data.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [14]:
data.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [4]:
#average tenure
round(data.loc[:,'tenure'].mean(),2)

32.37

In [12]:
# average tenure for non-churn customers
round(data.loc[data['Churn'] == 'Yes', 'tenure'].mean(),2)

17.98

In [13]:
# average tenure for churned customers
round(data.loc[data['Churn'] == 'No', 'tenure'].mean(),2)

37.57

In [9]:
import matplotlib.style
matplotlib.style.available

['Solarize_Light2',
 '_classic_test_patch',
 '_mpl-gallery',
 '_mpl-gallery-nogrid',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark',
 'seaborn-dark-palette',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'tableau-colorblind10']

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Logistic regression1

In [15]:
X = data[['SeniorCitizen','tenure','gender','Partner','Dependents','MultipleLines','InternetService',
          'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
          'StreamingMovies','Contract','PaperlessBilling','PaymentMethod']]

y =  data[['Churn']]

X = pd.get_dummies(X, columns = ['gender','Partner','Dependents','MultipleLines','InternetService',
          'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
          'StreamingMovies','Contract','PaperlessBilling','PaymentMethod'])

X.head(5)

Unnamed: 0,SeniorCitizen,tenure,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,MultipleLines_No,MultipleLines_No phone service,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,1,0,0,1,1,0,0,1,...,0,1,0,0,0,1,0,0,1,0
1,0,34,0,1,1,0,1,0,1,0,...,0,0,1,0,1,0,0,0,0,1
2,0,2,0,1,1,0,1,0,1,0,...,0,1,0,0,0,1,0,0,0,1
3,0,45,0,1,1,0,1,0,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0,2,1,0,1,0,1,0,1,0,...,0,1,0,0,0,1,0,0,1,0


In [18]:
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate a logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Get the coefficients for each explanatory variable
coefficients = model.coef_[0]

# Display each explanatory variable and its coefficient
for feature, coef in zip(X_train.columns, coefficients):
    print(f'{feature}: {coef}')

SeniorCitizen: 0.16130830613145875
tenure: -0.032862629352142465
gender_Female: -0.05643024814183436
gender_Male: -0.10804079890113111
Partner_No: -0.11022057880892898
Partner_Yes: -0.05425046823403596
Dependents_No: 0.004502811034152063
Dependents_Yes: -0.16897385807711587
MultipleLines_No: -0.32270185636529386
MultipleLines_No phone service: 0.19376635598272318
MultipleLines_Yes: -0.035535546660394
InternetService_DSL: -0.4392828897963065
InternetService_Fiber optic: 0.43178971554126677
InternetService_No: -0.15697787278792852
OnlineSecurity_No: 0.20313191399267436
OnlineSecurity_No internet service: -0.15697787278792852
OnlineSecurity_Yes: -0.21062508824770887
OnlineBackup_No: 0.06653398346138101
OnlineBackup_No internet service: -0.15697787278792852
OnlineBackup_Yes: -0.07402715771641522
DeviceProtection_No: -0.006628373797748054
DeviceProtection_No internet service: -0.15697787278792852
DeviceProtection_Yes: -0.0008648004572862238
TechSupport_No: 0.16041394771656564
TechSupport_No

  y = column_or_1d(y, warn=True)


In [19]:
# Predict on the test data
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Display evaluation metrics for each class
report = classification_report(y_test, y_pred)
print(report)

Accuracy: 0.8211497515968772
              precision    recall  f1-score   support

          No       0.86      0.91      0.88      1036
         Yes       0.69      0.58      0.63       373

    accuracy                           0.82      1409
   macro avg       0.78      0.75      0.76      1409
weighted avg       0.81      0.82      0.82      1409



# Logistics Regression2 with standardization

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled

array([[-0.4377492 , -0.46568336,  1.02516569, ..., -0.52765585,
        -0.70964983,  1.84247002],
       [-0.4377492 ,  0.88553679,  1.02516569, ..., -0.52765585,
        -0.70964983, -0.54274967],
       [-0.4377492 , -1.28460467, -0.97545208, ..., -0.52765585,
         1.40914569, -0.54274967],
       ...,
       [-0.4377492 , -0.83419795, -0.97545208, ..., -0.52765585,
         1.40914569, -0.54274967],
       [ 2.28441306, -0.83419795, -0.97545208, ..., -0.52765585,
         1.40914569, -0.54274967],
       [-0.4377492 , -0.26095304, -0.97545208, ...,  1.89517467,
        -0.70964983, -0.54274967]])

In [9]:
# ロジスティック回帰モデルをインスタンス化
model = LogisticRegression()

# モデルをトレーニング
model.fit(X_train_scaled, y_train)

# 各説明変数の係数を取得
coefficients = model.coef_[0]

# 各説明変数とその係数を表示
for feature, coef in zip(X_train.columns, coefficients):
    print(f'{feature}: {coef}')

SeniorCitizen: 0.059172709771479495
tenure: -0.7972658409841031
gender_Female: 0.012948513762738551
gender_Male: -0.012948513762738765
Partner_No: -0.013775173404453459
Partner_Yes: 0.013775173404453516
Dependents_No: 0.03955922248375553
Dependents_Yes: -0.039559222483755666
MultipleLines_No: -0.09925271228482362
MultipleLines_No phone service: 0.09562431836926674
MultipleLines_Yes: 0.04276673303473223
InternetService_DSL: -0.196347603092788
InternetService_Fiber optic: 0.2282486140836151
InternetService_No: -0.04870710382105402
OnlineSecurity_No: 0.11529997729160117
OnlineSecurity_No internet service: -0.04870710382105402
OnlineSecurity_Yes: -0.08327924471742715
OnlineBackup_No: 0.05476280330855407
OnlineBackup_No internet service: -0.04870710382105402
OnlineBackup_Yes: -0.014926130275759302
DeviceProtection_No: 0.019740318971837596
DeviceProtection_No internet service: -0.04870710382105402
DeviceProtection_Yes: 0.021539199119253047
TechSupport_No: 0.09600491686482454
TechSupport_No i

  y = column_or_1d(y, warn=True)


In [10]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predict the test data
y_pred = model.predict(X_test_scaled)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'accuracy: {accuracy}')

# Calculate the confusion matrix
confusion = confusion_matrix(y_test, y_pred)
print("The confusion matrix:")
print(confusion)

# Display the evaluation metrics by class
report = classification_report(y_test, y_pred)
print("The evaluation metrics by class:")
print(report)

accuracy: 0.8204400283889283
The confusion matrix:
[[938  98]
 [155 218]]
The evaluation metrics by class:
              precision    recall  f1-score   support

          No       0.86      0.91      0.88      1036
         Yes       0.69      0.58      0.63       373

    accuracy                           0.82      1409
   macro avg       0.77      0.74      0.76      1409
weighted avg       0.81      0.82      0.82      1409



# 3 with MonthlyCharges(0.8218)

In [11]:
data['MonthlyCharges'] = data['MonthlyCharges'].round().astype('int64')

In [12]:
X = data[['SeniorCitizen','tenure','gender','Partner','Dependents','MultipleLines','InternetService',
          'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
          'StreamingMovies','Contract','PaperlessBilling','PaymentMethod','MonthlyCharges']]

y =  data[['Churn']]

categorical_cols = ['gender','Partner','Dependents','MultipleLines','InternetService',
          'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
          'StreamingMovies','Contract','PaperlessBilling','PaymentMethod']

New_X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)


In [13]:
# データをトレーニングセットとテストセットに分割
X_train, X_test, y_train, y_test = train_test_split(New_X, y, test_size=0.2, random_state=42)

# ロジスティック回帰モデルをインスタンス化
model = LogisticRegression()

# モデルをトレーニング
model.fit(New_X, y)

# 各説明変数の係数を取得
coefficients = model.coef_[0]

# 各説明変数とその係数を表示
for feature, coef in zip(X_train.columns, coefficients):
    print(f'{feature}: {coef}')

  y = column_or_1d(y, warn=True)


SeniorCitizen: 0.21933067568311476
tenure: -0.03452805605117316
MonthlyCharges: -0.004197399365329193
gender_Male: -0.023324652435552845
Partner_Yes: -0.013586428386667411
Dependents_Yes: -0.1743649748895735
MultipleLines_No phone service: 0.4172752626119415
MultipleLines_Yes: 0.31993458238777905
InternetService_Fiber optic: 1.0132810012785185
InternetService_No: -0.13047972834846253
OnlineSecurity_No internet service: -0.13047972834846253
OnlineSecurity_Yes: -0.3356294515823746
OnlineBackup_No internet service: -0.13047972834846253
OnlineBackup_Yes: -0.08644036216113328
DeviceProtection_No internet service: -0.13047972834846253
DeviceProtection_Yes: 0.01829984092556252
TechSupport_No internet service: -0.13047972834846253
TechSupport_Yes: -0.3044045653898293
StreamingTV_No internet service: -0.13047972834846253
StreamingTV_Yes: 0.3059205853297012
StreamingMovies_No internet service: -0.13047972834846253
StreamingMovies_Yes: 0.3188495473022097
Contract_One year: -0.6419780662559242
Con

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# テストデータを予測
y_pred = model.predict(X_test)

# 精度を計算
accuracy = accuracy_score(y_test, y_pred)
print(f'精度: {accuracy}')

# 混同行列を計算
confusion = confusion_matrix(y_test, y_pred)
print("混同行列:")
print(confusion)

# クラスごとの評価メトリクスを表示
report = classification_report(y_test, y_pred)
print("クラスごとの評価メトリクス:")
print(report)

精度: 0.8218594748048261
混同行列:
[[938  98]
 [153 220]]
クラスごとの評価メトリクス:
              precision    recall  f1-score   support

          No       0.86      0.91      0.88      1036
         Yes       0.69      0.59      0.64       373

    accuracy                           0.82      1409
   macro avg       0.78      0.75      0.76      1409
weighted avg       0.82      0.82      0.82      1409



# k-Nearest Neighbor: kNN（0.763）

In [15]:
X = data[['SeniorCitizen','tenure','gender','Partner','Dependents','MultipleLines','InternetService',
          'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
          'StreamingMovies','Contract','PaperlessBilling','PaymentMethod']]

y =  data[['Churn']]

X = pd.get_dummies(X, columns = ['gender','Partner','Dependents','MultipleLines','InternetService',
          'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
          'StreamingMovies','Contract','PaperlessBilling','PaymentMethod'])

In [16]:
# データをトレーニングセットとテストセットに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

knn.score(X_test_scaled, y_test)


  return self._fit(X, y)


0.7636621717530163

# support-vector machine

In [18]:
from sklearn.svm import LinearSVC

svc = LinearSVC()
svc.fit(X_train_scaled, y_train)

svc.score(X_test_scaled, y_test)

  y = column_or_1d(y, warn=True)


0.8176011355571328

# Random Forest

In [19]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train_scaled, y_train)
rf.score(X_test_scaled, y_test)

  rf.fit(X_train_scaled, y_train)


0.7913413768630234

# Gradient Boosting

In [20]:
from sklearn.ensemble import GradientBoostingClassifier

gbrt = RandomForestClassifier()
gbrt.fit(X_train_scaled, y_train)
gbrt.score(X_test_scaled, y_test)

  gbrt.fit(X_train_scaled, y_train)


0.7963094393186657