## Churn Prediction: prediction to identify customers that are likely to stop using services provided by a company.

### Plan:
1. Download and initial preparation of the dataset: rename columns and update values to make everything consistant
2. Split the dataset into train/validation/test
3. Identify important features
4. Tranform categorical importance into numerical variables
5. Train the model

<code>https://www.kaggle.com/datasets/blastchar/telco-customer-churn</code>

### Initial Library Importing and Dataset Download

In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
# read the dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [3]:
# create a copy to keep the original unmodified
churn = df.copy()

### Basic Data Exploration and Preparation

**Attributes:**
- shape
- size
- columns

In [4]:
print("The shape is:", churn.shape)
print("The size is:", churn.size)
print("The column names are:", churn.columns)

The shape is: (7043, 21)
The size is: 147903
The column names are: Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')


**Methods:**
- .head()
- .info()
- .describe()
- .isnull().sum()

In [5]:
churn.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [6]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [7]:
churn.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [8]:
churn.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [9]:
# removing the inconsistencies (lowercasing and replacing spaces with underscores)
churn.columns = churn.columns.str.lower().str.replace(' ', '_')
string_columns = list(churn.dtypes[churn.dtypes == 'object'].index)

for col in string_columns:
    churn[col] = churn[col].str.lower().str.replace(' ', '_')

In [10]:
churn.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-vhveg,female,0,yes,no,1,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,yes,electronic_check,29.85,29.85,no
1,5575-gnvde,male,0,no,no,34,yes,no,dsl,yes,...,yes,no,no,no,one_year,no,mailed_check,56.95,1889.5,no
2,3668-qpybk,male,0,no,no,2,yes,no,dsl,yes,...,no,no,no,no,month-to-month,yes,mailed_check,53.85,108.15,yes
3,7795-cfocw,male,0,no,no,45,no,no_phone_service,dsl,yes,...,yes,yes,no,no,one_year,no,bank_transfer_(automatic),42.3,1840.75,no
4,9237-hqitu,female,0,no,no,2,yes,no,fiber_optic,no,...,no,no,no,no,month-to-month,yes,electronic_check,70.7,151.65,yes


In [11]:
# looking at the 'churn' column of the DataFrame
churn['churn'].head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

#### This can be used to see the 'churn' column as Boolean types
- turning the churn information into Boolean types
- churn['churn'] = (churn['churn'] == 'yes')
- churn['churn'].head()

In [12]:
# turning the Boolean series to integers: True is converted to 1 and False is converted to 0
churn['churn'] = (churn['churn'] == 'yes').astype(int)
churn['churn'].head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

### Splitting the dataset into train/validation/test with Scikit-Learn
- from sklearn.model_selection import train_test_split

In [13]:
# importing train_test_split from sklearn
from sklearn.model_selection import train_test_split

In [14]:
# breaking the dataset into train and test sets
df_train_full, df_test = train_test_split(churn, test_size=0.2, random_state=42)

**Making sure the data is separated and shuffled.**

In [15]:
df_train_full.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
2142,4223-bkeor,female,0,no,yes,21,yes,no,dsl,yes,...,yes,no,no,yes,one_year,no,mailed_check,64.85,1336.8,0
1623,6035-riiom,female,0,no,no,54,yes,yes,fiber_optic,no,...,no,no,yes,yes,two_year,yes,bank_transfer_(automatic),97.2,5129.45,0
6074,3797-vtidr,male,0,yes,no,1,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,yes,electronic_check,23.45,23.45,1
1362,2568-brgyx,male,0,no,no,4,yes,no,fiber_optic,no,...,no,no,no,no,month-to-month,yes,electronic_check,70.2,237.95,1
6754,2775-sefee,male,0,no,yes,0,yes,yes,dsl,yes,...,no,yes,no,no,two_year,yes,bank_transfer_(automatic),61.9,_,0


In [16]:
df_test.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
185,1024-guald,female,0,yes,no,1,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,yes,electronic_check,24.8,24.8,1
2715,0484-jpbru,male,0,no,no,41,yes,yes,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,month-to-month,yes,bank_transfer_(automatic),25.25,996.45,0
3825,3620-ehimz,female,0,yes,yes,52,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.35,1031.7,0
1807,6910-hadcm,female,0,no,no,1,yes,no,fiber_optic,no,...,yes,no,no,no,month-to-month,no,electronic_check,76.35,76.35,1
132,8587-xyzsf,male,0,no,no,67,yes,no,dsl,no,...,no,yes,no,no,two_year,no,bank_transfer_(automatic),50.55,3260.1,0


In [17]:
# break down the df_train_full into a train and validation set
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=13)

# save the target variable (churn column) outside the dataframe
y_train = df_train.churn.values
y_test = df_val.churn.values

# delete the churn columns for both dataframes to make sure that it isn't used as a feature during training
del df_train['churn']
del df_val['churn']

**More data exploration after separation.**

In [18]:
# checking the counts of the churn variable (the target)
# 1 is True and 0 is False
df_train_full.churn.value_counts()

churn
0    4138
1    1496
Name: count, dtype: int64