# The Problem

Intro: The telecom operator Interconnect would like to forecast churn of their clients
    
Business Problem Statement: The company wants to forecast which users are planning to leave

Business Value: To ensure loyalty, those who are going to leave, will be offered with promotional codes and special plan options

## Solution:  Build a predictive machine learning model to forecast churn of their clients

In [1]:
import pandas as pd

In [2]:
contract = pd.read_csv('contract.csv')

internet = pd.read_csv('internet.csv')

personal = pd.read_csv('personal.csv')

phone = pd.read_csv('phone.csv')

### Contract Dataframe Notes

* Naturally we should expect the contracts dataframe to have the majority of observations since it contains all customers
* TotalCharge column needs to be float data type

In [3]:
contract.head(3)

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15


In [4]:
contract.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB


In [5]:
# Checking for duplicates
contract.duplicated().sum()

0

### Internet Dataframe Notes

* As expected we have a dataframe with less observations than the contract since this dataframe represents only customers who are using internet services
* All data types are correct and no missing or duplicated values

In [6]:
internet.head(3)

Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No


In [7]:
internet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5517 entries, 0 to 5516
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customerID        5517 non-null   object
 1   InternetService   5517 non-null   object
 2   OnlineSecurity    5517 non-null   object
 3   OnlineBackup      5517 non-null   object
 4   DeviceProtection  5517 non-null   object
 5   TechSupport       5517 non-null   object
 6   StreamingTV       5517 non-null   object
 7   StreamingMovies   5517 non-null   object
dtypes: object(8)
memory usage: 344.9+ KB


In [8]:
## Checking for duplicates
internet.duplicated().sum()

0

### Personal Dataframe Notes

* As expected the personal dataframe contains the same amount of observations as the contracts column since this dataframe specifically displays information of every unique customer
* All data types are correct and no missing or duplicated values

In [9]:
personal.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No


In [10]:
personal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   Dependents     7043 non-null   object
dtypes: int64(1), object(4)
memory usage: 275.2+ KB


### Phone Dataframe Notes

* Represents every customer that uses phone services
* All data types are correct and no missing or duplicated values

In [11]:
phone.head(3)

Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No


In [12]:
phone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6361 entries, 0 to 6360
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     6361 non-null   object
 1   MultipleLines  6361 non-null   object
dtypes: object(2)
memory usage: 99.5+ KB


### Overall Notes

After reviewing each dataframe and the business problem we now know that this will be a classification problem.

We will define 0 as customers who did not leave and 1 as customers who did leave.

We will test different models to to identify which classification model performs best for the problem.

However, to do this we will need to combine each dataframe into a single dataframe to take advantage of all the features in the database to maximize model performance.

We will need to read the data, clean & prepare it to perform an exploratory analysis to see what other insights we can obtain before model preparation, and finally we will preprocess
the data and add new features (target column and encoding) to have the data prepared for modeling.

### Proposed Plan

1) Reading The Data
* Customer ID columns uniquely identifies each customer, and thus no duplicates were found
* No missing values were found


2) Data Cleaning & Preparation
* For appropriate formatting we will change column names to lower case
* We will merge the dataframes into a single dataframe to facilitate model training for the classification problem (on 'customerID' column)
* We will convert 'BeginDate' column to datetime and 'TotalCharges' to float 

3) EDA
* Summarize the data 
* Plot distribution of contract time for each customer  (any outliers?)


4) Data Preprocessing & Feature Engineering
* Create a target feature from the 'EndDate' column where 0 represents those who did not leave and 1 represents those who did leave
* Identify and fix any potential class imbalance on the new target feature created
* One Hot Encoding will be used due to the large amounts of categories in the database
* We will scale the numeric values after encoding

5) Modeling:

* Train Models
* Sanity Check
* Measure & Improve