## Customer Churn Analysis in Telecom Industry
We are going to use scikit-learn library with python to predict customer churn in the telecom industry

In [3]:
#import python libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
from IPython.display import display, HTML

Lets read our data using pandas and then view the head of the df frame

In [5]:
df = pd.read_csv("telecom_churn.csv")
display(df.head())

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## Data exploration 
Lets perform some initial data exploration to determine the following
- how much data we have 
- if there are missing values
- what data type each column is 
- the distribution of data in each column

In [10]:
print("Number of rows: ", df.shape)

Number of rows:  (3333, 20)


In [8]:
print(df.columns)

Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')


In [11]:
counts = df.describe().iloc[0]
display(
    pd.DataFrame(
        counts.tolist(), 
        columns=["Count of values"], 
        index=counts.index.values
    ).transpose()
)

Unnamed: 0,Account length,Area code,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
Count of values,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0


All our dataset have no missing values if not we will have gone into filling in missing values, change dates and numbers in incorrect formats
extracting features out of text and so on

In [15]:
# Drop the columns that we have decided won't be used in prediction
df = df.drop([ "Area code", "State"], axis=1)
features = df.drop(["Churn"], axis=1).columns

In [14]:
display(df.tail(100))

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
3233,OK,112,415,No,No,0,166.0,79,28.22,74.6,100,6.34,247.9,74,11.16,6.3,7,1.70,0,False
3234,DE,75,510,No,Yes,28,200.6,96,34.10,164.1,111,13.95,169.6,153,7.63,2.5,5,0.68,1,False
3235,AZ,97,408,No,Yes,25,141.0,101,23.97,212.0,85,18.02,175.2,138,7.88,4.9,2,1.32,3,False
3236,AK,121,408,No,Yes,34,245.0,95,41.65,216.9,66,18.44,112.4,125,5.06,7.5,8,2.03,0,False
3237,MI,142,415,Yes,No,0,140.8,140,23.94,228.6,119,19.43,152.9,88,6.88,10.9,7,2.94,1,False
3238,WA,121,510,No,No,0,255.1,93,43.37,266.9,97,22.69,197.7,118,8.90,8.8,3,2.38,3,True
3239,SD,87,415,No,Yes,33,125.0,99,21.25,235.3,81,20.00,215.3,95,9.69,10.2,7,2.75,2,False
3240,SD,34,408,No,No,0,180.6,65,30.70,280.4,99,23.83,292.4,105,13.16,5.0,3,1.35,1,False
3241,AK,177,415,Yes,No,0,248.7,118,42.28,172.3,73,14.65,191.9,87,8.64,11.3,2,3.05,1,True
3242,MA,58,415,No,Yes,30,178.1,111,30.28,236.7,109,20.12,264.0,118,11.88,8.4,2,2.27,0,False


We assume that the 