# Customer Churn Prediction
Churn is a one of the biggest problem in the telecom industry, ant this is because customers are free to choose from a variety of network providers within a product category. Retaining existing customers is believed to be more cost-effective than acquiring new ones. Therefore, Keeping churn rates as low as possible is what every business pursuits, and understanding these metrics can assist companies to identify potential churners in time to prevent them from leaving the client base.  

In [None]:
#Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style = 'white')
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick # For specifying the axes tick format 

In [None]:
#Loading the csv dataset
data = pd.read_csv('C:/Users/user/OneDrive/Desktop\Data S/Telecom/WA_Fn-UseC_-Telco-Customer-Churn.csv')
data.head(10)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
#converting the data type of the TotalCharges
#data['TotalCharges'] = data['TotalCharges'].astype(float)
data.TotalCharges = pd.to_numeric(data.TotalCharges, errors='coerce')

In [None]:
data.isnull().sum()

In [None]:
#Removing missing values 
data.dropna(inplace = True)

In [None]:
# eliminate 'customerID'
#data.drop('customerID', axis=1, inplace=True)
data1 = data.iloc[:,1:]

In [None]:
# unique values for each column containing a categorical feature
def unique_values():
  cat_columns = np.unique(data1.select_dtypes('object').columns)
  for i in cat_columns:
    print(i, data1[i].unique())

unique_values()

In [104]:
# switch 'No inernet service to 'No'
to_binary = ['DeviceProtection', 'OnlineBackup', 'OnlineSecurity', 'StreamingMovies', 'StreamingTV', 'TechSupport']

for i in to_binary:
  data.loc[data[i].isin(['No internet service']), i] = 'No'

unique_values()

Churn ['No' 'Yes']
Contract ['Month-to-month' 'One year' 'Two year']
Dependents ['No' 'Yes']
DeviceProtection ['No' 'Yes']
InternetService ['DSL' 'Fiber optic' 'No']
MultipleLines ['No phone service' 'No' 'Yes']
OnlineBackup ['Yes' 'No']
OnlineSecurity ['No' 'Yes']
PaperlessBilling ['Yes' 'No']
Partner ['Yes' 'No']
PaymentMethod ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
PhoneService ['No' 'Yes']
StreamingMovies ['No' 'Yes']
StreamingTV ['No' 'Yes']
TechSupport ['No' 'Yes']
customerID ['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK']
gender ['Female' 'Male']


In [105]:
#Convertin the predictor variable in a binary numeric variable
data['Churn'].replace(to_replace='Yes', value=1, inplace=True)
data['Churn'].replace(to_replace='No',  value=0, inplace=True)

#Let's convert all the categorical variables into dummy variables
data_dummies = pd.get_dummies(data)
data_dummies.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,customerID_0002-ORFBO,customerID_0003-MKNFE,customerID_0004-TLHLJ,customerID_0011-IGKFF,customerID_0013-EXCHZ,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,29.85,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
1,0,34,56.95,1889.5,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,1
2,0,2,53.85,108.15,1,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,1
3,0,45,42.3,1840.75,0,0,0,0,0,0,...,0,0,1,0,1,0,1,0,0,0
4,0,2,70.7,151.65,1,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0


# Exploratory Data Analysis

In [None]:
#Correlation of "Churn" with other variables:
plt.figure(figsize=(15,8))
data_dummies.corr()['Churn'].sort_values(ascending = False).plot(kind='bar')

Month to month contracts, absence of online security and tech support seem to be positively correlated with churn. While, tenure, two year contracts seem to be negatively correlated with churn.

Services such as Online security, streaming TV, online backup, tech support, etc. without internet connection seem to be negatively related to churn.

**A.) Demographics**

Let us first understand the gender, age range, patner and dependent status of the customers

1. **Gender Distribution**

In [None]:
colors = ['#4D3425','#E4512B']
ax = (data['gender'].value_counts()*100.0 /len(data)).plot(kind='bar',
                                                                           stacked = True,
                                                                          rot = 0,
                                                                          color = colors)
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_ylabel('% Customers')
ax.set_xlabel('Gender')
ax.set_ylabel('% Customers')
ax.set_title('Gender Distribution')

# create a list to collect the plt.patches data
totals = []

# find the values and append to list
for i in ax.patches:
    totals.append(i.get_width())

# set individual bar lables using above list
total = sum(totals)

for i in ax.patches:
    # get_width pulls left or right; get_y pushes up or down
    ax.text(i.get_x()+.15, i.get_height()-3.5, \
            str(round((i.get_height()/total), 1))+'%',
            fontsize=12,
            color='white',
           weight = 'bold')


 About half of the customers in our data set are male while the other half are female

**2. % Senior Citizens**

In [None]:
ax = (data['SeniorCitizen'].value_counts()*100.0 /len(data))\
.plot.pie(autopct='%.1f%%', labels = ['No', 'Yes'],figsize =(5,5), fontsize = 12 )                                                                           
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_ylabel('Senior Citizens',fontsize = 12)
ax.set_title('% of Senior Citizens', fontsize = 12)

There are only 16% of the customers who are senior citizens. Thus most of our customers in the data are younger people.

**3. Partner and dependent status**

In [None]:
data1 = pd.melt(data, id_vars=['customerID'], value_vars=['Dependents','Partner'])
df3 = data1.groupby(['variable','value']).count().unstack()
df3 = df3*100/len(data)
colors = ['#4D3425','#E4512B']
ax = df3.loc[:,'customerID'].plot.bar(stacked=True, color=colors,
                                      figsize=(8,6),rot = 0,
                                     width = 0.2)

ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_ylabel('% Customers',size = 14)
ax.set_xlabel('')
ax.set_title('% Customers with dependents and partners',size = 14)
ax.legend(loc = 'center',prop={'size':14})

for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.0f}%'.format(height), (p.get_x()+.25*width, p.get_y()+.4*height),
                color = 'white',
               weight = 'bold',
               size = 14)