# **Project Name**    -Orange Telecom Prevention and Predicting Churn



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -**- Upendra Pratap Singh


# **GitHub Link -**

https://github.com/UPENDRA555/Telecom-Churn-EDA/tree/main

# **Problem Statement**


Orange S.A., formerly France Télécom S.A., is a French multinational telecommunications corporation. The Orange Telecom's Churn Dataset, consists of cleaned customer activity data (features), along with a churn label specifying whether a customer cancelled the subscription. Explore and analyze the data to discover key factors responsible for customer churn and come up with ways/recommendations to ensure customer retention.

***DATA DISCRIPTION***

This research employed a binary variable, default Churn (True,False),True means churned customer, False means retained customer, as the response variable. This study reviewed the literature and used the following 20 variables as explanatory variables

*   STATE: 51 Unique States name
*   Account Length: Length of The Account

*   Area Code: Code Number of Area having some States
*   International Plan: Yes Indicate International Plan is Present and No Indicates no subscription for Internatinal Plan

*   Voice Mail Plan: Yes Indicates Voice Mail Plan is Present and No Indicates no subscription for Voice Mail Plan
*   Number vmail messages: Number of Voice Mail Messages ranging from 0 to 50

*   Total day minutes: Total Number of Minutes Spent in Morning
*   Total day calls: Total Number of Calls made in Morning

*   Total day charge: Total Charge to the Customers in Morning
*   Total eve minutes: Total Number of Minutes calls in Evening

*   Total eve calls: Total Number of Calls made in Evening
*   Total eve charge: Total Charge to the Customers in Morning

*   Total night minutes: Total Number of Minutes calls in the Night
*   Total night calls: Total Number of Calls made in Night

*   Total night charge: Total Charge to the Customers in Night
*   Total intl minutes: Total Number of international calls in Minutes

*   Total intl calls: Total number of international calls made
*   Total intl charge: Total charges for international calls

*   Customer service calls: Number of customer service calls made by customer
*   Churn: Customer Churn, True means churned customer, False means retained customer






















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Telecom Churn.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('The row & column count' )
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
df.dtypes

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_value=len(df[df.duplicated()])
print('The number of duplicate value in theis data:', duplicate_value)



#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
def show_missing():
    missing = df.columns[df.isnull().any()].tolist()
    return missing

# Missing data counts and percentage
print('Missing Data Count')
print(df[show_missing()].isnull().sum().sort_values(ascending = False))
print('--'*50)
print('Missing Data Percentage')
print(round(df[show_missing()].isnull().sum().sort_values(ascending = False)/len(df)*100,2))


In [None]:
df.isna().sum()

No Missing value is present in the dataset

In [None]:
# Visualizing the missing values
!pip install missingno

In [None]:
# Plot a Distplot of missing value
import missingno as msno
msno.matrix(df)

In [None]:
# Plot a bar graph of missing value
msno.bar(df)

## ***Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
data = df.copy()
data.head()

In [None]:
data.dtypes

In [None]:
data.shape

we can see that all the 20 columns have 3333 count which indicates there is no missing value



*   'Area Code' and 'Customer service calls' feature are show a integer types but they are categorical type
*   Then convert these feature into categorical type



In [None]:
data['Area code']= data['Area code'].astype('category')
data['Customer service calls']= data['Customer service calls'].astype('category')

In [None]:
numerical_features= data.select_dtypes(include=['int64', 'float64'])
numerical_features.head()

In [None]:
categorical_features= data.select_dtypes(include=['object', 'category'])
categorical_features.head()

## *** EDA(Exploratory Data Analysis)***

#### Dependent Varaiable analysis

In [None]:
# Dependent variable analysis of the dataset
# Find a total count of a dependent variable churn
data.Churn.value_counts()

In [None]:
# Plot a pie chart and bar chart of churen feature

plt.subplot(2, 1, 1)
data["Churn"].value_counts().plot.pie( figsize= (40, 40), fontsize=10, autopct= "%1.2f%%")

plt.subplot(2, 1, 2)
data["Churn"].value_counts().plot(kind= 'bar', figsize= (10, 10), fontsize=20)
plt.xlabel('Churn Type')
plt.ylabel('Churn Count')
plt.suptitle("Churn Percentage by Customer", fontsize=25)
plt.show()



*   As we can see from above graph that both classes are not in proportion and we have imbalanced dataset.
*   Churn customer is 483(14.49%) and retained customer is 2580(85.51%)




#### For Categorical data Univariate analysis

In [None]:
# Find a unique value of categorical variable
for colm in categorical_features:
  data[colm].unique()
  print(data[colm].unique())

In [None]:
# Find a total count of a unique value
for colm in categorical_features:
  data[colm].value_counts()
  print(data[colm].value_counts())


In [None]:
# Plot a bar graph of a categorical feature
for colm in categorical_features:
  data[colm].value_counts().plot(kind= 'bar', figsize= (10, 10), fontsize=10, width= 0.3)
  plt.xlabel(colm)
  plt.ylabel('Count')
  plt.show()



*   The maximum number of calls is 106 in WV(West Virginia) State and minimum number of calls is 34 in CA(California) State.
*   The maximum number of calls is 1655 in 415 Area code and minimum number of calls is 838 in 408 Area code.

*   The number of calls is 323 in International plan and number of calls is 3010 not a International plan.
*   The number of calls is 922 in Voice mail plan and number of calls is 2411 not a Voice mail plan.




### For Categorical data bivariate analysis with target feature churn

In [None]:
for colm in categorical_features:
  percentage_churn = pd.crosstab(df[colm], df['Churn'] )
  percentage_churn["Statewise_churn"] = percentage_churn.apply(lambda x: x[1]*100/(x[0]+x[1]), axis=1)
  print('=========================================================================================================================================================')
  print(percentage_churn.sort_values(by="Statewise_churn", ascending=False).head(3))
  print('-------------------------------------------------------------------------------------------------------------------------------------------------')
  print(percentage_churn.sort_values(by="Statewise_churn", ascending=False).tail(3))

In [None]:
# Plot a bar graph between a categorical feature and categorical feature
for colm in categorical_features:
  fig, ax = plt.subplots(figsize=(15,10))

  ax = sns.countplot(x=data[colm], hue='Churn', data=data, width= 0.5)
  ax.set_ylabel('COUNTS', rotation=0, labelpad=100,size=10)
  ax.set_xlabel(colm)
  ax.yaxis.set_label_coords(0.03, 0.75)
  ax.tick_params(labelsize=10)



*   State NJ (New Jersey) has highest churn rate, State HI(Hawaii) has lowest churn rate
*   Area code (510) has highest churn rate, Area code (415) has lowest churn rate

*   International plan is yes has highest churn rate, International plan is no has lowest churn rate
*   Voice mail plan is yes has highest churn rate, Voice mail plan is no has lowest churn rate

*   Data shows that if there are more 3 customer service calls the churn increasesm








### For numerical data Univariate analysis

In [None]:
# Plot a distplot of a numerical features
for colm in numerical_features:
  sns.displot(x=df[colm])

From distribution plot it has been observed that the data is normal distributed which is good thing for analyzing data

In [None]:
# Plot a boxplot of a numerical features with target feature 'Churn"
for colm in numerical_features:
  data.boxplot(column=colm, by='Churn')
  plt.show()

In [None]:
for colm in numerical_features:
  ax = data.groupby('Churn')[colm].mean()
  print('--------------------------------------------------------------------')
  print(pd.DataFrame(ax))

In [None]:
for colm in numerical_features:
  ax = data.groupby('Churn')[colm].max()
  print('--------------------------------------------------------------------')
  print(pd.DataFrame(ax))

In [None]:
# Plot a barplot of mean value
for colm in numerical_features:
  ax = data.groupby('Churn')[colm].mean()
  ax.plot(legend=True, kind = 'barh')
  plt.show()

In [None]:
# Plot a barplot of maximum value
for colm in numerical_features:
  ax = data.groupby('Churn')[colm].max()
  ax.plot(legend=True, kind = 'barh')
  plt.show()

In [None]:
# calculating per minute call for diffrent category of calls
total_avg_day_minutes = round(df['Total day minutes'].mean())
total_avg_eve_minutes = round(df['Total eve minutes'].mean())
total_avg_night_minutes = round(df['Total night minutes'].mean())
total_avg_intl_minutes = round(df['Total intl minutes'].mean())
print("Total minutes calls in day :", total_avg_day_minutes)
print('Total minutes calls in  evening :', total_avg_eve_minutes)
print('Total minutes calls in  night :', total_avg_night_minutes)
print('Total minutes calls in international :', total_avg_intl_minutes)

In [None]:
sns.barplot( x=['Day','Evening','Night','International'], y=[total_avg_day_minutes,total_avg_eve_minutes,total_avg_night_minutes,total_avg_intl_minutes], width= 0.6)
plt.show()

In [None]:
# calculating per minute call for diffrent category of calls
total_day_minutes_sum = round(df['Total day minutes'].sum())
total_eve_minutes_sum = round(df['Total eve minutes'].sum())
total_night_minutes_sum = round(df['Total night minutes'].sum())
total_intl_minutes_sum = round(df['Total intl minutes'].sum())
print("Total minutes calls in day :", total_day_minutes_sum)
print('Total minutes calls in  evening :', total_eve_minutes_sum)
print('Total minutes calls in  night :', total_night_minutes_sum)
print('Total minutes calls in international :', total_intl_minutes_sum)

In [None]:
sns.barplot( x=['Day','Evening','Night','International'], y=[total_day_minutes_sum,total_eve_minutes_sum,total_night_minutes_sum,total_intl_minutes_sum], width= 0.6)
plt.show()

In [None]:
# calculating total call for diffrent category of calls
total_day_calls_sum = round(df['Total day calls'].sum())
total_eve_calls_sum = round(df['Total eve calls'].sum())
total_night_calls_sum = round(df['Total night calls'].sum())
total_intl_calls_sum = round(df['Total intl calls'].sum())
print("Total calls in day :", total_day_calls_sum)
print('Total calls in  evening :', total_eve_calls_sum)
print('Total calls in  night :', total_night_calls_sum)
print('Total calls in international :', total_intl_calls_sum)

In [None]:
sns.barplot( x=['Day','Evening','Night','International'], y=[total_day_calls_sum,total_eve_calls_sum,total_night_calls_sum,total_intl_calls_sum], width= 0.6)
plt.show()

In [None]:
# calculating charge for diffrent category of calls
total_day_charge_sum = round(df['Total day charge'].sum())
total_eve_charge_sum = round(df['Total eve charge'].sum())
total_night_charge_sum = round(df['Total night charge'].sum())
total_intl_charge_sum = round(df['Total intl charge'].sum())
print("Total charge in day :", total_day_charge_sum)
print('Total charge in  evening :', total_eve_charge_sum)
print('Total charge in  night :', total_night_charge_sum)
print('Total charge in international :', total_intl_charge_sum)

In [None]:
sns.barplot( x=['Day','Evening','Night','International'], y=[total_day_charge_sum,total_eve_charge_sum,total_night_charge_sum,total_intl_charge_sum], width= 0.6)
plt.show()

In [None]:
# calculating charge per cals for diffrent category of calls
day_charge_per_calls = round(df['Total day charge'].sum()/total_day_calls_sum, 2)
eve_charge_per_calls = round(df['Total eve charge'].sum()/total_eve_calls_sum, 2)
night_charge_per_calls = round(df['Total night charge'].sum()/total_night_calls_sum, 2)
intl_charge_per_calls = round(df['Total intl charge'].sum()/total_intl_calls_sum, 2)
print("Per Calls charge in day :", day_charge_per_calls)
print('Per Calls charge in  evening :', eve_charge_per_calls)
print('Per Calls charge in  night :', night_charge_per_calls)
print('Per Calls charge in international :', intl_charge_per_calls)

In [None]:
sns.barplot( x=['Day','Evening','Night','International'], y=[day_charge_per_calls,eve_charge_per_calls,night_charge_per_calls,intl_charge_per_calls], width= 0.6)
plt.show()

From the above graph we can see that international call charges are high as compared to other call charges so in next step we need to analyze if there is churn because of international calls or not.

In [None]:
# calculating charge per min for diffrent category of calls
day_charge_per_min = round(df['Total day charge'].sum()/total_day_minutes_sum, 2)
eve_charge_per_min = round(df['Total eve charge'].sum()/total_eve_minutes_sum, 2)
night_charge_per_min = round(df['Total night charge'].sum()/total_night_minutes_sum, 2)
intl_charge_per_min = round(df['Total intl charge'].sum()/total_intl_minutes_sum, 2)
print("Per minute charge in day :", day_charge_per_min)
print('Per minute charge in  evening :', eve_charge_per_min)
print('Per minute charge in  night :', night_charge_per_min)
print('Per minute charge in international :', intl_charge_per_min)

In [None]:
sns.barplot( x=['Day','Evening','Night','International'], y=[day_charge_per_min,eve_charge_per_min,night_charge_per_min,intl_charge_per_min], width= 0.6)
plt.show()

From the above graph we can see that international per min charges are high as compared to other call charges so in next step we need to analyze if there is churn because of international calls or not.

#### Correlation Heatmap of the dataset

In [None]:
# ploting heatmap for analyzing correlation between the variables
sns.heatmap(df.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
fig=plt.gcf()
fig.set_size_inches(15,8)
plt.show()

This heat map shows there is high correlation between churn and Total day minute/Total day charge, total eve calls and total eve charge and customer service calls

#### Pair Plot of Datasets

In [None]:
# Plot a pairplot of the dataset
plt.close()
sns.set_style("whitegrid")
sns.pairplot(df, hue="Churn", height=4, kind= 'hist')
plt.show()

## **Solution to Business Objective**



*   This heat map shows there is high correlation between churn and Total day minute/Total day charge, total eve calls and total eve charge and customer service calls
*   In states (NJ, CA, TX, MD, SC ) where churn is high company needs to inspect if there is low network penetration or competitor are offering cheaper prices.

*   Customer with international calling plan churn 200% more than normal customers so this needs to be addressed with optimal international calling rate.
*   When there are more than 3 service calls the churn increases which mean there is need of resolving customers concerns for customer retention and satisfaction which lead to reduction in customer churn.








# **Conclusion**

There is correlation between churn and Total day minute/Total day charge, total eve calls and total eve charge and customer service calls. Higher total day minutes or total day charge is responsible for customer chrun

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***