### INTRODUTION

When customers or subscribers stop doing business with a company or service, it is known as customer churn. Customers in the telecom sector can choose from a wide range of service providers and actively swap between them. The telecommunications industry has a 15 to 25% yearly turnover rate in this fiercely competitive market. One of the biggest threats to revenue loss in the telecom sector is customer churn. Fostering customer loyalty is essential since the cost of recruiting new customers can be up to 25 times higher than the cost of keeping existing ones.

in our case a bank. Some studies show that acquiring new customers can cost 5 times more than that of satisfying and retaining existing customers. Thus tracking of bank customer churn rate through prediction will help in reducing marketing costs, lead to increase in capital ,expanding total customers and a lot more.

For the dataset  we are going to work with it will enable us to predict a customer's churn.
The  dataset contains the following columns : RowNumber,CustomerId,Surname,CreditScore,Georaphy,Gender,Age,Tenure,Balance,NumOfProdcuts,HasCrCard,IsActiveMember,EstimatedSalary,Exited.

The dependent Variable is Exited which we will use to observe the outcome while the independent variables are the other columns which will enable us to get an outcome.


In [4]:
#import the required libraries
import pandas as pd
import numpy as np 
import pickle

In [9]:
#load the data 
df = pd.read_csv(" Churn_Modelling.csv")
#check the column labels
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [8]:
#column info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [10]:
#dropping columns that wont be needed 
df.drop(['RowNumber','CustomerId','Surname'], axis=1, inplace=True)

In [11]:
#confirm column dropped
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [12]:
#check for null values
df.isnull().sum()

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [19]:
#preprocessing data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder , StandardScaler
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

#check if the preprocessing worked-0-Female-1-Male
df.head(20).Gender

0     0
1     0
2     0
3     0
4     0
5     1
6     1
7     0
8     1
9     1
10    1
11    1
12    0
13    0
14    0
15    1
16    1
17    0
18    1
19    0
Name: Gender, dtype: int64

In [21]:
#lets perfrom preprocessing on geography
df['Geography'] = le.fit_transform(df['Geography'])  

#check if the preprocessing worked,0-France,1-spain,2-Germany
df.head(20).Geography

0     0
1     2
2     0
3     0
4     2
5     2
6     0
7     1
8     0
9     0
10    0
11    2
12    0
13    0
14    2
15    1
16    1
17    2
18    2
19    0
Name: Geography, dtype: int64

In [30]:
#changing to type string the outcome column Exited(intially it was a int)
df['Exited'] = df['Exited'].astype(str)
#confirming the new datatype- object(string)
df.dtypes

CreditScore          int64
Geography            int64
Gender               int64
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited              object
dtype: object

In [31]:
#replacing the outcome Exited to 1-exited ,0-remained
df['Exited'] = df['Exited'].str.replace('1', 'Exited')
df['Exited'] = df['Exited'].str.replace('0', 'Remained')

#check the the change occured
df.head(10).Exited

0      Exited
1    Remained
2      Exited
3    Remained
4    Remained
5      Exited
6    Remained
7      Exited
8    Remained
9    Remained
Name: Exited, dtype: object

In [32]:
#Define X and Y
x = df.drop('Exited',axis = 1)
y = df['Exited']

In [33]:
#Divide the data into training and testing 
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.23)

In [34]:
#scaling 
sc = StandardScaler()
sc.fit_transform(x_train)
sc.transform(x_test)

array([[-1.66873989, -0.89842968, -1.10641217, ..., -1.54325711,
         0.9708153 , -1.63726353],
       [-1.09939772,  0.31587759,  0.90382231, ..., -1.54325711,
        -1.03006206, -0.54148049],
       [-0.24020862, -0.89842968, -1.10641217, ...,  0.64798017,
        -1.03006206, -0.30287751],
       ...,
       [-0.10563684,  0.31587759,  0.90382231, ..., -1.54325711,
        -1.03006206, -1.13710438],
       [ 0.44300198, -0.89842968, -1.10641217, ..., -1.54325711,
        -1.03006206, -0.38685107],
       [-0.84060582, -0.89842968,  0.90382231, ...,  0.64798017,
         0.9708153 ,  0.13989205]])

In [36]:
#creating a model and perfroming training 
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear', C=0.05, multi_class='ovr',
                           random_state=0)
model.fit(x_train, y_train)

In [43]:
#model evaluation
from sklearn.metrics import classification_report

print(classification_report(y_test,  model.predict(x_test)))

              precision    recall  f1-score   support

      Exited       0.42      0.06      0.11       465
    Remained       0.80      0.98      0.88      1835

    accuracy                           0.79      2300
   macro avg       0.61      0.52      0.49      2300
weighted avg       0.73      0.79      0.73      2300

