# Predictive Analytics with Machine Learning

### How ML Engineers / Data Scientists work on/ build ML models.

1. Domain Exploration
- Understanding business process, common challenges
- Understanding dataflow, soureces of data/information
- Identify common hypothesis, gaps


2. Data Exploration and Building a dataset
- explore and analyze all avaiable data
- gather, integrate data sources, build a dataset
- formulate one/many target attributes
- overall exploraiton, analysis on data


3. Data Enrichment / Cleaning, Transformation
- Data Enrichment: standardizing data attributes, measures
- Cleaning: Handling missing values, duplicates, outliers, unwatned columns
- Transformation: converting from one type/form to another, extracting features


4. Feature Engineering
- extract important features
- select important features, combine features


5. Preprocessing features
- ecoding, scaling, splitting into sets


6. Training ML Model
- select appropriate algorithm
- training ML model
- scoring models

7. Model OPtimization, Tunings
- imporve features, improve model params


8. Packaging
- managing environments, optimizing latency in prediction

9. Deployment to production

10. Monitoring in production


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Data Exploration

In [2]:
# load the dataset
url = "https://raw.githubusercontent.com/anshupandey/Machine_Learning_Training/refs/heads/master/datasets/Bank_churn_modelling.csv"
df = pd.read_csv(url)
df.shape

(10000, 14)

In [4]:
df.head(2)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0


## Observations:

1. Unwatned Columns: RowNumber, CustomerId, Surname
2. Information Dimensions
  - Demographic Information: Gender, Age, Geography
  - Financial Condition of customer: CreditScore, Balance, EstimatedSalary
  - Relation of customer with the bank: Tenure, NumOfProducts,IsActiveMember,Exited

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [6]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


## 3. Data Cleaning

In [7]:
# check for duplicate entries
df.duplicated().sum()

0

In [8]:
# check for missing values
df.isnull().sum()

Unnamed: 0,0
RowNumber,0
CustomerId,0
Surname,0
CreditScore,0
Geography,0
Gender,0
Age,0
Tenure,0
Balance,0
NumOfProducts,0


In [10]:
df.skew(numeric_only=True)

Unnamed: 0,0
RowNumber,0.0
CustomerId,0.001149
CreditScore,-0.071607
Age,1.01132
Tenure,0.010991
Balance,-0.141109
NumOfProducts,0.745568
HasCrCard,-0.901812
IsActiveMember,-0.060437
EstimatedSalary,0.002085


In [11]:
import plotly.express as px
fig = px.histogram(df, x='Age')
fig.show()

In [12]:
# trmming/ clipping age to age = 75
df['Age'][df['Age']>75] = 75
df['Age'].skew()


ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



0.9563388133296787

In [13]:
import plotly.express as px
fig = px.histogram(df, x='Age')
fig.show()

In [14]:
df.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [15]:
df.drop(columns=['RowNumber','CustomerId','Surname'],inplace=True)
df.columns

Index(['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance',
       'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary',
       'Exited'],
      dtype='object')

## 4. Feature Engineering

In [16]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [17]:
# Target / label: Exited (categorical)
# Features:
  # categorical features: Geography, Gender, HasCrCard, IsActivemember
  # numeric features: creditscore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary

In [19]:
## ANOVA f test

xnum = df[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary',]]

y = df['Exited']


# CI = 95%, significance level, alpha = 0.05
# if pvalue > alpha = multiple groups have similar variance in features, hence feature is not informative for target
# if pvalue < alpha = multiple groups have different variance in features, hence feature is informative for target


from sklearn.feature_selection import f_classif

fscore, pvalue = f_classif(xnum,y)

for i in range(len(xnum.columns)):
  print(xnum.columns[i],pvalue[i])

CreditScore 0.006738213892258643
Age 2.538669143997693e-190
Tenure 0.1615268494952801
Balance 1.275563319153163e-32
NumOfProducts 1.7173330048040421e-06
EstimatedSalary 0.22644042802376574


In [20]:
## chi square test

xcat = df[['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']]
y = df['Exited']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
xcat = xcat.apply(le.fit_transform)
xcat.head()

Unnamed: 0,Geography,Gender,HasCrCard,IsActiveMember
0,0,0,1,1
1,2,0,0,1
2,0,0,1,0
3,0,0,0,0
4,2,0,1,1


In [22]:
from sklearn.feature_selection import chi2

chiscore, pvalue = chi2(xcat,y)

for i in range(len(xcat.columns)):
  print(xcat.columns[i],pvalue[i])

Geography 0.0005756078382573235
Gender 7.015574513879596e-13
HasCrCard 0.6984962089530451
IsActiveMember 1.568036240543455e-27


In [23]:
# important selected features
x = df[['CreditScore', 'Geography','Gender', 'Age','Balance', 'NumOfProducts', 'IsActiveMember']]
y = df['Exited']

## 5. Preprocessing of Features

In [24]:
x.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Balance,NumOfProducts,IsActiveMember
0,619,France,Female,42,0.0,1,1
1,608,Spain,Female,41,83807.86,1,1
2,502,France,Female,42,159660.8,3,0
3,699,France,Female,39,0.0,2,0
4,850,Spain,Female,43,125510.82,1,1


In [25]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

pipeline = ColumnTransformer([('encoder',OneHotEncoder(drop='first'),[1,2]),
                              ('scaler',StandardScaler(),[0,3,4,5])],remainder='passthrough')

pipeline.fit(x)
x2 = pd.DataFrame(pipeline.transform(x),columns=pipeline.get_feature_names_out())
x2.head()


Unnamed: 0,encoder__Geography_Germany,encoder__Geography_Spain,encoder__Gender_Male,scaler__CreditScore,scaler__Age,scaler__Balance,scaler__NumOfProducts,remainder__IsActiveMember
0,0.0,0.0,0.0,-0.326221,0.297413,-1.225848,-0.911583,1.0
1,0.0,1.0,0.0,-0.440036,0.20139,0.11735,-0.911583,1.0
2,0.0,0.0,0.0,-1.536794,0.297413,1.333053,2.527057,0.0
3,0.0,0.0,0.0,0.501521,0.009343,-1.225848,0.807737,0.0
4,0.0,1.0,0.0,2.063884,0.393436,0.785728,-0.911583,1.0


In [26]:
# train test split

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x2,y,test_size=0.2,random_state=1, stratify=y)

print(x2.shape, x_train.shape, x_test.shape)
print(y.shape, y_train.shape, y_test.shape)

(10000, 8) (8000, 8) (2000, 8)
(10000,) (8000,) (2000,)


## 6. Machine Learning Modelling

In [27]:
# using logistic regression
from sklearn.linear_model import LogisticRegression

# create a model object using class LogisticRegression
model = LogisticRegression()

# train the model using train set : x_train, and y_train
model.fit(x_train,y_train)

In [28]:
# Asess the model using train set
y_pred_train = model.predict(x_train)
from sklearn import metrics

print("Accuracy is: ", metrics.accuracy_score(y_train,y_pred_train))
print("Precision is: ", metrics.precision_score(y_train,y_pred_train))
print("Recall is: ", metrics.recall_score(y_train,y_pred_train))
print("F1 Score is" , metrics.f1_score(y_train,y_pred_train))

Accuracy is:  0.8125
Precision is:  0.6132404181184669
Recall is:  0.21595092024539878
F1 Score is 0.3194192377495463


In [29]:
# assess the model using test set
y_pred_test = model.predict(x_test)
from sklearn import metrics

print("Accuracy is: ", metrics.accuracy_score(y_test,y_pred_test))
print("Precision is: ", metrics.precision_score(y_test,y_pred_test))
print("Recall is: ", metrics.recall_score(y_test,y_pred_test))
print("F1 Score is" , metrics.f1_score(y_test,y_pred_test))

Accuracy is:  0.8095
Precision is:  0.5902777777777778
Recall is:  0.20884520884520885
F1 Score is 0.308529945553539
