### Project - Propensify
#### Submitted by - Gayathri Keerthivasagam
### Description:
The end-to-end implementation of the propensity model to identify potential customers is detailed below.  
Comprehensive exploratory data analysis (EDA) and determination of the best resampling technique, along with the optimal machine learning model for this marketing campaign, are documented in another notebook titled 'Machine_Learning_Model_Implementation.ipynb'

### Process overview

1.Data Collection  
2.Data Cleaning and Preprocessing  
3.Feature Engineering and Selection  
4.Dealing with Imbalanced Data  
5.Model Selection  
6.Model Training and Evaluation   
7.Save the model as pickle file  
8.Load the model and predict for the test dataset  
9.Append the predicted target  
10.Save the final test file


## Import necessary libraries

In [1]:
import pandas as pd
import numpy as np

#libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#libraries for preprocessing steps
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

#libraries for resampling the data
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

#libraries for machine learning models
from sklearn.ensemble import GradientBoostingClassifier

#libraries for cross validation,train test split,hyperparameter tuning
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

#libraries for performance metrics
from sklearn.metrics import accuracy_score,classification_report,precision_score,recall_score,f1_score,roc_auc_score
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay,roc_curve,roc_auc_score,RocCurveDisplay, log_loss

#libraries to regulate warnings
import warnings
warnings.filterwarnings("ignore")

## Read the dataset

In [2]:
train_df = pd.read_excel('train.xlsx')
test_df = pd.read_excel('test.xlsx')

### Drop unnecessary features

In [3]:
#Ignoring the additional columns available other than the columns mentioned in the project decription.
#Dropping the 'profit' and 'id' features.
def train_exclude_features(train_df):
    train_df.drop(['profit','id'],axis=1,inplace=True)
#Delete the last two rows as it has only null values
    train_df= train_df.iloc[:-2]
    return train_df
def test_exclude_features(test_df):
    test_df.drop(['id'],axis=1,inplace=True)
    return test_df

In [4]:
new_train_df = train_exclude_features(train_df)
new_test_df = test_exclude_features(test_df)

Thus the additional features such as 'profit' and 'id' are dropped as mentioned in the project description.

### Null value Treatment 

In [5]:
def null_value_treatment(dataset):
# Calculate mean age for each profession
    mean_age_by_profession = dataset.groupby('profession')['custAge'].mean()
# Impute null values in 'custAge' based on mean age for each profession
    for profession,mean_age in mean_age_by_profession.items():
        dataset.loc[(dataset['profession'] == profession) & (dataset['custAge'].isnull()), 'custAge'] = mean_age
#Mode imputation for categorical column 'day_of_week'
    missing_cat_column = ['day_of_week']
    mode_imputer = SimpleImputer(strategy='most_frequent')
    dataset[missing_cat_column] = mode_imputer.fit_transform(dataset[missing_cat_column])
#Impute null values in 'schooling' column with mode value based on 'profession'.
    schooling_by_profession = dataset.groupby('profession')['schooling'].agg(lambda x: x.mode())
    for profession, mode_value in schooling_by_profession.items():
        dataset.loc[(dataset['profession'] == profession) & (dataset['schooling'].isnull()), 'schooling'] = mode_value
    return dataset   

In [6]:
new_train_df = null_value_treatment(new_train_df)
new_test_df = null_value_treatment(new_test_df)

Thus the null values in the 'custAge','day_of_week','schooling' features are imputed successfully.

### Remove Duplicates

In [7]:
def remove_duplicates(dataset):
    #checking for duplicates
    num_duplicates = dataset.duplicated().sum()
    print(num_duplicates)
    #Removing the duplicate records
    dataset.drop_duplicates(inplace=True)
    return dataset

In [8]:
new_train_df = remove_duplicates(new_train_df)

64


### Feature engineering

In [9]:
#Encoding the target variable for class 0 and class 1
new_train_df['responded'] = new_train_df['responded'].map(lambda x: 0 if x == 'no' else 1)

In [10]:
def feature_engineering(dataset):
    dataset.drop('pmonths',axis=1,inplace=True)
#Define conditions and choices for pdays
    conditions = [
        (dataset['pdays'] == 999),
        (dataset['pdays'] < 5),
        ((dataset['pdays'] >= 5) & (dataset['pdays'] <= 10)),
        ((dataset['pdays'] > 10) & (dataset['pdays'] != 999)) ]
    choices = ['not_contacted', 'less_than_5_days', '5_to_10 days', 'greater_than_10_days']
    # Create the 'pdays' column based on conditions
    dataset['pdays'] = np.select(conditions, choices, default='unknown')
    
    
# Define conditions and choices for pastEmail
    conditions_pastEmail = [
        (dataset['pastEmail'] == 0),
        (dataset['pastEmail'] < 10),
        (dataset['pastEmail'] >= 10) ]
    choices_pastEmail = ['no_email_sent', 'less_than_10', 'more_than_10']
    # Create the 'pastEmail_category' column based on conditions
    dataset['pastEmail'] = np.select(conditions_pastEmail, choices_pastEmail, default='unknown')
    

# Define conditions and choices for 'custAge'
    conditions_custAge = [
         (dataset['custAge'] <= 30),
         ((dataset['custAge'] > 30) & (dataset['custAge'] <= 45)),
         ((dataset['custAge'] > 45) & (dataset['custAge'] <= 60)),
         ((dataset['custAge'] > 60) & (dataset['custAge'] <= 75)),
         (dataset['custAge'] > 75) ]
    choices_custAge = ['below_30', '30-45', '45-60','60-75','above_75']
    # Create the 'custAge' column based on conditions
    dataset['custAge'] = np.select(conditions_custAge, choices_custAge, default='unknown')
    return dataset

In [11]:
new_train_df = feature_engineering(new_train_df)
new_test_df = feature_engineering(new_test_df)

Feature engineering is done for 'custAge','pdays','pmonths' and 'pastEmail' features

### Identifying numerical,categorical and binary features

In [12]:
#Finding the numerical features in the dataset
numerical_columns = new_train_df._get_numeric_data().columns
#Find the categorical features in the dataset
categorical_columns = new_train_df.drop(numerical_columns,axis=1).columns
#identify the binary columns
binary_cols = []
for i in new_train_df.select_dtypes(include=['int', 'float']).columns:
    unique_values = new_train_df[i].unique()
    if np.in1d(unique_values, [0, 1]).all():
        binary_cols.append(i)
numerical_columns = [i for i in numerical_columns if i not in binary_cols] 

In [13]:
#find the skewness of the numerical features

def skewness(dataset, numerical_columns):
    # Calculate skewness of each column
    skewness = dataset[numerical_columns].skew()
    
    # Find columns with positive skewness
    positive_skew_cols = skewness[skewness > 1].index.tolist()
    print(positive_skew_cols)
    
    # Apply log transformation to columns with positive skewness
    for col in positive_skew_cols:
        dataset[col] = np.log1p(dataset[col])
    
    return dataset
    

In [14]:
new_train_df = skewness(new_train_df,numerical_columns)
new_test_df = skewness(new_test_df,numerical_columns)

['campaign', 'previous']
['campaign', 'previous']


### Feature Encoding and Scaling

In [15]:
def feature_encoding(dataset):
# Initialize LabelEncoder    
    label_encoder = LabelEncoder()
#specify the features that needs to be label encoded
    cat_cols1 = ['profession','schooling', 'month', 'day_of_week']
    for col in cat_cols1:
        dataset[col] = label_encoder.fit_transform(dataset[col])
#specify the features that needs to be one hot encoded        
    cat_cols2 = ['custAge','marital', 'default', 'housing', 'loan','contact', 'poutcome','pdays','pastEmail']
# Use pd.get_dummies() to one-hot encode the categorical columns
    encoded_features = pd.get_dummies(dataset[cat_cols2])
# Concatenate the original DataFrame with the encoded features along the columns axis
    dataset = pd.concat([dataset, encoded_features], axis=1)
# Drop the original categorical columns 
    dataset.drop(cat_cols2, axis=1, inplace=True) 

    return dataset

new_train_df = feature_encoding(new_train_df)
new_test_df = feature_encoding(new_test_df)

In [16]:
def feature_scaling(dataset, numerical_columns):
    sc_x = StandardScaler()
    dataset[numerical_columns] = sc_x.fit_transform(dataset[numerical_columns])
    return dataset

new_train_df = feature_scaling(new_train_df, numerical_columns)
new_test_df = feature_scaling(new_test_df, numerical_columns)

### Split the dataset into features and target

In [17]:
#Split the features and target
X = new_train_df.drop('responded',axis= 1)
y = new_train_df['responded']

### Train the model

In [18]:
# Perform cross-validation with SMOTE technique
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=24)
smote = SMOTE(random_state=12)
gb_classifier = GradientBoostingClassifier(random_state=12)
pipeline = Pipeline([('smote', smote), ('gb_classifier', gb_classifier)])

# Define the parameter grid to search
parameters = {
    'gb_classifier__n_estimators': [100, 200, 300],
    'gb_classifier__learning_rate': [0.01, 0.1, 0.2],
    'gb_classifier__max_depth': [3, 5, 7]
}

grid_search = GridSearchCV(pipeline, 
                           param_grid = parameters, 
                           cv=skf, 
                           scoring='f1', 
                           n_jobs=-1)
grid_search.fit(X, y)

best_est = grid_search.best_estimator_
best_model = grid_search.best_estimator_['gb_classifier']


### Save the model

In [19]:
import pickle

# Save the model
with open('model.pkl', 'wb') as file:
    pickle.dump(best_model, file)

In [20]:
# Load the model
with open('model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

### Predictions on the test dataset

In [21]:
# Make predictions on the test dataset
predictions = loaded_model.predict(new_test_df)

In [22]:
test_predictions = test_df.copy()

### Predictions are added to the test dataset

In [23]:
# Add predictions to the test dataset
test_predictions['responded'] = predictions

In [24]:
test_predictions['responded'].value_counts()

responded
0    23347
1     9603
Name: count, dtype: int64

In [25]:
#Encoding the target 'yes' for 1 and 'no' for 0
test_predictions['responded'] = test_predictions['responded'].replace({0: 'no', 1: 'yes'})

In [26]:
test_predictions.head()

Unnamed: 0,custAge,profession,marital,schooling,default,housing,loan,contact,month,day_of_week,...,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,pastEmail,responded
0,30-45,0,married,6,no,no,yes,cellular,9,4,...,not_contacted,0.693147,failure,-1.1,94.199,-37.5,0.886,4963.6,less_than_10,yes
1,30-45,7,married,3,no,no,no,cellular,9,3,...,less_than_5_days,0.693147,success,-3.4,92.379,-29.8,0.788,5017.5,less_than_10,yes
2,45-60,1,married,5,unknown,yes,no,cellular,6,2,...,not_contacted,0.693147,failure,-1.8,92.893,-46.2,1.327,5099.1,less_than_10,no
3,below_30,0,single,6,no,no,no,cellular,1,4,...,not_contacted,0.0,nonexistent,1.4,93.444,-36.1,4.964,5228.1,no_email_sent,no
4,30-45,7,divorced,3,no,yes,no,cellular,7,3,...,not_contacted,0.0,nonexistent,-0.1,93.2,-42.0,4.153,5195.8,no_email_sent,no


### Save the final test dataset

In [27]:
#Save the updated test dataset
test_predictions.to_excel('test_predictions.xlsx', index=False)

### Conclusion:
 Utilizing gradient boosting algorithm along with the SMOTE technique, potential customers for the insurance company have been successfully identified with an accuracy of 0.87. This achievement enables the company to target their potential customers effectively, leading to success in their marketing campaign

In [28]:
##########################################################################################################3