# 0. The problem

### Context

Our client is an Insurance company that has provided Health Insurance to its customers. Now the company needs your help to build a model capable of predict if a policyholder (customers) from past year will also be interested in **Vehicle Insurance**, also provided by the company.

A prediction model will help the company being more accurate in its communication strategy to reach out those customers most likely to purchase a vehicle insurance.

### Solution

Supposing that the company does not have enough resources to contact every client in the data base, a good strategy would be creating a list of clients ordered by their propensity of being interested in Vehicle Insurance. Such strategy would allow the company to maximize the effort of reaching the potential clients in comparison to a randomized choice in a list.

Let's say the company has a marketing budget to contact **25000** person.

The purpose is to employ a Machine Learning model to order a list of clients, from the most interested in to the less one. Next, with that list it is possible to plot a Cumulative Gains Curve to evaluate the effectiveness of the model in comparison to a randomized choice.

# 1. Data description

## 1.1. Imports

In [1]:
import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid")
#sns.set_theme(style="darkgrid")
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import TargetEncoder, StandardScaler, RobustScaler

from sklearn.feature_selection import mutual_info_classif

from sklearn.feature_selection import SelectKBest, SelectPercentile, f_classif

from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score, ConfusionMatrixDisplay

#from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, QuantileTransformer, PowerTransformer, RobustScaler, TargetEncoder
#from sklearn.feature_selection import mutual_info_classif
#
#from sklearn.compose import ColumnTransformer
#from sklearn.pipeline import Pipeline
#
#from sklearn.neighbors import KNeighborsClassifier
#from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
#from sklearn.linear_model import LogisticRegression
#
#from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, cross_validate
#from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
#
import scikitplot as skplt

In [2]:
# Random seed
seed = 42

In [3]:
# Functions
def some_metrics(y_pred, y_true):
    accuracy = accuracy_score(y_pred=y_pred, y_true=y_true)
    precision = precision_score(y_pred=y_pred, y_true=y_true)
    recall = recall_score(y_pred=y_pred, y_true=y_true)
    f1 = f1_score(y_pred=y_pred, y_true=y_true)
    print(f'Accuracy: {100*accuracy:.4f}%')
    print(f'Precision: {100*precision:.4f}%')
    print(f'Recall: {100*recall:.4f}%')
    print(f'F1 score: {f1:.4f}')

## 1.2. Loading data

Data available at https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction

In [4]:
PATH = '../data/raw/train.csv'

df_raw = pd.read_csv(filepath_or_buffer=PATH)
df = df_raw.copy()

# 2. Exploratory Data Analysis (EDA)

## 2.1. Data description

In [None]:
df.head()

Columns description:

* **id**                      Unique ID for the customer  
* **Gender**                  Gender of the customer  
* **Age**                     Age of the customer  
* **Driving_License**         0 : Customer does not have DL, 1 : Customer already has DL  
* **Region_Code** 	        Unique code for the region of the customer  
* **Previously_Insured**	    1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance  
* **Vehicle_Age** 	        Age of the Vehicle  
* **Vehicle_Damage** 	        1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.  
* **Annual_Premium** 	        The amount customer needs to pay as premium in the year  
* **Policy_Sales_Channel** 	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.  
* **Vintage** 	            Number of Days which customer has been associated with the company  
* **Response** 	            1 : Customer is interested, 0 : Customer is not interested

* Currency: Idian Rupee (Rs)

In [6]:
#df_train.columns = df_train.columns.str.lower()

#### Shape

In [None]:
print(f'Number of rows: {df.shape[0]}')
print(f'Number of columns: {df.shape[1]}')

In [8]:
## id column has no importance and can be removed
#df_train.drop(columns=['id'], inplace=True)

#### Types

In [None]:
df.info()

**Summary**
- Categorical variables:
    - gender (object)
    - driving license (int64)
    - previously insured (int64)
    - region code (float64)
    - policy sales channel (float64)
    - vehicle age (object)
    - vehicle damage (object)
    - response (int64)
- Variable representing numerical variables:
    - age
    - annual premium
    - vintage

#### Transform type of some categorical features

In [10]:
#df['Driving_License'] = df['Driving_License'].astype('category')
#df['Previously_Insured'] = df['Previously_Insured'].astype('category')
#df['Region_Code'] = df['Region_Code'].astype('category')
#df['Policy_Sales_Channel'] = df['Policy_Sales_Channel'].astype('category')

#### Missing values
-> No missing values

In [None]:
df.isna().sum()

#### Duplicated
-> The number of duplicates is low, so they were removed with no further investigation

In [None]:
df.duplicated().sum()

In [13]:
#df_train.drop_duplicates(inplace=True)

#### Split data into train and test

In [14]:
df_train, df_valid = train_test_split(df, train_size=0.7, stratify=df['Response'], random_state=seed)

#### Target variable
-> Unbalanced target

In [None]:
sns.countplot(data=df_train, x=df_train['Response'])

In [None]:
print(f'Total of interested: {df_train["Response"].value_counts(normalize=True)[1]*100:.2f}%')
print(f'Total of not interested: {df_train["Response"].value_counts(normalize=True)[0]*100:.2f}%')

#### Numerical variables

In [None]:
num_columns = df_train[["Age", "Annual_Premium", "Vintage"]].columns.tolist()
#num_columns = df_train.select_dtypes(exclude=['object', 'category']).columns.tolist()
#num_columns.pop(0)
num_columns

In [None]:
df_train[num_columns].describe()

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
sns.histplot(data=df_train, x='Age', hue='Response', bins=50, ax=ax[0][0])
sns.histplot(data=df_train, x='Annual_Premium', bins=50, hue='Response', ax=ax[0][1])
sns.histplot(data=df_train, x='Vintage', hue='Response', bins=50, ax=ax[1][0])
sns.histplot(data=df_train, x='Region_Code', hue='Response', bins=50, ax=ax[1][1]);

#### Categorical variables

In [None]:
#cat_columns = df_train.select_dtypes(include=['object', 'category']).columns.tolist()
cat_columns = df_train.drop(columns=num_columns).columns.tolist()
cat_columns.pop(0)
cat_columns

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
sns.countplot(data=df_train, x='Gender', hue='Response', ax=ax[0][0])
sns.countplot(data=df_train, x='Vehicle_Age', hue='Response', ax=ax[0][1])
sns.countplot(data=df_train, x='Vehicle_Damage', hue='Response', ax=ax[1][0])
sns.countplot(data=df_train, x='Driving_License', hue='Response', ax=ax[1][1])

## 2.2. Hypothesis

#### **H1**: Individuals between 30 and 50 years old would be more likely to purchase a vehicle insurance.
-> True

In [22]:
between_30_50 = df_train.query('Age >= 30 & Age <= 50 & Response == 1').shape[0]
below_30 = df_train.query('Age < 30 and Response == 1').shape[0]
over_50 = df_train.query('Age > 50 and Response == 1').shape[0]

In [None]:
aux1 = pd.DataFrame({'below_30': [below_30],
                     'between_30_50': [between_30_50],
                     'over_50': [over_50]})
aux1

In [None]:
sns.barplot(data=aux1)
plt.title('Purchasing propensity by age group');

#### **H2**: Women would be more interested in having vehicle insurance.
-> False. 10,4% of total women would purchase compared to 13,8% of total men.

In [None]:
sns.countplot(data=df_train, x='Gender')
plt.title('Entries by gender');

In [None]:
# Result given in proportion by gender
gender_count = pd.crosstab(df_train['Response'], df_train['Gender'], normalize='columns')
gender_count

In [None]:
fig, ax = plt.subplots(figsize=(8,5))
gender_count.plot(kind='bar', stacked=True, ax=ax)
plt.xticks(rotation=0)
plt.title('Interest by gender')

#### **H3**: Individuals who already have driver license and got the vehicle damage in the past would be more interested in vehicle insurance.
-> False.

In [28]:
aux3 = df_train.query('Driving_License == 1 & Vehicle_Damage == "Yes"')['Response'].value_counts()

In [None]:
sns.barplot(data=aux3)
plt.title('Purchasing propensity among people who Driving_License = 1 and Vehicle_Damage = Yes')

#### **H4**: Individuals who already have vehicle insurance (previously insured) would not be interested in vehicle insurance.
-> True. 99,91% of those who already have insurance would not purchase another one

In [None]:
pd.crosstab(index=df_train['Response'], columns=df_train['Previously_Insured'], normalize='columns')

#### **H5**: Individuals who got the vehicle damaged and were not previously insured would be more interested in vehicle insurance.
-> False. Even if not being insured, people who have vehicle damaged would not purschase

In [None]:
aux5 = df_train.query('Vehicle_Damage == "Yes" & Previously_Insured == 0')['Response'].value_counts()
sns.barplot(data=aux5)

#### **H6**: Individuals who own vehicle with more than two year would be more interested in vehicle insurance.
-> FALSE. Ownners of vehicles between 1-2 years are the most interested.

In [None]:
aux7 = pd.crosstab(index=df['Response'], columns=df['Vehicle_Age'])
aux7

In [None]:
fig, ax = plt.subplots(figsize=(8,5))
aux7.plot(kind='bar', stacked=True, ax=ax)
plt.xticks(rotation=0)
plt.title('Propensity by vehicle age');

In [None]:
fig, ax = plt.subplots(1,3, figsize=(18,5))
sns.countplot(data=df_train, x=df_train['Gender'], hue=df_train['Response'], ax=ax[0])
sns.countplot(data=df_train, x=df_train['Vehicle_Age'], hue=df_train['Response'], ax=ax[1])
sns.countplot(data=df_train, x=df_train['Vehicle_Damage'], hue=df_train['Response'], ax=ax[2])

#### Vehicle Damage = No --> almost everybody is not interested 

#### Policy sales channel

In [None]:
aux = df_train.groupby('Policy_Sales_Channel')['Response'].sum().reset_index()
aux

In [None]:
plt.figure()
ax = sns.histplot(data=df_train, x='Policy_Sales_Channel')
ax = plt.plot(aux['Response'])

# 3. Feature Engineering

In [None]:
df_train

## 3.1. Data preparation

In [38]:
# Transform vehicle_damage to binary
df_train['Vehicle_Damage'] = df_train['Vehicle_Damage'].apply(lambda x: 0 if x=='No' else 1)

In [39]:
# Mapping gender --> Male:0, Female:1
gender_map = {"Male": 0, "Female": 1}
df_train["Gender"] = df_train["Gender"].map(gender_map)

In [40]:
# Rewrite vehicle age
#age_dict = {'1-2 Year': 'between_1_2',
#            '< 1 Year': 'below_1',
#            '> 2 Years': 'over_2'}

age_dict = {'< 1 Year': 1,
            '1-2 Year': 2,
            '> 2 Years': 3}

df_train['Vehicle_Age'] = df_train['Vehicle_Age'].map(age_dict)

In [None]:
df_train.info()

In [42]:
# Transform type of some categorical features
df_train["Gender"] = df_train["Gender"].astype("category")
df_train['Driving_License'] = df_train['Driving_License'].astype('category')
df_train['Previously_Insured'] = df_train['Previously_Insured'].astype('category')
df_train["Vehicle_Damage"] = df_train["Vehicle_Damage"].astype("category")
df_train['Region_Code'] = df_train['Region_Code'].astype('category')
df_train['Policy_Sales_Channel'] = df_train['Policy_Sales_Channel'].astype('category')
df_train["Response"] = df_train["Response"].astype("category")

In [None]:
X_train = df_train.drop(columns="Response").copy()

y_train = df_train['Response']

print(f'Training dataframe shape: {df_train.shape}')

In [None]:
X_train.head()

In [45]:
# Make column names lowercase
X_train.columns = X_train.columns.str.lower()

y_train.name = y_train.name.lower()

In [46]:
num_columns = X_train.select_dtypes(exclude=['object', 'category']).columns.to_list()
cat_columns = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

In [None]:
num_columns

In [None]:
cat_columns

In [None]:
X_train.nunique()

In [50]:
#df_train.dropna(inplace=True)
#df_train.drop_duplicates(inplace=True)

## 3.3. Encoding

### 3.3.1. One hot encode

In [51]:
## General function for One Hot Encoder
#def one_hot_encoder(df_to_encode, feature_to_encode):
#    encoder = OneHotEncoder(drop='if_binary')
#    new_features = encoder.fit_transform(df_to_encode[feature_to_encode]).toarray()
#    df_to_encode[encoder.get_feature_names_out()] = new_features
#    df_to_encode.drop(columns=encoder.feature_names_in_[0], inplace=True)
#    return df_to_encode, encoder

In [52]:
# Gender ---> OBS: Test dummy encoding
#X_train, encoding_gender = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['gender'])

In [53]:
# Driving license
#X_train, encoding_license = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['driving_license'])

In [54]:
# Previously insured
#X_train, encoding_insured = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['previously_insured'])

In [55]:
# Vehicle damage
#X_train, encoding_damage = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['vehicle_damage'])

In [56]:
# Vehicle age
# OBS: Testar Ordinal Enconding ou Target Enconding
#X_train, encoding_v_age = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['vehicle_age'])

### 3.3.2. Target encode

In [57]:
# Region code
tar_enc_reg_code = TargetEncoder()
X_train['region_code'] = tar_enc_reg_code.fit_transform(X=X_train[['region_code']], y=y_train)

In [58]:
# Policy sales channel
tar_enc_pol_sales = TargetEncoder()
X_train['policy_sales_channel'] = tar_enc_pol_sales.fit_transform(X=X_train[['policy_sales_channel']], y=y_train)

In [None]:
X_train

## 3.4. Rescaling

In [None]:
X_train[num_columns].hist(bins=50, figsize=(16,8));

### Vintage - MinMax scaler,Standard scaler, Quantile transform

In [61]:
#std_vintage = MinMaxScaler()
std_vintage = StandardScaler()
#std_vintage = QuantileTransformer()
#std_vintage = PowerTransformer(method='box-cox')
#std_vintage = RobustScaler()

new_vintage = std_vintage.fit_transform(X_train[['vintage']])
X_train['vintage'] = new_vintage

### Age - MinMax scaler, Standard scaler, Box-Cox or Quantile transform

In [62]:
#std_age = MinMaxScaler()
std_age = StandardScaler()
#std_age = QuantileTransformer()
#std_age = PowerTransformer(method='box-cox')
#std_age = RobustScaler()

#aux1 = X_train[['age']].transform(np.log1p)
#new_age = std_age.fit_transform(aux1)
new_age = std_age.fit_transform(X_train[['age']])
X_train['age'] = new_age

### Anual premium - Standard scaler, Robust scaler, Box-Cox or Quantile transform

In [None]:
#std_vintage = MinMaxScaler()
std_anual_pr = StandardScaler()
#std_anual_pr = QuantileTransformer()
#std_anual_pr = PowerTransformer(method='box-cox')
#std_anual_pr = RobustScaler()

aux1 = X_train[['annual_premium']].transform(np.log1p)
new_anual_pr = std_anual_pr.fit_transform(aux1)
#new_anual_pr = std_anual_pr.fit_transform(X_train[['annual_premium']])
X_train['annual_premium'] = new_anual_pr

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
sns.histplot(data=df_train['Annual_Premium'], bins=50, ax=ax[0])
sns.histplot(data=new_anual_pr, bins=50, ax=ax[1])
ax[0].set_title('No scalling')
ax[1].set_title('Some scalling')

## 3.5. Validation dataframe

In [64]:
# Transform vehicle_damage to binary
df_valid['Vehicle_Damage'] = df_valid['Vehicle_Damage'].apply(lambda x: 0 if x=='No' else 1)

# Mapping gender --> Male:0, Female:1
gender_map = {"Male": 0, "Female": 1}
df_valid["Gender"] = df_valid["Gender"].map(gender_map)

# Rewrite vehicle age
age_dict = {'< 1 Year': 1,
            '1-2 Year': 2,
            '> 2 Years': 3}
df_valid['Vehicle_Age'] = df_valid['Vehicle_Age'].map(age_dict)

# Transform type of some categorical features
df_valid["Gender"] = df_valid["Gender"].astype("category")
df_valid['Driving_License'] = df_valid['Driving_License'].astype('category')
df_valid['Previously_Insured'] = df_valid['Previously_Insured'].astype('category')
df_valid["Vehicle_Damage"] = df_valid["Vehicle_Damage"].astype("category")
df_valid['Region_Code'] = df_valid['Region_Code'].astype('category')
df_valid['Policy_Sales_Channel'] = df_valid['Policy_Sales_Channel'].astype('category')
df_valid["Response"] = df_valid["Response"].astype("category")

# id column has no importance and can be removed
#df_valid.drop(columns=['id'], inplace=True)
X_valid = df_valid.drop(columns="Response").copy()
y_valid = df_valid['Response']

# Make column names lowercase
X_valid.columns = X_valid.columns.str.lower()
y_valid.name = y_valid.name.lower()

In [65]:
#X_valid[encoding_gender.get_feature_names_out()] =encoding_gender.transform(X=X_valid[['gender']]).toarray()
#X_valid.drop(columns=encoding_gender.feature_names_in_[0], inplace=True)
#
#X_valid[encoding_license.get_feature_names_out()] =encoding_license.transform(X=X_valid[['driving_license']]).toarray()
#X_valid.drop(columns=encoding_license.feature_names_in_[0], inplace=True)
#
#X_valid[encoding_insured.get_feature_names_out()] =encoding_insured.transform(X=X_valid[['previously_insured']]).toarray()
#X_valid.drop(columns=encoding_insured.feature_names_in_[0], inplace=True)
#
#X_valid[encoding_damage.get_feature_names_out()] =encoding_damage.transform(X=X_valid[['vehicle_damage']]).toarray()
#X_valid.drop(columns=encoding_damage.feature_names_in_[0], inplace=True)
#
#X_valid[encoding_v_age.get_feature_names_out()] =encoding_v_age.transform(X=X_valid[['vehicle_age']]).toarray()
#X_valid.drop(columns=encoding_v_age.feature_names_in_[0], inplace=True)

In [66]:
X_valid['region_code'] = tar_enc_reg_code.transform(X=X_valid[['region_code']])

X_valid['policy_sales_channel'] = tar_enc_pol_sales.transform(X=X_valid[['policy_sales_channel']])

In [67]:
X_valid['age'] = std_age.transform(X=X_valid[['age']])

X_valid['vintage'] = std_vintage.transform(X=X_valid[['vintage']])

X_valid['annual_premium'] = std_anual_pr.transform(X=X_valid[['annual_premium']].transform(np.log1p))

In [68]:
X_train_no_id = X_train.drop(columns=["id"])
X_valid_no_id = X_valid.drop(columns=["id"])

## 3.6. Feature selection

### 3.6.1. Correlation matrix

In [69]:
aux = X_train.drop(columns=["id"])
aux = aux.join(y_train)

In [None]:
corr_matrix = aux.corr()
plt.figure(figsize=(12,7))
sns.heatmap(data=corr_matrix, annot=True)

In [None]:
corr_matrix["response"].sort_values(ascending=False)

### 3.6.1. Tree-based

In [72]:
feat_imp_clf = RandomForestClassifier(n_estimators=400, random_state=seed, n_jobs=-1)
feat_imp_clf.fit(X=X_train_no_id, y=y_train)

tree_based_score = pd.DataFrame(data=feat_imp_clf.feature_importances_, index=X_train_no_id.columns,
                                columns=['tree_based']).sort_values(by='tree_based', ascending=False)

In [None]:
tree_based_score

In [None]:
sns.catplot(data=tree_based_score, y=tree_based_score.index, x='tree_based', kind='bar', aspect=2)

### 3.6.1. Mutual information

In [75]:
mi_score = mutual_info_classif(X=X_train_no_id, y=y_train, random_state=seed, n_jobs=-1)
mi_score = pd.DataFrame(data=mi_score, index=X_train_no_id.columns, columns=['mi_score']).sort_values(by='mi_score', ascending=False)

In [None]:
sns.catplot(data=mi_score, y=mi_score.index, x='mi_score', kind='bar', aspect=2)
plt.grid()

# 4. Machine Learning Modeling

**Metrics**:
- Caution with accuracy because the target classes are unbalanced. There is much more people who are not interested in car insurence
- It is important to indentify as many customers as possible who are interested in car insurance, even though the model makes some mistakes ==> Recall

## 4.1. Using all features

In [None]:
features = X_train_no_id.columns.to_list()
features

### 4.1.1. Training models

#### Select k-best features

In [78]:
def kbest_vs_score(model):
    k_vs_score = []
    for k in range(2, len(X_train_no_id.columns)+1):
        selector = SelectKBest(score_func=f_classif, k=k)

        X_train_2 = selector.fit_transform(X=X_train_no_id, y=y_train)
        X_valid_2 = selector.transform(X=X_valid_no_id)

        model_clf = model
        model_clf.fit(X=X_train_2, y=y_train)
        y_pred_model = model_clf.predict(X=X_valid_2)
        y_pred_proba_model = model_clf.predict_proba(X=X_valid_2)
        roc_model = roc_auc_score(y_score=y_pred_proba_model[:,1], y_true=y_valid)
        accuracy = accuracy_score(y_pred=y_pred_model, y_true=y_valid)
        precision = precision_score(y_true=y_valid, y_pred=y_pred_model)
        recall = recall_score(y_true=y_valid, y_pred=y_pred_model)

        k_vs_score.append(recall)

        print(f"Number of features: {k}")
        print(f"Accuracy: {accuracy}")
        print(f"Precision: {precision}")
        print(f"Recall: {recall}")
        print(f"ROC: {roc_model}")
        print(30*"-")

        sns.lineplot(k_vs_score)

#### Logistic Regression

In [None]:
#log_reg_clf = LogisticRegression(class_weight='balanced', random_state=seed)
#kbest_vs_score(model=log_reg_clf)

In [78]:
log_reg_clf = LogisticRegression(class_weight='balanced', random_state=seed)
log_reg_clf.fit(X=X_train[features], y=y_train)
y_pred_log_reg = log_reg_clf.predict(X=X_valid[features])
y_pred_proba_log_reg = log_reg_clf.predict_proba(X=X_valid[features])
roc_log_reg = roc_auc_score(y_score=y_pred_proba_log_reg[:,1], y_true=y_valid)

#### Random Forest

In [81]:
#rf_clf = RandomForestClassifier(n_estimators=150, n_jobs=-1, random_state=42)
#kbest_vs_score(model=rf_clf)

In [79]:
rf_clf = RandomForestClassifier(n_estimators=150, n_jobs=-1, random_state=42)
rf_clf.fit(X=X_train[features], y=y_train)
y_pred_rf = rf_clf.predict(X=X_valid[features])
y_pred_proba_rf = rf_clf.predict_proba(X=X_valid[features])
roc_rf = roc_auc_score(y_score=y_pred_proba_rf[:,1], y_true=y_valid)

#### HGBoosting

In [84]:
#hgb_clf = HistGradientBoostingClassifier(random_state=42)
#kbest_vs_score(model=hgb_clf)

In [None]:
hgb_clf = HistGradientBoostingClassifier(random_state=42)
hgb_clf.fit(X=X_train[features], y=y_train)
y_pred_hgb = hgb_clf.predict(X=X_valid[features])
y_pred_proba_hgb = hgb_clf.predict_proba(X=X_valid[features])
roc_hgb = roc_auc_score(y_score=y_pred_proba_hgb[:,1], y_true=y_valid)

## 4.2. Results

In [81]:
def metrics(models):

    results = {'Model': [],
               'Accuracy': [],
               'Precision': [],
               'Recall': [],
               'ROC_AUC': []}

    for name, pred in models.items():
        results['Model'].append(name)
        results['Accuracy'].append(accuracy_score(y_pred=pred[0], y_true=y_valid))
        results['Precision'].append(precision_score(y_pred=pred[0], y_true=y_valid))
        results['Recall'].append(recall_score(y_pred=pred[0], y_true=y_valid))
        results["ROC_AUC"].append(pred[1])

    results = pd.DataFrame(results).set_index('Model')
    results.index.names = [None]
    return pd.DataFrame(results)

In [82]:
models = {'Logistic Regression': [y_pred_log_reg, roc_log_reg],
          'Random Forest': [y_pred_rf, roc_rf],
          'HGBoost': [y_pred_hgb, roc_hgb]}

In [None]:
results = metrics(models)
results.style.highlight_max(color='green', axis=0)

#### Confusion matrix

In [None]:
fig, ax = plt.subplots(1,4, figsize=(20, 5))
fig.suptitle('Confusion Matrix')
ConfusionMatrixDisplay.from_predictions(y_true=y_valid, y_pred=y_pred_log_reg, labels=log_reg_clf.classes_, ax=ax[0], cmap='Blues',colorbar=False)
#ConfusionMatrixDisplay.from_predictions(y_true=y_valid, y_pred=y_pred_knn, labels=knn_clf.classes_, ax=ax[1], cmap='Blues',colorbar=False)
ConfusionMatrixDisplay.from_predictions(y_true=y_valid, y_pred=y_pred_rf, labels=rf_clf.classes_, ax=ax[2], cmap='Blues',colorbar=False)
ConfusionMatrixDisplay.from_predictions(y_true=y_valid, y_pred=y_pred_hgb, labels=hgb_clf.classes_, ax=ax[3], cmap='Blues',colorbar=False)
ax[0].set_title('Logistic Regression')
ax[1].set_title('KNN')
ax[2].set_title('Random Forest')
ax[3].set_title('HGBoosting')
for axes in ax.flatten():
    axes.grid(False)

#### Metrics @ K

In [108]:
# Propensity score - probability of each customer being interested in car insurance
aux = pd.concat([X_valid, y_valid], axis=1)
#aux['score'] = y_pred_proba_log_reg[:, 1].tolist()
aux['score'] = y_pred_proba_hgb[:, 1].tolist()

# Sort clients by propensity score
aux = aux.sort_values('score', ascending=False)

aux = aux[["id", 'response', 'score']].reset_index(drop=True)

In [None]:
K = 25000                   # number of calls
n = K / X_valid.shape[0]      # percentage of data base represented by the number of calls

print(f'K = {K} represents {n * 100:.2f}% of the validation data base')

# Precision Top K
aux['precision_at_k'] = aux['response'].astype(int).cumsum() / (aux.index.values + 1)
precision_at_k = aux.loc[K, 'precision_at_k']
print(f'Precision @ K: {precision_at_k * 100:.2f}%')

# Recall Top K
aux['recall_at_k'] = aux['response'].astype(int).cumsum() / aux['response'].astype(int).sum()
recall_at_k = aux.loc[K, 'recall_at_k']
print(f'Recall @ K: {recall_at_k * 100:.2f}%')

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(14,5))

# Gains curve --> Recall (gain) as a function of percentage of sample
skplt.metrics.plot_cumulative_gain(y_probas=y_pred_proba_log_reg, y_true=y_valid, ax=ax[0])
ax[0].vlines(x=n, ymin=0, ymax=recall_at_k, linestyles='dashed', colors='purple')
ax[0].hlines(y=recall_at_k, xmin=0, xmax=n, linestyles='dashed', colors='purple')
ax[0].hlines(y=n, xmin=0, xmax=n, linestyles='dashed', colors='purple')

# Lift curvr --> Lift represents the percentage of customers interested in car insurance as a function of the percentage sample 
skplt.metrics.plot_lift_curve(y_probas=y_pred_proba_log_reg, y_true=y_valid, ax=ax[1])
ax[1].vlines(x=n, ymin=0, ymax=recall_at_k/n, linestyles='dashed', color='purple')
ax[1].hlines(y=recall_at_k/n, xmin=0, xmax=n, linestyles='dashed', color='purple')
ax[1].set_xlim(0.05, 1)
ax[1].set_ylim(0.5, 3.5)

#### Baseline vs. ML Model

In [None]:
total_class_1 = y_valid.value_counts()[1]

random_selection = df_valid[["id", "Response"]].sample(n=K)
total_random = random_selection["Response"].value_counts()[1]

total_ml_model = aux.iloc[:K, 0:2]["response"].value_counts()[1]

avg_insurance = 1000.00     # hypothetical anual average cost of vehicle insurance in US$
cost_per_contact = 1.00     # hypothetical cost per contact in US$

random_cost = total_random * cost_per_contact
ML_cost = total_ml_model * cost_per_contact

random_revenue = avg_insurance * total_random
ML_revenue = avg_insurance * total_ml_model

summary = {'Recall': [n, recall_at_k],
           'Number of interests': [total_random, total_ml_model],
           'Contact per sale': [f'{K / total_random:.2f}', f'{K / total_ml_model:.2f}'],
           'Cost of contacts (US$)': [random_cost, ML_cost],
           'Revenue (US$)': [random_revenue, ML_revenue],
           "Profit (US$):": [random_revenue - random_cost, ML_revenue - ML_cost],
           'Gain (%)': ['-', f'{((ML_revenue-random_revenue)/random_revenue)*100:.2f}']}

df_summary = pd.DataFrame(index=['Random', 'ML model'], data=summary)
df_summary

In [None]:
probas_list = [y_pred_proba_log_reg, y_pred_proba_knn, y_pred_proba_rf, y_pred_proba_hgb, ]
skplt.metrics.plot_calibration_curve(y_valid, probas_list, list(models.keys()))

**Strategy**
- Reach the highest number of customers with potencial of purchasing insurance
- Recall must be increased in order to contact as many interested as possible

In [None]:
models.keys()