# 0. The problem

### Context

Our client is an Insurance company that has provided Health Insurance to its customers. Now the company needs your help to build a model capable of predict if a policyholder (customers) from past year will also be interested in **Vehicle Insurance**, also provided by the company.

A prediction model will help the company being more accurate in its communication strategy to reach out those customers most likely to purchase a vehicle insurance.

### Solution

Supposing that the company does not have enough resources to contact every client in the data base, a good strategy would be creating a list of clients ordered by their propensity of being interested in Vehicle Insurance. Such strategy would allow the company to maximize the effort of reaching the potential clients in comparison to a randomized choice in a list.

Let's say the company has a marketing budget to contact **25000** person.

The purpose is to employ a Machine Learning model to order a list of clients, from the most interested in to the less one. Next, with that list it is possible to plot a Cumulative Gains Curve to evaluate the effectiveness of the model in comparison to a randomized choice.

# 1. Data description

## 1.1. Imports

In [None]:
import pandas as pd
import seaborn as sns
sns.set_theme(style="darkgrid")
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, QuantileTransformer, PowerTransformer, RobustScaler, TargetEncoder
from sklearn.feature_selection import mutual_info_classif

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, cross_validate
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
import scikitplot as skplt

In [None]:
# Random seed
seed = 42

In [None]:
# Functions
def some_metrics(y_pred, y_true):
    accuracy = accuracy_score(y_pred=y_pred, y_true=y_true)
    precision = precision_score(y_pred=y_pred, y_true=y_true)
    recall = recall_score(y_pred=y_pred, y_true=y_true)
    f1 = f1_score(y_pred=y_pred, y_true=y_true)
    print(f'Accuracy: {100*accuracy:.4f}%')
    print(f'Precision: {100*precision:.4f}%')
    print(f'Recall: {100*recall:.4f}%')
    print(f'F1 score: {f1:.4f}')

## 1.2. Loading data

In [None]:
PATH = '/home/ezequiel/Documentos/Comunidade_DS/car_insurance_sell/data/raw/train.csv'

df_raw = pd.read_csv(filepath_or_buffer=PATH)
df = df_raw.copy()

# 2. Exploratory Data Analysis (EDA)

## 2.1. Data description

In [None]:
df.head()

Columns description:

* **id**                      Unique ID for the customer  
* **Gender**                  Gender of the customer  
* **Age**                     Age of the customer  
* **Driving_License**         0 : Customer does not have DL, 1 : Customer already has DL  
* **Region_Code** 	        Unique code for the region of the customer  
* **Previously_Insured**	    1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance  
* **Vehicle_Age** 	        Age of the Vehicle  
* **Vehicle_Damage** 	        1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.  
* **Annual_Premium** 	        The amount customer needs to pay as premium in the year  
* **Policy_Sales_Channel** 	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.  
* **Vintage** 	            Number of Days which customer has been associated with the company  
* **Response** 	            1 : Customer is interested, 0 : Customer is not interested

* Currency: Idian Rupee (Rs)

In [None]:
#df_train.columns = df_train.columns.str.lower()

#### Shape

In [None]:
print(f'Number of rows: {df.shape[0]}')
print(f'Number of columns: {df.shape[1]}')

In [None]:
## id column has no importance and can be removed
#df_train.drop(columns=['id'], inplace=True)

#### Types

In [None]:
df.info()

**Summary**
- Categorical variables:
    - gender (object)
    - driving license (int64)
    - previously insured (int64)
    - region code (float64)
    - policy sales channel (float64)
    - vehicle age (object)
    - vehicle damage (object)
    - response (int64)
- Variable representing numerical variables:
    - age
    - annual premium
    - vintage

#### Transform type of some categorical features

In [None]:
df['Driving_License'] = df['Driving_License'].astype('category')
df['Previously_Insured'] = df['Previously_Insured'].astype('category')
df['Region_Code'] = df['Region_Code'].astype('category')
df['Policy_Sales_Channel'] = df['Policy_Sales_Channel'].astype('category')

In [None]:
df.info()

#### Missing values
-> No missing values

In [None]:
df.isna().sum()

#### Duplicated
-> The number of duplicates is low, so they were removed with no further investigation

In [None]:
df.duplicated().sum()

In [None]:
#df_train.drop_duplicates(inplace=True)

#### Target variable
-> Unbalanced target

In [None]:
sns.countplot(data=df, x=df['Response'])

In [None]:
print(f'Total of interested: {df["Response"].value_counts(normalize=True)[1]*100:.2f}%')
print(f'Total of not interested: {df["Response"].value_counts(normalize=True)[0]*100:.2f}%')

#### Numerical variables

In [None]:
num_columns = df.select_dtypes(exclude=['object', 'category']).columns.tolist()
num_columns.pop(0)
num_columns

In [None]:
df[num_columns].describe()

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
sns.histplot(data=df, x='Age', hue='Response', bins=50, ax=ax[0][0])
sns.histplot(data=df, x='Annual_Premium', bins=50, hue='Response', ax=ax[0][1])
sns.histplot(data=df, x='Vintage', hue='Response', bins=50, ax=ax[1][0])
sns.histplot(data=df, x='Region_Code', hue='Response', bins=50, ax=ax[1][1]);

#### Categorical variables

In [None]:
cat_columns = df.select_dtypes(include=['object', 'category']).columns.tolist()
cat_columns

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
sns.countplot(data=df, x='Gender', hue='Response', ax=ax[0][0])
sns.countplot(data=df, x='Vehicle_Age', hue='Response', ax=ax[0][1])
sns.countplot(data=df, x='Vehicle_Damage', hue='Response', ax=ax[1][0])
sns.countplot(data=df, x='Driving_License', hue='Response', ax=ax[1][1])

## 2.2. Hypothesis

#### **H1**: Individuals between 30 and 50 years old would be more likely to purchase a vehicle insurance.
-> True

In [None]:
between_30_50 = df.query('Age >= 30 & Age <= 50 & Response == 1').shape[0]
below_30 = df.query('Age < 30 and Response == 1').shape[0]
over_50 = df.query('Age > 50 and Response == 1').shape[0]

In [None]:
aux1 = pd.DataFrame({'below_30': [below_30],
                     'between_30_50': [between_30_50],
                     'over_50': [over_50]})
aux1

In [None]:
sns.barplot(data=aux1)
plt.title('Purchasing propensity by age group');

#### **H2**: Women would be more interested in having vehicle insurance.
-> False. 10,4% of total women would purchase compared to 13,8% of total men.

In [None]:
sns.countplot(data=df, x='Gender')
plt.title('Entries by gender');

In [None]:
# Result given in proportion by gender
gender_count = pd.crosstab(df['Response'], df['Gender'], normalize='columns')
gender_count

In [None]:
fig, ax = plt.subplots(figsize=(8,5))
gender_count.plot(kind='bar', stacked=True, ax=ax)
plt.xticks(rotation=0)
plt.title('Interest by gender')

#### **H3**: Individuals who already have driver license and got the vehicle damage in the past would be more interested in vehicle insurance.
-> False.

In [None]:
aux3 = df.query('Driving_License == 1 & Vehicle_Damage == "Yes"')['Response'].value_counts()

In [None]:
sns.barplot(data=aux3)
plt.title('Purchasing propensity among people who Driving_License = 1 and Vehicle_Damage = Yes')

#### **H4**: Individuals who already have vehicle insurance (previously insured) would not be interested in vehicle insurance.
-> True. 99,91% of those who already have insurance would not purchase another one

In [None]:
pd.crosstab(index=df['Response'], columns=df['Previously_Insured'], normalize='columns')

#### **H5**: Individuals who got the vehicle damaged and were not previously insured would be more interested in vehicle insurance.
-> False. Even if not being insured, people who have vehicle damaged would not purschase

In [None]:
aux5 = df.query('Vehicle_Damage == "Yes" & Previously_Insured == 0')['Response'].value_counts()
sns.barplot(data=aux5)

#### **H6**: Individuals who own vehicle with more than two year would be more interested in vehicle insurance.
-> FALSE. Ownners of vehicles between 1-2 years are the most interested.

In [None]:
aux7 = pd.crosstab(index=df['Response'], columns=df['Vehicle_Age'])
aux7

In [None]:
fig, ax = plt.subplots(figsize=(8,5))
aux7.plot(kind='bar', stacked=True, ax=ax)
plt.xticks(rotation=0)
plt.title('Propensity by vehicle age');

In [None]:
fig, ax = plt.subplots(1,3, figsize=(18,5))
sns.countplot(data=df, x=df['Gender'], hue=df['Response'], ax=ax[0])
sns.countplot(data=df, x=df['Vehicle_Age'], hue=df['Response'], ax=ax[1])
sns.countplot(data=df, x=df['Vehicle_Damage'], hue=df['Response'], ax=ax[2])

#### Vehicle Damage = No --> almost everybody is not interested 

#### Policy sales channel

In [None]:
aux = df.groupby('Policy_Sales_Channel')['Response'].sum().reset_index()
aux

In [None]:
plt.figure()
ax = sns.histplot(data=df, x='Policy_Sales_Channel')
ax = plt.plot(aux['Response'])

# 3. Feature Engineering

## 3.1. Spliting data into train and validation dataframes

In [None]:
df_train, df_valid = train_test_split(df, train_size=0.8, stratify=df['Response'], random_state=seed)

In [None]:
X_train = df_train.drop(columns='Response').copy()
X_valid = df_valid.drop(columns='Response').copy()

y_train = df_train['Response']
y_valid = df_valid['Response']

print(f'Training dataframe shape: {df_train.shape}')
print(f'Validation dataframe shape: {df_valid.shape}')

## 3.2. Data preparation

In [None]:
# Make column names lowercase
X_train.columns = X_train.columns.str.lower()
X_valid.columns = X_valid.columns.str.lower()
y_train.name = y_train.name.lower()
y_valid.name = y_valid.name.lower()

In [None]:
# id column has no importance and can be removed
X_train.drop(columns=['id'], inplace=True)
X_valid.drop(columns=['id'], inplace=True)

In [None]:
# Rewrite vehicle age
age_dict = {'1-2 Year': 'between_1_2',
            '< 1 Year': 'below_1',
            '> 2 Years': 'over_2'}

X_train['vehicle_age'] = X_train['vehicle_age'].map(age_dict)
X_valid['vehicle_age'] = X_valid['vehicle_age'].map(age_dict)

In [None]:
num_columns = X_train.select_dtypes(exclude=['object', 'category']).columns.to_list()
cat_columns = X_train.select_dtypes(include=['object', 'category']).columns.to_list()

In [None]:
X_train.nunique()

In [None]:
#df_train.dropna(inplace=True)
#df_train.drop_duplicates(inplace=True)

In [None]:
## Transform vehicle_damage to numeric
#df_train['vehicle_damage'] = df_train['vehicle_damage'].apply(lambda x: 0 if x=='No' else 1)

## 3.3. Encoding

### 3.3.1. One hot encode

In [None]:
# General function for One Hot Encoder
def one_hot_encoder(df_to_encode, feature_to_encode):
    encoder = OneHotEncoder(drop='if_binary')
    new_features = encoder.fit_transform(df_to_encode[feature_to_encode]).toarray()
    df_to_encode[encoder.get_feature_names_out()] = new_features
    df_to_encode.drop(columns=encoder.feature_names_in_[0], inplace=True)
    return df_to_encode, encoder

In [None]:
# Gender ---> OBS: Test dummy encoding
X_train, encoding_gender = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['gender'])

In [None]:
# Driving license
X_train, encoding_license = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['driving_license'])

In [None]:
# Previously insured
X_train, encoding_insured = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['previously_insured'])

In [None]:
# Vehicle damage
X_train, encoding_damage = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['vehicle_damage'])

In [None]:
# Vehicle age
X_train, encoding_v_age = one_hot_encoder(df_to_encode=X_train, feature_to_encode=['vehicle_age'])

### 3.3.2. Target encode

In [None]:
# Region code
tar_enc_reg_code = TargetEncoder()
X_train['region_code'] = tar_enc_reg_code.fit_transform(X=X_train[['region_code']], y=y_train)

In [None]:
# Policy sales channel
tar_enc_pol_sales = TargetEncoder()
X_train['policy_sales_channel'] = tar_enc_pol_sales.fit_transform(X=X_train[['policy_sales_channel']], y=y_train)

In [None]:
X_train

## 3.4. Rescaling

In [None]:
X_train[num_columns].hist(bins=50, figsize=(16,8));

### Vintage - MinMax scaler,Standard scaler, Quantile transform

In [None]:
#std_vintage = MinMaxScaler()
std_vintage = StandardScaler()
#std_vintage = QuantileTransformer()
#std_vintage = PowerTransformer(method='box-cox')
#std_vintage = RobustScaler()

new_vintage = std_vintage.fit_transform(X_train[['vintage']])
X_train['vintage'] = new_vintage

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
sns.histplot(data=X_train['vintage'], bins=50, ax=ax[0])
sns.histplot(data=new_vintage, bins=50, ax=ax[1])
ax[0].set_title('No scalling')
ax[1].set_title('Some scalling')

### Age - MinMax scaler, Standard scaler, Box-Cox or Quantile transform

In [None]:
#std_age = MinMaxScaler()
std_age = StandardScaler()
#std_age = QuantileTransformer()
#std_age = PowerTransformer(method='box-cox')
#std_age = RobustScaler()

#aux1 = X_train[['age']].transform(np.log1p)
#new_age = std_age.fit_transform(aux1)
new_age = std_age.fit_transform(X_train[['age']])
X_train['age'] = new_age

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
sns.histplot(data=X_train['age'], bins=50, ax=ax[0])
sns.histplot(data=new_age, bins=50, ax=ax[1])
ax[0].set_title('No scalling')
ax[1].set_title('Some scalling')

### Anual premium - Standard scaler, Robust scaler, Box-Cox or Quantile transform

In [None]:
#std_vintage = MinMaxScaler()
std_anual_pr = StandardScaler()
#std_anual_pr = QuantileTransformer()
#std_anual_pr = PowerTransformer(method='box-cox')
#std_anual_pr = RobustScaler()

aux1 = X_train[['annual_premium']].transform(np.log1p)
new_anual_pr = std_anual_pr.fit_transform(aux1)
#new_anual_pr = std_anual_pr.fit_transform(X_train[['annual_premium']])
X_train['annual_premium'] = new_anual_pr

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
sns.histplot(data=df_train['Annual_Premium'], bins=50, ax=ax[0])
sns.histplot(data=new_anual_pr, bins=50, ax=ax[1])
ax[0].set_title('No scalling')
ax[1].set_title('Some scalling')

## 3.5. Validation dataframe

In [None]:
X_valid[encoding_gender.get_feature_names_out()] =encoding_gender.transform(X=X_valid[['gender']]).toarray()
X_valid.drop(columns=encoding_gender.feature_names_in_[0], inplace=True)

X_valid[encoding_license.get_feature_names_out()] =encoding_license.transform(X=X_valid[['driving_license']]).toarray()
X_valid.drop(columns=encoding_license.feature_names_in_[0], inplace=True)

X_valid[encoding_insured.get_feature_names_out()] =encoding_insured.transform(X=X_valid[['previously_insured']]).toarray()
X_valid.drop(columns=encoding_insured.feature_names_in_[0], inplace=True)

X_valid[encoding_damage.get_feature_names_out()] =encoding_damage.transform(X=X_valid[['vehicle_damage']]).toarray()
X_valid.drop(columns=encoding_damage.feature_names_in_[0], inplace=True)

X_valid[encoding_v_age.get_feature_names_out()] =encoding_v_age.transform(X=X_valid[['vehicle_age']]).toarray()
X_valid.drop(columns=encoding_v_age.feature_names_in_[0], inplace=True)

In [None]:
X_valid['region_code'] = tar_enc_reg_code.transform(X=X_valid[['region_code']])

X_valid['policy_sales_channel'] = tar_enc_pol_sales.transform(X=X_valid[['policy_sales_channel']])

In [None]:
X_valid['age'] = std_age.transform(X=X_valid[['age']])

X_valid['vintage'] = std_vintage.transform(X=X_valid[['vintage']])

X_valid['annual_premium'] = std_anual_pr.transform(X=X_valid[['annual_premium']].transform(np.log1p))

In [None]:
X_valid

# 4. Machine Learning Modeling

## 4.1. Model trainning

### 4.1.1. Logistic Regression

In [None]:
log_reg_clf = LogisticRegression()

In [None]:
log_reg_clf.fit(X=X_train, y=y_train)
y_pred_log_reg = log_reg_clf.predict(X=X_valid)
y_pred_proba_log_reg = log_reg_clf.predict_proba(X=X_valid)

### 4.1.2. KNN

In [None]:
knn_clf = KNeighborsClassifier()

In [None]:
knn_clf.fit(X=X_train, y=y_train)
y_pred_knn = knn_clf.predict(X=X_valid)
y_pred_proba_knn = knn_clf.predict_proba(X=X_valid)

### 4.3.2. Random Forest

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, n_jobs=2, random_state=42)

In [None]:
rf_clf.fit(X=X_train, y=y_train)
y_pred_rf = rf_clf.predict(X=X_valid)
y_pred_proba_rf = rf_clf.predict_proba(X=X_valid)

### 4.3.3. HGBoosting

In [None]:
hgb_clf = HistGradientBoostingClassifier(random_state=42)

In [None]:
hgb_clf.fit(X=X_train, y=y_train)
y_pred_hgb = hgb_clf.predict(X=X_valid)
y_pred_proba_hgb = hgb_clf.predict_proba(X=X_valid)

### 4.3.4. Results

In [None]:
def metrics(models):

    results = {'Model': [],
               'Accuracy': [],
               'Precision': [],
               'Recall': []}

    for name, pred in models.items():
        results['Model'].append(name)
        results['Accuracy'].append(accuracy_score(y_pred=pred, y_true=y_valid))
        results['Precision'].append(precision_score(y_pred=pred, y_true=y_valid))
        results['Recall'].append(recall_score(y_pred=pred, y_true=y_valid))

    results = pd.DataFrame(results).set_index('Model')
    results.index.names = [None]
    return pd.DataFrame(results)

In [None]:
models = {'Logistic Regression': y_pred_log_reg,
          'KNN': y_pred_knn,
          'Random Forest': y_pred_rf,
          'HGBoost': y_pred_hgb}

In [None]:
results = metrics(models)
results.style.highlight_max(color='green', axis=0)

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(15, 12))
skplt.metrics.plot_roc(y_probas=y_pred_proba_log_reg, y_true=y_valid, plot_macro=False, plot_micro=False, title='Linear Regression', classes_to_plot=1, ax=ax[0][0])
skplt.metrics.plot_roc(y_probas=y_pred_proba_knn, y_true=y_valid, plot_macro=False, plot_micro=False, title='KNN', classes_to_plot=1, ax=ax[0][1])
skplt.metrics.plot_roc(y_probas=y_pred_proba_rf, y_true=y_valid, plot_macro=False, plot_micro=False, title='Random Forest', classes_to_plot=1, ax=ax[1][0])
skplt.metrics.plot_roc(y_probas=y_pred_proba_hgb, y_true=y_valid, plot_macro=False, plot_micro=False, title='HGBoosting', classes_to_plot=1, ax=ax[1][1])