<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course 
### <center> Author: Alexander Nichiporenko, @AlexNich
    
## <center> Prediction of customers which will buy car insurance

### Part 1. Feature and data explanation

Probably, many of us are faced with a situation when a company calls you to buy or buy something. Typical examples:

* You use a credit card, and the bank calls you with an offer to issue a loan;*
* You bought auto insurance, and the insurance company calls and offers you other types of insurance;
* You have been using cellular communication for a long time, and your operator calls you with a proposal to use a new more profitable (oddly enough, more expensive) tariff;
* You bought something from an online store, and after a while he calls you to buy another item.
* Any situations related to the acquisition of a new service, an additional service, a more expensive service.

Usually, in most cases, the client does not agree to such offers, because he simply does not need it. It turns out that ringing the entire customer base is long and inefficient, so companies try to contact only those who are likely to agree to their proposal. How to find such customers? This can be done as follows:

* Call a certain random part of clients, record the result;
* Find in the remaining customer base of the most similar to those who agreed to the proposed service;
* Call these customers, thereby increasing the effectiveness of contacts.

We will solve a similar problem. We have a dataset from one bank in the United States. Besides usual services, this bank also provides car insurance services. The bank organizes regular campaigns to attract new clients. The bank has potential customers’ data, and bank’s employees call them for advertising available car insurance options. We are provided with general information about clients (age, job, etc.) as well as more specific information about the current insurance sell campaign (communication, last contact day) and previous campaigns (attributes like previous attempts, outcome). The task is to predict of customers who will buy car insurance or not.

In [None]:
#import libraries

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, TimeSeriesSplit, GridSearchCV, train_test_split, KFold, learning_curve, validation_curve
from sklearn.metrics import accuracy_score,classification_report,f1_score,roc_auc_score,roc_curve,precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
plt.rcParams['figure.figsize'] = (20,20)
#sns.set(style="darkgrid");
%matplotlib inline
pd.options.display.max_columns=500

Let's look at our dataset. You can download it here: https://www.kaggle.com/kondla/carinsurance

In [None]:
data = pd.read_csv('carInsurance_train.csv',index_col='Id')

In [None]:
data.head()

In [None]:
data.shape

We have 4000 customers with 17 features.

Our target variabe - **'CarInsurance'**, which is binary (1/0)."1" means that the customer has agreed to the offer, "0" means that not.

Eighteen features overvies:

- **Id** - Unique ID number;
- **Age** - Age of the client;
- **Job** - Job of the client.  "admin.", "blue-collar", etc.
 **Marital** - Marital status of the client  "divorced", "married", "single";
- **Education** - Education level of the client "primary", "secondary", etc.
- **Default** - Has credit in default? "yes" - 1,"no" - 0
- **Balance** - Average yearly balance, in USD
- **HHInsurance** - Is household insured "yes" - 1,"no" - 0
- **CarLoan** - Has the client a car loan "yes" - 1,"no" - 0
- **Communication** - Contact communication type "cellular", "telephone", “NA”
- **LastContactMonth** -  Month of the last contact "jan", "feb", etc.
- **LastContactDay** - Day of the last contact
- **CallStart** - Start time of the last call (HH:MM:SS) 12:43:15
- **CallEnd** - End time of the last call (HH:MM:SS) 12:43:15
- **NoOfContacts** - Number of contacts performed during this campaign for this client; 
- **DaysPassed** - Number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted) 
- **PrevAttempts** - Number of contacts performed before this campaign and for this client 
- **Outcome** - Outcome of the previous marketing campaign "failure", "other", "success", “NA”.

### Part 2. Primary data analysis

Firstly, examine our data on missing values and outliers.

In [None]:
data.info()

In [None]:
#devide features in categorical and numerical

data['Default']=data['Default'].astype('object')
data['HHInsurance']=data['HHInsurance'].astype('object')
data['CarLoan']=data['CarLoan'].astype('object')
data['LastContactDay']=data['LastContactDay'].astype('object')

cat = []
num = []
for feature in data.drop(columns=['CarInsurance']).columns:
    if data[feature].dtype == object:
        cat.append(feature)
    else:
        num.append(feature)

In [None]:
print ('Number of categorical features:',len(cat))
print ('Number of numerical features:',len(num))

In [None]:
data[data['Job'].isnull()].head()

In [None]:
data[data['Education'].isnull()].head()

In [None]:
data[data['Communication'].isnull()].head()

In [None]:
data[data['Outcome'].isnull()].head()

As we see dataset has some missing values: 
* Job and Education may be missed because customers didn't specify this information;
* Communication may be missed because bank didn't fix communication type
* Outcome has missing values because some customers haven't been offered anything before, respectively, and there is no outcome;

We will fill **NaN's** later.

In [None]:
data.describe()

In [None]:
data.describe(include = ['object'])

Some values seem suspicious and may be outliers:

* **max Age = 95 years**. Real survivor!
* **max Balance = 98 417 USD**, when mean is **1532 USD** and 75% procentile equals to **1619 USD**. May be this man is very rich? It's typical for income distribution.
* **min Balance = - 3058 USD**. Maybe this person spent all the credit money and did not return?
* **max NoOfContacs = 43**. Did the bank offer so many times insurance within this company to some person? Interestingly, he agreed?
* **max DaysPassed = 854**. The bank does not call someone for more than three years?
* **max PrevAttempts = 58** when mean is 0.72. 

Let's look at id with this strange values.

In [None]:
data[data['Age']==95].head()

In [None]:
data[data['Balance']==98417].head()

In [None]:
data[data['Balance']==-3058].head()

In [None]:
data[data['DaysPassed']==854].head()

In [None]:
data[data['NoOfContacts']==43].head()

In [None]:
data[data['PrevAttempts']==58].head()

Looking at this data it is impossible to say that there are definitely some errors in the data. Perhaps everything is correct. Later we will visualize the data and decide what to do with suspicious values.

Let's see the part of customers who bought car insurance.

In [None]:
data['CarInsurance'].mean()

**40%** isn't bad! But I think the bank wants **100%**, so it calls customers several times. In ML terms we can say that our two classes are balanced.

Now examine the influence of our features on the target variable. Firsly, numerical features.

In [None]:
data.columns

In [None]:
data.groupby(by=['CarInsurance'])[['Age']].agg([np.mean,np.std,np.min,np.max])

In [None]:
data.groupby(by=['CarInsurance'])[['NoOfContacts']].agg([np.mean,np.std,np.min,np.max])

In [None]:
data.groupby(by=['CarInsurance'])[['DaysPassed']].agg([np.mean,np.std,np.min,np.max])

In [None]:
data.groupby(by=['CarInsurance'])[['PrevAttempts']].agg([np.mean,np.std,np.min,np.max])

In [None]:
data.groupby(by=['CarInsurance'])[['Balance']].agg([np.mean,np.std,np.min,np.max])

In the constructed tables, we can see that those customers who agree to insurance in average:

* The bank makes more offers with this insurance
* Such clients were offered an offer by another bank company on average more than two months ago, for those who did not agree - just over a month
* They were more often offered other bank offers
* Have a bit more balance
* Have less contacts from the bank for other campaigns

To confirm these observations we build histograms and boxplots of features futher.
Now take a look at categorical and binary features.


In [None]:
pd.crosstab(data['Education'],data['Job'],values=data['CarInsurance'],aggfunc='mean',margins=True)

In [None]:
pd.crosstab(data['Marital'],data['Education'],values=data['CarInsurance'],aggfunc='mean',margins=True)

In [None]:
pd.crosstab(data['Default'],data['Job'],values=data['CarInsurance'],aggfunc='mean',margins=True)

In [None]:
pd.crosstab(data['CarLoan'],data['Job'],values=data['CarInsurance'],aggfunc='mean',margins=True)

In [None]:
pd.crosstab(data['CarLoan'],data['HHInsurance'],values=data['CarInsurance'],aggfunc='mean',margins=True)

In [None]:
pd.crosstab(data['Communication'],data['Outcome'],values=data['CarInsurance'],aggfunc='mean',margins=True)

In [None]:
pd.crosstab(data['Communication'],data['LastContactMonth'],values=data['CarInsurance'],aggfunc='mean',margins=True)

In [None]:
pd.crosstab(data['LastContactDay'],data['LastContactMonth'],values=data['CarInsurance'],aggfunc='mean',margins=True)

Looking at these crosstabs we can see:

* The way of communication doesn't affect on target variable
* Monthly and dayily dependence of campaign
* Persons with CarLoan rare agree to the offer
* Persons with HHInsurance rare agree to the offer
* People who agreed to other offers of the bank more often agree to insurance
* Persons with Default rare agree to the offer
* Single persons and persons who have tretiary education ofter agree to insurance 



### Part 3. Primary visual data analysis

Let's make visualizations of our features and their effect on the target variable.

In [None]:
#target distribution
sns.countplot(data['CarInsurance'],palette="Accent");
plt.title('Target distribution');

In [None]:
#distribution of categorical features

plt.figure(figsize=(20,20))
for i in range(1,len(cat[:11])):
    plt.subplot(4,3,i)
    sns.countplot(data[cat[i-1]],palette='Accent')
    plt.xticks(rotation=90)

It can be seen that some values of categorical features (**"Default=1"** or months) have a small number of examples. In general, such values are usually combined into one group to prevent overfitting, and in the binary case, this column can be deleted.

In [None]:
#target variable versus categorical

plt.figure(figsize=(20,20))
for i in range(1,len(cat[:11])):
    plt.subplot(4,3,i)
    sns.barplot(data[cat[i-1]],data['CarInsurance'],palette='Accent')
    plt.xticks(rotation=90)

Conclusions regarding the dependence of the target variable on categorical features obtained using primary data analysis are confirmed by these visualizations (see **Part 2**).

In [None]:
#histograms of numerical features and their scatterplots

sns.pairplot(data[num], palette="Accent");

In [None]:
corr_matrix = data[num].corr()

In [None]:
sns.heatmap(corr_matrix,cmap="Accent");

From scatterplots and heatmap is obviosly that our numerical haven't visible correlations, and the distributions are strongly skewed to the left except for age.

In [None]:
#histograms of numerical features and their scatterplot

sns.pairplot(data[num + ['CarInsurance']],hue='CarInsurance',palette="Accent",diag_kind='kde');

In [None]:
#boxplots depending on the target variable

plt.figure(figsize=(20,10))
for i in range(1,len(num)+1):
    plt.subplot(2,3,i)
    sns.boxplot(data=data, x=data['CarInsurance'],y=data[num[i-1]],palette="Accent")

In [None]:
#graphs depending on the target variable with a limit of 0.975 quantile for better visibility

plt.figure(figsize=(20,10))
for i in range(1,len(num)+1):
    plt.subplot(2,3,i)
    sns.boxplot(data=data, x=data['CarInsurance'],y=data[data[num[i-1]]<data[num[i-1]].quantile(0.975)][num[i-1]],palette="Accent")

In general, all conclusions and influinces also agree with what was obtained as a result of the analysis in **Part 2**.

### Part 4. Insights and found dependencies 

Let's summarize, what patterns were discovered:

* Tretiary education rises chances to accept insurance offer, these persons may be more responsible and prudent;
* Persons without car's loan and house insurance more loyal to car insurance offer, but it looks a little bit strange
* Persons who applied other bank's offers is more loyal to car insurance offer;
* If bank offers insurance many times, then it’s more likely that the customer will agree to it;
* Persons who were last called in March, September, October and December very likely agreed to the offer. This may be due to seasonality of car sales. Usually insure new cars, and in these months dealers make good discounts on cars;
* Single persons often buy car insurance, may be they have extra money to this service. Married persons spend money to other things;
* People which buy car insurance have a little bit balance
* People to whom the bank has never offered its other services are less likely to agree to car insurance. These are new customers with whom the bank has not yet built a relationship.
* Students more often buy car insurance. I think they are newbies in driving so they need insurance.

### Part 5. Metrics selection

Suppose that we have data on **4 000** clients, and this is ** 10% **  of the entire database. If we assume that the effectiveness of calls for the remaining customers will be about the same, then the bank is interested in calling all customers who agree to insurance for a smaller number of calls or the number of calls that the bank can make. Thus, it would be possible to choose metric **recall@topK%**. As the metric, **K** in this case would be equal to about **50%**. In general, the strategy and capabilities of the bank may vary, so you need to have a universal classifier. In general, the strategy and capabilities of the bank may vary, so you need to have a universal classifier. In this case, the universal metric of work quality classifiers is **ROC-AUC**. We will use it. In this case we can choose threshold and calculate **K** (what part of customers have higher probability) and **recall**@**topK%**.

### Part 6. Model selection

In our dataset there are numerical features with very large values, but they do not contradict anything, so we will leave them unchanged, and use **XGBoost** as a prediction model, which is not afraid to such outliers. Also this algorithm has the best perfomance in most of tasks. Also this task doesn't connect with financial risks, so we can make a "blackbox".

### Parts 7-9.Data preprocessing. Cross-validation and adjustment of model hyperparameters. Creation of new features and description of this process.

First of all, let's fill **NaN's**. We assumed that some people did not fill in the fields of **Education** and **Job** for any reason, so instead of passes, we put "unknown", we will do the same with the type of communication. Missing values in feature **Outcome** we will fill with "no_outcome". In general, we simply denote the missing values as another category.

In [None]:
data['Education'].fillna('unknown',inplace=True)
data['Job'].fillna('unknown',inplace=True)
data['Communication'].fillna('unknown',inplace=True)
data['Outcome'].fillna('no_outcome',inplace=True)

In [None]:
data.head()

For coding our categorical features we will use common method **OHE** using **pd.get_dummies**.

In [None]:
data=pd.concat([data.drop(columns=['Job','Marital','Education','Communication','LastContactMonth','Outcome']),pd.get_dummies(data[['Job','Marital','Education','Communication','LastContactMonth','Outcome']])],axis=1)

In [None]:
data.head()

In [None]:
data.shape

At first, we will not use "CallStart" and "CallEnd" features, because we need to work on them and make new features from them.

In [None]:
# get X and y

X = data.drop(columns=['CallStart','CallEnd','CarInsurance'])
X=X.astype('float')
y = data['CarInsurance']

Devide our dataset by train and valid parts. We will use **25%** for validation. Because we have balanced classification task we won't use stratified splitting.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=33)

In [None]:
#part of class "1" 
y_train.mean(), y_valid.mean()

Let's check quality of XGBoost via CV with 5 shuffling folds.

In [None]:
xgb = XGBClassifier(random_state=33, n_jobs=4)
kf = KFold(random_state=33,n_splits=5,shuffle=True)
print ('Mean ROC-AUC CV score:', np.mean(cross_val_score(xgb, X_train, y_train, scoring='roc_auc',cv=kf)))

Ok, now we try to add some extra features. We could add the day of the week, but unfortunately we don’t know the year when the calls were made. Therefore, we will work with the signs associated with the call time:
* start hour of call
* start minute of call
* call duration in seconds

In [None]:
data['CallDuration']=pd.to_datetime(data['CallEnd'])-pd.to_datetime(data['CallStart'])
data['CallDuration']=data['CallDuration'].dt.total_seconds()
data['CallHourStart']=pd.to_datetime(data['CallStart']).apply(lambda t: t.hour)
data['CallMinStart']=pd.to_datetime(data['CallStart']).apply(lambda t: t.minute)

In [None]:
data.head()

Let's look at our new features.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(data=data, x=data['CarInsurance'],y=data['CallDuration'],palette="Accent");

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(data['CallHourStart'],data['CarInsurance'],palette='Accent');

In [None]:
plt.figure(figsize=(15,6))
sns.barplot(data['CallMinStart'],data['CarInsurance'],palette='Accent');
plt.xticks(rotation=90);

**"CallDurations"** is very usefull feature, longer calls lead to the purchase of insurance. Other features don't seem so much usefull, but we will try all of them together.

In [None]:
# get X and y

X = data.drop(columns=['CallStart','CallEnd','CarInsurance'])
X=X.astype('float')
y = data['CarInsurance']

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=33)

In [None]:
#part of class "1" 
y_train.mean(), y_valid.mean()

Let's check quality again.

In [None]:
xgb = XGBClassifier(random_state=33, n_jobs=4)
kf = KFold(random_state=33,n_splits=5,shuffle=True)
print ('Mean ROC-AUC CV score:', np.mean(cross_val_score(xgb, X_train, y_train, scoring='roc_auc',cv=kf)))

Whoah! New features gave a noticeable increase in quality! Let's tune hyperparameters via GridSearchCV.

In [None]:
%%time
parameters = {'n_estimators':[40, 50, 60, 80, 100, 150, 200, 300], 'max_depth':[3, 4, 5, 6, 7, 8], 'min_child_weight': [1,3,5,7,9]}
xgb = XGBClassifier(random_state=33, n_jobs=4)
clf = GridSearchCV(xgb, parameters, scoring='roc_auc', cv=kf)
clf.fit(X_train, y_train)
print('Best parameters: ', clf.best_params_)

Now check new hyperparameters via our CV.

In [None]:
xgb = XGBClassifier(random_state=33, n_jobs=4,max_depth=4, min_child_weight=1, n_estimators=200)
kf = KFold(random_state=33,n_splits=5,shuffle=True)
print ('Mean ROC-AUC CV score:', np.mean(cross_val_score(xgb, X_train, y_train, scoring='roc_auc',cv=kf)))

The result is now higher.

### Part 10. Plotting training and validation curves

In [None]:
def plot_with_std(x, data, **kwargs):
        mu, std = data.mean(1), data.std(1)
        lines = plt.plot(x, mu, '-', **kwargs)
        plt.fill_between(x, mu - std, mu + std, edgecolor='none',
                         facecolor=lines[0].get_color(), alpha=0.2)
        
def plot_learning_curve(clf, X, y, scoring, cv=5):
 
    train_sizes = np.linspace(0.05, 1, 20)
   
    n_train, val_train, val_test = learning_curve(clf, X=X, y=y, train_sizes=train_sizes, cv=cv,scoring=scoring)
    plot_with_std(n_train, val_train, label='training scores', c='green')
    plot_with_std(n_train, val_test, label='validation scores', c='red')
    plt.xlabel('Training Set Size'); plt.ylabel(scoring)
    plt.legend()

def plot_validation_curve(clf, X, y, cv_param_name, 
                          cv_param_values, scoring):

    val_train, val_test = validation_curve(clf, X, y, cv_param_name, cv_param_values, cv=5, scoring=scoring)
    plot_with_std(cv_param_values, val_train, 
                  label='training scores', c='green')
    plot_with_std(cv_param_values, val_test, 
                  label='validation scores', c='red')
    plt.xlabel(cv_param_name); plt.ylabel(scoring)
    plt.legend()

In [None]:
# learning curve
plt.figure(figsize=(12,6))
plot_learning_curve(xgb,X_train, y_train, scoring='roc_auc', cv=10)

Considering the learning curve, we can say that adding data could improve the quality of the models, because With the addition of new data, the quality of validation is increasing.

In [None]:
# validation curve

plt.figure(figsize=(12,6))
max_depth = [3, 4, 5, 6, 7, 8]
plot_validation_curve(XGBClassifier(random_state=33, n_jobs=4, min_child_weight=1, n_estimators=200), X_train, y_train, 
                    cv_param_name='max_depth', 
                    cv_param_values=max_depth,
                    scoring='roc_auc')

In [None]:
# validation curve

plt.figure(figsize=(12,6))
n_estimators = [40, 50, 60, 80, 100, 150, 200, 300]
plot_validation_curve(XGBClassifier(random_state=33, n_jobs=4, min_child_weight=1, n_estimators=200), X_train, y_train, 
                    cv_param_name='n_estimators', 
                    cv_param_values=n_estimators,
                    scoring='roc_auc')

In [None]:
# validation curve

plt.figure(figsize=(12,6))
min_child_weight = [1,3,5,7,9]
plot_validation_curve(XGBClassifier(random_state=33, n_jobs=4, min_child_weight=1, n_estimators=200), X_train, y_train, 
                    cv_param_name='min_child_weight', 
                    cv_param_values=min_child_weight,
                    scoring='roc_auc')

The validation curves show that the result on the CV is much lower than on the train. This indicates a overfitting of the model. For such a small dataset and boosting is a common thing. To reduce the degree of overfitting, you can try to reduce the complexity of the model and increase the parameters responsible for regularization.

### Part 11. Prediction for test or hold-out samples

Now we use our XGBoost to predict probabilities to our X_valid.

In [None]:
xgb.fit(X_train,y_train)
y_pred_valid=xgb.predict_proba(X_valid)[:,1]

print ('ROC-AUC score of X_valid:', roc_auc_score(y_valid, y_pred_valid))

We obtained score a bit higher then on CV, it means that our CV is correct.

In [None]:
import xgboost

In [None]:
#look at most important features
xgboost.plot_importance(xgb,max_num_features=15,importance_type='gain');

### Part 12. Conclusions

In this project we made a model with a good quality **~0.92 ROC-AUC**, so the bank can use it to find customers who are most likely to buy car insurance, depending on the capabilities and policies of the bank.