## What can we do with this data?

After the previous initial analysis and exploration of the dataset, there are several conclusions that we can already share:
- Before the year 2014 there were no lead conversions and data was scarce.
- There are many features with geographical information of the lead that is redundant.
- We did not find a strong correlation between the selected features and the value to predict.
- The dataset is imbalanced having many more labels with 0 than with 1. This was somehow expected.

## Place for improvement?

So far given the provided features the only patterns you might find are those entirely related with the campaign (given just the name), lead_source and little more... which means your focus in on the campaign and all the configuration around it. This is a fair point and strategy to follow, given the things you have control of.

However there could be potential extra successful factors that maybe have nothing to do with the campaign or those external factors and come directly from the user interaction with the product. If you enjoy and you trust a product you have a higher probability to pay for it.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from time import time
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

## Get prepared dataset

In [None]:
df = pd.read_pickle("../training_df")

In [None]:
df.head()

## Balanced classification data

In [None]:
df.label.value_counts()

In [None]:
# Current total percentage of converted leads
1115 * 100 / 722245

With just 15% of converted leads it is going to be difficult to predict anything. We need to follow some of the techniques to correct this imbalance.

## Balancing techniques

In classification problems, a disparity in the frequencies of the observed classes can have a significant negative impact on model fitting. One technique for resolving such a class imbalance is to subsample the training data in a manner that mitigates the issues. Examples of sampling methods for this purpose are:

* under-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). 
* over-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class. Techniques such as SMOTE, down-sample the majority class and synthesizes new data points in the minority class. 
* hybrid methods: SMOTEENN combines over- and under-sampling using SMOTE and Edited Nearest Neighbours

# Model

Let´s try to build different models using these correction techniques. We will compare the accuracy of each model and check if there is place to build something valuable here or not.

## Evaluation

The confusion matrix gives us an idea how well we are predicting both types of leads, the one that is not converting (0) and the one that is converting (1).

In [None]:
def paint_confusion_matrix_and_report(model, X0_test, y0_test):
    y_pred = model.predict(X0_test)
    cm2 = confusion_matrix(y0_test, y_pred.round())
    ax= plt.subplot()
    sns.heatmap(cm2, annot=True, ax = ax, fmt="d", cmap="YlGnBu")
    # labels, title and ticks
    ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix')
    ax.xaxis.set_ticklabels(['no-conversion', 'conversion']); ax.yaxis.set_ticklabels(['no-conversion', 'conversion'])
    prec_rec = classification_report(y_pred, y0_test, target_names=['no-conversion', 'conversion'])
    print(prec_rec)

## Training

In [None]:
X = df.drop(['label'], axis=1)
y = df['label']  # Labels

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

In [None]:
y_test.value_counts()

In [None]:
len(y_test)

In [None]:
345 * 100 / 217008

In [None]:
def fit_model(model, X0_train, y0_train):
    start = time()
    model.fit(X0_train,y0_train)
    end = time()
    result = end - start
    print('Training time = %.3f seconds' % result)
    return model

## No correction

In [None]:
# 95.146 seconds for just 1 job
clf=RandomForestClassifier(n_estimators=100, n_jobs=4)

clf_n = fit_model(clf, X_train, y_train)
paint_confusion_matrix_and_report(clf_n, X_test, y_test)

This classifier is really bad since it is predicting most of the leads as not converted (0).

### Downsampling

In [None]:
print("Before undersampling: ", Counter(y_train))
undersample = RandomUnderSampler(sampling_strategy='majority')
X_train_under, y_train_under = undersample.fit_resample(X_train, y_train)
print("After undersampling: ", Counter(y_train_under))

In [None]:
clf_d = fit_model(clf, X_train_under, y_train_under)
paint_confusion_matrix_and_report(clf_d, X_test, y_test)

We have improved a bit the detection of leads but with the price of increasing false positives. Classifying many as successful converted leads (value 1) when they were not.

### SMOTE

In [None]:
print("Before undersampling: ", Counter(y_train))
smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("After undersampling: ", Counter(y_train_smote))

In [None]:
clf_s = fit_model(clf, X_train_smote, y_train_smote)
paint_confusion_matrix_and_report(clf_s, X_test, y_test)

### SMOTEENN

In [None]:
print("Before undersampling: ", Counter(y_train))
sme = SMOTEENN(random_state=42)
X_train_sme, y_train_sme = smote.fit_resample(X_train, y_train)
print("After undersampling: ", Counter(y_train_sme))

In [None]:
clf_sm = fit_model(clf, X_train_sme, y_train_sme)
paint_confusion_matrix_and_report(clf_sm, X_test, y_test)

### Manual

In [None]:
print("Before undersampling: ", Counter(y_train))
over = SMOTE(sampling_strategy=0.4)
under = RandomUnderSampler(sampling_strategy=0.5)
X_train_s, y_train_s = over.fit_resample(X_train, y_train)
X_train_comb, y_train_comb = under.fit_resample(X_train_s, y_train_s)
print("After undersampling: ", Counter(y_train_comb))

In [None]:
clf_c = fit_model(clf, X_train_comb, y_train_comb)
paint_confusion_matrix_and_report(clf_c, X_test, y_test)

## Hyperparameters Tuning

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 400, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

In [None]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)# Fit the random search model
# let´s try first with the resample dataset
start = time()
rf_random.fit(X_train_under, y_train_under)
end = time()
result = end - start
print('Training time = %.3f seconds' % result)

In [None]:
rf_random.best_params_

In [None]:
best_model = RandomForestClassifier(n_estimators= 361,
 min_samples_split= 5,
 min_samples_leaf= 4,
 max_features= 'auto',
 max_depth= 10,
 bootstrap= False,
 n_jobs=4)
bm = fit_model(best_model, X_train_under, y_train_under)
paint_confusion_matrix_and_report(bm, X_test, y_test)

### Gridsearch

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [False],
    'max_depth': [5, 10, 20],
    'max_features': ['auto'],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [4, 5, 10],
    'n_estimators': [200, 300, 400]
}

In [None]:
# Create a based model
rf = RandomForestClassifier()# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train_under, y_train_under)
grid_search.best_params_

In [None]:
final_model = RandomForestClassifier(n_estimators= 200,
 min_samples_split= 4,
 min_samples_leaf= 5,
 max_features= 'auto',
 max_depth= 5,
 bootstrap= False,
 n_jobs=4)
fm = fit_model(final_model, X_train_under, y_train_under)
paint_confusion_matrix_and_report(fm, X_test, y_test)

In [None]:
import joblib
# save
joblib.dump(fm, "final_model.joblib")

# Conclusion

By using a pure classifier we found out that the downsampling technique gave better results with the price of increasing false positives. We should need to check this with the client if that is accepted or not. 
The SMOTE technique helped also identifying also a lot of converted leads, up to aprox. 2/3 in the best case run. 
We tried some hyperparameter tuning but due to computer performance limitations could not investigate a wide range of values. The found best model improved precision on the converted leads by almost 0.10% in the end.
However we would like to explore a more complicated approach, for instance combining a clustering of leads with a classifier per cluster and compare results with this baseline model.

# References
https://towardsdatascience.com/how-to-deal-with-imbalanced-data-in-python-f9b71aba53eb

https://machinelearningmastery.com/multi-core-machine-learning-in-python/

http://glemaitre.github.io/imbalanced-learn/api.html

https://vitalflux.com/micro-average-macro-average-scoring-metrics-multi-class-classification-python/

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74