# Better Collective Challenge - Predictive Modeling

Eloy Chang

----------------------------------

## Imports

In [1]:
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from scipy.stats import randint, uniform
import numpy as np

## Read data

In [2]:
with open('../data/train_data.pkl', 'rb') as file:
    X_train, y_train = pickle.load(file)

with open('../data/test_data.pkl', 'rb') as file:
    X_test, y_test = pickle.load(file)

## Baseline model

As a baseline model a logistic regression will be used, this is a good model for this because is a very simple model, easy to explain and this model can provide interesting insights, however, this model may not be highly accurate.

In [3]:
# Model training
baseline_model = LogisticRegression(random_state=0).fit(X_train, y_train)

# Metrics
baseline_predictions = baseline_model.predict(X_test)
baseline_metrics = {
    'accuracy' : baseline_model.score(X_test, y_test),
    'precision' : precision_score(y_test, baseline_predictions),
    'recall' : recall_score(y_test, baseline_predictions),
    'f1-score' : f1_score(y_test, baseline_predictions),
    'roc-auc' : roc_auc_score(y_test, baseline_model.predict_proba(X_test)[:, 1])
}
baseline_metrics

  y = column_or_1d(y, warn=True)


{'accuracy': 0.7425886712546321,
 'precision': 0.9744763245907411,
 'recall': 0.7546346782988005,
 'f1-score': 0.8505800107551664,
 'roc-auc': 0.5290686651135125}

This model has high values of accuracy, precision and recall, but this metrics are tricky because the unbalance of the data; the AUC-ROC is a good metrics in this case, but the results for this model is very low. 

## Churn Model

The ensamble models are very powerfull tree based model, one of them is the Gradient Boosting algorithm. These model have several parameters that we can optimice, in order to improve the model performance, to do so, the Randomized Search Cross Validation will be used.

In [None]:
weights = y_train['churn'].copy()
weights.loc[weights == 0] = 6

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

random_forest_model = GradientBoostingClassifier(n_estimators = 500, random_state=0)
distributions = {
    'learning_rate' : uniform(loc=0, scale=1),
    'max_depth' : randint(2, 50),
    'min_samples_leaf' : uniform(loc=0, scale=.5)
}

model = RandomizedSearchCV(random_forest_model, distributions, random_state=0, n_iter = 20, cv = 5)
best_model = model.fit(X_train, np.array(y_train)[:, 0], sample_weight = weights)

print(f'Best model parameters: {best_model.best_params_}')

predictions = best_model.predict(X_test)
model_metrics = {
    'accuracy' : model.score(X_test, y_test),
    'precision' : precision_score(y_test, predictions),
    'recall' : recall_score(y_test, predictions),
    'f1-score' : f1_score(y_test, predictions),
    'roc-auc' : roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
}
model_metrics

Best model parameters: {'learning_rate': 0.6976311959272649, 'max_depth': 43, 'min_samples_leaf': 0.4849045338733744}


{'accuracy': 0.7204870301746956,
 'precision': 0.9699532205829435,
 'recall': 0.7348691384950927,
 'f1-score': 0.8362028850628199,
 'roc-auc': 0.5071995514027957}

In [5]:
with open('../model/model.pkl', 'wb') as file:
    pickle.dump(best_model, file)



## Model evaluation

#### Effectiveness 

Machine learning models lose prediction power over time due to change on the behaviour of variables, this is why there should be a scheduled process to, from time to time, measure how well is performing the model with a selected metric like AUC - ROC, once the model performance go below a defined threshold a alert can be raise, it is also possible to automatically run a re train process using the same methodology but with newest data, if the model performance do not improve as wanted a new model can be manually trained using new variables as well.

#### summary

* 88% of the accounts churn the same month they register, however, these accounts represent the 62.9% of the total revenue; although is not ideal the short life of these account, they still generate a great part of the revenue, but, on the other side, this generate  a data issue given the huge unbalance of the data. 

* Most of the variables have a high variability, a segmentation prior to the model is highly recommendable to reduce noice on the model.

* There are some brands and ads locations that seems to generate huge losses.

## Next Steps

#### Aditional Data

* Demographic data.

* Lead history on the sites.

* Account history between brands.

#### Metrics

I think the most valuable metrics is the revenue generated by the account, however, the model should be based on the percentage of that revenue that goes to Better Collective, given that, may not be the same maximize the revenue of the partners that maximize the revenue of Better Collective.


#### Ideas

* I would start obtaining more data and improving the data cleaning process; a high amount of data was dropped during the analysis, drill down on the root causes of these inconsistent and / or generate rules or models to correctly impute them is highly recommendable.

* Train a clusterization algorithm to reduce variance and noise on the variables, this not only may improve models performances, but also may generate interesting insights. 

* Train a model to estimate the revenue generated by the account each month, together with this model a prediction of the CLV is duable. 

* Once a model to predict the CLV is completed with a acceptable performance, this can be used to optimize several other process as:

    * Ads selection and ubications. 

    * Improve negotiation with partners. 

    * Improve customer experience. 


