## Machine Learning approach to contrast the manual selection and recommendation problem

The objective of this model will be to assess the probability of the group of selected people to engage or not by purchasing an em account.

For that I will use the following data:
- To be consistent, only customers that have been considered active this last month will be considered.
- TARGET will be em account.
- Distribution of products that a given customer has in last partition (except em account)
- revenue computed in last manual step
- months being active

This model is extremely primitive and thus very limited. 
The final goal is to predict the probability of a client that has no products, purchasing an em account. In order to do so it will only have access to a list of products == 0 and the number of months which that client has been active in the app and 3 values of revenue (0,10,20) but mostly 0.
In order to reach this goal the model will have to learn patterns from clients' products list (containing ones), number of months (most consistent feature across the dataset) and values of revenue (greater than zero) which will lead to not very reliable results.

About looking for similar profiles in other months, again the limitation is the features we are inputting. If anything the only thing that will enrich the model are demographics:
- Age range for example
- Entry channel..
- Gender at some point...

## Libraries

In [1]:
import pandas as pd
import numpy as np

## VISUALIZATION
import matplotlib.pyplot as plt
import seaborn as sns

## SKLEARN
from sklearn import metrics
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb


  from pandas import MultiIndex, Int64Index


In [2]:
pd.options.display.float_format = '{:,.2f}'.format

## Load data

I need to train two different models, the first one is going to predict the possibility of a customer purchasing the em account based on age, revenue and months active and I want to evaluate the 4733 customers that have no products.

The second one, similarly, will evaluate the probability of someone buying a debit card. I will use this one to evaluate the rest of the customers, which do have em account but no debit card.

The products have no use so I will drop all of the except for my two targets, which I will rename

In [47]:
## load data for train/testing

model_df = pd.read_csv('model_proba_df.csv', index_col=0)
model_df.head()

Unnamed: 0_level_0,revenue,months,short_term_deposit,loans,mortgage,funds,securities,long_term_deposit,credit_card,debit_card,...,em_account_p,em_acount,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1128353,4740,16,0,0,0,1,1,1,1,1,...,0,0,0,0,0,0,1,0,0,0
1116675,4720,16,0,0,0,1,1,0,1,1,...,0,1,0,0,0,0,0,1,0,0
1136671,4580,16,0,0,0,1,1,0,1,1,...,0,0,0,0,0,1,0,0,0,0
1070525,4360,16,0,0,0,1,1,1,0,1,...,0,0,0,0,0,1,0,0,0,0
1133500,4320,16,0,0,0,1,1,0,1,1,...,0,0,0,0,1,0,0,0,0,0


In [48]:
# rename target features
model_df.rename(columns={"em_acount": "target_em"}, inplace=True)
model_df.rename(columns={"debit_card": "target_dc"}, inplace=True)

In [49]:
model_df.columns

Index(['revenue', 'months', 'short_term_deposit', 'loans', 'mortgage', 'funds',
       'securities', 'long_term_deposit', 'credit_card', 'target_dc',
       'payroll', 'pension_plan', 'payroll_account', 'emc_account',
       'em_account_p', 'target_em', 'age_u18', 'age_18-30', 'age_31-40',
       'age_41-50', 'age_51-60', 'age_61-70', 'age_71-80', 'age_o80'],
      dtype='object')

In [50]:
_drop_prods = ['short_term_deposit', 'loans', 'mortgage', 'funds','securities', 'long_term_deposit', 'credit_card', 'payroll','pension_plan', 'payroll_account', 'emc_account', 'em_account_p']

In [51]:
# Drop all products
model_df_no_prod = model_df.drop(_drop_prods, axis=1)

In [52]:
model_df_no_prod.head()

Unnamed: 0_level_0,revenue,months,target_dc,target_em,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1128353,4740,16,1,0,0,0,0,0,1,0,0,0
1116675,4720,16,1,1,0,0,0,0,0,1,0,0
1136671,4580,16,1,0,0,0,0,1,0,0,0,0
1070525,4360,16,1,0,0,0,0,1,0,0,0,0
1133500,4320,16,1,0,0,0,1,0,0,0,0,0


In [55]:
model_df_no_prod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 160017 entries, 1128353 to 1548202
Data columns (total 12 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   revenue    160017 non-null  int64
 1   months     160017 non-null  int64
 2   target_dc  160017 non-null  uint8
 3   target_em  160017 non-null  uint8
 4   age_u18    160017 non-null  uint8
 5   age_18-30  160017 non-null  uint8
 6   age_31-40  160017 non-null  uint8
 7   age_41-50  160017 non-null  uint8
 8   age_51-60  160017 non-null  uint8
 9   age_61-70  160017 non-null  uint8
 10  age_71-80  160017 non-null  uint8
 11  age_o80    160017 non-null  uint8
dtypes: int64(2), uint8(10)
memory usage: 5.2 MB


In [54]:
# transform dtypes, they are only ones
for col in model_df_no_prod.columns[2:]:
    model_df_no_prod[col] = model_df_no_prod[col].astype('uint8')

In [56]:
# load data of the final 10k customers to check once model is validated
final_10k = pd.read_csv('final_10k.csv', index_col=0)
final_10k.head()

Unnamed: 0_level_0,revenue,months,short_term_deposit,loans,mortgage,funds,securities,long_term_deposit,credit_card,debit_card,...,em_account_p,em_acount,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1045535,2560,15,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1116106,2420,16,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1020461,2400,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1119050,2190,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1209899,2050,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [57]:
# rename em account to target
final_10k.rename(columns={"em_acount": "target_em"}, inplace=True)
final_10k.rename(columns={"debit_card": "target_dc"}, inplace=True)
# Drop all products
final_10k_no_prod = final_10k.drop(_drop_prods, axis=1)
final_10k_no_prod.head()
final_10k_no_prod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1045535 to 1531108
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   revenue    10000 non-null  int64
 1   months     10000 non-null  int64
 2   target_dc  10000 non-null  int64
 3   target_em  10000 non-null  int64
 4   age_u18    10000 non-null  int64
 5   age_18-30  10000 non-null  int64
 6   age_31-40  10000 non-null  int64
 7   age_41-50  10000 non-null  int64
 8   age_51-60  10000 non-null  int64
 9   age_61-70  10000 non-null  int64
 10  age_71-80  10000 non-null  int64
 11  age_o80    10000 non-null  int64
dtypes: int64(12)
memory usage: 1015.6 KB


In [58]:
for col in final_10k_no_prod.columns[2:]:
    final_10k_no_prod[col] = final_10k_no_prod[col].astype('uint8')

In [59]:
final_10k_no_prod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1045535 to 1531108
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   revenue    10000 non-null  int64
 1   months     10000 non-null  int64
 2   target_dc  10000 non-null  uint8
 3   target_em  10000 non-null  uint8
 4   age_u18    10000 non-null  uint8
 5   age_18-30  10000 non-null  uint8
 6   age_31-40  10000 non-null  uint8
 7   age_41-50  10000 non-null  uint8
 8   age_51-60  10000 non-null  uint8
 9   age_61-70  10000 non-null  uint8
 10  age_71-80  10000 non-null  uint8
 11  age_o80    10000 non-null  uint8
dtypes: int64(2), uint8(10)
memory usage: 332.0 KB


### 1. Recommend em_account

This is the first problem to solve so I will remove the credit card info for this one.

In [60]:
model_df_em = model_df_no_prod.drop('target_dc', axis=1)
model_df_em.head()

Unnamed: 0_level_0,revenue,months,target_em,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1128353,4740,16,0,0,0,0,0,1,0,0,0
1116675,4720,16,1,0,0,0,0,0,1,0,0
1136671,4580,16,0,0,0,0,1,0,0,0,0
1070525,4360,16,0,0,0,0,1,0,0,0,0
1133500,4320,16,0,0,0,1,0,0,0,0,0


In [61]:
X_train_em, X_dev_em, y_train_em, y_dev_em = model_selection.train_test_split(
    model_df_em.drop('target_em',axis=1),
    model_df_em['target_em'],
    test_size=0.3,
    random_state=42
)

In [62]:
print(model_df.shape)
print(X_train_em.shape)
print(X_dev_em.shape)

(160017, 24)
(112011, 10)
(48006, 10)


In [63]:
X_test_em, X_val_em, y_test_em, y_val_em = model_selection.train_test_split(
    X_dev_em,
    y_dev_em,
    test_size = 0.5,
    random_state=42
)

In [64]:
print(X_test_em.shape)
print(X_val_em.shape)

(24003, 10)
(24003, 10)


#### Model

In [65]:
split_dict_em = {
    "TRAINING": [X_train_em, y_train_em],
    "TESTING": [X_test_em, y_test_em]
}

In [66]:
# instantiate model
xgb_model = xgb.XGBClassifier(
    eta = 0.1,
    max_depth = 30,
    min_child_weight = 0.5,
    gamma = 0.5,
    random_state = 42,
    verbosity=0,
    use_label_encoder=False
)

# train 
xgb_model.fit(X = X_train_em, y = y_train_em)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


In [67]:
print("################## em acount ##################")
print("\n____________ SCORES & EVALUATIONS ____________\n")
print("#################### RESULTS ###################")


for data in split_dict_em.items():
    pred = xgb_model.predict(data[1][0])
    confusion_matrix = metrics.confusion_matrix(data[1][1], pred)
    tn, fp, fn, tp = confusion_matrix.ravel()
    Accuracy = metrics.accuracy_score(data[1][1], pred)
    Precision = metrics.precision_score(data[1][1], pred)
    Recall = metrics.recall_score(data[1][1], pred)
    F_1_Score = metrics.f1_score(data[1][1], pred)


    probs = xgb_model.predict_proba(data[1][0])[:, 1]
    probs_mean = round(probs.mean()* 100, 2) 
    auc_score = roc_auc_score(data[1][1], probs)
    fpr, tpr, thresholds = roc_curve(data[1][1], probs)

    PPV, NPV = ((tp / (tp + fp)) * 100), ((tn / (fn + tn)) * 100)
    
    
    print(f"#################### {data[0]} ####################")
    print(f"Accuracy: {round(Accuracy, 5)} | Precision: {round(Precision, 5)} | Recall: {round(Recall, 5)} | F1_Score: {round(F_1_Score, 5)}")
    print(f"TN = {tn} | FN = {fn} | TP = {tp} | FP = {fp}")
    print(f"Positive prediction value: {round(PPV, 2)}% | Negative prediction value: {round(NPV, 2)}%")
    print("########## TOP FEATURES ##########")
    top_features = pd.Series(xgb_model.feature_importances_, index = data[1][0].columns).sort_values(ascending = False).head()
    print(top_features)
    print("\n")

################## em acount ##################

____________ SCORES & EVALUATIONS ____________

#################### RESULTS ###################
#################### TRAINING ####################
Accuracy: 0.89547 | Precision: 0.91982 | Recall: 0.95233 | F1_Score: 0.93579
TN = 14974 | FN = 4271 | TP = 85328 | FP = 7438
Positive prediction value: 91.98% | Negative prediction value: 77.81%
########## TOP FEATURES ##########
age_18-30   0.33
revenue     0.29
months      0.10
age_31-40   0.06
age_o80     0.04
dtype: float32


#################### TESTING ####################
Accuracy: 0.87814 | Precision: 0.90686 | Recall: 0.94446 | F1_Score: 0.92527
TN = 2969 | FN = 1065 | TP = 18109 | FP = 1860
Positive prediction value: 90.69% | Negative prediction value: 73.6%
########## TOP FEATURES ##########
age_18-30   0.33
revenue     0.29
months      0.10
age_31-40   0.06
age_o80     0.04
dtype: float32




In [32]:
params_xgb = {
        'eta': [0.05, 0.1, 0.3],
        'min_child_weight': [0.5, 1, 5],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'max_depth': [5, 10, 30],
        'subsample': [0.8, 0.9, 1]
        }

grid_search_xgb = model_selection.GridSearchCV(
    estimator=xgb_model,
    param_grid=params_xgb,
    scoring='recall',
    cv=5,
    verbose=1
)

grid_search_xgb.fit(X_train_em, y_train_em)

print(grid_search_xgb.best_estimator_)

Fitting 5 folds for each of 405 candidates, totalling 2025 fits


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

KeyboardInterrupt: 

In [None]:
# validation data

In [None]:
# proba for the 4733

### 2. Recommend credit card

will remove em account for this one

In [68]:
model_df_dc = model_df_no_prod.drop('target_em', axis=1)
model_df_dc.head()

Unnamed: 0_level_0,revenue,months,target_dc,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1128353,4740,16,1,0,0,0,0,1,0,0,0
1116675,4720,16,1,0,0,0,0,0,1,0,0
1136671,4580,16,1,0,0,0,1,0,0,0,0
1070525,4360,16,1,0,0,0,1,0,0,0,0
1133500,4320,16,1,0,0,1,0,0,0,0,0


In [69]:
X_train_dc, X_dev_dc, y_train_dc, y_dev_dc = model_selection.train_test_split(
    model_df_dc.drop('target_dc',axis=1),
    model_df_dc['target_dc'],
    test_size=0.3,
    random_state=42
)

In [70]:
print(model_df.shape)
print(X_train_dc.shape)
print(X_dev_dc.shape)

(160017, 24)
(112011, 10)
(48006, 10)


In [71]:
X_test_dc, X_val_dc, y_test_dc, y_val_dc = model_selection.train_test_split(
    X_dev_dc,
    y_dev_dc,
    test_size = 0.5,
    random_state=42
)

In [72]:
print(X_test_dc.shape)
print(X_val_dc.shape)

(24003, 10)
(24003, 10)


#### Model

In [76]:
split_dict_dc = {
    "TRAINING": [X_train_dc, y_train_dc],
    "TESTING": [X_test_dc, y_test_dc]
}

In [77]:
# instantiate model
xgb_model = xgb.XGBClassifier(
    eta = 0.1,
    max_depth = 30,
    min_child_weight = 0.5,
    gamma = 0.5,
    random_state = 42,
    verbosity=0,
    use_label_encoder=False
)

# train 
xgb_model.fit(X = X_train_dc, y = y_train_dc)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


In [78]:
print("################## debit card ##################")
print("\n____________ SCORES & EVALUATIONS ____________\n")
print("#################### RESULTS ###################")


for data in split_dict_dc.items():
    pred = xgb_model.predict(data[1][0])
    confusion_matrix = metrics.confusion_matrix(data[1][1], pred)
    tn, fp, fn, tp = confusion_matrix.ravel()
    Accuracy = metrics.accuracy_score(data[1][1], pred)
    Precision = metrics.precision_score(data[1][1], pred)
    Recall = metrics.recall_score(data[1][1], pred)
    F_1_Score = metrics.f1_score(data[1][1], pred)


    probs = xgb_model.predict_proba(data[1][0])[:, 1]
    probs_mean = round(probs.mean()* 100, 2) 
    auc_score = roc_auc_score(data[1][1], probs)
    fpr, tpr, thresholds = roc_curve(data[1][1], probs)

    PPV, NPV = ((tp / (tp + fp)) * 100), ((tn / (fn + tn)) * 100)
    
    
    print(f"#################### {data[0]} ####################")
    print(f"Accuracy: {round(Accuracy, 5)} | Precision: {round(Precision, 5)} | Recall: {round(Recall, 5)} | F1_Score: {round(F_1_Score, 5)}")
    print(f"TN = {tn} | FN = {fn} | TP = {tp} | FP = {fp}")
    print(f"Positive prediction value: {round(PPV, 2)}% | Negative prediction value: {round(NPV, 2)}%")
    print("########## TOP FEATURES ##########")
    top_features = pd.Series(xgb_model.feature_importances_, index = data[1][0].columns).sort_values(ascending = False).head()
    print(top_features)
    print("\n")

################## debit card ##################

____________ SCORES & EVALUATIONS ____________

#################### RESULTS ###################
#################### TRAINING ####################
Accuracy: 0.90708 | Precision: 0.83223 | Recall: 0.82104 | F1_Score: 0.8266
TN = 76796 | FN = 5407 | TP = 24807 | FP = 5001
Positive prediction value: 83.22% | Negative prediction value: 93.42%
########## TOP FEATURES ##########
revenue     0.47
months      0.19
age_18-30   0.12
age_31-40   0.05
age_71-80   0.04
dtype: float32


#################### TESTING ####################
Accuracy: 0.89572 | Precision: 0.81015 | Recall: 0.79351 | F1_Score: 0.80174
TN = 16439 | FN = 1317 | TP = 5061 | FP = 1186
Positive prediction value: 81.01% | Negative prediction value: 92.58%
########## TOP FEATURES ##########
revenue     0.47
months      0.19
age_18-30   0.12
age_31-40   0.05
age_71-80   0.04
dtype: float32




In [None]:
params_xgb = {
        'eta': [0.05, 0.1, 0.3],
        'min_child_weight': [0.5, 1, 5],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'max_depth': [5, 10, 30],
        'subsample': [0.8, 0.9, 1]
        }

grid_search_xgb = model_selection.GridSearchCV(
    estimator=xgb_model,
    param_grid=params_xgb,
    scoring='recall',
    cv=5,
    verbose=1
)

grid_search_xgb.fit(X_train_dc, y_train_dc)

print(grid_search_xgb.best_estimator_)

In [79]:
# validation data

In [80]:
# proba for the 5267 customers left debit card


5267