## Machine Learning approach to contrast the manual selection and recommendation problem

The objective of this model will be to assess the probability of the group of selected people to engage or not by purchasing an em account.

For that I will use the following data:
- To be consistent, only customers that have been considered active this last month will be considered.
- TARGET will be em account.
- Distribution of products that a given customer has in last partition (except em account)
- revenue computed in last manual step
- months being active

This model is extremely primitive and thus very limited. 
The final goal is to predict the probability of a client that has no products, purchasing an em account. In order to do so it will only have access to a list of products == 0 and the number of months which that client has been active in the app and 3 values of revenue (0,10,20) but mostly 0.
In order to reach this goal the model will have to learn patterns from clients' products list (containing ones), number of months (most consistent feature across the dataset) and values of revenue (greater than zero) which will lead to not very reliable results.

About looking for similar profiles in other months, again the limitation is the features we are inputting. If anything the only thing that will enrich the model are demographics:
- Age range for example
- Entry channel..
- Gender at some point...

## Libraries

In [1]:
import pandas as pd
import numpy as np

## VISUALIZATION
import matplotlib.pyplot as plt
import seaborn as sns

## SKLEARN
from sklearn import metrics
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb


  from pandas import MultiIndex, Int64Index


In [2]:
pd.options.display.float_format = '{:,.2f}'.format

## Load data

I need to train two different models, the first one is going to predict the possibility of a customer purchasing the em account based on age, revenue and months active and I want to evaluate the 4733 customers that have no products.

The second one, similarly, will evaluate the probability of someone buying a credit card. I will use this one to evaluate the rest of the customers, which do have em account but no credit card.

The products have no use so I will drop all of the except for my two targets, which I will rename

In [3]:
## load data for train/testing

model_df = pd.read_csv('model_proba_df.csv', index_col=0)
model_df.head()

Unnamed: 0_level_0,revenue,months,short_term_deposit,loans,mortgage,funds,securities,long_term_deposit,credit_card,debit_card,...,em_account_p,em_acount,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1128353,4740,16,0,0,0,1,1,1,1,1,...,0,0,0,0,0,0,1,0,0,0
1116675,4720,16,0,0,0,1,1,0,1,1,...,0,1,0,0,0,0,0,1,0,0
1136671,4580,16,0,0,0,1,1,0,1,1,...,0,0,0,0,0,1,0,0,0,0
1070525,4360,16,0,0,0,1,1,1,0,1,...,0,0,0,0,0,1,0,0,0,0
1133500,4320,16,0,0,0,1,1,0,1,1,...,0,0,0,0,1,0,0,0,0,0


In [6]:
# rename target features
model_df.rename(columns={"emc_account": "target_emc"}, inplace=True)
# model_df.rename(columns={"credit_card": "target_cc"}, inplace=True)

In [7]:
model_df.columns

Index(['revenue', 'months', 'short_term_deposit', 'loans', 'mortgage', 'funds',
       'securities', 'long_term_deposit', 'credit_card', 'debit_card',
       'payroll', 'pension_plan', 'payroll_account', 'target_emc',
       'em_account_p', 'em_acount', 'age_u18', 'age_18-30', 'age_31-40',
       'age_41-50', 'age_51-60', 'age_61-70', 'age_71-80', 'age_o80'],
      dtype='object')

In [8]:
_drop_prods = ['short_term_deposit', 'loans', 'mortgage', 'funds','securities', 'long_term_deposit', 'credit_card', 'debit_card', 'payroll','pension_plan', 'payroll_account', 'em_acount', 'em_account_p']

In [10]:
# Drop all products
model_df_emc = model_df.drop(_drop_prods, axis=1)

In [12]:
model_df_emc.head()

Unnamed: 0_level_0,revenue,months,target_emc,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1128353,4740,16,1,0,0,0,0,1,0,0,0
1116675,4720,16,0,0,0,0,0,0,1,0,0
1136671,4580,16,0,0,0,0,1,0,0,0,0
1070525,4360,16,1,0,0,0,1,0,0,0,0
1133500,4320,16,1,0,0,1,0,0,0,0,0


In [13]:
model_df_emc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 160017 entries, 1128353 to 1548202
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   revenue     160017 non-null  int64
 1   months      160017 non-null  int64
 2   target_emc  160017 non-null  int64
 3   age_u18     160017 non-null  int64
 4   age_18-30   160017 non-null  int64
 5   age_31-40   160017 non-null  int64
 6   age_41-50   160017 non-null  int64
 7   age_51-60   160017 non-null  int64
 8   age_61-70   160017 non-null  int64
 9   age_71-80   160017 non-null  int64
 10  age_o80     160017 non-null  int64
dtypes: int64(11)
memory usage: 14.6 MB


In [14]:
# transform dtypes, they are only ones
for col in model_df_emc.columns[2:]:
    model_df_emc[col] = model_df_emc[col].astype('uint8')

In [15]:
# load data of the final 10k customers to check once model is validated
final_10k = pd.read_csv('final_10k.csv', index_col=0)
final_10k.head()

Unnamed: 0_level_0,revenue,months,short_term_deposit,loans,mortgage,funds,securities,long_term_deposit,credit_card,debit_card,...,em_account_p,em_acount,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1045535,2560,15,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1116106,2420,16,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1020461,2400,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1119050,2190,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1209899,2050,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [16]:
# rename em account to target
final_10k.rename(columns={"emc_acount": "target_emc"}, inplace=True)

# Drop all products
final_10k_no_prod = final_10k.drop(_drop_prods, axis=1)
final_10k_no_prod.head()
final_10k_no_prod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1045535 to 1531108
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   revenue      10000 non-null  int64
 1   months       10000 non-null  int64
 2   emc_account  10000 non-null  int64
 3   age_u18      10000 non-null  int64
 4   age_18-30    10000 non-null  int64
 5   age_31-40    10000 non-null  int64
 6   age_41-50    10000 non-null  int64
 7   age_51-60    10000 non-null  int64
 8   age_61-70    10000 non-null  int64
 9   age_71-80    10000 non-null  int64
 10  age_o80      10000 non-null  int64
dtypes: int64(11)
memory usage: 937.5 KB


In [17]:
for col in final_10k_no_prod.columns[2:]:
    final_10k_no_prod[col] = final_10k_no_prod[col].astype('uint8')

In [18]:
final_10k_no_prod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1045535 to 1531108
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   revenue      10000 non-null  int64
 1   months       10000 non-null  int64
 2   emc_account  10000 non-null  uint8
 3   age_u18      10000 non-null  uint8
 4   age_18-30    10000 non-null  uint8
 5   age_31-40    10000 non-null  uint8
 6   age_41-50    10000 non-null  uint8
 7   age_51-60    10000 non-null  uint8
 8   age_61-70    10000 non-null  uint8
 9   age_71-80    10000 non-null  uint8
 10  age_o80      10000 non-null  uint8
dtypes: int64(2), uint8(9)
memory usage: 322.3 KB


### 1. Recommend em_account

This is the first problem to solve so I will remove the credit card info for this one.

In [19]:
X_train, X_dev, y_train, y_dev = model_selection.train_test_split(
    model_df_emc.drop('target_emc',axis=1),
    model_df_emc['target_emc'],
    test_size=0.3,
    random_state=42
)

In [20]:
print(model_df.shape)
print(X_train.shape)
print(X_dev.shape)

(160017, 24)
(112011, 10)
(48006, 10)


In [21]:
X_test, X_val, y_test, y_val = model_selection.train_test_split(
    X_dev,
    y_dev,
    test_size = 0.5,
    random_state=42
)

In [22]:
print(X_test.shape)
print(X_val.shape)

(24003, 10)
(24003, 10)


## Model

In [23]:
split_data_dict = {
    "TRAINING": [X_train, y_train],
    "TESTING": [X_test, y_test]
}

In [24]:
# instantiate model
xgb_model = xgb.XGBClassifier(
    eta = 0.1,
    max_depth = 30,
    min_child_weight = 0.5,
    gamma = 0.5,
    random_state = 42,
    use_label_encoder=False
)

# train 
xgb_model.fit(X = X_train, y = y_train)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):




In [25]:
print("################## em acount ##################")
print("\n____________ SCORES & EVALUATIONS ____________\n")
print("#################### RESULTS ###################")


for data in split_data_dict.items():
    pred = xgb_model.predict(data[1][0])
    confusion_matrix = metrics.confusion_matrix(data[1][1], pred)
    tn, fp, fn, tp = confusion_matrix.ravel()
    Accuracy = metrics.accuracy_score(data[1][1], pred)
    Precision = metrics.precision_score(data[1][1], pred)
    Recall = metrics.recall_score(data[1][1], pred)
    F_1_Score = metrics.f1_score(data[1][1], pred)


    probs = xgb_model.predict_proba(data[1][0])[:, 1]
    probs_mean = round(probs.mean()* 100, 2) 
    auc_score = roc_auc_score(data[1][1], probs)
    fpr, tpr, thresholds = roc_curve(data[1][1], probs)

    PPV, NPV = ((tp / (tp + fp)) * 100), ((tn / (fn + tn)) * 100)
    
    
    print(f"#################### {data[0]} ####################")
    print(f"Accuracy: {round(Accuracy, 5)} | Precision: {round(Precision, 5)} | Recall: {round(Recall, 5)} | F1_Score: {round(F_1_Score, 5)}")
    print(f"TN = {tn} | FN = {fn} | TP = {tp} | FP = {fp}")
    print(f"Positive prediction value: {round(PPV, 2)}% | Negative prediction value: {round(NPV, 2)}%\n")
    # feature importance
    top_features = pd.Series(xgb_model.feature_importances_, index = data[1][0].columns).sort_values(ascending = False).head()
    print(top_features)

################## em acount ##################

____________ SCORES & EVALUATIONS ____________

#################### RESULTS ###################
#################### TRAINING ####################
Accuracy: 0.91567 | Precision: 0.80324 | Recall: 0.50439 | F1_Score: 0.61967
TN = 94870 | FN = 7561 | TP = 7695 | FP = 1885
Positive prediction value: 80.32% | Negative prediction value: 92.62%

age_18-30   0.97
revenue     0.01
months      0.01
age_31-40   0.00
age_o80     0.00
dtype: float32
#################### TESTING ####################
Accuracy: 0.90364 | Precision: 0.72296 | Recall: 0.44462 | F1_Score: 0.55061
TN = 20273 | FN = 1770 | TP = 1417 | FP = 543
Positive prediction value: 72.3% | Negative prediction value: 91.97%

age_18-30   0.97
revenue     0.01
months      0.01
age_31-40   0.00
age_o80     0.00
dtype: float32


In [45]:
## predictions for test dataset

y_val_pred = pd.DataFrame(xgb_model.predict(X_val), index = y_val.index, columns = ['pred'])
results_df_test = pd.DataFrame(y_val).join(y_val_pred, how = 'inner')

# metrics
Accuracy_tree_test = metrics.accuracy_score(results_df_test['em_acount'], results_df_test['em_acount'])
Precision_tree_test = metrics.precision_score(results_df_test['em_acount'], results_df_test['em_acount'])
Recall_tree_test = metrics.recall_score(results_df_test['em_acount'], results_df_test['em_acount'])
print("Accuracy: ", Accuracy_tree_test)
print("Precision: ", Precision_tree_test)
print("Recall: ", Recall_tree_test)

Accuracy:  1.0
Precision:  1.0
Recall:  1.0
