## Machine Learning approach to contrast the manual selection and recommendation problem

The objective of this model will be to assess the probability of the group of selected people to engage or not by purchasing an em account.

For that I will use the following data:
- To be consistent, only customers that have been considered active this last month will be considered.
- TARGET will be em account.
- Distribution of products that a given customer has in last partition (except em account)
- revenue computed in last manual step
- months being active

This model is extremely primitive and thus very limited. 
The final goal is to predict the probability of a client that has no products, purchasing an em account. In order to do so it will only have access to a list of products == 0 and the number of months which that client has been active in the app and 3 values of revenue (0,10,20) but mostly 0.
In order to reach this goal the model will have to learn patterns from clients' products list (containing ones), number of months (most consistent feature across the dataset) and values of revenue (greater than zero) which will lead to not very reliable results.

About looking for similar profiles in other months, again the limitation is the features we are inputting. If anything the only thing that will enrich the model are demographics:
- Age range for example
- Entry channel..
- Gender at some point...

## Libraries

In [218]:
import pandas as pd
import numpy as np

## VISUALIZATION
import matplotlib.pyplot as plt
import seaborn as sns

## SKLEARN
import sklearn
from sklearn import metrics
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb


In [219]:
pd.options.display.float_format = '{:,.2f}'.format

## Load data

I need to train two different models, the first one is going to predict the possibility of a customer purchasing the em account based on age, revenue and months active and I want to evaluate the 4733 customers that have no products.

The second one, similarly, will evaluate the probability of someone buying a debit card. I will use this one to evaluate the rest of the customers, which do have em account but no debit card.

The products have no use so I will drop all of the except for my two targets, which I will rename

In [220]:
## load data for train/testing

model_df = pd.read_csv('model_proba_df.csv', index_col=0)
model_df.head()

Unnamed: 0_level_0,revenue,months,short_term_deposit,loans,mortgage,funds,securities,long_term_deposit,credit_card,debit_card,...,emc_account,em_acount,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1128353,4580,16,0,0,0,1,1,1,1,1,...,1,0,0,0,0,0,1,0,0,0
1116675,4560,16,0,0,0,1,1,0,1,1,...,0,1,0,0,0,0,0,1,0,0
1136671,4420,16,0,0,0,1,1,0,1,1,...,0,0,0,0,0,1,0,0,0,0
1070525,4200,16,0,0,0,1,1,1,0,1,...,1,0,0,0,0,1,0,0,0,0
1133500,4160,16,0,0,0,1,1,0,1,1,...,1,0,0,0,1,0,0,0,0,0


In [221]:
# rename target features
model_df.rename(columns={"em_acount": "target_em"}, inplace=True)
model_df.rename(columns={"debit_card": "target_dc"}, inplace=True)

In [222]:
model_df.columns

Index(['revenue', 'months', 'short_term_deposit', 'loans', 'mortgage', 'funds',
       'securities', 'long_term_deposit', 'credit_card', 'target_dc',
       'payroll', 'pension_plan', 'payroll_account', 'emc_account',
       'target_em', 'age_u18', 'age_18-30', 'age_31-40', 'age_41-50',
       'age_51-60', 'age_61-70', 'age_71-80', 'age_o80'],
      dtype='object')

In [223]:
_drop_prods = ['short_term_deposit', 'loans', 'mortgage', 'funds','securities', 'long_term_deposit', 'credit_card', 'payroll','pension_plan', 'payroll_account', 'emc_account']

In [224]:
# Drop all products
model_df_no_prod = model_df.drop(_drop_prods, axis=1)

In [225]:
model_df_no_prod.head()

Unnamed: 0_level_0,revenue,months,target_dc,target_em,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1128353,4580,16,1,0,0,0,0,0,1,0,0,0
1116675,4560,16,1,1,0,0,0,0,0,1,0,0
1136671,4420,16,1,0,0,0,0,1,0,0,0,0
1070525,4200,16,1,0,0,0,0,1,0,0,0,0
1133500,4160,16,1,0,0,0,1,0,0,0,0,0


In [226]:
model_df_no_prod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 160017 entries, 1128353 to 1548202
Data columns (total 12 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   revenue    160017 non-null  int64
 1   months     160017 non-null  int64
 2   target_dc  160017 non-null  int64
 3   target_em  160017 non-null  int64
 4   age_u18    160017 non-null  int64
 5   age_18-30  160017 non-null  int64
 6   age_31-40  160017 non-null  int64
 7   age_41-50  160017 non-null  int64
 8   age_51-60  160017 non-null  int64
 9   age_61-70  160017 non-null  int64
 10  age_71-80  160017 non-null  int64
 11  age_o80    160017 non-null  int64
dtypes: int64(12)
memory usage: 15.9 MB


In [227]:
# transform dtypes, they are only ones
for col in model_df_no_prod.columns[2:]:
    model_df_no_prod[col] = model_df_no_prod[col].astype('uint8')

In [228]:
# load data of the final 10k customers to check once model is validated
final_10k = pd.read_csv('final_10k.csv', index_col=0)
final_10k.head()

Unnamed: 0_level_0,revenue,months,short_term_deposit,loans,mortgage,funds,securities,long_term_deposit,credit_card,debit_card,...,emc_account,em_acount,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1045535,2430,15,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1116106,2260,16,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1020461,2260,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1119050,2030,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1209899,1950,16,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [229]:
# rename em account to target
final_10k.rename(columns={"em_acount": "target_em"}, inplace=True)
final_10k.rename(columns={"debit_card": "target_dc"}, inplace=True)
# Drop all products
final_10k_no_prod = final_10k.drop(_drop_prods, axis=1)
final_10k_no_prod.head()
final_10k_no_prod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1045535 to 1530618
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   revenue    10000 non-null  int64
 1   months     10000 non-null  int64
 2   target_dc  10000 non-null  int64
 3   target_em  10000 non-null  int64
 4   age_u18    10000 non-null  int64
 5   age_18-30  10000 non-null  int64
 6   age_31-40  10000 non-null  int64
 7   age_41-50  10000 non-null  int64
 8   age_51-60  10000 non-null  int64
 9   age_61-70  10000 non-null  int64
 10  age_71-80  10000 non-null  int64
 11  age_o80    10000 non-null  int64
dtypes: int64(12)
memory usage: 1015.6 KB


In [230]:
for col in final_10k_no_prod.columns[2:]:
    final_10k_no_prod[col] = final_10k_no_prod[col].astype('uint8')

In [231]:
final_10k_no_prod.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1045535 to 1530618
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   revenue    10000 non-null  int64
 1   months     10000 non-null  int64
 2   target_dc  10000 non-null  uint8
 3   target_em  10000 non-null  uint8
 4   age_u18    10000 non-null  uint8
 5   age_18-30  10000 non-null  uint8
 6   age_31-40  10000 non-null  uint8
 7   age_41-50  10000 non-null  uint8
 8   age_51-60  10000 non-null  uint8
 9   age_61-70  10000 non-null  uint8
 10  age_71-80  10000 non-null  uint8
 11  age_o80    10000 non-null  uint8
dtypes: int64(2), uint8(10)
memory usage: 332.0 KB


### 1. Recommend em_account

This is the first problem to solve so I will remove the credit card info for this one.

In [232]:
model_df_em = model_df_no_prod.drop('target_dc', axis=1)
model_df_em.head()

Unnamed: 0_level_0,revenue,months,target_em,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1128353,4580,16,0,0,0,0,0,1,0,0,0
1116675,4560,16,1,0,0,0,0,0,1,0,0
1136671,4420,16,0,0,0,0,1,0,0,0,0
1070525,4200,16,0,0,0,0,1,0,0,0,0
1133500,4160,16,0,0,0,1,0,0,0,0,0


In [233]:
X_train_em, X_dev_em, y_train_em, y_dev_em = model_selection.train_test_split(
    model_df_em.drop('target_em',axis=1),
    model_df_em['target_em'],
    test_size=0.3,
    random_state=42
)

In [234]:
print(model_df.shape)
print(X_train_em.shape)
print(X_dev_em.shape)

(160017, 23)
(112011, 10)
(48006, 10)


In [235]:
X_test_em, X_val_em, y_test_em, y_val_em = model_selection.train_test_split(
    X_dev_em,
    y_dev_em,
    test_size = 0.5,
    random_state=42
)

In [236]:
print(X_test_em.shape)
print(X_val_em.shape)

(24003, 10)
(24003, 10)


#### Model

In [237]:
split_dict_em = {
    "TRAINING": [X_train_em, y_train_em],
    "TESTING": [X_test_em, y_test_em]
}

In [238]:
# instantiate model
xgb_model_em = xgb.XGBClassifier(
    eta = 0.1,
    max_depth = 30,
    min_child_weight = 0.5,
    gamma = 5,
    random_state = 42,
    verbosity=0,
    use_label_encoder=False
)

# train 
xgb_model_em.fit(X = X_train_em, y = y_train_em)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


In [239]:
print("################## em acount ##################")
print("\n____________ SCORES & EVALUATIONS ____________\n")
print("#################### RESULTS ###################")


for data in split_dict_em.items():
    pred = xgb_model_em.predict(data[1][0])
    confusion_matrix = metrics.confusion_matrix(data[1][1], pred)
    tn, fp, fn, tp = confusion_matrix.ravel()
    Accuracy = metrics.accuracy_score(data[1][1], pred)
    Precision = metrics.precision_score(data[1][1], pred)
    Recall = metrics.recall_score(data[1][1], pred)
    F_1_Score = metrics.f1_score(data[1][1], pred)


    probs = xgb_model_em.predict_proba(data[1][0])[:, 1]
    probs_mean = round(probs.mean()* 100, 2) 
    auc_score = roc_auc_score(data[1][1], probs)
    fpr, tpr, thresholds = roc_curve(data[1][1], probs)

    PPV, NPV = ((tp / (tp + fp)) * 100), ((tn / (fn + tn)) * 100)
    
    
    print(f"#################### {data[0]} ####################")
    print(f"Accuracy: {round(Accuracy, 5)} | Precision: {round(Precision, 5)} | Recall: {round(Recall, 5)} | F1_Score: {round(F_1_Score, 5)}")
    print(f"TN = {tn} | FN = {fn} | TP = {tp} | FP = {fp}")
    print(f"Positive prediction value: {round(PPV, 2)}% | Negative prediction value: {round(NPV, 2)}%")
    print("########## TOP FEATURES ##########")
    top_features = pd.Series(xgb_model_em.feature_importances_, index = data[1][0].columns).sort_values(ascending = False).head()
    print(top_features)
    print("\n")

################## em acount ##################

____________ SCORES & EVALUATIONS ____________

#################### RESULTS ###################
#################### TRAINING ####################
Accuracy: 0.87624 | Precision: 0.90323 | Recall: 0.94659 | F1_Score: 0.9244
TN = 13403 | FN = 4782 | TP = 84746 | FP = 9080
Positive prediction value: 90.32% | Negative prediction value: 73.7%
########## TOP FEATURES ##########
age_18-30   0.32
revenue     0.28
months      0.11
age_31-40   0.06
age_51-60   0.05
dtype: float32


#################### TESTING ####################
Accuracy: 0.86643 | Precision: 0.8961 | Recall: 0.94282 | F1_Score: 0.91887
TN = 2642 | FN = 1101 | TP = 18155 | FP = 2105
Positive prediction value: 89.61% | Negative prediction value: 70.59%
########## TOP FEATURES ##########
age_18-30   0.32
revenue     0.28
months      0.11
age_31-40   0.06
age_51-60   0.05
dtype: float32




In [113]:
params_xgb = {
        'eta': [0.05, 0.1, 0.3],
        'min_child_weight': [0.5, 1, 5],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'max_depth': [5, 10, 30],
        'subsample': [0.8, 0.9, 1]
        }

grid_search_xgb = model_selection.GridSearchCV(
    estimator=xgb_model_em,
    param_grid=params_xgb,
    scoring='f1',
    cv=5,
    verbose=1
)

grid_search_xgb.fit(X_train_em, y_train_em)

print(grid_search_xgb.best_estimator_)

Fitting 5 folds for each of 405 candidates, totalling 2025 fits


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eta=0.05, gamma=5, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.100000001,
              max_delta_step=0, max_depth=30, min_child_weight=0.5, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.8,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=0)


In [112]:
metrics.get_scorer_names()

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'matthews_corrcoef',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_absolute_percentage_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'rand_score',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_

In [240]:
# validation data

y_val_pred_em = pd.DataFrame(xgb_model_em.predict(X_test_em), index = y_test_em.index, columns = ['pred'])
results_val_em = pd.DataFrame(y_test_em).join(y_val_pred_em, how = 'inner')

# metrics
Accuracy_val_em = metrics.accuracy_score(results_val_em['target_em'], results_val_em['pred'])
Precision_val_em = metrics.precision_score(results_val_em['target_em'], results_val_em['pred'])
Recall_val_em = metrics.recall_score(results_val_em['target_em'], results_val_em['pred'])
rf_f1_val_em = metrics.f1_score(y_test_em, y_val_pred_em)
print("Accuracy: ", Accuracy_val_em)
print("Precision: ", Precision_val_em)
print("Recall: ", Recall_val_em)
print("F1 score xgb: ",rf_f1_val_em)

Accuracy:  0.8664333624963546
Precision:  0.8961006910167818
Recall:  0.942823016202742
F1 score xgb:  0.918868306508756


In [241]:
# proba for the 4733

# get the customers for recommending the em account, the first 4733
recommend_em_df = final_10k_no_prod.iloc[:4733]
recommend_em_df_targets = recommend_em_df[['target_dc','target_em']]
recommend_em_df.drop(['target_dc','target_em'], axis=1, inplace=True)

# predict proba
proba_em = xgb_model_em.predict_proba(recommend_em_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommend_em_df.drop(['target_dc','target_em'], axis=1, inplace=True)


In [245]:
proba_em_df = pd.DataFrame(data = proba_em, index=final_10k_no_prod.iloc[:4733].index, columns=['class1','class2'])

In [243]:
final_10k_proba = pd.concat([final_10k_no_prod,proba_em_df['class2']], axis=1)

In [244]:
final_10k_proba.loc[final_10k_proba['class2']>0.5].iloc[:4733]

Unnamed: 0_level_0,revenue,months,target_dc,target_em,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80,class2
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1261703,1120,16,0,0,0,1,0,0,0,0,0,0,0.95
1260310,1120,16,0,0,0,0,1,0,0,0,0,0,0.96
1239232,1120,16,0,0,0,0,0,0,1,0,0,0,0.96
1032738,1110,16,0,0,0,1,0,0,0,0,0,0,0.67
1008244,1110,16,0,0,0,0,1,0,0,0,0,0,0.74
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1323949,0,0,0,0,0,0,1,0,0,0,0,0,0.94
1393316,0,0,0,0,0,0,0,0,0,0,1,0,0.93
1382923,0,0,0,0,0,0,0,0,1,0,0,0,0.91
1375266,0,0,0,0,1,0,0,0,0,0,0,0,0.55


In [246]:
proba_em_df.describe()

Unnamed: 0,class1,class2
count,4733.0,4733.0
mean,0.74,0.26
std,0.33,0.33
min,0.0,0.01
25%,0.48,0.01
50%,0.96,0.04
75%,0.99,0.52
max,0.99,1.0


In [247]:
xgb_model_em.classes_

array([0, 1], dtype=uint8)

### 2. Recommend debiit card

will remove em account for this one

In [248]:
model_df_dc = model_df_no_prod.drop('target_em', axis=1)
model_df_dc.head()

Unnamed: 0_level_0,revenue,months,target_dc,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1128353,4580,16,1,0,0,0,0,1,0,0,0
1116675,4560,16,1,0,0,0,0,0,1,0,0
1136671,4420,16,1,0,0,0,1,0,0,0,0
1070525,4200,16,1,0,0,0,1,0,0,0,0
1133500,4160,16,1,0,0,1,0,0,0,0,0


In [249]:
X_train_dc, X_dev_dc, y_train_dc, y_dev_dc = model_selection.train_test_split(
    model_df_dc.drop('target_dc',axis=1),
    model_df_dc['target_dc'],
    test_size=0.3,
    random_state=42
)

In [250]:
print(model_df.shape)
print(X_train_dc.shape)
print(X_dev_dc.shape)

(160017, 23)
(112011, 10)
(48006, 10)


In [251]:
X_test_dc, X_val_dc, y_test_dc, y_val_dc = model_selection.train_test_split(
    X_dev_dc,
    y_dev_dc,
    test_size = 0.5,
    random_state=42
)

In [252]:
print(X_test_dc.shape)
print(X_val_dc.shape)

(24003, 10)
(24003, 10)


#### Model

In [253]:
split_dict_dc = {
    "TRAINING": [X_train_dc, y_train_dc],
    "TESTING": [X_test_dc, y_test_dc]
}

In [254]:
# instantiate model
xgb_model_dc = xgb.XGBClassifier(
    eta = 0.1,
    max_depth = 30,
    min_child_weight = 0.5,
    gamma = 5,
    random_state = 42,
    verbosity=0,
    use_label_encoder=False
)

# train 
xgb_model_dc.fit(X = X_train_dc, y = y_train_dc)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


In [255]:
print("################## debit card ##################")
print("\n____________ SCORES & EVALUATIONS ____________\n")
print("#################### RESULTS ###################")


for data in split_dict_dc.items():
    pred = xgb_model_dc.predict(data[1][0])
    confusion_matrix = metrics.confusion_matrix(data[1][1], pred)
    tn, fp, fn, tp = confusion_matrix.ravel()
    Accuracy = metrics.accuracy_score(data[1][1], pred)
    Precision = metrics.precision_score(data[1][1], pred)
    Recall = metrics.recall_score(data[1][1], pred)
    F_1_Score = metrics.f1_score(data[1][1], pred)


    probs = xgb_model_dc.predict_proba(data[1][0])[:, 1]
    probs_mean = round(probs.mean()* 100, 2) 
    auc_score = roc_auc_score(data[1][1], probs)
    fpr, tpr, thresholds = roc_curve(data[1][1], probs)

    PPV, NPV = ((tp / (tp + fp)) * 100), ((tn / (fn + tn)) * 100)
    
    
    print(f"#################### {data[0]} ####################")
    print(f"Accuracy: {round(Accuracy, 5)} | Precision: {round(Precision, 5)} | Recall: {round(Recall, 5)} | F1_Score: {round(F_1_Score, 5)}")
    print(f"TN = {tn} | FN = {fn} | TP = {tp} | FP = {fp}")
    print(f"Positive prediction value: {round(PPV, 2)}% | Negative prediction value: {round(NPV, 2)}%")
    print("########## TOP FEATURES ##########")
    top_features = pd.Series(xgb_model_dc.feature_importances_, index = data[1][0].columns).sort_values(ascending = False).head()
    print(top_features)
    print("\n")

################## debit card ##################

____________ SCORES & EVALUATIONS ____________

#################### RESULTS ###################
#################### TRAINING ####################
Accuracy: 0.90196 | Precision: 0.82525 | Recall: 0.80684 | F1_Score: 0.81594
TN = 76690 | FN = 5827 | TP = 24340 | FP = 5154
Positive prediction value: 82.53% | Negative prediction value: 92.94%
########## TOP FEATURES ##########
revenue     0.47
months      0.20
age_18-30   0.12
age_31-40   0.07
age_41-50   0.04
dtype: float32


#################### TESTING ####################
Accuracy: 0.89914 | Precision: 0.81825 | Recall: 0.79947 | F1_Score: 0.80875
TN = 16463 | FN = 1284 | TP = 5119 | FP = 1137
Positive prediction value: 81.83% | Negative prediction value: 92.76%
########## TOP FEATURES ##########
revenue     0.47
months      0.20
age_18-30   0.12
age_31-40   0.07
age_41-50   0.04
dtype: float32




In [81]:
params_xgb = {
        'eta': [0.05, 0.1, 0.3],
        'min_child_weight': [0.5, 1, 5],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'max_depth': [5, 10, 30],
        'subsample': [0.8, 0.9, 1]
        }

grid_search_xgb = model_selection.GridSearchCV(
    estimator=xgb_model_dc,
    param_grid=params_xgb,
    scoring='recall',
    cv=5,
    verbose=1
)

grid_search_xgb.fit(X_train_dc, y_train_dc)

print(grid_search_xgb.best_estimator_)

Fitting 5 folds for each of 405 candidates, totalling 2025 fits


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eta=0.05, gamma=5, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.100000001,
              max_delta_step=0, max_depth=30, min_child_weight=0.5, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=0)


In [256]:
# validation data

y_val_pred_dc = pd.DataFrame(xgb_model_dc.predict(X_test_dc), index = y_test_dc.index, columns = ['pred'])
results_val_dc = pd.DataFrame(y_test_dc).join(y_val_pred_dc, how = 'inner')

# metrics
Accuracy_val_dc = metrics.accuracy_score(results_val_dc['target_dc'], results_val_dc['pred'])
Precision_val_dc = metrics.precision_score(results_val_dc['target_dc'], results_val_dc['pred'])
Recall_val_dc = metrics.recall_score(results_val_dc['target_dc'], results_val_dc['pred'])
rf_f1_val_dc = metrics.f1_score(y_test_dc, y_val_pred_dc)
print("Accuracy: ", Accuracy_val_dc)
print("Precision: ", Precision_val_dc)
print("Recall: ", Recall_val_dc)
print("F1 score xgb: ",rf_f1_val_dc)

Accuracy:  0.8991376077990251
Precision:  0.8182544757033248
Recall:  0.7994689989067625
F1 score xgb:  0.8087526660873686


In [257]:
# proba for the 5267 customers left debit card

# get the customers 
recommend_dc_df = final_10k_no_prod.iloc[4733:]
recommend_dc_df_targets = recommend_dc_df[['target_dc','target_em']]
recommend_dc_df.drop(['target_dc','target_em'], axis=1, inplace=True)

# predict proba
proba_dc = xgb_model_dc.predict_proba(recommend_dc_df)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommend_dc_df.drop(['target_dc','target_em'], axis=1, inplace=True)


In [258]:
proba_dc_df = pd.DataFrame(data = proba_dc, index=final_10k_no_prod.iloc[4733:].index, columns=['class1','class2'])

In [259]:
final_10k_proba = pd.concat([final_10k_no_prod,proba_dc_df['class2']], axis=1)

In [260]:
proba_df = pd.concat([proba_em_df,proba_dc_df])

In [261]:
proba_df

Unnamed: 0_level_0,class1,class2
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1
1045535,0.89,0.11
1116106,0.80,0.20
1020461,0.83,0.17
1119050,0.54,0.46
1209899,0.78,0.22
...,...,...
1530058,0.83,0.17
1530046,0.83,0.17
1530044,0.88,0.12
1530043,0.94,0.06


In [262]:
final_10k_proba = pd.concat([final_10k_no_prod,proba_df['class2']], axis=1)

In [263]:
final_10k_proba.loc[final_10k_proba['class2']>0.5].iloc[:4733]

Unnamed: 0_level_0,revenue,months,target_dc,target_em,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80,class2
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1261703,1120,16,0,0,0,1,0,0,0,0,0,0,0.95
1260310,1120,16,0,0,0,0,1,0,0,0,0,0,0.96
1239232,1120,16,0,0,0,0,0,0,1,0,0,0,0.96
1032738,1110,16,0,0,0,1,0,0,0,0,0,0,0.67
1008244,1110,16,0,0,0,0,1,0,0,0,0,0,0.74
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1536847,20,2,0,1,0,1,0,0,0,0,0,0,0.54
1534472,20,2,0,1,0,0,1,0,0,0,0,0,0.72
1534469,20,2,0,1,0,0,0,0,1,0,0,0,0.71
1534468,20,2,0,1,0,0,1,0,0,0,0,0,0.72


In [264]:
em_value = ['em_account']*4733
dc_value = ['debit_card']*5267
recommendation = em_value + dc_value
print(set(recommendation[:4733]))
print(set(recommendation[4733:]))

{'em_account'}
{'debit_card'}


In [265]:
final_10k_proba['recommendation'] = recommendation

In [266]:
final_10k_proba

Unnamed: 0_level_0,revenue,months,target_dc,target_em,age_u18,age_18-30,age_31-40,age_41-50,age_51-60,age_61-70,age_71-80,age_o80,class2,recommendation
pk_cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1045535,2430,15,0,0,0,0,1,0,0,0,0,0,0.11,em_account
1116106,2260,16,0,0,0,0,0,1,0,0,0,0,0.20,em_account
1020461,2260,16,0,0,0,0,1,0,0,0,0,0,0.17,em_account
1119050,2030,16,0,0,0,0,1,0,0,0,0,0,0.46,em_account
1209899,1950,16,0,0,0,0,1,0,0,0,0,0,0.22,em_account
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1530058,30,3,0,1,0,0,1,0,0,0,0,0,0.17,debit_card
1530046,30,3,0,1,0,0,1,0,0,0,0,0,0.17,debit_card
1530044,30,3,0,1,0,0,0,1,0,0,0,0,0.12,debit_card
1530043,30,3,0,1,0,1,0,0,0,0,0,0,0.06,debit_card
