
## **Adopter Prediction Challenge**

 ~ Ankita, Ashok, Kaydee, Young
 
 ---

Website XYZ, a music-listening social networking website, follows the “freemium” business model. The website offers basic services for free, and provides a number of additional premium capabilities for a monthly subscription fee. We are interested in predicting which people would be likely to convert from free users to premium subscribers in the next 6 month period, if they are targeted by our promotional campaign.

### Dataset

We have a dataset from the previous marketing campaign which targeted a number of non-subscribers.

Features: 

```
1.   adopter (predictor class)
2.   user_id
3.   age
4.   male
5.   friend_cnt
6.   avg_friend_age
7.   avg_friend_male
8.   friend_country_cnt
9.   subscriber_friend_cnt
10.   songsListened
11.   lovedTracks
12.   posts
13.   playlists
14.   shouts
15.   good_country
16.   tenure
17.   *other delta variables*
```



### Task

The task is to build the best predictive model for the next marketing campaign, i.e., for predicting likely `adopters` (that is, which current non- subscribers are likely to respond to the marketing campaign and sign up for the premium service within 6 months after the campaign).

---

In [0]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive

from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
import xgboost as xgb

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, roc_curve, classification_report, recall_score, f1_score, accuracy_score, precision_score

In [0]:
# setting fixed seed value for consistency in results
seed = 7
np.random.seed(seed)

In [4]:
# original dataset
data = pd.read_csv('https://drive.google.com/uc?export=view&id=1wctM0dYDj839zp6sTlFnDgCmFspXhDuW')
data.head()

Unnamed: 0,user_id,age,male,friend_cnt,avg_friend_age,avg_friend_male,friend_country_cnt,subscriber_friend_cnt,songsListened,lovedTracks,...,delta_subscriber_friend_cnt,delta_songsListened,delta_lovedTracks,delta_posts,delta_playlists,delta_shouts,tenure,good_country,delta_good_country,adopter
0,10,24,0,20,26.333333,0.777778,6,0,37804,4,...,0,54,0,0,0,0,79,0,0,0
1,58,29,1,12,26.9,0.818182,6,1,15955,19,...,0,802,0,0,0,1,80,0,0,0
2,72,22,0,4,21.0,1.0,2,0,31441,7,...,0,0,0,0,0,0,53,0,0,0
3,121,27,0,1,29.0,1.0,1,0,0,0,...,0,0,0,0,0,0,59,0,0,0
4,137,22,1,4,21.25,0.75,1,0,774,0,...,0,0,0,0,0,0,60,0,0,0


 None of the features are categorical, let's skip feature engineering for now just like our previous models. Let's undersample the data in favour of minority class.



In [5]:
# fetching the indices of minority instances
adopting_indices = np.array(data[data.adopter == 1].index)

# fetching indices of normal instances
non_adopting_indices = data[data.adopter == 0].index

# randomly select 1540 normal instances to create a partitioned balanced dataset
random_non_adopting_indices = np.random.choice(non_adopting_indices,
                                            5040,
                                            replace = False)
random_non_adopting_indices = np.array(random_non_adopting_indices)

# combining both the instance groups (minority and the new random set) 
undersampled_indices = np.concatenate([adopting_indices, random_non_adopting_indices])

# creating the undersampled dataset
undersampled_data = data.iloc[undersampled_indices, :]

# storing the features(X) and predictor class(y)
X_undersample = undersampled_data.iloc[:, undersampled_data.columns != 'adopter']
y_undersample = undersampled_data.iloc[:, undersampled_data.columns == 'adopter']

print("Number of minority instances: {}\nNumber of normal instances: {} \nTotal: {}".format(len(undersampled_data[undersampled_data.adopter == 1]), 
                                                                                           len(undersampled_data[undersampled_data.adopter == 0]),
                                                                                           len(undersampled_data)))

Number of minority instances: 1540
Number of normal instances: 5040 
Total: 6580


In [6]:
# some housekeeping for metrics
recalls = {}
f1s = {}
precisions = {}
accuracies = {}

# splitting original dataset into features and predictor
X = undersampled_data.iloc[:, data.columns != 'adopter']
y = undersampled_data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Undersampled Data:")
print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Undersampled Data:
Number of train instances: 4606
Number of test instances: 1974


## First level (Base) learners

We'll use following base learners (more the merrier?):



1.   Random Forest
2.   Extra Trees
3.   AdaBoost
4.   Gradient Boosting
5.   SVM

Among the above, we've already trained RF and SVM and they both seem to perform well indivudually.



### Setting the parameters

We dict the parameters for each model for readability later

In [0]:
random_forest_params = {
    'n_jobs': -1,
    'n_estimators': 200,
    'min_samples_leaf': 2,
    'max_depth': 5,
    'class_weight' : "balanced",
    'max_features' : 'sqrt',
    'verbose': 0
}

extra_trees_params = {
    'n_jobs': -1,
    'n_estimators':200,
    'max_depth': 8,
    'min_samples_leaf': 2,
    'verbose': 0
}

adaboost_params = {
    'n_estimators': 200,
    'learning_rate' : 0.75
}

gradient_boosting_params = {
    'n_estimators': 200,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

# polynomial kernels have performed worse, so let's go with linear
svm_params = {
    'kernel' : 'rbf',
    'C' : 0.025
    }

### Training the base learners

In [8]:
# random forest
rf = RandomForestClassifier(**random_forest_params)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

# extra trees
et = ExtraTreesClassifier(**extra_trees_params)
et.fit(X_train, y_train)
et_pred = et.predict(X_test)

# adaboost
ab = AdaBoostClassifier(**adaboost_params)
ab.fit(X_train, y_train)
ab_pred = ab.predict(X_test)

# gradient boosting
gb = GradientBoostingClassifier(**gradient_boosting_params)
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)

# svm 
svm = SVC(**svm_params)
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)


  
  import sys
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [25]:
rf_pred_df = pd.DataFrame({'rf_pred': rf_pred})
et_pred_df = pd.DataFrame({'et_pred': et_pred})
ab_pred_df = pd.DataFrame({'ab_pred': ab_pred})
gb_pred_df = pd.DataFrame({'gb_pred': gb_pred})
svm_pred_df = pd.DataFrame({'svm_pred': svm_pred})

all_preds = pd.concat([rf_pred_df, et_pred_df, ab_pred_df, gb_pred_df, svm_pred_df], axis=1)
all_preds['temp_index'] = range(1, len(all_preds) + 1)

X_test_cp = X_test
X_test_cp['temp_index'] = range(1, len(X_test_cp) + 1)

y_test_cp = y_test
y_test_cp['temp_index'] = range(1, len(y_test_cp) + 1)

X_test_l2 = X_test_cp.merge(all_preds, on='temp_index')
X_test_l2 = X_test_cp.merge(y_test_cp, on='temp_index')

X_test_l2.drop('temp_index', axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,user_id,age,male,friend_cnt,avg_friend_age,avg_friend_male,friend_country_cnt,subscriber_friend_cnt,songsListened,lovedTracks,...,delta_subscriber_friend_cnt,delta_songsListened,delta_lovedTracks,delta_posts,delta_playlists,delta_shouts,tenure,good_country,delta_good_country,adopter
0,38941,25,0,2,28.500000,1.000000,1,1,527,12,...,0,0,0,0,0,0,40,0,0,0
1,414539,19,1,3,20.666667,0.666667,1,0,404,2,...,0,0,0,0,0,0,23,0,0,0
2,325435,19,1,5,21.000000,0.750000,1,0,3515,105,...,0,1090,0,0,0,0,7,1,0,0
3,490579,41,1,16,36.714286,0.222222,3,1,14453,554,...,1,275,6,0,0,0,42,1,0,1
4,333652,19,0,5,20.000000,0.200000,1,0,1721,4,...,0,0,0,0,0,0,26,0,0,0
5,872342,24,1,71,24.250000,0.378788,26,1,20499,177,...,0,747,2,0,0,2,64,0,0,1
6,549996,18,1,6,20.500000,0.833333,1,0,5630,0,...,0,0,0,0,0,0,48,1,0,0
7,552972,47,1,13,43.500000,0.583333,8,2,33616,56,...,0,0,0,0,0,4,25,0,0,1
8,1468644,15,0,6,20.166667,0.666667,2,0,116,16,...,0,0,0,0,0,0,24,0,0,0
9,517313,35,1,17,28.454545,0.500000,8,1,49347,134,...,0,2715,4,0,0,1,51,0,0,1


In [26]:
all_preds.head()

Unnamed: 0,rf_pred,et_pred,ab_pred,gb_pred,svm_pred,temp_index
0,0,0,0,0,0,1
1,0,0,0,0,0,2
2,0,0,0,0,0,3
3,1,0,1,1,0,4
4,0,0,0,0,0,5


## Training level 2 learner

In [0]:
# splitting new dataset into features and predictor
X2 = undersampled_data.iloc[:, data.columns != 'adopter']
y2 = undersampled_data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X2, y2, test_size = 0.2, random_state = 0)

dtrain = xgb.DMatrix(X_train_2, y_train_2)
dtest = xgb.DMatrix(X_test_2, y_test_2)

In [24]:
X_test_2

Unnamed: 0,user_id,age,male,friend_cnt,avg_friend_age,avg_friend_male,friend_country_cnt,subscriber_friend_cnt,songsListened,lovedTracks,...,delta_friend_country_cnt,delta_subscriber_friend_cnt,delta_songsListened,delta_lovedTracks,delta_posts,delta_playlists,delta_shouts,tenure,good_country,delta_good_country
1959,38941,25,0,2,28.500000,1.000000,1,1,527,12,...,0,0,0,0,0,0,0,40,0,0
20995,414539,19,1,3,20.666667,0.666667,1,0,404,2,...,0,0,0,0,0,0,0,23,0,0
16456,325435,19,1,5,21.000000,0.750000,1,0,3515,105,...,0,0,1090,0,0,0,0,7,1,0
24948,490579,41,1,16,36.714286,0.222222,3,1,14453,554,...,0,1,275,6,0,0,0,42,1,0
16876,333652,19,0,5,20.000000,0.200000,1,0,1721,4,...,0,0,0,0,0,0,0,26,0,0
44410,872342,24,1,71,24.250000,0.378788,26,1,20499,177,...,1,0,747,2,0,0,2,64,0,0
27945,549996,18,1,6,20.500000,0.833333,1,0,5630,0,...,0,0,0,0,0,0,0,48,1,0
28103,552972,47,1,13,43.500000,0.583333,8,2,33616,56,...,0,0,0,0,0,0,4,25,0,0
74398,1468644,15,0,6,20.166667,0.666667,2,0,116,16,...,0,0,0,0,0,0,0,24,0,0
26306,517313,35,1,17,28.454545,0.500000,8,1,49347,134,...,0,0,2715,4,0,0,1,51,0,0


In [18]:
num_rounds = 50

params = {
    'max_depth': 3,
    'eta': 0.1,
    'objective': 'binary:logistic',
    'seed': 7
}

test_train_split = [(dtest, 'test'), (dtrain, 'train')]

boost = xgb.train(params,
                 dtrain,
                 num_rounds, 
                 test_train_split)

[0]	test-error:0.206687	train-error:0.220175
[1]	test-error:0.206687	train-error:0.220365
[2]	test-error:0.200608	train-error:0.215236
[3]	test-error:0.199848	train-error:0.214856
[4]	test-error:0.199848	train-error:0.214286
[5]	test-error:0.199088	train-error:0.214096
[6]	test-error:0.200608	train-error:0.216185
[7]	test-error:0.201368	train-error:0.216945
[8]	test-error:0.203647	train-error:0.216565
[9]	test-error:0.199088	train-error:0.214856
[10]	test-error:0.201368	train-error:0.215426
[11]	test-error:0.202888	train-error:0.215615
[12]	test-error:0.202888	train-error:0.215995
[13]	test-error:0.199848	train-error:0.214096
[14]	test-error:0.198328	train-error:0.214286
[15]	test-error:0.195289	train-error:0.214666
[16]	test-error:0.193769	train-error:0.213336
[17]	test-error:0.192249	train-error:0.212006
[18]	test-error:0.190729	train-error:0.210486
[19]	test-error:0.191489	train-error:0.211246
[20]	test-error:0.190729	train-error:0.210296
[21]	test-error:0.190729	train-error:0.20896

In [19]:
y_pred = boost.predict(dtest)
y_pred[y_pred > 0.5] = 1
y_pred[y_pred <= 0.5] = 0

# acc_val = accuracy_score(y_pred, y_test)
# f1_val = f1_score(y_pred, y_test)
# recall_val = recall_score(y_pred, y_test)
# prec_val = precision_score(y_pred, y_test)

print (accuracy_score(y_pred, y_test))
print (f1_score(y_pred, y_test))
print (recall_score(y_pred, y_test))
print (precision_score(y_pred, y_test))

recalls.update({len(undersampled_data[undersampled_data.adopter == 0]) : recall_score(y_pred, y_test)})
f1s.update({len(undersampled_data[undersampled_data.adopter == 0]) : f1_score(y_pred, y_test)})
precisions.update({len(undersampled_data[undersampled_data.adopter == 0]) : precision_score(y_pred, y_test)})
accuracies.update({len(undersampled_data[undersampled_data.adopter == 0]) : accuracy_score(y_pred, y_test)})

print(recalls)
print(f1s)
print(precisions)
print(accuracies)

ValueError: ignored

## Gotta stack 'em all

In [0]:
# base models
base_models = [RandomForestClassifier(n_estimators=50, n_jobs=-1, criterion='gini'),
               RandomForestClassifier(n_estimators=50, n_jobs=-1, criterion='entropy'),
               ExtraTreesClassifier(n_estimators=500, n_jobs=-1, criterion='gini')]

# blending models
blending_model = LogisticRegression()

sg = importlib.import_module("path.to.my-module")

# initialize multi-stage model
sg_model = StackedGeneralizer(base_models, 
                              blending_model, 
	                            n_folds=5, 
                              verbose=True)

# fit the stacked models
sg.fit(X_train,y_train)

y_pred = sg.predict(X_test)
pred_classes = [np.argmax(p) for p in y_pred]

_ = sg.evaluate(y_test, pred_classes)

In [0]:
print (accuracy_score(y_pred, y_test))
print (f1_score(y_pred, y_test))
print (recall_score(y_pred, y_test))
print (precision_score(y_pred, y_test))

recalls.update({len(undersampled_data[undersampled_data.adopter == 0]) : recall_score(y_pred, y_test)})
f1s.update({len(undersampled_data[undersampled_data.adopter == 0]) : f1_score(y_pred, y_test)})
precisions.update({len(undersampled_data[undersampled_data.adopter == 0]) : precision_score(y_pred, y_test)})
accuracies.update({len(undersampled_data[undersampled_data.adopter == 0]) : accuracy_score(y_pred, y_test)})

print(recalls)
print(f1s)
print(precisions)
print(accuracies)

In [0]:
# # predictions on unlabelled set
# unseen_data = pd.read_csv('https://drive.google.com/uc?export=view&id=1yVPwqGQC2gkhF2bcbue9j3184ryAJRtG')
# unseen_data = xgb.DMatrix(unseen_data)

# y_pred = boost.predict(unseen_data)
# y_pred[y_pred > 0.5] = 1
# y_pred[y_pred <= 0.5] = 0

# y_pred = pd.DataFrame({'Adopters': y_pred })

In [0]:
# testing the model on provided test dataset
# np.savetxt("predictions.csv", y_pred , delimiter=",")
# from google.colab import files
# files.download('predictions.csv')

In [0]:
unseen_data = pd.read_csv('https://drive.google.com/uc?export=view&id=1yVPwqGQC2gkhF2bcbue9j3184ryAJRtG')

rf_pred = rf.predict(unseen_data)
et_pred = et.predict(unseen_data)
ab_pred = ab.predict(unseen_data)
gb_pred = gb.predict(unseen_data)
svm_pred = svm.predict(unseen_data)

rf_pred_df = pd.DataFrame({'rf_pred': rf_pred})
et_pred_df = pd.DataFrame({'et_pred': et_pred})
ab_pred_df = pd.DataFrame({'ab_pred': ab_pred})
gb_pred_df = pd.DataFrame({'gb_pred': gb_pred})
svm_pred_df = pd.DataFrame({'svm_pred': svm_pred})

all_preds = pd.concat([rf_pred_df, et_pred_df, ab_pred_df, gb_pred_df, svm_pred_df], axis=1)
all_preds['temp_index'] = range(1, len(all_preds) + 1)

np.savetxt("all_predictions.csv", all_preds , delimiter=",")
from google.colab import files
files.download('all_predictions.csv')