
## **Adopter Prediction Challenge**

 ~ Ankita, Ashok, Kaydee, Young
 
 ---

Website XYZ, a music-listening social networking website, follows the “freemium” business model. The website offers basic services for free, and provides a number of additional premium capabilities for a monthly subscription fee. We are interested in predicting which people would be likely to convert from free users to premium subscribers in the next 6 month period, if they are targeted by our promotional campaign.

### Dataset

We have a dataset from the previous marketing campaign which targeted a number of non-subscribers.

Features: 

```
1.   adopter (predictor class)
2.   user_id
3.   age
4.   male
5.   friend_cnt
6.   avg_friend_age
7.   avg_friend_male
8.   friend_country_cnt
9.   subscriber_friend_cnt
10.   songsListened
11.   lovedTracks
12.   posts
13.   playlists
14.   shouts
15.   good_country
16.   tenure
17.   *other delta variables*
```



### Task

The task is to build the best predictive model for the next marketing campaign, i.e., for predicting likely `adopters` (that is, which current non- subscribers are likely to respond to the marketing campaign and sign up for the premium service within 6 months after the campaign).

---

In [0]:
!pip3 install sklearn



In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, roc_curve, classification_report, recall_score, f1_score, accuracy_score, precision_score
from sklearn.ensemble import RandomForestClassifier


In [0]:
# setting fixed seed value for consistency in results
seed = 7
np.random.seed(seed)

In [3]:
# original dataset
data = pd.read_csv('https://drive.google.com/uc?export=view&id=1wctM0dYDj839zp6sTlFnDgCmFspXhDuW')

# data.columns

data.adopter.value_counts()

0    85142
1     1540
Name: adopter, dtype: int64

Checking to see if any features (especially adopter) needs to be encoded as int

In [4]:
# some housekeeping for metrics
recalls = {}
f1s = {}
precisions = {}
accuracies = {}

# splitting original dataset into features and predictor
X = data.iloc[:, data.columns != 'adopter']
y = data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Number of train instances: 60677
Number of test instances: 26005


## SMOTE splitting

We'll use SMOTE (Synthetic Minority Oversampling Technique) to create(synthesize) more samples of minority class. The recall score we got earlier might be less as we imputed more than 80% of the data to balance the dataset. 

Before we SMOTE the entire dataset, synthesizing around 58000 new instances of minority will not introduce enough variation in data for the models to learn. 

We decide that we will include only a subset of the majority class instances (4000) and synthsize 4000-1540=2460 new instances for minority class using SMOTE. That'll (hopefully) avoid our models from overfitting. 

In [5]:
# fetching the indices of minority instances
adopting_indices = np.array(data[data.adopter == 1].index)

# fetching indices of normal instances
non_adopting_indices = data[data.adopter == 0].index

# randomly select 1540 normal instances to create a partitioned balanced dataset
random_non_adopting_indices = np.random.choice(non_adopting_indices,
                                            6000,
                                            replace = False)
random_non_adopting_indices = np.array(random_non_adopting_indices)

# combining both the instance groups (minority and the new random set) 
undersampled_indices = np.concatenate([adopting_indices, random_non_adopting_indices])

# creating the undersampled dataset
undersampled_data = data.iloc[undersampled_indices, :]

# shuffling the set
undersampled_data = shuffle(undersampled_data)

# storing the features(X) and predictor class(y)
X_undersample = undersampled_data.iloc[:, undersampled_data.columns != 'adopter']
y_undersample = undersampled_data.iloc[:, undersampled_data.columns == 'adopter']

print("Number of minority instances: {}\nNumber of normal instances: {} \nTotal: {}".format(len(undersampled_data[undersampled_data.adopter == 1]), 
                                                                                           len(undersampled_data[undersampled_data.adopter == 0]),
                                                                                           len(undersampled_data)))

Number of minority instances: 1540
Number of normal instances: 6000 
Total: 7540


In [6]:
# splitting original dataset into features and predictor
X = undersampled_data.iloc[:, data.columns != 'adopter']
y = undersampled_data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Undersampled Data:")
print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Undersampled Data:
Number of train instances: 5278
Number of test instances: 2262


In [0]:
# sm = SMOTE(random_state = 12, ratio = None)
# X_train_smoted_np, y_train_smoted_np = sm.fit_sample(X_train, y_train)
# # X_train_smoted, y_train_smoted = sm.fit_sample(X_train, y_train.values.ravel())
# print(type(X_train_smoted_np))

In [0]:
# # checking the lengths of new training set

# print ("Number of SMOTEd instances: {}".format(len(X_train_smoted_np)))

# X_train.head()
# y_train_smoted_non_adopters = y_train_smoted_np[y_train_smoted_np == 1]
# y_train_smoted_adopters = y_train_smoted_np[y_train_smoted_np == 0]

# print ("Number of SMOTEd non-adopters (adopter = 0): {}".format(len(y_train_smoted_non_adopters)))
# print ("Number of SMOTEd adopters (adopter = 1): {}".format(len(y_train_smoted_adopters)))

We now have around 2792 instances each of both the classes, which is better than simple undersampling and having only 3080 instances in all.



For now we'll import smoted data from our R scripts since the above is taking time.

## Random Forest with all the features

In [7]:
classifier = RandomForestClassifier(random_state=0, n_jobs=-1, class_weight="balanced")

classifier.fit(X_train, y_train)

  This is separate from the ipykernel package so we can avoid doing imports until


RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=-1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [8]:
y_pred = classifier.predict(X_test)

# acc_val = accuracy_score(y_pred, y_test)
# f1_val = f1_score(y_pred, y_test)
# recall_val = recall_score(y_pred, y_test)
# prec_val = precision_score(y_pred, y_test)

print ("Acc:", accuracy_score(y_pred, y_test))
print ("F1:", f1_score(y_pred, y_test))
print ("Recall:", recall_score(y_pred, y_test))
print ("Precision", precision_score(y_pred, y_test))

recalls.update({len(undersampled_data[undersampled_data.adopter == 0]) : recall_score(y_pred, y_test)})
f1s.update({len(undersampled_data[undersampled_data.adopter == 0]) : f1_score(y_pred, y_test)})
precisions.update({len(undersampled_data[undersampled_data.adopter == 0]) : precision_score(y_pred, y_test)})
accuracies.update({len(undersampled_data[undersampled_data.adopter == 0]) : accuracy_score(y_pred, y_test)})

print("\nF1s:",f1s)
print("Recalls:",recalls)
print("Precisions:",precisions)
print("Accuracies:",accuracies)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Acc: 0.8054818744473917
F1: 0.2280701754385965
Recall: 0.5118110236220472
Precision 0.14672686230248308

F1s: {6000: 0.2280701754385965}
Recalls: {6000: 0.5118110236220472}
Precisions: {6000: 0.14672686230248308}
Accuracies: {6000: 0.8054818744473917}
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      1819
           1       0.51      0.15      0.23       443

   micro avg       0.81      0.81      0.81      2262
   macro avg       0.67      0.56      0.56      2262
weighted avg       0.76      0.81      0.76      2262

[[1757   62]
 [ 378   65]]


In [9]:
# predictions on unlabelled set
unseen_data = pd.read_csv('https://drive.google.com/uc?export=view&id=1yVPwqGQC2gkhF2bcbue9j3184ryAJRtG')
# unseen_data = xgb.DMatrix(unseen_data)

y_pred = classifier.predict(unseen_data)

print(sum(y_pred))

y_pred = pd.DataFrame({'Adopters': y_pred })

# testing the model on provided test dataset
np.savetxt("rf_predictions.csv", y_pred , delimiter=",")
from google.colab import files
files.download('rf_predictions.csv')



3545


### Feature selection

Let's try to understand what features are the most important as the data seems to be pretty sparse and there's possibility of noise

In [0]:
feature_imp = pd.Series(classifier.feature_importances_).sort_values(ascending=False)
feature_imp

9     0.126045
18    0.094097
4     0.073787
8     0.073598
3     0.063483
1     0.059883
0     0.056898
23    0.053577
19    0.049249
12    0.047444
14    0.043408
5     0.041726
13    0.031127
6     0.029272
7     0.026110
15    0.025286
10    0.017854
24    0.016657
11    0.016271
22    0.014607
17    0.014412
2     0.014258
16    0.008207
21    0.001803
20    0.000859
25    0.000080
dtype: float64

Looks like columns 11, 22, 16, 24, 2, 17, 21, 20, 25 aren't contributing enough. Let's try to build a classifier by removing some of these.

In [0]:
# data = pd.read_csv('https://drive.google.com/uc?export=view&id=1wctM0dYDj839zp6sTlFnDgCmFspXhDuW')

# type(data)
# data.columns
# #data.head()

In [0]:
# data = data.drop(data.columns[[11, 22, 16, 24, 2, 17, 21, 20, 25]], axis=1)

# # some housekeeping for metrics
# recalls = {}
# f1s = {}
# precisions = {}
# accuracies = {}

# # splitting original dataset into features and predictor
# X = data.iloc[:, data.columns != 'adopter']
# y = data.iloc[:, data.columns == 'adopter']

# # splitting the original dataset for cross-validation (0.7 train, 0.3 test)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# # fetching the indices of minority instances
# adopting_indices = np.array(data[data.adopter == 1].index)

# # fetching indices of normal instances
# non_adopting_indices = data[data.adopter == 0].index

# # randomly select 1540 normal instances to create a partitioned balanced dataset
# random_non_adopting_indices = np.random.choice(non_adopting_indices,
#                                             1040,
#                                             replace = False)
# random_non_adopting_indices = np.array(random_non_adopting_indices)

# # combining both the instance groups (minority and the new random set) 
# undersampled_indices = np.concatenate([adopting_indices, random_non_adopting_indices])

# # creating the undersampled dataset
# undersampled_data = data.iloc[undersampled_indices, :]

# # storing the features(X) and predictor class(y)
# X_undersample = undersampled_data.iloc[:, undersampled_data.columns != 'adopter']
# y_undersample = undersampled_data.iloc[:, undersampled_data.columns == 'adopter']

# # splitting original dataset into features and predictor
# X = undersampled_data.iloc[:, data.columns != 'adopter']
# y = undersampled_data.iloc[:, data.columns == 'adopter']

# # splitting the original dataset for cross-validation (0.7 train, 0.3 test)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# classifier = RandomForestClassifier(random_state=0, n_jobs=-1, class_weight="balanced")

# classifier.fit(X_train, y_train)

# y_pred = classifier.predict(X_test)

# recalls.update({len(undersampled_data[undersampled_data.adopter == 0]) : recall_score(y_pred, y_test)})
# f1s.update({len(undersampled_data[undersampled_data.adopter == 0]) : f1_score(y_pred, y_test)})
# precisions.update({len(undersampled_data[undersampled_data.adopter == 0]) : precision_score(y_pred, y_test)})
# accuracies.update({len(undersampled_data[undersampled_data.adopter == 0]) : accuracy_score(y_pred, y_test)})

# print("\nF1s:",f1s)
# print("Recalls:",recalls)
# print("Precisions:",precisions)
# print("Accuracies:",accuracies)

# print(classification_report(y_test, y_pred))
# print(confusion_matrix(y_test, y_pred))

In [0]:
# # predictions on unlabelled set
# unseen_data = pd.read_csv('https://drive.google.com/uc?export=view&id=1yVPwqGQC2gkhF2bcbue9j3184ryAJRtG')
# # unseen_data = xgb.DMatrix(unseen_data)
# # unseen_data = unseen_data.drop(data.columns[[11, 22, 16, 24, 2, 17, 21, 20, 25]], axis=1)

# y_pred = classifier.predict(unseen_data)

# y_pred = pd.DataFrame({'Adopters': y_pred })

In [0]:
# # testing the model on provided test dataset
# np.savetxt("predictions.csv", y_pred , delimiter=",")
# from google.colab import files
# files.download('rf_predictions.csv')