
## **Adopter Prediction Challenge**

 ~ Ankita, Ashok, Kaydee, Young
 
 ---

Website XYZ, a music-listening social networking website, follows the “freemium” business model. The website offers basic services for free, and provides a number of additional premium capabilities for a monthly subscription fee. We are interested in predicting which people would be likely to convert from free users to premium subscribers in the next 6 month period, if they are targeted by our promotional campaign.

### Dataset

We have a dataset from the previous marketing campaign which targeted a number of non-subscribers.

Features: 

```
1.   adopter (predictor class)
2.   user_id
3.   age
4.   male
5.   friend_cnt
6.   avg_friend_age
7.   avg_friend_male
8.   friend_country_cnt
9.   subscriber_friend_cnt
10.   songsListened
11.   lovedTracks
12.   posts
13.   playlists
14.   shouts
15.   good_country
16.   tenure
17.   *other delta variables*
```



### Task

The task is to build the best predictive model for the next marketing campaign, i.e., for predicting likely `adopters` (that is, which current non- subscribers are likely to respond to the marketing campaign and sign up for the premium service within 6 months after the campaign).

---

In [0]:
!pip3 install sklearn



In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, roc_curve, classification_report, recall_score, f1_score, accuracy_score, precision_score

import lightgbm
from sklearn.utils import shuffle

In [0]:
# setting fixed seed value for consistency in results
seed = 7
np.random.seed(seed)

In [4]:
# original dataset
data = pd.read_csv('https://drive.google.com/uc?export=view&id=1wctM0dYDj839zp6sTlFnDgCmFspXhDuW')

data.adopter.value_counts()

0    85142
1     1540
Name: adopter, dtype: int64

Checking to see if any features (especially adopter) needs to be encoded as int

In [5]:
# some housekeeping for metrics
recalls = {}
f1s = {}
precisions = {}
accuracies = {}

# splitting original dataset into features and predictor
X = data.iloc[:, data.columns != 'adopter']
y = data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Number of train instances: 60677
Number of test instances: 26005


## SMOTE splitting

We'll use SMOTE (Synthetic Minority Oversampling Technique) to create(synthesize) more samples of minority class. The recall score we got earlier might be less as we imputed more than 80% of the data to balance the dataset. 

Before we SMOTE the entire dataset, synthesizing around 58000 new instances of minority will not introduce enough variation in data for the models to learn. 

We decide that we will include only a subset of the majority class instances (4000) and synthsize 4000-1540=2460 new instances for minority class using SMOTE. That'll (hopefully) avoid our models from overfitting. 

In [49]:
# fetching the indices of minority instances
adopting_indices = np.array(data[data.adopter == 1].index)

# fetching indices of normal instances
non_adopting_indices = data[data.adopter == 0].index

# randomly select 1540 normal instances to create a partitioned balanced dataset
random_non_adopting_indices = np.random.choice(non_adopting_indices,
                                            1040,
                                            replace = False)
random_non_adopting_indices = np.array(random_non_adopting_indices)

# combining both the instance groups (minority and the new random set) 
undersampled_indices = np.concatenate([adopting_indices, random_non_adopting_indices])

# creating the undersampled dataset
undersampled_data = data.iloc[undersampled_indices, :]

# shuffling the new dataset
undersampled_data = shuffle(undersampled_data)

# storing the features(X) and predictor class(y)
X_undersample = undersampled_data.iloc[:, undersampled_data.columns != 'adopter']
y_undersample = undersampled_data.iloc[:, undersampled_data.columns == 'adopter']

print("Number of minority instances: {}\nNumber of normal instances: {} \nTotal: {}".format(len(undersampled_data[undersampled_data.adopter == 1]), 
                                                                                           len(undersampled_data[undersampled_data.adopter == 0]),
                                                                                           len(undersampled_data)))

Number of minority instances: 1540
Number of normal instances: 1040 
Total: 2580


In [50]:
# splitting original dataset into features and predictor
X = undersampled_data.iloc[:, data.columns != 'adopter']
y = undersampled_data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Undersampled Data:")
print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Undersampled Data:
Number of train instances: 1806
Number of test instances: 774


Creating the LightGBM test and train containers

In [0]:
train_data = lightgbm.Dataset(X_train, 
                              label = y_train)

test_data = lightgbm.Dataset(X_test, 
                             label = y_test)

## LightGBM

In [60]:
parameters = {
    'application': 'binary',
    'objective': 'binary',
    'metric': 'l1',
    'is_unbalance': 'true',
    'boosting': 'gbdt',
    'num_leaves': 50,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.5,
    'bagging_freq': 20,
    'learning_rate': 0.05,
    'verbose': 1
}

lgbm = lightgbm.train(parameters,
                       train_data,
                       valid_sets=test_data,
                       verbose_eval=10,
                       num_boost_round=8000,
                       early_stopping_rounds=3500)

Training until validation scores don't improve for 3500 rounds.
[10]	valid_0's l1: 0.439948
[20]	valid_0's l1: 0.40537
[30]	valid_0's l1: 0.386767
[40]	valid_0's l1: 0.373145
[50]	valid_0's l1: 0.360761
[60]	valid_0's l1: 0.3504
[70]	valid_0's l1: 0.345507
[80]	valid_0's l1: 0.338523
[90]	valid_0's l1: 0.33435
[100]	valid_0's l1: 0.329326
[110]	valid_0's l1: 0.324113
[120]	valid_0's l1: 0.320005
[130]	valid_0's l1: 0.313558
[140]	valid_0's l1: 0.308804
[150]	valid_0's l1: 0.308641
[160]	valid_0's l1: 0.307837
[170]	valid_0's l1: 0.30509
[180]	valid_0's l1: 0.302298
[190]	valid_0's l1: 0.3032
[200]	valid_0's l1: 0.302084
[210]	valid_0's l1: 0.301335
[220]	valid_0's l1: 0.300456
[230]	valid_0's l1: 0.298372
[240]	valid_0's l1: 0.296362
[250]	valid_0's l1: 0.29532
[260]	valid_0's l1: 0.293863
[270]	valid_0's l1: 0.293608
[280]	valid_0's l1: 0.293411
[290]	valid_0's l1: 0.291116
[300]	valid_0's l1: 0.289202
[310]	valid_0's l1: 0.288711
[320]	valid_0's l1: 0.288594
[330]	valid_0's l1: 0.287

In [61]:
y_pred = lgbm.predict(y_test)
# y_pred[y_pred > 0.5] = 1
# y_pred[y_pred <= 0.5] = 0

print (y_pred)

# acc_val = accuracy_score(y_pred, y_test)
# f1_val = f1_score(y_pred, y_test)
# recall_val = recall_score(y_pred, y_test)
# prec_val = precision_score(y_pred, y_test)

# print ("F1: ", f1_score(y_pred, y_test))
# print ("Recall: ", recall_score(y_pred, y_test))
# print ("Prec.: ", precision_score(y_pred, y_test))
# print ("Acc.: ", accuracy_score(y_pred, y_test))

# recalls.update({len(undersampled_data[undersampled_data.adopter == 0]) : recall_score(y_pred, y_test)})
# f1s.update({len(undersampled_data[undersampled_data.adopter == 0]) : f1_score(y_pred, y_test)})
# precisions.update({len(undersampled_data[undersampled_data.adopter == 0]) : precision_score(y_pred, y_test)})
# accuracies.update({len(undersampled_data[undersampled_data.adopter == 0]) : accuracy_score(y_pred, y_test)})

# print("F1 list: ", f1s)
# print("Recall list: ", recalls)
# print("Precision list: ", precisions)
# print("Accuracy list: ", accuracies)

# np.savetxt("predictions.csv", y_pred , delimiter=",")
# from google.colab import files
# files.download('predictions.csv')

[9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.58798171e-07 9.58798171e-07 9.58798171e-07
 9.58798171e-07 9.587981

In [38]:
# predictions on unlabelled set
unseen_data = pd.read_csv('https://drive.google.com/uc?export=view&id=1yVPwqGQC2gkhF2bcbue9j3184ryAJRtG')

y_pred = lgbm.predict(unseen_data)
y_pred[y_pred > 0.8] = 1
y_pred[y_pred <= 0.2] = 0

print (sum(y_pred))

y_pred = pd.DataFrame({'Adopters': y_pred })

21579.227274603905


In [0]:
# testing the model on provided test dataset
np.savetxt("predictions.csv", y_pred , delimiter=",")
from google.colab import files
files.download('predictions.csv')