
## **Adopter Prediction Challenge**

 ~ Ankita, Ashok, Kaydee, Young
 
 ---

Website XYZ, a music-listening social networking website, follows the “freemium” business model. The website offers basic services for free, and provides a number of additional premium capabilities for a monthly subscription fee. We are interested in predicting which people would be likely to convert from free users to premium subscribers in the next 6 month period, if they are targeted by our promotional campaign.

### Dataset

We have a dataset from the previous marketing campaign which targeted a number of non-subscribers.

Features: 

```
1.   adopter (predictor class)
2.   user_id
3.   age
4.   male
5.   friend_cnt
6.   avg_friend_age
7.   avg_friend_male
8.   friend_country_cnt
9.   subscriber_friend_cnt
10.   songsListened
11.   lovedTracks
12.   posts
13.   playlists
14.   shouts
15.   good_country
16.   tenure
17.   *other delta variables*
```



### Task

The task is to build the best predictive model for the next marketing campaign, i.e., for predicting likely `adopters` (that is, which current non- subscribers are likely to respond to the marketing campaign and sign up for the premium service within 6 months after the campaign).

---

In [0]:
!pip3 install sklearn



In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

from google.colab import drive

from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, roc_curve, classification_report, recall_score, f1_score, accuracy_score, precision_score

import xgboost as xgb

from sklearn.utils import shuffle
import pickle

In [0]:
# setting fixed seed value for consistency in results
seed = 7
np.random.seed(seed)

In [1]:
# original dataset
data = pd.read_csv('https://drive.google.com/uc?export=view&id=1wctM0dYDj839zp6sTlFnDgCmFspXhDuW')

data.adopter.value_counts()

NameError: ignored

In [0]:
# some housekeeping for metrics
recalls = {}
f1s = {}
precisions = {}
accuracies = {}

## PCA

Let's try to understand the coorelations between features and reduce dimenions using PCA



In [0]:
# Let's reduce the feature space to 10 Principal Components
pca = PCA(n_components=2, svd_solver='full')
pca.fit(data)

# fetching the principal components
pca_df = pca.transform(data)
pca_df

array([[ 854759.82859973,   19870.23826579],
       [ 854720.24411516,   -1914.32387718],
       [ 854700.28031605,   13512.38783989],
       ...,
       [-854170.64437233,   39125.80735444],
       [-854151.35634299,  -16267.59321778],
       [-854175.54207324,   28314.13514004]])

In [0]:
# understanding the percentage of variance in original dataset explained by our pricnipal components
sum(pca.explained_variance_ratio_)

0.9999742484087512

In [0]:
# using the principal components to fetch feature importances
# reference - http://benalexkeen.com/principle-component-analysis-in-python/ 
def fetch_feature_importance(pca_df, components, cols):
  
  num_columns = len(cols)
  
  xvector = components[0] * max(pca_df[:,0])
  yvector = components[1] * max(pca_df[:,1])
  
  imp_features = { cols[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
  imp_features = sorted(zip(imp_features.values(), imp_features.keys()), reverse=True)
#   print ("Features by importance:\n", imp_features)
  return imp_features

In [0]:
imp_features = fetch_feature_importance(pca_df, pca.components_, data.columns.values)

pca_features = []

for item in imp_features:
  pca_features.append(item[1])
  
top_pca_features = pca_features[0:13]

# fetching top 12 features from the dataset
X = data[top_pca_features]
X = X.drop(['user_id'], axis = 1)
y = data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Number of train instances: 60677
Number of test instances: 26005


## Resampling the data to avoid overfitting

Now that we've performed PCA to fetch top 12 features, we repeat the process after resampling the data

In [0]:
# fetching the indices of minority instances
adopting_indices = np.array(data[data.adopter == 1].index)

# fetching indices of normal instances
non_adopting_indices = data[data.adopter == 0].index

# randomly select 1540 normal instances to create a partitioned balanced dataset
random_non_adopting_indices = np.random.choice(non_adopting_indices,
                                            6000,
                                            replace = False)
random_non_adopting_indices = np.array(random_non_adopting_indices)

# combining both the instance groups (minority and the new random set) 
undersampled_indices = np.concatenate([adopting_indices, random_non_adopting_indices])

# creating the undersampled dataset
undersampled_data = data.iloc[undersampled_indices, :]

# shuffling the new dataset
undersampled_data = shuffle(undersampled_data)

# storing the features(X) and predictor class(y)
X_undersample = undersampled_data.iloc[:, undersampled_data.columns != 'adopter']
y_undersample = undersampled_data.iloc[:, undersampled_data.columns == 'adopter']

print("Number of minority instances: {}\nNumber of normal instances: {} \nTotal: {}".format(len(undersampled_data[undersampled_data.adopter == 1]), 
                                                                                           len(undersampled_data[undersampled_data.adopter == 0]),
                                                                                           len(undersampled_data)))

Number of minority instances: 1540
Number of normal instances: 6000 
Total: 7540


In [0]:
imp_features = fetch_feature_importance(pca_df, pca.components_, data.columns.values)

pca_features = []

for item in imp_features:
  pca_features.append(item[1])
  
top_pca_features = pca_features[0:13]

# fetching top 12 features from the dataset
X = undersampled_data[top_pca_features]
X = X.drop(['user_id'], axis = 1)
y = undersampled_data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Number of train instances: 5278
Number of test instances: 2262


## XGBoost over fetched important features

In [0]:
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)

num_rounds = 50

params = {
    'max_depth': 3,
    'eta': 0.1,
    'objective': 'binary:logistic',
    'seed': 7
}

test_train_split = [(dtest, 'test'), (dtrain, 'train')]

boost = xgb.train(params,
                 dtrain,
                 num_rounds, 
                 test_train_split)

[0]	test-error:0.214854	train-error:0.192687
[1]	test-error:0.209107	train-error:0.195339
[2]	test-error:0.212644	train-error:0.190792
[3]	test-error:0.214854	train-error:0.190413
[4]	test-error:0.212644	train-error:0.190792
[5]	test-error:0.213528	train-error:0.190792
[6]	test-error:0.212202	train-error:0.190413
[7]	test-error:0.212644	train-error:0.191929
[8]	test-error:0.209991	train-error:0.190034
[9]	test-error:0.210875	train-error:0.190413
[10]	test-error:0.210433	train-error:0.190792
[11]	test-error:0.209991	train-error:0.192118
[12]	test-error:0.209991	train-error:0.190413
[13]	test-error:0.209549	train-error:0.19155
[14]	test-error:0.211317	train-error:0.191171
[15]	test-error:0.213086	train-error:0.190792
[16]	test-error:0.21176	train-error:0.188139
[17]	test-error:0.209549	train-error:0.188329
[18]	test-error:0.209991	train-error:0.188329
[19]	test-error:0.210433	train-error:0.188139
[20]	test-error:0.212202	train-error:0.188329
[21]	test-error:0.212644	train-error:0.189276


In [0]:
y_pred = boost.predict(dtest)
y_pred[y_pred > 0.5] = 1
y_pred[y_pred <= 0.5] = 0

print ("F1: ", f1_score(y_pred, y_test))
print ("Recall: ", recall_score(y_pred, y_test))
print ("Precision: ", precision_score(y_pred, y_test))
print ("Acc: ", accuracy_score(y_pred, y_test))

recalls.update({len(undersampled_data[undersampled_data.adopter == 0]) : recall_score(y_pred, y_test)})
f1s.update({len(undersampled_data[undersampled_data.adopter == 0]) : f1_score(y_pred, y_test)})
precisions.update({len(undersampled_data[undersampled_data.adopter == 0]) : precision_score(y_pred, y_test)})
accuracies.update({len(undersampled_data[undersampled_data.adopter == 0]) : accuracy_score(y_pred, y_test)})

print("F1s", f1s)
print("Recalls", recalls)
print("Precisions", precisions)
print("Accuracies", accuracies)

F1:  0.19322033898305083
Recall:  0.5377358490566038
Precision:  0.11776859504132231
Acc:  0.7895667550839964
F1s {6000: 0.19322033898305083}
Recalls {6000: 0.5377358490566038}
Precisions {6000: 0.11776859504132231}
Accuracies {6000: 0.7895667550839964}


In [0]:
# predictions on unlabelled set
unseen_data = pd.read_csv('https://drive.google.com/uc?export=view&id=1yVPwqGQC2gkhF2bcbue9j3184ryAJRtG')

# dropping the unimportant features
unseen_data = unseen_data[top_pca_features]
unseen_data = unseen_data.drop(['user_id'], axis = 1)

unseen_data = xgb.DMatrix(unseen_data)

y_pred = boost.predict(unseen_data)
y_pred[y_pred > 0.5] = 1
y_pred[y_pred <= 0.5] = 0

y_pred = pd.DataFrame({'Adopters': y_pred })

In [0]:
# testing the model on provided test dataset
np.savetxt("predictions.csv", y_pred , delimiter=",")
from google.colab import files
files.download('predictions.csv')