
## **Adopter Prediction Challenge**

 ~ Ankita, Ashok, Kaydee, Young
 
 ---

Website XYZ, a music-listening social networking website, follows the “freemium” business model. The website offers basic services for free, and provides a number of additional premium capabilities for a monthly subscription fee. We are interested in predicting which people would be likely to convert from free users to premium subscribers in the next 6 month period, if they are targeted by our promotional campaign.

### Dataset

We have a dataset from the previous marketing campaign which targeted a number of non-subscribers.

Features: 

```
1.   adopter (predictor class)
2.   user_id
3.   age
4.   male
5.   friend_cnt
6.   avg_friend_age
7.   avg_friend_male
8.   friend_country_cnt
9.   subscriber_friend_cnt
10.   songsListened
11.   lovedTracks
12.   posts
13.   playlists
14.   shouts
15.   good_country
16.   tenure
17.   *other delta variables*
```



### Task

The task is to build the best predictive model for the next marketing campaign, i.e., for predicting likely `adopters` (that is, which current non- subscribers are likely to respond to the marketing campaign and sign up for the premium service within 6 months after the campaign).

---

In [0]:
!pip3 install sklearn



In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, roc_curve, classification_report, recall_score, f1_score, accuracy_score, precision_score

from sklearn.utils import shuffle
import pickle

In [0]:
# setting fixed seed value for consistency in results
seed = 7
np.random.seed(seed)

In [3]:
# original dataset
data = pd.read_csv('https://drive.google.com/uc?export=view&id=1wctM0dYDj839zp6sTlFnDgCmFspXhDuW')

data.adopter.value_counts()

0    85142
1     1540
Name: adopter, dtype: int64

In [0]:
# some housekeeping for metrics
recalls = {}
f1s = {}
precisions = {}
accuracies = {}

## Splitting adopter and non-adopter instances

Since Isolation Forests train on a single class, the goal is to train it on non-adopters while considering adopters as anomalies. 

In [5]:
# fetching the indices of minority instances
adopting_indices = np.array(data[data.adopter == 1].index)

# fetching indices of normal instances
non_adopting_indices = data[data.adopter == 0].index

# randomly select 1540 normal instances to create a partitioned balanced dataset
# random_non_adopting_indices = np.random.choice(non_adopting_indices,
#                                             6040,
#                                             replace = False)
# random_non_adopting_indices = np.array(random_non_adopting_indices)

# combining both the instance groups (minority and the new random set) 
# undersampled_indices = np.concatenate([adopting_indices, random_non_adopting_indices])

# separating the adopter and non-adopter instances
adopter_data = data.iloc[adopting_indices, :]
non_adopter_data = data.iloc[non_adopting_indices, :]

print("Number of minority instances: {}\nNumber of normal instances: {} ".format(len(adopter_data), len(non_adopter_data)))

Number of minority instances: 1540
Number of normal instances: 85142 


In [6]:
# splitting the adopter dataset into features and predictor
X = non_adopter_data.iloc[:, data.columns != 'adopter']
y = non_adopter_data.iloc[:, data.columns == 'adopter']

# splitting the dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Undersampled Data:")
print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Undersampled Data:
Number of train instances: 59599
Number of test instances: 25543


In [0]:
# sm = SMOTE(random_state = 12, ratio = None)
# X_train_smoted_np, y_train_smoted_np = sm.fit_sample(X_train, y_train)
# # X_train_smoted, y_train_smoted = sm.fit_sample(X_train, y_train.values.ravel())
# print(type(X_train_smoted_np))

In [0]:
# # checking the lengths of new training set

# print ("Number of SMOTEd instances: {}".format(len(X_train_smoted_np)))

# X_train.head()
# y_train_smoted_non_adopters = y_train_smoted_np[y_train_smoted_np == 1]
# y_train_smoted_adopters = y_train_smoted_np[y_train_smoted_np == 0]

# print ("Number of SMOTEd non-adopters (adopter = 0): {}".format(len(y_train_smoted_non_adopters)))
# print ("Number of SMOTEd adopters (adopter = 1): {}".format(len(y_train_smoted_adopters)))

## isolation Forest

In [30]:
# deciding the anomaly ration in the train set
anomaly_ratio = len(adopter_data)/len(non_adopter_data)
# anomaly_ratio

estimator = IsolationForest(n_estimators = 500, 
                            max_samples = 512,
                            contamination = anomaly_ratio, 
                            behaviour= "new", 
                            random_state = np.random.RandomState(seed),
                            verbose = 1)

estimator.fit(X_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.6s finished


IsolationForest(behaviour='new', bootstrap=False,
        contamination=0.018087430410373258, max_features=1.0,
        max_samples=512, n_estimators=500, n_jobs=None,
        random_state=<mtrand.RandomState object at 0x7fb6353f7c18>,
        verbose=1)

In [0]:
# filename = 'isolation_forest_model.sav'
# pickle.dump(estimator, open(filename, 'wb'))

In [0]:
# filename = 'isolation_forest_model.sav'
# loaded_model = pickle.load(open(filename, 'rb'))

In [0]:
X_train_pred = estimator.predict(X_train)
X_train_pred = [1 if x == -1 else 0 for x in X_train_pred]

In [29]:
sum(X_train_pred)

1078

In [0]:
X_pred
X_pred = pd.DataFrame({'Adopters': X_pred })
np.savetxt("predictions.csv", X_pred , delimiter=",")
from google.colab import files
files.download('predictions.csv')

In [0]:

# y_pred = X_pred.predict(dtest)
# y_pred[y_pred > 0.5] = 1
# y_pred[y_pred <= 0.5] = 0

# acc_val = accuracy_score(y_pred, y_test)
# f1_val = f1_score(y_pred, y_test)
# recall_val = recall_score(y_pred, y_test)
# prec_val = precision_score(y_pred, y_test)

print (accuracy_score(X_pred, X_test))
print (f1_score(X_pred, X_test))
print (recall_score(X_pred, X_test))
print (precision_score(X_pred, X_test))

recalls.update({len(undersampled_data[undersampled_data.adopter == 0]) : recall_score(y_pred, y_test)})
f1s.update({len(undersampled_data[undersampled_data.adopter == 0]) : f1_score(y_pred, y_test)})
precisions.update({len(undersampled_data[undersampled_data.adopter == 0]) : precision_score(y_pred, y_test)})
accuracies.update({len(undersampled_data[undersampled_data.adopter == 0]) : accuracy_score(y_pred, y_test)})

print(recalls)
print(f1s)
print(precisions)
print(accuracies)

KeyboardInterrupt: ignored

Let's try standardising the data

In [0]:

# y_pred[y_pred > 0.5] = 1
# y_pred[y_pred <= 0.5] = 0

# type(y_pred)

In [0]:
# predictions on unlabelled set
unseen_data = pd.read_csv('https://drive.google.com/uc?export=view&id=1yVPwqGQC2gkhF2bcbue9j3184ryAJRtG')
unseen_data = xgb.DMatrix(unseen_data)

y_pred = boost.predict(unseen_data)
y_pred[y_pred > 0.5] = 1
y_pred[y_pred <= 0.5] = 0

y_pred = pd.DataFrame({'Adopters': y_pred })

NameError: ignored

In [0]:
# testing the model on provided test dataset
np.savetxt("predictions.csv", y_pred , delimiter=",")
from google.colab import files
files.download('predictions.csv')