
## **Adopter Prediction Challenge**

 ~ Ankita, Ashok, Kaydee, Young
 
 ---

Website XYZ, a music-listening social networking website, follows the “freemium” business model. The website offers basic services for free, and provides a number of additional premium capabilities for a monthly subscription fee. We are interested in predicting which people would be likely to convert from free users to premium subscribers in the next 6 month period, if they are targeted by our promotional campaign.

### Dataset

We have a dataset from the previous marketing campaign which targeted a number of non-subscribers.

Features: 

```
1.   adopter (predictor class)
2.   user_id
3.   age
4.   male
5.   friend_cnt
6.   avg_friend_age
7.   avg_friend_male
8.   friend_country_cnt
9.   subscriber_friend_cnt
10.   songsListened
11.   lovedTracks
12.   posts
13.   playlists
14.   shouts
15.   good_country
16.   tenure
17.   *other delta variables*
```



### Task

The task is to build the best predictive model for the next marketing campaign, i.e., for predicting likely `adopters` (that is, which current non- subscribers are likely to respond to the marketing campaign and sign up for the premium service within 6 months after the campaign).

---

### EDA

Performing some rudimentary EDA

In [0]:
!pip3 install sklearn



In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from google.colab import drive
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, roc_curve, classification_report, recall_score, f1_score, accuracy_score, precision_score

from imblearn.over_sampling import SMOTE

In [0]:
# setting fixed seed value for consistency in results
seed = 7
np.random.seed(seed)

In [0]:
# drive.mount('/content/drive/')

# original dataset
data = pd.read_csv('https://drive.google.com/uc?export=view&id=1wctM0dYDj839zp6sTlFnDgCmFspXhDuW')

# rose_data from the R script
# data = pd.read_csv('https://drive.google.com/uc?export=view&id=14wilOFigXttteZAt5oUHT9fh1m5LhnJj')

data.adopter.value_counts()

0    85142
1     1540
Name: adopter, dtype: int64

Checking to see if any features (especially adopter) needs to be encoded as int

In [0]:
data.dtypes

user_id                          int64
age                              int64
male                             int64
friend_cnt                       int64
avg_friend_age                 float64
avg_friend_male                float64
friend_country_cnt               int64
subscriber_friend_cnt            int64
songsListened                    int64
lovedTracks                      int64
posts                            int64
playlists                        int64
shouts                           int64
delta_friend_cnt                 int64
delta_avg_friend_age           float64
delta_avg_friend_male          float64
delta_friend_country_cnt         int64
delta_subscriber_friend_cnt      int64
delta_songsListened              int64
delta_lovedTracks                int64
delta_posts                      int64
delta_playlists                  int64
delta_shouts                     int64
tenure                           int64
good_country                     int64
delta_good_country       

In [0]:
# splitting original dataset into features and predictor
X = data.iloc[:, data.columns != 'adopter']
y = data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Original Data:")
print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Original Data:
Number of train instances: 60677
Number of test instances: 26005


## SMOTE splitting

We'll use SMOTE (Synthetic Minority Oversampling Technique) to create(synthesize) more samples of minority class. The recall score we got earlier might be less as we imputed more than 80% of the data to balance the dataset. 

Before we SMOTE the entire dataset, synthesizing around 58000 new instances of minority will not introduce enough variation in data for the models to learn. 

We decide that we will include only a subset of the majority class instances (4000) and synthsize 4000-1540=2460 new instances for minority class using SMOTE. That'll (hopefully) avoid our models from overfitting. 

In [0]:
# fetching the indices of minority instances
adopting_indices = np.array(data[data.adopter == 1].index)

# fetching indices of normal instances
non_adopting_indices = data[data.adopter == 0].index

# randomly select 1540 normal instances to create a partitioned balanced dataset
random_non_adopting_indices = np.random.choice(non_adopting_indices,
                                            1540,
                                            replace = False)
random_non_adopting_indices = np.array(random_non_adopting_indices)

# combining both the instance groups (minority and the new random set) 
undersampled_indices = np.concatenate([adopting_indices, random_non_adopting_indices])

# creating the undersampled dataset
undersampled_data = data.iloc[undersampled_indices, :]

# storing the features(X) and predictor class(y)
X_undersample = undersampled_data.iloc[:, undersampled_data.columns != 'adopter']
y_undersample = undersampled_data.iloc[:, undersampled_data.columns == 'adopter']

print("Number of minority instances: {}\nNumber of normal instances: {} \nTotal: {}".format(len(undersampled_data[undersampled_data.adopter == 1]), 
                                                                                           len(undersampled_data[undersampled_data.adopter == 0]),
                                                                                           len(undersampled_data)))

Number of minority instances: 1540
Number of normal instances: 1540 
Total: 3080


In [0]:
# splitting original dataset into features and predictor
X = undersampled_data.iloc[:, data.columns != 'adopter']
y = undersampled_data.iloc[:, data.columns == 'adopter']

# splitting the original dataset for cross-validation (0.7 train, 0.3 test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print ("Undersampled Data:")
print ("Number of train instances: {}".format(len(X_train)))
print ("Number of test instances: {}".format(len(X_test)))

Undersampled Data:
Number of train instances: 2156
Number of test instances: 924


In [0]:
sm = SMOTE(random_state = 12, ratio = None)
X_train_smoted_np, y_train_smoted_np = sm.fit_sample(X_train, y_train)
# X_train_smoted, y_train_smoted = sm.fit_sample(X_train, y_train.values.ravel())
print(type(X_train_smoted_np))

<class 'numpy.ndarray'>


  y = column_or_1d(y, warn=True)


In [0]:
# checking the lengths of new training set

print ("Number of SMOTEd instances: {}".format(len(X_train_smoted_np)))

X_train.head()
y_train_smoted_non_adopters = y_train_smoted_np[y_train_smoted_np == 1]
y_train_smoted_adopters = y_train_smoted_np[y_train_smoted_np == 0]

print ("Number of SMOTEd non-adopters (adopter = 0): {}".format(len(y_train_smoted_non_adopters)))
print ("Number of SMOTEd adopters (adopter = 1): {}".format(len(y_train_smoted_adopters)))

Number of SMOTEd instances: 8410
Number of SMOTEd non-adopters (adopter = 0): 4205
Number of SMOTEd adopters (adopter = 1): 4205


We now have around 2792 instances each of both the classes, which is better than simple undersampling and having only 3080 instances in all.



For now we'll import smoted data from our R scripts since the above is taking time.

## Building a shallow NN

In [0]:
# temp function to plot confusion matrix

def plot_conf_matrix(cm, 
                     classes,
                     normalize=False,
                     title='Confusion matrix',
                     cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1#print('Confusion matrix, without normalization')

    #print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [0]:
# temp function to plot training accuracy and loss

def plot_acc_loss(class_hist):
  plt.subplot(211)
  plt.title('Loss')
  plt.plot(class_hist.history['loss'], label='train')
  plt.plot(class_hist.history['val_loss'], label='test')
  plt.legend()
  
  plt.subplot(212)
  plt.title('Accuracy')
  plt.plot(class_hist.history['acc'], label='train')
  plt.plot(class_hist.history['val_acc'], label='test')
  plt.legend()
  plt.show()

In [0]:
# Baseline NN with single hidden layer (num of neurons = num of features)
# num_neurons = undersampled_data.shape[1] - 1
num_neurons = 4
def baseline_nn():
  model = Sequential()	
  
  # 5 hidden neurons + input neurons = num of dimensions
  model.add(Dense(num_neurons, 
                  input_dim = undersampled_data.shape[1] - 1, 
                  kernel_initializer = 'normal', 
                  activation = 'relu'))
  
  model.add(Dense(3, 
                  kernel_initializer = 'normal', 
                  activation = 'relu'))
  
  model.add(Dense(1, 
                  kernel_initializer = 'normal', 
                  activation='sigmoid'))	
  
  model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['accuracy'])
  
  return model

In [0]:
# estimator = KerasClassifier(build_fn = baseline_nn, 
#                             epochs=10, 
#                             batch_size = 5, 
#                             verbose=1)

classifier = baseline_nn()

class_weights = {0: 1.,
                 1: 50.}

classifier_history = classifier.fit(X_train, 
                                    y_train, 
                                    validation_split=0.33, 
                                    batch_size = 5, 
                                    epochs = 100,
                                   class_weight = class_weights)

plot_acc_loss(classifier_history)

Train on 1444 samples, validate on 712 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100

KeyboardInterrupt: ignored

In [0]:
classifier.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_54 (Dense)             (None, 5)                 135       
_________________________________________________________________
dense_55 (Dense)             (None, 3)                 18        
_________________________________________________________________
dense_56 (Dense)             (None, 1)                 4         
Total params: 157
Trainable params: 157
Non-trainable params: 0
_________________________________________________________________


In [0]:
y_pred = classifier.predict_classes(X_test)
# y_pred = (y_pred > 0.5)
y_pred
# cm = confusion_matrix(y_test, y_pred)
# plot_conf_matrix(cm, [1,0])


array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]], dtype=int32)

In [0]:
# y_test.shape[0]
# y_pred

In [0]:
# fetching metrics
recall_val = recall_score(y_test, y_pred)
f1_val = f1_score(y_test, y_pred)
acc_val = accuracy_score(y_test, y_pred)
precision_val = precision_score(y_test, y_pred)

print('Accuracy: %f' % acc_val)
print('Precision: %f' % precision_val)
print('Recall: %f' % recall_val)
print('F1 score: %f' % f1_val)

Accuracy: 0.793546
Precision: 0.000000
Recall: 0.000000
F1 score: 0.000000


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
