### Background: 
Based on the weak signals and trends from dataset visualization, it is unlikely that we can make good predictions on whether or not a user a subscribe. However, given the sparse and imbalanced dataset, we want to test some sampling techniques to see how they fare with a classification algorithm.
### Purpose:
Train a classification model to predict whether or not a user will subscribe. Predictions enable us to provide appropriate incentives to subscribe or develop new app features that will be useful to subscribers.

In [2]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [3]:
# open dataframe
df = pd.read_pickle('df_clean-2018-03-16-reducedQ')

In [4]:
df.columns

Index(['ID', 'Class', 'InstallDate', 'LastSavedUTC', 'LastSync', 'Platform',
       'FirstInstalledVersion', 'DaysSinceInstall', 'SessionCount',
       'ExperimentID', 'Experiment', 'MinLockedBottom', 'LockedBottom',
       'LockedBottomDelta', 'LockedTop', 'Variation', 'PriceOptions',
       'PriceSkus', 'Tier', 'OfferFree', 'SingleProduct', 'TwoProducts',
       'UICulture', 'Language', 'IsPaid', 'LastProduct', 'LastProductValue',
       'LastPurchaseDate', 'LastPurchaseDaysSinceInstall', 'LastDiscount',
       'LastExpired', 'FirstAvailableStoreProduct', 'AvailableStoreProducts',
       'FirstPrice', 'Feedback.FirstRating', 'Feedback.RatingCount',
       'Feedback.LastRating', 'LastOnboardingScreen', 'PriceStrategy',
       'ActionUnlockRequestCount', 'ActionUnlockRequestValue',
       'FailedDBRequests', 'TotalValue', 'InstalledDBVersion', 'Gender',
       'AgeWhenGoalsSet', 'DailyGoalCount', 'Program', 'ReminderFrequency',
       'SelectedTotal', 'ShowMealTime', 'ExerciseFrequency

In [5]:
df.drop(['ID', 'Class', 'InstallDate', 'LastSavedUTC', 'LastSync',
       'FirstInstalledVersion', 'DaysSinceInstall', 'SessionCount',
       'ExperimentID', 'Experiment', 'MinLockedBottom', 'LockedBottom',
       'LockedBottomDelta', 'LockedTop', 'Variation', 'PriceOptions',
       'PriceSkus', 'Tier', 'OfferFree', 'SingleProduct', 'TwoProducts',
       'Language', 'LastProduct', 'LastProductValue',
       'LastPurchaseDate', 'LastPurchaseDaysSinceInstall', 'LastDiscount',
       'LastExpired', 'FirstAvailableStoreProduct', 'AvailableStoreProducts',
       'FirstPrice', 'Feedback.FirstRating', 'Feedback.RatingCount',
       'Feedback.LastRating', 'LastOnboardingScreen', 'PriceStrategy',
       'ActionUnlockRequestCount', 'ActionUnlockRequestValue',
       'FailedDBRequests', 'TotalValue', 'InstalledDBVersion',
       'AgeWhenGoalsSet', 'DailyGoalCount', 'Program', 'ReminderFrequency',
       'SelectedTotal', 'ExerciseFrequency', 'Height',
       'LastWeight', 'TargetWeight', 'HeightUnit', 'WeightUnit', 'EnergyUnit',
       'OnboardingGoal', 'SessionsPerDay', 'LastBMI',
       'TargetBMI'], axis=1, inplace=True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462919 entries, 0 to 462918
Data columns (total 11 columns):
Platform             462919 non-null object
UICulture            462919 non-null object
IsPaid               462919 non-null bool
Gender               462919 non-null object
ShowMealTime         462919 non-null bool
WeightDiff           462919 non-null float64
AgeGroup             462919 non-null int64
AgeInput             462919 non-null bool
TargetWeightInput    462919 non-null bool
QProductAmtOffer     462919 non-null int64
QProductAmtPaid      462919 non-null int64
dtypes: bool(4), float64(1), int64(3), object(3)
memory usage: 26.5+ MB


In [7]:
df['ShowMealTime'] = df['ShowMealTime'].astype(int)
df['AgeInput'] = df['AgeInput'].astype(int)
df['TargetWeightInput'] = df['TargetWeightInput'].astype(int)
df['IsPaid'] = df['IsPaid'].astype(int)
df['QPaid'] = df['QProductAmtPaid'].apply(lambda x: x>0).astype(int)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462919 entries, 0 to 462918
Data columns (total 12 columns):
Platform             462919 non-null object
UICulture            462919 non-null object
IsPaid               462919 non-null int64
Gender               462919 non-null object
ShowMealTime         462919 non-null int64
WeightDiff           462919 non-null float64
AgeGroup             462919 non-null int64
AgeInput             462919 non-null int64
TargetWeightInput    462919 non-null int64
QProductAmtOffer     462919 non-null int64
QProductAmtPaid      462919 non-null int64
QPaid                462919 non-null int64
dtypes: float64(1), int64(8), object(3)
memory usage: 42.4+ MB


In [9]:
df.drop(['IsPaid', 'QProductAmtOffer', 'QPaid'], axis=1, inplace=True)

In [10]:
# new column QPaid to designate paid (subscribed) or not for a Q-type subscription
df['QPaid'] = df['QProductAmtPaid']/df['QProductAmtPaid']

In [11]:
df['QPaid'].fillna(value=0, inplace=True)

In [12]:
df['QPaid'] = df['QPaid'].astype(int)

In [13]:
df.drop(['QProductAmtPaid'], axis=1, inplace=True )

### Following features showed some signals in exploratory data analysis

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462919 entries, 0 to 462918
Data columns (total 9 columns):
Platform             462919 non-null object
UICulture            462919 non-null object
Gender               462919 non-null object
ShowMealTime         462919 non-null int64
WeightDiff           462919 non-null float64
AgeGroup             462919 non-null int64
AgeInput             462919 non-null int64
TargetWeightInput    462919 non-null int64
QPaid                462919 non-null int64
dtypes: float64(1), int64(5), object(3)
memory usage: 31.8+ MB


In [15]:
# One hot encode categorical features
df_new = pd.get_dummies(df, columns=['Platform', 'UICulture', 'Gender', 'AgeGroup'], 
                        prefix=['P','C','G','A'], drop_first=True)

In [16]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462919 entries, 0 to 462918
Columns: 184 entries, ShowMealTime to A_5
dtypes: float64(1), int64(4), uint8(179)
memory usage: 96.7 MB


In [17]:
df_new.head(5)

Unnamed: 0,ShowMealTime,WeightDiff,AgeInput,TargetWeightInput,QPaid,P_iOS,C_ar,C_ar-AE,C_ar-BH,C_ar-DZ,...,C_zh-Hant,C_zh-SG,C_zh-TW,G_Male,G_None,A_1,A_2,A_3,A_4,A_5
0,1,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0.0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


### Attempting random forest classifier to predict target feature, QPaid 

In [149]:
from sklearn.ensemble import RandomForestClassifier

In [150]:
X = df_new.drop(['QPaid'], axis=1)
y = df_new['QPaid']

In [151]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [152]:
clf = RandomForestClassifier()

In [153]:
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [154]:
predictions = clf.predict(X_test)

In [155]:
from sklearn.metrics import confusion_matrix, classification_report

In [156]:
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

[[152306     72]
 [   384      2]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00    152378
          1       0.03      0.01      0.01       386

avg / total       1.00      1.00      1.00    152764



#### Comments:
- dataset is extremely imbalanced- not enough positive cases to train on
- signals are not strong enough
- avg/total f1-score is ignored in these cases since it is not capturing the precision and recall for positive case, that has more weight in our analysis

_____________________________________

<font color=blue>
### Techniques to address imbalanced data:
</n>
1. Split into training and test sets with stratified distribution on target (IsPaid)
2. Upsample minority class (QPaid==1)
3. Downsample majority class (QPaid==0)
</font>
______________________________________

### <font color=blue> 1. Stratified distribution on target variable
</font>

In [162]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y, random_state=42)

In [163]:
y_train.value_counts()

0    309391
1       764
Name: QPaid, dtype: int64

In [164]:
y_test.value_counts()

0    152388
1       376
Name: QPaid, dtype: int64

In [167]:
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [168]:
prediction_strat = clf.predict(X_test)

In [169]:
print(confusion_matrix(y_test, prediction_strat))
print(classification_report(y_test, prediction_strat))

[[152303     85]
 [   375      1]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00    152388
          1       0.01      0.00      0.00       376

avg / total       1.00      1.00      1.00    152764



#### Comments:
- RFC performs the same or slightly worse with stratified distribution

__________________________

### <font color=blue> 2. Upsampling the minority class (QPaid==1) </font>
- train test split with stratified distribution to ensure test set has QPaid==1 cases
- recombine xtrain and ytrain and upsample
- train model on upsampled data
- test model on original test data

In [193]:
# stratified distribution on QPaid class
X = df_new.drop(['QPaid'], axis=1)
y = df_new['QPaid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y, random_state=42)

In [200]:
# recombine X_train and y_train into one dataframe for upsampling
df_train = pd.concat([X_train, y_train], axis=1)

In [201]:
# separate majority and minority classes
df_train_majority = df_train[df_train['QPaid']==0]
df_train_minority = df_train[df_train['QPaid']==1]

In [None]:
from sklearn.utils import resample

In [204]:
df_train_majority.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 309391 entries, 73626 to 460307
Columns: 184 entries, ShowMealTime to QPaid
dtypes: float64(1), int64(4), uint8(179)
memory usage: 67.0 MB


In [206]:
# upsample minority training data to match majority training data
df_train_minority_upsampled = resample(df_train_minority, replace=True, n_samples=309391, random_state=42)

In [207]:
# recombine all training data
df_train_upsampled = pd.concat([df_train_majority, df_train_minority_upsampled])

In [208]:
# reset X_train and y_train
X_train_upsampled = df_train_upsampled.drop(['QPaid'], axis=1)
y_train_upsampled = df_train_upsampled['QPaid']

In [209]:
clf.fit(X_train_upsampled, y_train_upsampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [211]:
prediction_upsample = clf.predict(X_test)

In [212]:
print(confusion_matrix(y_test, prediction_upsample))
print(classification_report(y_test, prediction_upsample))

[[144116   8272]
 [   336     40]]
             precision    recall  f1-score   support

          0       1.00      0.95      0.97    152388
          1       0.00      0.11      0.01       376

avg / total       1.00      0.94      0.97    152764



#### Comments:
- Upsampling helps classifier perform better in predicting positive cases, but # of false positives increased by 100x 

_______________________

### <font color=blue> 3. Downsampling of majority class (QPaid==0) </font>
- train test split with stratified distribution to ensure test set has QPaid==1 cases
- recombine xtrain and ytrain and downsample
- train model on downsampled data
- test model on original test data

In [214]:
df_train_minority.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 764 entries, 361115 to 12737
Columns: 184 entries, ShowMealTime to QPaid
dtypes: float64(1), int64(4), uint8(179)
memory usage: 169.4 KB


In [215]:
df_train_majority_downsampled = resample(df_train_majority, replace=False, n_samples=764, random_state=42)

In [216]:
df_train_downsampled = pd.concat([df_train_majority_downsampled, df_train_minority])

In [217]:
X_train_downsampled = df_train_downsampled.drop(['QPaid'], axis=1)
y_train_downsampled = df_train_downsampled['QPaid']

In [218]:
clf.fit(X_train_downsampled, y_train_downsampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [220]:
prediction_downsample = clf.predict(X_test)

In [221]:
print(confusion_matrix(y_test, prediction_downsample))
print(classification_report(y_test, prediction_downsample))

[[114095  38293]
 [   140    236]]
             precision    recall  f1-score   support

          0       1.00      0.75      0.86    152388
          1       0.01      0.63      0.01       376

avg / total       1.00      0.75      0.85    152764



#### Comments:
- downsampling the majority class causes classifier to perform better in terms of recall and precision for target class
- correctly predicts more positive cases, but also predicts 10x more false positives than when the minority class was upsampled

### <font color=blue> Summary: </font>
In this case, false positives are more acceptable than false negatives (Type I error preferred over Type II error).  By correcting predicting positive cases, specific incentives can be targetted towards these users with little detriment to targetting users to would not subscribe in the first place.