## 1. Problem Definition

In this problem, we are assuming the role of a company trying to sell its product. However, the company is not able to profile the customers who bought the new product X918 so far. Data collected contains a diverse group of customers with a wide variety of demographic information, and not understanding its core customer leads to inefficient marketing mix; for example, they could have wasted promotions and coupons on people who are less likely to buy. In addition, they do not know with certainty the likely buyer of this product in the future period. They might not be able to estimate the demand for our new product, which can possibly lead to oversupply of product in the market or failure to optimize supply chain. In order to profile the buyer of the product and correctly predict the buyers next month, a decision tree and random forest can be used.

## 2. Preparing the Data
The following libraries are used in this problem

In [263]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
from imblearn import over_sampling

We have an existing record of 29,304 customers with variables including age, working class, level of education, years spent in school, marital status, occupation, known relationship, race, gender, average work week hours, native country. 

In [264]:
df=pd.read_sas("C:/Users/namhpham/Documents/Personal files/R workspace/x918_training_data.sas7bdat",encoding="utf-8")
# need to use encoding because otherwise will return byte value
df.shape

(29304, 13)

In [265]:
df.head(n=25)

Unnamed: 0,person_id,Purchased_X918,age,work_class,education,education_yrs,marital_status,occupation,relationship,race,gender,hrs_per_week,native_country
0,2.0,No,50.0,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,13.0,United-States
1,3.0,No,38.0,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,40.0,United-States
2,5.0,No,28.0,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40.0,Cuba
3,6.0,No,37.0,Private,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,40.0,United-States
4,7.0,No,49.0,Private,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,16.0,Jamaica
5,8.0,Yes,52.0,Self-emp-not-inc,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,45.0,United-States
6,9.0,Yes,31.0,Private,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,50.0,United-States
7,10.0,Yes,42.0,Private,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,40.0,United-States
8,11.0,Yes,37.0,Private,Some-college,10.0,Married-civ-spouse,Exec-managerial,Husband,Black,Male,80.0,United-States
9,12.0,Yes,30.0,State-gov,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,40.0,India


In [266]:
df.dtypes

person_id         float64
Purchased_X918     object
age               float64
work_class         object
education          object
education_yrs     float64
marital_status     object
occupation         object
relationship       object
race               object
gender             object
hrs_per_week      float64
native_country     object
dtype: object

In [267]:
# Converting object to category
for col in ['Purchased_X918','work_class','education','marital_status', 'occupation','relationship','race','gender','native_country']:
    df[col]=df[col].astype('category')

There are some values marked as ?, we will attemp to convert them to NaN to count the number of missing values

In [268]:
df=df.replace('?', np.NaN)

In [269]:
# Find out how many data is missing 

df.isnull().sum()

person_id            0
Purchased_X918       0
age                  0
work_class        1650
education            0
education_yrs        0
marital_status       0
occupation        1657
relationship         0
race                 0
gender               0
hrs_per_week         0
native_country     523
dtype: int64

There are some missing values in the dataset, such as 1650 (5.63%) customers without working class, 5.65% without occupation, and 1.75% with no native country. We will proceed the dataset.

### Broad insights into the data
The target variable is `Purchased_X918`, which indicates if the person decided to purchase the product in the last year. We can also see that the dataset is imbalanced, with only 24% of the people bought the product

In [270]:
# Over the last year, only about 24% of customers bought the product
purchase_count=df['Purchased_X918'].value_counts(normalize='True')
purchase_count

No     0.760203
Yes    0.239797
Name: Purchased_X918, dtype: float64

In [271]:
# Encode target variable for easier processing
df['Purchased_X918'] = df['Purchased_X918'].map({'Yes': "1", 'No': "0"})
df['Purchased_X918']=df['Purchased_X918'].astype(float)

From overview perspective, people who purchase the product are generally older, has higher education, and work longer hours. They are dominantly white male who work in white-collar positions.

In [272]:
# break down by age, education year, hourrs per week
df.groupby('Purchased_X918').describe()

Unnamed: 0_level_0,age,age,age,age,age,age,age,age,education_yrs,education_yrs,...,hrs_per_week,hrs_per_week,person_id,person_id,person_id,person_id,person_id,person_id,person_id,person_id
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Purchased_X918,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0.0,22277.0,36.718185,13.973288,17.0,25.0,34.0,46.0,90.0,22277.0,9.597208,...,40.0,99.0,22277.0,16260.50083,9375.103925,2.0,8135.0,16273.0,24383.0,32560.0
1.0,7027.0,44.260424,10.558968,19.0,36.0,44.0,51.0,90.0,7027.0,11.613775,...,50.0,99.0,7027.0,16370.717234,9456.577114,8.0,8226.5,16419.0,24511.5,32561.0


In [273]:
df.groupby(['Purchased_X918'])['race'].value_counts(normalize='True')

Purchased_X918  race              
0.0             White                 0.838174
                Black                 0.109710
                Asian-Pac-Islander    0.030884
                Amer-Indian-Eskimo    0.011402
                Other                 0.009831
1.0             White                 0.905792
                Black                 0.050519
                Asian-Pac-Islander    0.035719
                Amer-Indian-Eskimo    0.004554
                Other                 0.003415
Name: race, dtype: float64

In [274]:
df.groupby(['Purchased_X918'])['gender'].value_counts(normalize='True')

Purchased_X918  gender
0.0             Male      0.613907
                Female    0.386093
1.0             Male      0.849865
                Female    0.150135
Name: gender, dtype: float64

In [275]:
df[df['Purchased_X918']==1].groupby('Purchased_X918')['occupation'].value_counts(normalize='True')

Purchased_X918  occupation       
1.0             Exec-managerial      0.255688
                Prof-specialty       0.243874
                Sales                0.128209
                Craft-repair         0.120333
                Adm-clerical         0.066365
                Transport-moving     0.043028
                Tech-support         0.037631
                Machine-op-inspct    0.032818
                Protective-serv      0.028005
                Other-service        0.017065
                Farming-fishing      0.015023
                Handlers-cleaners    0.011669
                Armed-Forces         0.000146
                Priv-house-serv      0.000146
Name: occupation, dtype: float64

## 3. Building the Model
Our original goal was to classify customers into groups in order to predict potential customers in future periods. We are building two models: Decision trees and Random Forest with `Purchased_X918` as target variable. Our first step is to choose the variables that will be used for classification

In [276]:
model_variables = ['Purchased_X918','age','work_class','education','education_yrs','marital_status', 'occupation','relationship','race','gender','hrs_per_week','native_country']
df_relevant = df[model_variables]

Normally we don't need to encode categorical variables for decision trees or random forest; however, scikit learn do need to enforce this rule. 

In [277]:
df_relevant_encoded = pd.get_dummies(df_relevant)

New data set after one hot encoding

In [278]:
df_relevant_encoded.head()

Unnamed: 0,Purchased_X918,age,education_yrs,hrs_per_week,work_class_?,work_class_Federal-gov,work_class_Local-gov,work_class_Never-worked,work_class_Private,work_class_Self-emp-inc,...,native_country_Portugal,native_country_Puerto-Rico,native_country_Scotland,native_country_South,native_country_Taiwan,native_country_Thailand,native_country_Trinadad&Tobago,native_country_United-States,native_country_Vietnam,native_country_Yugoslavia
0,0.0,50.0,13.0,13.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0.0,38.0,9.0,40.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,0.0,28.0,13.0,40.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0.0,37.0,14.0,40.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,0.0,49.0,5.0,16.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Our next step is to partition the dataset into training set and test set (about 10%)

In [279]:
training_features,test_features,training_target,test_target=train_test_split(df_relevant_encoded.drop('Purchased_X918',axis=1),df_relevant_encoded['Purchased_X918'],test_size=0.1, random_state=12)

Since we have an unbalanced dataset (76.02% not purchased vs. 23.98% purchased), we want to oversample the Purchased group in order to create a balanced sample between Purchased and Not Purchased in order to reduce the bias in our model.

In [280]:
x_train,x_val,y_train,y_val=train_test_split(training_features, training_target,test_size=.1, random_state=12)

We then upsample using SMOTE algorithm (synthetic minority oversampling technique). SMOTE uses k-nearest neighbor technique to find minority class observation and randomly choosing neighbors but randomly tweak to create new observation.

In [281]:
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(x_train,y_train)



### 3.1. Decision Tree model

In [282]:
clf_dt=tree.DecisionTreeClassifier()

In [283]:
clf_dt=clf_dt.fit(x_train_res,y_train_res) 

In [284]:
print ('Decision Tree Validation Result')
print (clf_dt.score(x_val,y_val))
print (recall_score(y_val,clf_dt.predict(x_val)))

Decision Tree Validation Result
0.797573919636
0.535825545171


In [285]:
print ('Decision Tree Test Result')
print (clf_dt.score(test_features, test_target))
print (recall_score(test_target,clf_rf.predict(test_features)))

Decision Tree Test Result
0.773797338792
0.532670454545


The results from Validation set and test set do not differ too much from each other, which indicates the tree is generalizable. 

### 3.2. Random Forest model

In [286]:
clf_rf=RandomForestClassifier(n_estimators=25,random_state=12)

In [287]:
clf_rf.fit(x_train_res,y_train_res)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=1,
            oob_score=False, random_state=12, verbose=0, warm_start=False)

In [288]:
print ('Random Forest Validation result')
print (clf_rf.score(x_val,y_val))
print (recall_score(y_val,clf_rf.predict(x_val)))

Random Forest Validation result
0.826004548901
0.580996884735


In [289]:
print ('Random Forest Test result')
print (clf_rf.score(test_features,test_target))
print (recall_score(test_target,clf_rf.predict(test_features)))

Random Forest Test result
0.805527123849
0.532670454545


The result also indicates that model does not have the issue of overfitting.

## 4. Model evaluation and Implications

Random forest provides a better performance: the accuracy and recall score on the test set are both higher than those on the test set predicted by decision tree model. Accuracy rate is 80.9% and Recall is 54.7% on the test set. As a result, Random Forest performs better in both recognizing Purchase cases and Not Purchase cases. 

With the higher predictive rate, company can leverage this information in order to apply on its potential group of customers who are likely to buy its product in the future. If they can collect data of customers with attributes similar to the given dataset, the company can confidently identify the groups of customers who are more interested in its product and use marketing channels appropriately. 

#### References
[The Right Way to Oversample in Predictive Modeling](https://beckernick.github.io/oversampling-modeling/)