In this notebook we have the data from young people survey ([link](https://www.kaggle.com/miroslavsabo/young-people-survey)). You have joined a startup that delivers healthy meals to people. You have been tasked with doing a marketing study and understanding how likely a person is to “pay more money for good, quality or healthy food” (on a scale from 1 to 5). We will try to create a classifier for the same. The classifiers will be judged on their accuracy and precision. The reason for choosing precision is that the company would pith their product to people who are interesed in product. So high precision will ensure good return on investment.

We start off with basic imports required

In [1]:
# some basic ibrary import
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
%matplotlib inline
import copy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import resample
pd.set_option('display.max_columns',150)
plt.style.use('bmh')
from IPython.display import display
import warnings
warnings.filterwarnings("ignore")
print('done')

done


We load the data and remove all the rows where our target variable is empty

In [2]:
# load data
link = 'https://www.kaggle.com/miroslavsabo/young-people-survey/downloads/responses.csv/2'
myDataOrig = pd.read_csv('../input/responses.csv')
myData = myDataOrig.copy()
targetVar="Spending on healthy eating"
# remove rows where targeted variable is empty
myData= myData.drop(myData[myData[targetVar].isnull()].index)
print('Done')

Done


Check whether my data is balanced or not. If not, we oversample the data to make the dstribution similar


In [3]:
myData[targetVar].value_counts().sort_index()

1.0     41
2.0    132
3.0    282
4.0    330
5.0    223
Name: Spending on healthy eating, dtype: int64

Turns out the data is highly imbalanced, we oversample class 1 and class 2 to 220 examples

In [4]:
minorityData= myData[myData[targetVar]==1]
minorityData2= myData[myData[targetVar]==2]
majorityData1 =myData[myData[targetVar] ==3]
majorityData2 =myData[myData[targetVar] ==4]
majorityData3 =myData[myData[targetVar] ==5]
upsampledMinorityData = resample(minorityData,replace=True,n_samples=220,random_state=123)
upsampledMinorityData2 = resample(minorityData2,replace=True,n_samples=220,random_state=123)
myData=pd.concat([majorityData1,majorityData2,majorityData3,upsampledMinorityData,upsampledMinorityData2])
myData[targetVar].value_counts().sort_index()


1.0    220
2.0    220
3.0    282
4.0    330
5.0    223
Name: Spending on healthy eating, dtype: int64

Most of the data is from range 1 to 5. So we convert it into categorical data type. Also, we replace the missing values in those features with the most frequent category.
The features that are non-categorical are 'Age','Height','Weight','Number of siblings'. We replace their missing values with the rounded off means of the data

In [5]:
for colName in myData.columns.difference(['Age','Height','Weight','Number of siblings']):        
    myData[colName]=myData[colName].astype('category')
    modVal=myData[colName].describe().top
    myData[colName][myData[colName].isnull()]=modVal
    
for colName in myData[['Age','Height','Weight','Number of siblings']].columns:
    meanVal= round((myData[colName].describe())[1])
    myData[colName][myData[colName].isnull()]=meanVal
print('Done')

Done


Our data cleaning is done. We load sklearn library modules for processing needed later on

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import BaggingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
print('Done')

Done


We create dummy features for categorical data. This will result in large number of features. We choose top 150 features to avoid problems because of high dimensional data.  Then we split the data into training set, development set and test set. The data is divided into 70:10:20 ratio

In [7]:
RANDOM_STATE = 3812 #random state to produce consistent results. 3812 is randomly chosen number here
features= myData[myData.columns.difference([targetVar])]
features = pd.get_dummies(features)
target = myData[targetVar]
X_new = SelectKBest(chi2, k=150).fit_transform(features, target)
X_train, X_test, y_train, y_test = train_test_split(X_new, target,test_size=0.20,random_state =RANDOM_STATE)
X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train,test_size=0.125,random_state =RANDOM_STATE)

print('Done')

Done


The code below is a function to get development accuracy for a given classifier

In [8]:
def getDevAccuracyForGivenClassifier(classifier):
    boostedClassifier = make_pipeline(StandardScaler(),classifier)
    boostedClassifier.fit(X_train, y_train)
    pred_Dev_Data = boostedClassifier.predict(X_dev)
    devAccScore = metrics.accuracy_score(y_dev, pred_Dev_Data)
    return devAccScore

Here we tune paramaters of different classifiers based on their performance on development set. The classifiers used are: Random Forest, Decision Tree, Support Vector Machine, Neural Networks, Adaboost classifier, Bagging Classifier

In [9]:
depthArr = [1,2,5,10,15,20]
bestRFdepth=0
bestAcc=0
for i in range(len(depthArr)):
    randomForest_clf = RandomForestClassifier(n_jobs=100,random_state=RANDOM_STATE, max_depth=depthArr[i])
    devAccScore=getDevAccuracyForGivenClassifier(randomForest_clf)
#     print(devAccScore)
    if devAccScore>bestAcc:
        bestAcc=devAccScore
        bestRFdepth=depthArr[i]
print("Best depth for random forest is: ",end="");print(bestRFdepth)

depthArr = [1,2,5,10,15,20]
bestDTdepth=0
bestAcc=0
for i in range(len(depthArr)):
    decisionTree_clf = DecisionTreeClassifier(max_depth=depthArr[i],random_state=RANDOM_STATE)
    devAccScore=getDevAccuracyForGivenClassifier(decisionTree_clf)
#     print(devAccScore)
    if devAccScore>bestAcc:
        bestAcc=devAccScore
        bestDTdepth=depthArr[i]
print("Best depth for DT is: ",end="");print(bestDTdepth)

gammaArr = [0.001,0.01,0.1,0.2,0.5,1,2,5,10]
bestSVMGamma=0
bestAcc=0
for i in range(len(gammaArr)):
    svc_clf=SVC(gamma=gammaArr[i],random_state=RANDOM_STATE)
    devAccScore=getDevAccuracyForGivenClassifier(svc_clf)
#     print(devAccScore)
    if devAccScore>bestAcc:
        bestAcc=devAccScore
        bestSVMGamma=gammaArr[i]
print("Best gamma for SVM is: ",end="");print(bestSVMGamma)

hiddenLaterCtArr = [1,2,5,10,20,50,100,200,500]
bestNumberOfHiddenLayers=0
bestAcc=0
for i in range(len(hiddenLaterCtArr)):
    neuralNet_clf=MLPClassifier(hidden_layer_sizes=(hiddenLaterCtArr[i],),random_state=RANDOM_STATE)
    devAccScore=getDevAccuracyForGivenClassifier(neuralNet_clf)
#     print(devAccScore)
    if devAccScore>bestAcc:
        bestAcc=devAccScore
        bestNumberOfHiddenLayers=hiddenLaterCtArr[i]
print("Best number of hidden layers for neural networks is: ",end="");print(bestNumberOfHiddenLayers)

estimatorCtArr = [10,20,50,100,200,500,1000,2000]
bestNumberOfEstimators=0
bestAcc=0
for i in range(len(estimatorCtArr)):
    adaBoostCLF=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=bestDTdepth,random_state=RANDOM_STATE),n_estimators =estimatorCtArr[i])
    devAccScore=getDevAccuracyForGivenClassifier(adaBoostCLF)
#     print(devAccScore)
    if devAccScore>bestAcc:
        bestAcc=devAccScore
        bestNumberOfEstimators=estimatorCtArr[i]
print("Best number of estimators for ada boost is: ",end="");print(bestNumberOfEstimators)

estimatorCtArr = [10,20,50,100,200,500,1000,2000]
bestNumberOfBaggingEstimators=0
bestAcc=0
for i in range(len(estimatorCtArr)):
    bag_clf = BaggingClassifier(DecisionTreeClassifier(max_depth=bestDTdepth,random_state=RANDOM_STATE), n_estimators=estimatorCtArr[i],bootstrap=True, n_jobs=-1, oob_score=True, random_state=RANDOM_STATE)
    devAccScore=getDevAccuracyForGivenClassifier(bag_clf)
#     print(devAccScore)
    if devAccScore>bestAcc:
        bestAcc=devAccScore
        bestNumberOfBaggingEstimators=estimatorCtArr[i]
print("Best number of estimators for bagging classifier is: ",end="");print(bestNumberOfBaggingEstimators)

Best depth for random forest is: 15
Best depth for DT is: 15
Best gamma for SVM is: 0.01
Best number of hidden layers for neural networks is: 200
Best number of estimators for ada boost is: 2000
Best number of estimators for bagging classifier is: 50


Here we train the classifiers based on their optimal hyperparameters and apply them on the test data. Then, we manually compare their accuracy, f1-score and classification report

In [10]:
randomForest_clf = RandomForestClassifier(n_jobs=200,random_state=RANDOM_STATE, max_depth=bestRFdepth)
decisionTree_clf = DecisionTreeClassifier(max_depth=bestDTdepth,random_state=RANDOM_STATE)
gaussianNB_clf = GaussianNB()
svc_clf=SVC(gamma=bestSVMGamma,random_state=RANDOM_STATE)
neuralNet_clf=MLPClassifier(hidden_layer_sizes=(bestNumberOfHiddenLayers,),random_state=RANDOM_STATE)
combinationClassifier=VotingClassifier(estimators=[('randFor',randomForest_clf),('svc',svc_clf),('neuralNet_clf',neuralNet_clf)])
adaBoostCLF=AdaBoostClassifier(base_estimator=decisionTree_clf,n_estimators =bestNumberOfEstimators)
bag_clf = BaggingClassifier(decisionTree_clf, n_estimators=bestNumberOfBaggingEstimators,bootstrap=True, n_jobs=-1, oob_score=True, random_state=RANDOM_STATE)

classifierArr=[randomForest_clf,decisionTree_clf,gaussianNB_clf,svc_clf,neuralNet_clf,combinationClassifier,adaBoostCLF,bag_clf]
classifierName=['Random Forest','Decision Tree','Gaussian NB','Support Vector','Neural Network', 'Voting Classifier','adaBoost Classifier','Bagging Classifier']
classifierNum = -1
print('---------------------------------------------------------------- \n')
for i in range(len(classifierArr)):
    if(classifierNum != -1):
        i= classifierNum
    boostedClassifier = make_pipeline(StandardScaler(),classifierArr[i])
    boostedClassifier.fit(X_train, y_train)
    pred_train_std = boostedClassifier.predict(X_train)
    pred_test_std = boostedClassifier.predict(X_test)
    print('----------------------------------'+classifierName[i]+'-------------------------')
    print('Prediction accuracy for the training dataset:',end = " ")
    print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train_std)))
    print('Prediction accuracy for the test dataset:',end = " ")
    print('{:.2%} \n'.format(metrics.accuracy_score(y_test, pred_test_std)))
    print('f1 score is: {:.2}'.format(f1_score(y_test, pred_test_std,average='macro')))
    print(classification_report(y_test, pred_test_std))
    if(classifierNum != -1):
        break;
print('\n ----------------------------------XXXXXX-------------------------')

---------------------------------------------------------------- 

----------------------------------Random Forest-------------------------
Prediction accuracy for the training dataset: 97.98%
Prediction accuracy for the test dataset: 49.02% 

f1 score is: 0.51
             precision    recall  f1-score   support

        1.0       0.73      1.00      0.85        33
        2.0       0.64      0.64      0.64        45
        3.0       0.35      0.33      0.34        58
        4.0       0.41      0.44      0.42        71
        5.0       0.37      0.27      0.31        48

avg / total       0.47      0.49      0.48       255

----------------------------------Decision Tree-------------------------
Prediction accuracy for the training dataset: 97.20%
Prediction accuracy for the test dataset: 44.31% 

f1 score is: 0.49
             precision    recall  f1-score   support

        1.0       0.89      1.00      0.94        33
        2.0       0.58      0.64      0.61        45
        3

As we can see, support vecotr machine has the best  test accuracy. Also, the precision is highest. Hence it is the best classifier for our problem statement. Let's observe the performance on validation data again and observet the values where there was a mismatch

In [11]:
svc_clf=SVC(gamma=bestSVMGamma,random_state=RANDOM_STATE,decision_function_shape='ovr')
svc_clf.fit(X_train, y_train)
pred_Dev_Data = svc_clf.predict(X_dev)

(y_dev==pred_Dev_Data).head()


845    False
627     True
928    False
395     True
374    False
Name: Spending on healthy eating, dtype: bool

Now we observe the cases of feature vector where there was correct classification

In [12]:
# X_dev[0]
predictedVal = svc_clf.predict([X_dev[1]])
print(predictedVal[0])
print(y_dev.values[1])
print(svc_clf.decision_function(X_dev)[1])

1.0
1.0
[ 4.32247471  0.90521899  1.89505941  3.00988481 -0.13263791]


We can observe that index for which decision function value is highest is chosen, Now lets try to consider the case where there was misclassification

In [13]:
predictedVal = svc_clf.predict([X_dev[2]])
print(predictedVal[0])
print(y_dev.values[2])
print(svc_clf.decision_function(X_dev)[2])

4.0
1.0
[ 3.10977236 -0.16589295  1.9756115   4.20985883  0.87065027]


In the above case , the score of label 4 is higher than score of label 1. We can improve the score by increasing/decreasing the gamma score. lets try decreeasing it.

In [14]:
svc_clf=SVC(gamma=bestSVMGamma/2,random_state=RANDOM_STATE,decision_function_shape='ovr')
svc_clf.fit(X_train, y_train)
predictedVal = svc_clf.predict([X_dev[2]])
print(predictedVal[0])
print(y_dev.values[2])
print(svc_clf.decision_function(X_dev)[2])


4.0
1.0
[ 3.07548494 -0.18592052  1.98344413  4.23680872  0.89018273]


The score for label 1 decreased. So decreasing gamma was a bad choice. Lets try to increase it and check the values

In [15]:
svc_clf=SVC(gamma=bestSVMGamma*2,random_state=RANDOM_STATE,decision_function_shape='ovr')
svc_clf.fit(X_train, y_train)
predictedVal = svc_clf.predict([X_dev[2]])
print(predictedVal[0])
print(y_dev.values[2])
print(svc_clf.decision_function(X_dev)[2])

1.0
1.0
[ 4.24287492 -0.20501092  1.95567784  3.16421632  0.84224184]


By increasing gamma, the given example is rightly predicted now. We can do this for whole development data to get idea of how much to increase/decrease gamma to improve the classifier