# Predicting Preferences
---

A large dataset of young adult preferences is availible online. It contains answers to music preferences, movie preferences, ethics questions and other topics. The following report cleans the data, runs a brief correlation analysis and explores machine learning models to predict spending preferences in young adults. 

These machine learning models could be adjusted and used to implement more effective advertizements for entertainment companies.

In [1]:
%matplotlib inline

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np


### Data cleaning


---
We begin by reading in the data:

In [2]:
resp = pd.read_csv('young-people-survey/responses.csv')
resp.head()

Unnamed: 0,Music,Slow songs or fast songs,Dance,Folk,Country,Classical music,Musical,Pop,Rock,Metal or Hardrock,...,Age,Height,Weight,Number of siblings,Gender,Left - right handed,Education,Only child,Village - town,House - block of flats
0,5.0,3.0,2.0,1.0,2.0,2.0,1.0,5.0,5.0,1.0,...,20.0,163.0,48.0,1.0,female,right handed,college/bachelor degree,no,village,block of flats
1,4.0,4.0,2.0,1.0,1.0,1.0,2.0,3.0,5.0,4.0,...,19.0,163.0,58.0,2.0,female,right handed,college/bachelor degree,no,city,block of flats
2,5.0,5.0,2.0,2.0,3.0,4.0,5.0,3.0,5.0,3.0,...,20.0,176.0,67.0,2.0,female,right handed,secondary school,no,city,block of flats
3,5.0,3.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,...,22.0,172.0,59.0,1.0,female,right handed,college/bachelor degree,yes,city,house/bungalow
4,5.0,3.0,4.0,3.0,2.0,4.0,3.0,5.0,3.0,1.0,...,20.0,170.0,59.0,1.0,female,right handed,secondary school,no,village,house/bungalow


Next we will fill nans and adjust data types

In [3]:
#How many of the 150 columns are floats?
print("{} of 150 columns are floats".format((resp.dtypes=='float64').sum()))
#How many NaNs are there?
print("There are {} NaNs".format(1010*150 -resp.notnull().sum().sum()))

134 of 150 columns are floats
There are 608 NaNs


In [4]:
resp = resp.fillna(resp.median())
for c in resp.columns:
    try:
        resp[c] = resp[c].astype(float)
    except:
        pass

In [5]:
#How many of the 150 columns are floats?
print("{} of 150 columns are floats".format((resp.dtypes=='float64').sum()))
#How many NaNs are there?
print("There are {} NaNs".format(1010*150 -resp.notnull().sum().sum()))

139 of 150 columns are floats
There are 37 NaNs


In [6]:
mask = resp.dtypes != 'float64'
cateCol = resp.columns[mask]
for c in cateCol:
    resp[c] = pd.Categorical(resp[c])
    
print('There are {} non float answers.'.format(mask.sum()))

There are 11 non float answers.


We will have to look more in depth at the categories.

In [7]:
categoryCols = resp.columns[mask]
cateDf = resp[categoryCols]
cateDf.iloc[:5,:]

Unnamed: 0,Smoking,Alcohol,Punctuality,Lying,Internet usage,Gender,Left - right handed,Education,Only child,Village - town,House - block of flats
0,never smoked,drink a lot,i am always on time,never,few hours a day,female,right handed,college/bachelor degree,no,village,block of flats
1,never smoked,drink a lot,i am often early,sometimes,few hours a day,female,right handed,college/bachelor degree,no,city,block of flats
2,tried smoking,drink a lot,i am often running late,sometimes,few hours a day,female,right handed,secondary school,no,city,block of flats
3,former smoker,drink a lot,i am often early,only to avoid hurting someone,most of the day,female,right handed,college/bachelor degree,yes,city,house/bungalow
4,tried smoking,social drinker,i am always on time,everytime it suits me,few hours a day,female,right handed,secondary school,no,village,house/bungalow


Some of the categories are binary and some are on a scale. We will adjust each of them.

In [8]:
#Find the binary categories
binaryCol = []
for c in cateDf.columns:
    #If there are only two codes associated with 
    #the category add it to the list of binaries
    if cateDf[c].cat.codes.max() == 1:
        binaryCol.append(c)

binaryCol

['Gender',
 'Left - right handed',
 'Only child',
 'Village - town',
 'House - block of flats']

In [9]:
#Turn Binary columns into codes
for c in binaryCol:
    resp[c] = resp[c].cat.codes 

In [10]:
nonBinCol = []
for c in cateDf.columns:
    #If there are only two codes associated with 
    #the category add it to the list of binaries
    if cateDf[c].cat.codes.max() != 1:
        nonBinCol.append(c)
nonBinCol

['Smoking', 'Alcohol', 'Punctuality', 'Lying', 'Internet usage', 'Education']

For each of these we will generate a list of answers and sort them by hand

In [11]:
print("Smoking: {}".format(list(cateDf.Smoking.cat.categories)))
print("Alcohol: {}".format(list(cateDf.Alcohol.cat.categories)))
print("Punctuality: {}".format(list(cateDf.Punctuality.cat.categories)))
print("Lying: {}".format(list(cateDf.Lying.cat.categories)))
print("Internet Usage: {}".format(list(cateDf['Internet usage'].cat.categories)))
print("Education: {}".format(list(cateDf.Education.cat.categories)))


Smoking: ['current smoker', 'former smoker', 'never smoked', 'tried smoking']
Alcohol: ['drink a lot', 'never', 'social drinker']
Punctuality: ['i am always on time', 'i am often early', 'i am often running late']
Lying: ['everytime it suits me', 'never', 'only to avoid hurting someone', 'sometimes']
Internet Usage: ['few hours a day', 'less than an hour a day', 'most of the day', 'no time at all']
Education: ['college/bachelor degree', 'currently a primary school pupil', 'doctorate degree', 'masters degree', 'primary school', 'secondary school']


Luckily, each categoization follows a natural ordering. We can use this to turn the categories into numerics.

In [12]:
smk = ['never smoked','tried smoking','former smoker','current smoker']
alco = ['never','social drinker','drink alot']
punc = ['i am often early','i am always on time','i am often running late']
ly = ['never','only to avoid hurting someone','sometimes','everytime it suits me']
itnet = ['less than an hour a day','few hours a day','most of the day']
edu = ['currently a primary school pupil','primary school','secondary school',
       'college/bachelor degree', 'masters degree','doctorate degree']

#Form new ordered categories

resp.Smoking = pd.Categorical(resp.Smoking, ordered=True, categories=smk)

resp.Alcohol = pd.Categorical(resp.Alcohol, ordered=True, categories=alco)

resp.Punctuality = pd.Categorical(resp.Punctuality, ordered=True, categories=punc)

resp.Lying = pd.Categorical(resp.Lying, ordered=True, categories=ly)

resp.Education = pd.Categorical(resp.Education, ordered=True, categories=edu)

resp['Internet usage'] = pd.Categorical(resp['Internet usage'], ordered=True, categories=itnet)

#Turn the ordered categories into codes
for c in nonBinCol:
    resp[c] = resp[c].cat.codes
    
resp.dtypes.iloc[-10:]

Age                       float64
Height                    float64
Weight                    float64
Number of siblings        float64
Gender                       int8
Left - right handed          int8
Education                    int8
Only child                   int8
Village - town               int8
House - block of flats       int8
dtype: object

Cast all the category codes into floats and change -1 (the category code for NA) to the mean of the column.

In [13]:
for c in cateCol: 
    #Cast as float
    resp[c] = resp[c].astype(float) 
    #Replace na
    resp[c].replace(-1,resp[c].mode(),inplace=True)
    
#The above code didn't work completely so we add the 
#following somewhat redundant step

resp.fillna(resp.mode(),inplace=True)
#DataVals = np.array(resp.values,dtype=float),

#resp = pd.DataFrame(columns=resp.columns)

In [14]:
#How many of the 150 columns are floats?
print("{} of 150 columns are floats".format((resp.dtypes=='float64').sum()))
#How many NaNs are there?
print("There are {} NaNs".format(1010*150 -resp.notnull().sum().sum()))

150 of 150 columns are floats
There are 0 NaNs


### Statistical Analysis
---
Now that the data is cleaned we can run some analysis on it. Let's look at correlation in the data.

In [15]:
#Extract the correlation matrix
Cor = np.triu(resp.corr().values)

#Set the diagonal to zero
m,n = Cor.shape
for i in range(n): Cor[i,i] = 0

#Sort the absolute correlation
c = abs(np.ravel(Cor))
csorted = np.argsort(c)[::-1]

#Create tuples of row and column indexes
CorSorted = [[s//n,s % n] for s in csorted]

#Retrive the factor names
mostCorr = [[resp.columns[c[0]],
             resp.columns[c[1]],
             Cor[c[0],c[1]]] for c in CorSorted]

#Create new dataframe
mostCorr = pd.DataFrame(mostCorr,columns=["Factor1","Factor2","AbsCorrel"])

In [16]:
print('The ten most correlated factors are:')
mostCorr.iloc[:10,:]

The ten most correlated factors are:


Unnamed: 0,Factor1,Factor2,AbsCorrel
0,Biology,Medicine,0.702746
1,Biology,Chemistry,0.677562
2,Height,Weight,0.674775
3,Fantasy/Fairy tales,Animated,0.673745
4,Height,Gender,0.665535
5,Shopping,Shopping centres,0.650524
6,Weight,Gender,0.625511
7,Chemistry,Medicine,0.621256
8,Age,Education,0.609897
9,Classical music,Opera,0.593821


This proves that nobody studies biology unless they want to be a doctor.

## Machine learning
---

Let's build a scenario where we have some information about a person and want to predict their preferences. We will begin by separating the factors into factor we might have and preference factors.

In [17]:
resp.columns.values

array(['Music', 'Slow songs or fast songs', 'Dance', 'Folk', 'Country',
       'Classical music', 'Musical', 'Pop', 'Rock', 'Metal or Hardrock',
       'Punk', 'Hiphop, Rap', 'Reggae, Ska', 'Swing, Jazz', 'Rock n roll',
       'Alternative', 'Latino', 'Techno, Trance', 'Opera', 'Movies',
       'Horror', 'Thriller', 'Comedy', 'Romantic', 'Sci-fi', 'War',
       'Fantasy/Fairy tales', 'Animated', 'Documentary', 'Western',
       'Action', 'History', 'Psychology', 'Politics', 'Mathematics',
       'Physics', 'Internet', 'PC', 'Economy Management', 'Biology',
       'Chemistry', 'Reading', 'Geography', 'Foreign languages',
       'Medicine', 'Law', 'Cars', 'Art exhibitions', 'Religion',
       'Countryside, outdoors', 'Dancing', 'Musical instruments',
       'Writing', 'Passive sport', 'Active sport', 'Gardening',
       'Celebrities', 'Shopping', 'Science and technology', 'Theatre',
       'Fun with friends', 'Adrenaline sports', 'Pets', 'Flying', 'Storm',
       'Darkness', 'Heights', '

Though there are a lot of topics to choose from, it is unlikely that a business would have clear information about an individual's emotions or ethics. (Unless the business is Facebook).

Since it is unlikely that a buisness would have this data, let's assume that we are an entertainment company and have some data about a user's movie, music and subject preferences as well as their age and gender. Based on these preferences we will attempt to predict their spending. 

We begin by selecting appropriate columns by hand.

In [18]:
musicCol = [
    'Music', 'Slow songs or fast songs', 
    'Dance', 'Folk', 'Country',
    'Classical music', 'Musical', 
    'Pop', 'Rock', 'Metal or Hardrock', 
    'Punk','Hiphop, Rap', 'Reggae, Ska',
    'Swing, Jazz', 'Rock n roll',
    'Alternative', 'Latino', 
    'Techno, Trance', 'Opera' ]

movieCol = [
    'Movies','Horror', 'Thriller', 
    'Comedy', 'Romantic', 'Sci-fi',
    'War', 'Fantasy/Fairy tales',
    'Animated' ,'Documentary',
    'Western', 'Action']

subjCol = [
    'History', 'Psychology', 'Politics',
    'Mathematics','Physics', 'Internet', 
    'PC', 'Economy Management', 
    'Biology', 'Chemistry','Reading', 
    'Geography', 'Foreign languages', 
    'Medicine', 'Law']

pastimesCol = [
    'Interests or hobbies','Cars', 
    'Art exhibitions', 'Religion',
    'Countryside, outdoors', 'Dancing',
    'Musical instruments', 'Writing',
    'Passive sport', 'Active sport',
    'Gardening', 'Celebrities', 
    'Shopping', 'Science and technology',
    'Theatre','Fun with friends',
    'Adrenaline sports', 'Pets', 
    'Flying', 'Small - big dogs']

financeCol = [
    'Finances', 'Shopping centres',
    'Branded clothing',
    'Entertainment spending',
    'Spending on looks',
    'Spending on gadgets',
    'Spending on healthy eating']


In [19]:
#Create usable subset of the data
train_columns = subjCol+musicCol+movieCol+["Gender","Age"]
X = resp[train_columns]
y = resp['Spending on looks']

We will begin our exploration of machine learning models by writing a function to test each model. Let's see how well several key methods predict a person's willingness to spend on looks.

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
def testModelAcc(model,X,y,numTests=20):
    accuracy = []
    #Repeated accuracy tests
    for i in range(numTests):
        splits = train_test_split(X,y,test_size =.3)
        Xtrain,Xtest,ytrain,ytest = splits
        fitted = model.fit(Xtrain,ytrain)
        ypre = fitted.predict(Xtest)
        acc = sum(ypre==ytest)/float(len(ytest))
        accuracy.append(acc)
    
    #Return method name and average accuracy
    accuracy = np.mean(accuracy)
    name = str(model)[:str(model).find('(')]
    
    return name,accuracy

### Naive Bayes

Naive bayes uses baysian statistics to build a distribution for each feature. Let's see how it preforms.

In [22]:
from sklearn.naive_bayes import GaussianNB

In [23]:
name,acc = testModelAcc(GaussianNB(),X,y)
print("{} accuracy: {}".format(name,acc))

GaussianNB accuracy: 0.287128712871


Not that great of accuracy. It is slightly better than a random classifier, which would give 20% acuracy. To improve the prediction power, let's turn the spending vector into two groups of people instead of 5. We will label them as a spender if they ranked gadget spending as a 4 or a 5 and a non spender otherwise.

In [24]:
#Find where gadget spending is high
spenderMask = y >= 4
ybin = y.copy()
#Rate these people as 1 and all others as 0
ybin[spenderMask] = 1
ybin[~spenderMask] = 0

First we will examine the distribution of spenders and non spenders:

In [25]:
spenders = sum(ybin)/float(len(ybin))
print("Spenders: {}".format(spenders))
print("Non Spenders: {}".format(1-spenders))

Spenders: 0.387128712871
Non Spenders: 0.612871287129


We will want our models to have better accuracy than just predicting spenders with a 54% rate.

In [26]:
name,acc = testModelAcc(GaussianNB(),X,ybin)
print("{} accuracy: {}".format(name,acc))

GaussianNB accuracy: 0.636798679868


This has slighly better accuracy than just judging the distribution. We'll try to find something better.

### Decision Trees
Decision trees try to split the data on key feature values and make predictions based on those values.

In [27]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

In [28]:
#Five part classification
name,acc = testModelAcc(DecisionTreeRegressor(),X,y)
print("{} accuracy: {}".format(name,acc))
name,acc = testModelAcc(DecisionTreeClassifier(),X,y)
print("{} accuracy: {}".format(name,acc))

DecisionTreeRegressor accuracy: 0.245709570957
DecisionTreeClassifier accuracy: 0.238778877888


In [29]:
#Binary Classification
name, acc = testModelAcc(DecisionTreeRegressor(),X,ybin)
print("{} accuracy: {}".format(name,acc))
name,acc = testModelAcc(DecisionTreeClassifier(),X,ybin)
print("{} accuracy: {}".format(name,acc))

DecisionTreeRegressor accuracy: 0.570627062706
DecisionTreeClassifier accuracy: 0.562376237624


Decision trees preformed rather poorly. This could be because there is not a high amount of correlation between the given features and the target. It may be difficult to determine a meaningful split.

### Random forest
Random forest genereates lots of decision trees and ensambles the predictions to improve accuracy

In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

In [31]:
#Five part classification
name,acc = testModelAcc(RandomForestRegressor(),X,y)
print("{} accuracy: {}".format(name,acc))
name,acc = testModelAcc(RandomForestClassifier(),X,y)
print("{} accuracy: {}".format(name,acc))

RandomForestRegressor accuracy: 0.0354785478548
RandomForestClassifier accuracy: 0.260396039604


In [32]:
#Binary classification
name,acc = testModelAcc(RandomForestRegressor(),X,ybin)
print("{} accuracy: {}".format(name,acc))
name,acc = testModelAcc(RandomForestClassifier(),X,ybin)
print("{} accuracy: {}".format(name,acc))

RandomForestRegressor accuracy: 0.0270627062706
RandomForestClassifier accuracy: 0.613696369637


Random Forest is not much of an improvement on decision trees in the five class case. In the binary case, the classifier does significantly better. As you can see the regressor model does exceptionally poorly. This is to be expected because the regressor is suited to model continuous data and our five point scale is more categorical than continuous.

### SVM
The Support Vector Machine model class attempt to divide the data into classes using multi-dimentional vectors.

In [33]:
from sklearn import svm

In [34]:
#Five class classification
name,acc = testModelAcc(svm.SVC(),X,y)
print("{} accuracy: {}".format(name,acc))

SVC accuracy: 0.289603960396


In [35]:
#Binary classification
name,acc = testModelAcc(svm.SVC(),X,ybin)
print("{} accuracy: {}".format(name,acc))

SVC accuracy: 0.642574257426


SVM preforms the best so far, even outshining Naive Bayes. SVM has potential for improvement by tuning it's parameters. We will approach this later.

### Logistic Regression
The logistic regression method creates a logistic probability distribution and uses it to predict the target vector.

In [36]:
from sklearn.linear_model import LogisticRegression

In [37]:
name,acc = testModelAcc(LogisticRegression(),X,y)
print("{} accuracy: {}".format(name,acc))

LogisticRegression accuracy: 0.268481848185


In [38]:
name,acc = testModelAcc(LogisticRegression(),X,ybin)
print("{} accuracy: {}".format(name,acc))

LogisticRegression accuracy: 0.633828382838


### What can we predict?
The accuracy for predicting spending on looks was less than desirable. Let's look at ALL of the spending columns and determine which ones we can best predict.

In [39]:
#Break up the data 
train_columns = subjCol + musicCol+ movieCol+ ["Gender","Age"]
X = resp[train_columns]

#Make a list of all spending columns
spendingCol = ["Branded clothing",
               "Entertainment spending",
               "Spending on looks",
               "Spending on gadgets",
               "Spending on healthy eating"]

#Make a list of methods
methods = [GaussianNB,
           DecisionTreeClassifier,
           DecisionTreeRegressor,
           RandomForestClassifier,
           RandomForestRegressor,
           svm.SVC,
           LogisticRegression]

#Make dictionaries to store the accuracy data
fiveClass = dict()
binClass = dict()
for model in methods:
    name = str(model())[:str(model()).find('(')]
    fiveClass[name] = []
    binClass[name] = []
    

Next, we will test each method's predictions about each spending column.

It is important to note that before storing the model accuracy, we will subtract .2 (the random classifier accuracy). This way, we *only* see each model's improvement on a random classifier.

### Five Spending Class Prediction Accuracy
---

In [40]:
#Cycle through all methods and all prediction columns
for col in spendingCol:
    y = resp[col]
    for m in methods:
        name,acc = testModelAcc(m(),X,y)
        fiveClass[name].append(acc - .2)
        
#Put the results in a data frame
fiveClass = pd.DataFrame(fiveClass,index=spendingCol)
fiveClass

Unnamed: 0,DecisionTreeClassifier,DecisionTreeRegressor,GaussianNB,LogisticRegression,RandomForestClassifier,RandomForestRegressor,SVC
Branded clothing,0.030858,0.029538,0.065512,0.0967,0.05231,-0.169307,0.102805
Entertainment spending,0.036964,0.04703,0.078218,0.081353,0.075743,-0.170627,0.109076
Spending on looks,0.034818,0.037624,0.084983,0.068977,0.05297,-0.170132,0.094389
Spending on gadgets,0.049505,0.0533,0.09868,0.077228,0.081188,-0.171452,0.099505
Spending on healthy eating,0.067822,0.049175,0.10363,0.090924,0.085974,-0.161221,0.121782


Finally, we will test the model accuracy in binary classification. In this case, we will subtract the .5 from the accuracy before storing (because this is a random classifier's accuracy.)

### Spender/NonSpender Classification Accuracy
---

In [41]:
for col in spendingCol:
    
    y = resp[col]
    spenderMask = y >= 4
    ybin = y.copy()
    #Rate these people as 1 and all others as 0
    ybin[spenderMask] = 1
    ybin[~spenderMask] = 0
    
    for m in methods:
        name,acc = testModelAcc(m(),X,ybin)
        binClass[name].append(acc - .5)
        
        
#Put the results in a data frame
binClass = pd.DataFrame(binClass,index=spendingCol)
binClass

Unnamed: 0,DecisionTreeClassifier,DecisionTreeRegressor,GaussianNB,LogisticRegression,RandomForestClassifier,RandomForestRegressor,SVC
Branded clothing,0.065182,0.068647,0.109901,0.129373,0.111716,-0.479703,0.14868
Entertainment spending,0.05132,0.04802,0.113531,0.105446,0.088614,-0.482178,0.09505
Spending on looks,0.070957,0.05132,0.125248,0.133828,0.110396,-0.471617,0.136469
Spending on gadgets,0.109736,0.098845,0.171287,0.190924,0.177888,-0.437129,0.19901
Spending on healthy eating,0.023102,0.031518,0.070957,0.059736,0.030033,-0.491749,0.064686


### Parameter Tuning
Based on the dataframes above, SVM preforms the best on spending on looks and spending on gadgets (binary). We will optimize our SVM parameters with a grid search.

In [42]:
from sklearn.model_selection import GridSearchCV
y = resp['Spending on healthy eating']

#Make parameter dictionary
Cvals= np.logspace(-2, 10, 13)
gammaVals = np.logspace(-9, 3, 13)
params = dict(gamma=gammaVals, C=Cvals)

#Run the GridSearch
LooksGrid = GridSearchCV(svm.SVC(kernel='rbf'), param_grid=params)
LooksGrid.fit(X, y)
print("Best Looks Spending SVM Accuracy: {}".format(LooksGrid.best_score_))

Best Looks Spending SVM Accuracy: 0.341584158416


In [43]:
from sklearn.model_selection import GridSearchCV

#Make a binary target vector
y = resp['Spending on gadgets']
spenderMask = y >= 4
ybin = y.copy()
#Rate these people as 1 and all others as 0
ybin[spenderMask] = 1
ybin[~spenderMask] = 0

#Make parameter dictionary
Cvals= np.logspace(-2, 10, 13)
gammaVals = np.logspace(-9, 3, 13)
params = dict(gamma=gammaVals, C=Cvals)

#Run the GridSearch
GadgetGrid = GridSearchCV(svm.SVC(kernel='rbf'), param_grid=params)
GadgetGrid.fit(X, ybin)
print("Best gadgets Spending SVM Accuracy: {}".format(GadgetGrid.best_score_))

Best gadgets Spending SVM Accuracy: 0.7


In [44]:
GadgetGrid.best_params_

{'C': 1000.0, 'gamma': 1.0000000000000001e-05}

Using these parameters, we create out best model below:

In [47]:
bestSVM = svm.SVC(kernel='rbf',C=1000,gamma=1e-5)
splits = train_test_split(X,ybin,test_size =.3)
Xtrain,Xtest,ytrain,ytest = splits

ypre = bestSVM.fit(Xtrain,ytrain).predict(Xtest)
acc = sum(ypre==ytest)/float(len(ytest))
print(acc)

0.732673267327


### Conclusion
---
After cleaning the data and trying a large number of machine learning techniques, we were able to predict gadget spending preferences for young adults with 70% accuracy. The best predictor can be seen above. Support vector machines was could have been the most accurate because the integer data made it easier to find lines between data points.

This prediction algorithm could be used for a site like YouTube in predicting which ads are most likely to generate revenue.