Now it's time for another guided example. This time we're going to look at recipes. Specifically we'll use the epicurious dataset, which has a collection of recipes, key terms and ingredients, and their ratings.

What we want to see is if we can use the ingredient and keyword list to predict the rating. For someone writing a cookbook this could be really useful information that could help them choose which recipes to include because they're more likely to be enjoyed and therefore make the book more likely to be successful.

First let's load the dataset. It's [available on Kaggle](https://www.kaggle.com/hugodarwood/epirecipes). We'll use the csv file here and as pull out column names and some summary statistics for ratings.

Firstly the overfit is a problem, even though it was poor in the first place. We could go back and clean up our feature set. There might be some gains to be made by getting rid of the noise.

We could also see how removing the nulls but including dietary information performs. Though its a slight change to the question we could still possibly get some improvements there.

Lastly, we could take our regression problem and turn it into a classifier. With this number of features and a discontinuous outcome, we might have better luck thinking of this as a classification problem. We could make it simpler still by instead of classifying on each possible value, group reviews to some decided high and low values.

__And that is your challenge.__

__Transform this regression problem into a binary classifier and clean up the feature set.__ You can choose whether or not to include nutritional information, but try to cut your feature set down to the 30 most valuable features.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

sns.set_style('white')

raw_data = pd.read_csv('epi_r.csv')

# Check missing data.
null_count = raw_data.isnull().sum()
null_count[null_count>0]

df = raw_data.dropna(subset=['calories', 'protein', 'fat', 'sodium'])
df.drop(['title'], 1, inplace=True)

df.head(15)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,rating,calories,protein,fat,sodium,#cakeweek,#wasteless,22-minute meals,3-ingredient recipes,30 days of groceries,...,yellow squash,yogurt,yonkers,yuca,zucchini,cookbooks,leftovers,snack,snack week,turkey
0,2.5,426.0,30.0,7.0,559.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,4.375,403.0,18.0,23.0,1439.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.75,165.0,6.0,7.0,165.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.125,547.0,20.0,32.0,452.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.375,948.0,19.0,79.0,1042.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,4.375,170.0,7.0,10.0,1272.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,3.75,602.0,23.0,41.0,1696.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,3.75,256.0,4.0,5.0,30.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,4.375,766.0,12.0,48.0,439.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,4.375,174.0,11.0,12.0,176.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Looks like a lot of the features are binary and actually have a lot of 0s and very few 1s (low variance). We can get rid of some of them.

In [3]:
df_feature_1s = df.drop(df.columns[0:5], axis=1)

# Collect the features with 1s into a column.
df_feature_1s = df_feature_1s.loc[:, (df_feature_1s != 0).any(axis=0)]
feature_1s = pd.DataFrame()
ones = []
for i in df_feature_1s.columns:
    ones.append(df_feature_1s[i].value_counts()[1])
feature_1s['1s'] = ones

# Get the columns with 1s more than 75 percentile of the features.
feature_1s['1s'].describe()

f_to_keep = []
for i in df_feature_1s.columns:
    if df_feature_1s[i].value_counts()[1] > feature_1s['1s'].describe()['75%']:
        f_to_keep.append(i)

df_processing = df_feature_1s[f_to_keep]

df_processing.apply(pd.Series.value_counts)

Unnamed: 0,alcoholic,almond,appetizer,apple,apricot,backyard bbq,bacon,bake,basil,bean,...,vegetable,vegetarian,vinegar,walnut,wheat/gluten-free,white wine,winter,yogurt,zucchini,turkey
0.0,15274,15387,14817,15321,15642,15182,15360,12139,15415,15437,...,14189,10335,15308,15504,11963,15438,13283,15507,15639,15527
1.0,590,477,1047,543,222,682,504,3725,449,427,...,1675,5529,556,360,3901,426,2581,357,225,337


Now these are pretty good, but still too many of them. We can do a PCA to reduce them.

And since we can't have more than 30 features, let's keep 25 of the components from PCA.

In [4]:
# Perform PCA
df_processed = preprocessing.StandardScaler().fit_transform(df_processing)
pca = PCA(n_components=df_processing.shape[1])
components = pca.fit_transform(df_processed)
components = pd.DataFrame(components)

pca.explained_variance_ratio_

# Use the nutritional information plus the first 25 PCA components
X = df.iloc[:, 1:5]
X = X.reset_index(drop=True)
X_1 = components.iloc[:, 0:25]
X = X.join(X_1)


Check the correlations between features one last time. Then perform SVC.

Now we need to convert the ratings into binary outcomes. I decided to use the 50 percentile ratings as the boundary between high and low ratings.

In [5]:
# Check the correlation
cor = X.corr()

# Drop the highly correlated columns
X.drop(['fat', 'sodium'], 1, inplace=True)

# Prepare SVC.
# Make the ratings into binary outcome. Use the 50 percentile as boundary
# for high and low ratings.
Y = pd.DataFrame()
Y['rating'] = np.where(df['rating'] < df['rating'].describe()['50%'], 0, 1)

svc = SVC()
svc.fit(X, Y)
print('\nSVC score: ', svc.score(X, Y))
print('\nCross validation score: ', cross_val_score(svc, X, Y, cv=5))

  y = column_or_1d(y, warn=True)



SVC score:  0.9779374684820978


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Cross validation score:  [0.62287335 0.60857233 0.60983297 0.61569987 0.61380832]
