#CODE WALKTHROUGH

(written by Max Candocia)

The main code is found in `linear_modprog_solution.py`, but the general usage of the functions will be shown here.  See the Python file for more details on the inner workings.

In [1]:
import linear_modprog_solution as lms
import numpy
import sqlite3
import sklearn
from sklearn import linear_model
from sklearn.linear_model import LassoLarsIC
import re
import os
import math
import copy

In [2]:
data = lms.get_model_data('hummus')

29 different ingredients
100 different recipes


`data` is a length-4 list of 4 `numpy` arrays.  

`data[0]` is the matrix of predictors, where each row is a recipe, and each column is a proportion of ingredients.

In [3]:
print data[0]

[[  7.27161998e-01   0.00000000e+00   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.97348208e-01   0.00000000e+00   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  7.16069575e-02   0.00000000e+00   8.11545519e-01 ...,   5.72855660e-03
    0.00000000e+00   3.00749222e-03]
 ..., 
 [  6.86771403e-01   0.00000000e+00   0.00000000e+00 ...,   3.52371166e-04
    0.00000000e+00   0.00000000e+00]
 [  3.84252252e-01   0.00000000e+00   0.00000000e+00 ...,   5.71400054e-04
    0.00000000e+00   0.00000000e+00]
 [  8.45475805e-02   0.00000000e+00   0.00000000e+00 ...,   7.19553877e-04
    0.00000000e+00   0.00000000e+00]]


`data[1]` contains the response variable information, but first it needs to have a function applied to it to determine what value is being regressed on.  In this case, I chose `y = rating*log(1+number_of_ratings)`.

In [4]:
print data[1]
print 'APPLYING RESPONSE FUNCTION'
data = lms.apply_response_function(data)
print data[1]

[(0, 47, 4.319149), (1, 18, 3.388889), (2, 584, 4.657534), (3, 16, 4.4375), (4, 169, 4.16568), (5, 12, 4.833333), (6, 74, 4.094594), (7, 4, 5.0), (8, 25, 4.2), (9, 149, 4.597315), (10, 7, 3.714286), (11, 94, 4.234043), (12, 17, 4.294117), (13, 7, 4.571429), (14, 10, 4.5), (15, 13, 4.461538), (16, 23, 4.173913), (17, 18, 4.333333), (18, 106, 4.566038), (19, 18, 4.555555), (20, 4, 2.75), (21, 4, 3.75), (22, 625, 4.6704), (23, 7, 4.285714), (24, 2, 4.0), (25, 70, 4.3), (26, 3, 3.666667), (27, 45, 4.622222), (28, 26, 4.846154), (29, 13, 4.153846), (30, 6, 4.0), (31, 3, 5.0), (32, 43, 4.023256), (33, 25, 3.6), (34, 3, 4.666667), (35, 98, 4.612245), (36, 2, 4.5), (37, 29, 4.448276), (38, 112, 4.276786), (39, 1, 5.0), (40, 6, 4.666667), (41, 18, 4.055555), (42, 26, 4.461538), (43, 21, 2.809524), (44, 231, 4.554112), (45, 38, 4.131579), (46, 10, 4.9), (47, 1, 5.0), (48, 2, 4.5), (49, 72, 3.986111), (50, 27, 4.037037), (51, 2, 5.0), (52, 5, 4.6), (53, 9, 4.555555), (54, 872, 4.552752), (55, 241

`data[2]` contains the labels for the recipes, with the names extracted from the URLs

In [5]:
print data[2]

[u'Sunshine_Hummus_Melts' u'Tofu_Hummus' u'Black_Bean_Hummus'
 u'Savory_Pumpkin_Hummus' u'Curried_Hummus' u'Artichoke_Hummus_2'
 u'Raw_Hummus' u'Tao_Hummus' u'Hummus_Stuffed_Portobello_Caps'
 u'Hollys_Hummus' u'Avocado_Lime_Hummus' u'Fusion_Hummus'
 u'Cilantro_Jalapeno_Hummus' u'Chef_Johns_Green_Hummus'
 u'Easy_Black_Bean_Hummus'
 u'Pumpkin_Hummus_Caramelized_Onion_and_Fontina_Cheese_Pizzas'
 u'Five_Pepper_Hummus' u'Hummus_Chicken' u'Sun_Dried_Tomato_Hummus'
 u'Sweet_Potato_Hummus' u'Quick_and_Easy_Hummus' u'Black_Olive_Hummus'
 u'Spiced_Sweet_Roasted_Red_Pepper_Hummus' u'Hummus_II'
 u'Erins_Jalapeno_Hummus' u'Authentic_Middle_Eastern_Hummus_Chummus'
 u'Yummy_Cilantro_Jalapeno_Hummus' u'Mayo_Free_Tuna_Sandwich_Filling'
 u'Hummus_Layer_Dip' u'Italian_Hummus' u'Creamy_Zucchini_Hummus'
 u'Sesame_Seed_Oil_Hummus' u'Beetroot_Hummus' u'Cookie_Dough_Hummus'
 u'Decadent_Hummus' u'Authentic_Kicked_Up_Syrian_Hummus'
 u'Spicy_Jalapeno_Hummus' u'Zucchini_Hummus'
 u'Spicy_Roasted_Red_Pepper_and_Fet

`data[3]` contains the labels for the ingredients; the first label is `OTHER`, which means any ingredient that did not appear at least 5 times among all recipes, although this value can be adjusted

In [6]:
print data[3]

[u'OTHER' u'water' u'black beans, drained' u'chickpeas' u'hummus'
 u'cilantro leaves' u'fresh jalapeno peppers' u'sesame seeds' u'tahini'
 u'garlic' u'chopped onion' u'parsley' u'chopped tomato' u'lemon, juiced'
 u'lemon juice' u'olive oil' u'sesame oil' u'extra-virgin olive oil'
 u'tahini' u'basil leaves' u'crumbled feta cheese' u'garlic powder'
 u'dried oregano' u'paprika' u'ground black pepper' u'cayenne pepper'
 u'salt' u'sea salt' u'cumin']


The model used is a LASSO model, with AIC being used to determine the best coefficients (cross-validation may prove better if more ingredients are used). 

The advantage of LASSO is that it works well with higher dimensionalities, so we don't have to worry as much about error. 

The intercept is set to false, since the sum of all proportions in a recipe is equal to 1, and this linear dependence makes an intercept redundant (and harder to understand in the context of missing ingredients).

In [7]:
xvar = data[0]
yvar = data[1]
model = LassoLarsIC(criterion='aic',fit_intercept = False)
model.fit(xvar,yvar)
params = model.coef_
print numpy.column_stack((data[3],params))

[[u'OTHER' u'13.2261116967']
 [u'water' u'8.36721792098']
 [u'black beans, drained' u'13.2092884666']
 [u'chickpeas' u'15.1561599979']
 [u'hummus' u'10.7527700215']
 [u'cilantro leaves' u'0.0']
 [u'fresh jalapeno peppers' u'0.0']
 [u'sesame seeds' u'0.0']
 [u'tahini' u'3.77318583749']
 [u'garlic' u'0.0']
 [u'chopped onion' u'0.0']
 [u'parsley' u'0.0']
 [u'chopped tomato' u'10.9173930088']
 [u'lemon, juiced' u'0.0']
 [u'lemon juice' u'35.5990860037']
 [u'olive oil' u'-7.68609534264']
 [u'sesame oil' u'0.0']
 [u'extra-virgin olive oil' u'0.0']
 [u'tahini' u'39.3377162456']
 [u'basil leaves' u'0.0']
 [u'crumbled feta cheese' u'0.0']
 [u'garlic powder' u'0.0']
 [u'dried oregano' u'0.0']
 [u'paprika' u'0.0']
 [u'ground black pepper' u'0.0']
 [u'cayenne pepper' u'0.0']
 [u'salt' u'0.0']
 [u'sea salt' u'0.0']
 [u'cumin' u'0.0']]


The results so far show that lemon juice and tahini are the most important ingredients to have in generous quantities.  

I am not sure why "hummus" is an ingredient, and I would remove recipes that include it to avoid logical recursion (especially since it is high), although during the (yet-to-be-done) linear programming optimization process, it can be simply fixed to zero.

The "OTHER" ingredient is somewhat high, and I think that we can explore models with higher dimensionality that use a more restrictive version of LASSO.  

# TODO

##LINEAR PROGRAMMING

Linear programming can be used to solve the issue of "what should the recipe be?"

Basically, there are 3 constraints on the recipe

1. Sum of proportions has to equal 1

2. Ingredients have to be within the bounds that they appear in the training data.  This can be weighted to exclude the most extreme bounds to avoid issues with "problem" recipes.

3. If a recipe  uses substitute ingredients (ingredients with a highly negative correlation), then the sum of those two ingredients should fall within a certain bounds.  

This problem is relatively simple to solve using sklearn's linear programming solver, which uses the simplex method.

##CLUSTERING

Clustering should be performed to remove outlier recipes and possibly group large branches of the training data separately in case there are different "classes" of recipes.  Hierarchical methods are appropriate for this.