Collaborative baking
-----------------------

Today we are going to compute with distances. First, the recommender system we talked about last time. Since we can't recommend news articles, we'll add ingredients to recipes. We start with the incidence matrix and recall we had two ways to come up with recommended items.

<img src=http://compute-cuj.org/abcabc.001.jpeg width=700>

<img src=http://compute-cuj.org/abcabc.002.jpeg width=700>

This was pretty straightforward (I hope). The techniques here are referred to ask "k-nearest neighbors" (or KNN or kNN). It's actually a pretty powerful machine learning (well, statistical) technique. And we leverage this kind of scheme all the time -- your doctor, for example, distills "you" into a row in a table, a height, a weight, a blood pressure, maybe some facts from your medical history. She then gives you advice about dropping a few pounds, say by what the medical profession knows about people like you. The idea is that the health outcomes of people "similar" to you can help predict what's around the corner for you. 

The key ideas here are pretty fundamental and **they have to do with distance.** Rows, for example, can be compared. In the case of collaborative filtering, we can evaluate one recipe relative to another, marking some as close (similar ingredients) and others as far (different tastes). This is a basic, mathematical abstraction that machine learning applies to all tables. Rows are points in a "high-dimensional space". (The same is true for columns, of course, except that it's typical to have tables -- spreadsheets-- organized so that rows refer to units of observation and columns refer to their characteristics.)

Let's see an incidence matrix from our cakes data set. It is weenie, but I'm down the rabbit hole now.

In [None]:
from pandas import read_csv
cakes = read_csv("http://compute-cuj.org/cakes.csv")

In [None]:
cakes.head(5)

The first five rows of the `incidence` data frame. Again, the matrix holds a 0 in row i and column j if the recipe in row i is missing ingredient j. It's 1 otherwise. Remember how many cakes we had...

In [None]:
cakes.shape

So, 1,000 recipes and 381 ingredients. As we commented in the last notebook, row and column sums can be important. Summing down the columns counts how many recipes (out of 1,000) contain each ingredient. Here we `apply()` the `sum()` function down columns, indicated with `axis=0`. We sort the resuts and see that many ingredients appear just once.

In [None]:
cakes.apply(sum,axis=0).sort_values()

Here we compute just how many ingredients appear in one recipe. We take the column sum and ask for a boolean outcome, `True` if there was one recipe and `False` for more. Adding these up turns `True` to 1 and `False` to zero. 

In [None]:
sum( cakes.apply(sum,axis=0) == 1)

That means 126 or about 13% are used by just one recipe. How many have two? Three? Less than five?

In [None]:
# your code here



How should we judge similarity between two recipes? Again, we want them to have use much the same ingredients. A simple measure for that is the **Jaccard metric.** It is good for rows (or columns) that are made up of 0's and 1's. Essentially it looks at how many 1's the two recipes have in common, divided by the total number of ingredients they require. Well, you subtract that ratio from 1. So, if two recipes have nothing in common, the metric will be 1, and if they have everything in common, it will be 0.

In [None]:
# Jaccard distance

def dist(a,b):
    
    intersection = sum((a+b)==2)
    union = sum((a+b)>=1)
    
    # print intersection,union
    
    return 1.0-(0.0+intersection)/union

Let's try out the function. We'll make two series (Pandas' representation of a row or column) of 0's and 1's and computes their Jaccard distance. See if this makes sense. Change the 0's and 1's to all agree or disagree.

In [None]:
from pandas import Series
x = Series([1,0,1,1])
y = Series([0,0,1,1])
dist(x,y)

Finally, to compute the distance between two recipes, we need to access rows. This is done using `.iloc[]` subsetting. We haven't seen this yet, but it gives us fine-grained control over the rows or columns we want to extract. The notation is 
>`row selection, column selection`

Before the comma refers to rows and after to columns. We can use single integers for a single row or column, a slice like `5:10` to get a range,  and the empty slice `:` to ask for all the rows or columns. The result is a Pandas Series. 

Here is recipe with ID 40. 

In [None]:
cakes.iloc[40,:]

This recipe has read 10 ingredients.

In [None]:
sum(cakes.iloc[40,:])

Here are the 10.

In [None]:
[c for c in cakes.columns if cakes[c][10]]

And here is the distance between recipe 40 and recipe 10.

In [None]:
[c for c in cakes.columns if cakes[c][40]]

The two recipes share 8 ingredients and between them, there are 12 total ingredients. That means the distance is 1-8/12 = 1-2/3 = 1/3. Let's see!

In [None]:
dist(cakes.iloc[40,:],cakes.iloc[10,:])

Yes! 

Now, this loop iterates through the entire set of rows and calculates the distance between recipe 40 and the rest. 

In [None]:
distances = Series([ dist(cakes.iloc[40,:],cakes.iloc[i,:]) for i in range(1000)])
distances.head()

Here we sort the distaances and look at the distance of the 25th nearest recipe to number 40.

In [None]:
close = distances.sort_values().reset_index(drop=True)[25]
close

We then subset just the nearby cakes, those with distance less than 0.417. 

In [None]:
near_cakes = cakes[(distances<=close) & (distances>0)]
near_cakes.head(5)

Let's pull out the ingredients in recipe 40, storing them in a list.

In [None]:
ingredients = [ing for ing in cakes.columns if cakes[ing][40]]
ingredients

Finally, we sort the ingredients according to the number of recipes that contain them. We then leave out all the ingredients that are already in cake 40. 

In [None]:
recommendations = near_cakes.apply(sum,axis=0).sort_values(ascending=False).reset_index()
recommendations[~recommendations["index"].isin(ingredients)].head(10)

Let's wrap these two into a function and have a look at a few cakes and what we recommend to add.

In [None]:
def dist(a,b):
    
    intersection = sum((a+b)==2)
    union = sum((a+b)>=1)
    
    # print intersection,union
    
    return 1.0-(0.0+intersection)/union

def recommend(c,k=10,recipes=cakes):
    
    n_recipes = recipes.shape[0]
    
    ingredients = [ing for ing in cakes.columns if cakes[ing][c]]
    print "Ingredients in recipe",c
    print ingredients
    
    distances = Series([dist(recipes.iloc[c,:],recipes.iloc[i,:]) for i in range(n_recipes)])
    close = list(distances.sort_values())[k]

    near_recipes = recipes[(distances<=close) & (distances>0)]
    recommendations = near_recipes.apply(sum,axis=0).sort_values(ascending=False).reset_index()
    
    return recommendations[~recommendations["index"].isin(ingredients)][:10]

In [None]:
recommend(920,25)

In [None]:
recommend(230,25)

In [None]:
recommend(430,25)

**Principal components**

Last time we took the idea of distnce quite, um, far. From distances comes notions of near and far, clusters! We also had a right angle relationship and the notion of a projection (the nearst point to a line or a surface). From there, we saw Ggobi present different projections and we learned about one, so-called Principal components. Today we are going to kick the tires on SciKitLearn, a Python module for working with machine learning and statistical modeling.

In [None]:
from sklearn.decomposition import PCA 
from pandas import DataFrame

pca = PCA(n_components=2)
fit = DataFrame(pca.fit_transform(cakes))

In [None]:
from plotly.plotly import iplot, sign_in
import plotly.graph_objs as go
sign_in("cocteautt","9psj3t57ti")

mylayout = go.Layout(autosize=False, width=1000,height=1000)

mydata = [go.Scatter(x=fit.iloc[:,0],y=fit.iloc[:,1],marker={"color":"grey"},mode="markers")]

myfigure = go.Figure(data=mydata,layout=mylayout)
iplot(myfigure)

In [None]:
mylayout = go.Layout(autosize=False, width=1000,height=1000)

points0 = go.Scatter(x=fit[cakes["cake mix"]==1].iloc[:,0],y=fit[cakes["cake mix"]==1].iloc[:,1],name="cake mix",mode="markers",marker={"color":"orange"})
points1 = go.Scatter(x=fit[cakes["cake mix"]==0].iloc[:,0],y=fit[cakes["cake mix"]==0].iloc[:,1],name="no cake mix",mode="markers",marker={"color":"grey"})

mydata = [points0,points1]

myfigure = go.Figure(data=mydata,layout=mylayout)
iplot(myfigure)

In [None]:
mylayout = go.Layout(autosize=False, width=1000,height=1000)

points0 = go.Scatter(x=fit[cakes["cinnamon"]==1].iloc[:,0],y=fit[cakes["cinnamon"]==1].iloc[:,1],name="cinnamon",mode="markers",marker={"color":"orange"})
points1 = go.Scatter(x=fit[cakes["cinnamon"]==0].iloc[:,0],y=fit[cakes["cinnamon"]==0].iloc[:,1],name="cinnamon free",mode="markers",marker={"color":"grey"})

mydata = [points0,points1]

myfigure = go.Figure(data=mydata,layout=mylayout)
iplot(myfigure)