# Data: Past, Present, Future
## Lab 11
## Databases and recommendation engines


> Campaigns are moving away from the meaningless labels of pollsters and newsweeklies — “Nascar dads” and “waitress moms” — and moving toward treating each voter as a separate person. In 2012 you didn’t just have to be an African-American from Akron or a suburban married female age 45 to 54. More and more, the information age allows people to be complicated, contradictory and unique. New technologies and an abundance of data may rattle the senses, but they are also bringing a fresh appreciation of the value of the individual to American politics.

Ethan Roeder, [“I Am Not Big Brother”](http://www.nytimes.com/2012/12#/06/opinion/i-am-not-big-brother.html?_r=0)


## connecting people and the long tail

>In 1988, a British mountain climber named Joe Simpson wrote a book called *Touching the Void*, a harrowing account of near death in the Peruvian Andes. It got good reviews but, only a modest success, it was soon forgotten. Then, a decade later, a strange thing happened. Jon Krakauer wrote *Into Thin Air*, another book about a mountain-climbing tragedy, which became a publishing sensation. Suddenly *Touching the Void* started to sell again. . .. 

> What happened? In short, Amazon.com recommendations. The online bookseller's software noted patterns in buying behavior and suggested that readers who liked *Into Thin Air* would also like *Touching the Void*. People took the suggestion, agreed wholeheartedly, wrote rhapsodic reviews. More sales, more algorithm-fueled recommendations, and the positive feedback loop kicked in.

Chris Anderson, [The Long Tail](https://www.wired.com/2004/10/tail/)
                
![long tail](https://media.wired.com/photos/5a59579a5451ae3d197fcf65/master/w_650,c_limit/FF_170_tail2_f.gif)

![long tail connection](https://media.wired.com/photos/5a5957cf2bbf59566d73366b/master/w_550,c_limit/FF_170_tail6_f.gif)



## Netflix prize

In 2009, BellKor's Pragmatic Chaos won the Netflix Prize for building a superior movie recommender system.

![winners](https://graphics8.nytimes.com/images/2009/09/21/technology/netflixawards.480.jpg)



![netflix prize](https://i.imgur.com/6TUm2Yj.png)



(see the cached version at https://web.archive.org/web/20070202023620/http://www.netflixprize.com:80/rules)


Netflix data set:

> 5-star ratings on 17770 movies and 480189 anonymous users over ~7 years. total of 100480507 ratings

A good deal of commerical data to use machine learning on, collected over time by the ordinary actions of users. Potentially telling us an awful lot about users


We'll try a smaller data set. We won't win a million dollars.


In [None]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

We'll be using the MovieLens data set with 100K ratings from http://grouplens.org/datasets/movielens/. For now, it's available to you locally. There you can find much bigger sets.

(Compare a great blog post using `pandas` on the same data: http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/.) The approach and tools are slightly different. Worth checking out!

You need to have a directory ml-100k in the same place as this notebook.

We are going to look at three files: u.data, u.item, u.user


![relational](http://imgur.com/ZhpRFTj.png)

## relational database

>[The relational model] organizes data into one or more tables (or "relations") of columns and rows, with a unique key identifying each row. 

>each table/relation represents one "entity type" (such as customer or product). The rows represent instances of that type of entity (such as "Lee" or "chair") and the columns representing values attributed to that instance (such as address or price). (h/t wikipedia)

Created by E. F. Cobb at IBM around 1969, see https://dl.acm.org/citation.cfm?doid=362384.362685



In [None]:
films=pd.read_csv('./ml-100k/u.item', sep="|", names=["movie id", "movie_title", "release_date", "video_release_date", "IMDb_URL", "unknown", "Action","Adventure", "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"], index_col="movie id", encoding="latin1")

users=pd.read_csv('./ml-100k/u.user', sep="|", names=["user_id", "age", "gender","occupation","zip_code"], index_col="user_id")

individual_ratings=pd.read_csv( './ml-100k/u.data', sep="\t", names=["user_id","item_id","rating","timestamp"]) #\t because TAB separated

In [None]:
individual_ratings.head()

What's with that crazy number? According to the `README`, it's "unix seconds." A quick google search explains how to convert using the `pd.to.datetime` command. I will remember this just long enough to type the next few lines.

It is said that 95% of data analysis is fussing or munging the data. This is 

In [None]:
individual_ratings["timestamp"]=pd.to_datetime(individual_ratings["timestamp"], unit='s')

In [None]:
individual_ratings.head()

Now, this is nice. 

What if we wanted to get all the items that one user ranked?


We could use "boolean indexing."

In [None]:
users.loc[42]

In [None]:
films.loc[102]

What did Mr. administrator # 42 rate?

In [None]:
individual_ratings["user_id"]==42

Once we're in the confort zone with boolean indexing, we'd probably condense all that into one line:

In [None]:
individual_ratings[individual_ratings["user_id"]==42]

How might we profile user 42 based on this data? Think of three ways and do one.

In [None]:
individual_ratings[individual_ratings["user_id"]==42]['rating']

`pandas` lets us do all sorts of simple statistics, like finding the mean. just tack on the method `mean`.

In [None]:
individual_ratings[individual_ratings["user_id"]==42]['rating'].mean()

How about all the ratings for a given film, say no. 65? 

How would we get that? 

Same idea!

In [None]:
individual_ratings[individual_ratings["item_id"]==65]

How do we get the average rating for film 65?


In [None]:
individual_ratings[individual_ratings["item_id"]==65]["rating"].mean()


And we might want to know something about there's a lot of variation in views about the film.

In [None]:
individual_ratings[individual_ratings["item_id"]==65]["rating"].hist()

# Pandas as a powerful database



## SPLIT-APPLY-COMBINE

> - Splitting the data into groups based on some criteria
> - Applying a function to each group independently
> - Combining the results into a data structure

check the docs!



![SPLIT](http://i.imgur.com/yjNkiwL.png)


In [None]:
users.groupby(by=["occupation", "gender"]).mean()

# Pivot


What if we wanted to have a big table where each row is the user followed by all her ratings?
We could write a few lines of code to produce this.

Fortunately, Pandas will do this heavy lifting for us using the `pivot` method.


In [None]:
ratings=individual_ratings.pivot(index="user_id", columns="item_id", values="rating")

Basically: rework our data using user_id as row names; item_id as column names and all the ratings as the values

In [None]:
ratings[100:115]

### Question: Why all the NaNs?


#### Another question to the user: Why not switch all the NaNs to zeros?


This is called a *sparse* matrix: most of the values are empty. 

Most large scale commerical rating or purchasing data looks like this. Why?




### Question: What did we lose from our original dataframe?

### Question: What questions could no longer ask?

- say we wanted to know whether people rate movies differently at different times of the day? or differently during different seasons?.



We can now easily ask about the mean ratings of each user, and the mean ratings of each movie? 

How would we do these operations differently?

In [None]:
ratings.mean(axis=0)

Which movies are not garbage according to the masses?

In [None]:
ratings.mean(0)>4.25

In [None]:
the_good_stuff=ratings.mean(0)>4.25

And where would we find the names of these films?

In [None]:
films[the_good_stuff]

How find the bad stuff?

Cool!
What are the average ratings per user?

We need to use `mean` across columns.

In [None]:
ratings.mean(1) #average rating per user axis 1 is rows--user ids

In [None]:
ratings.mean(1)>4 # axis 1 is rows--user ids

In [None]:
those_lacking_discernment=ratings.mean(1)>4
pretentious_movie_snobs=ratings.mean(1)<2.5

What are some of things we might want to do with our knowledge of the users and their rating?

In [None]:
users[those_lacking_discernment].head(15)

In [None]:
ratings.loc[4].hist()

In [None]:
users[pretentious_movie_snobs]

In [None]:
ratings.loc[206].hist()

What does the teenager from Delavan, WI not hate? Could you figure it out? 

And, if we were Netflix, what would we want to recommed to her?

## What might we want to do with our new knowledge

Let's discount the less discerning viewers! Let's just lower their rankings by .75. A bit arbitrary, but so are they!

We could multiply every element in a dataframe by a constant like so:


In [None]:
ratings[those_lacking_discernment]*.75


## Recommending stuff

Lots of strategies.
Any ideas?


Find most similar *users*
Find most similar *items*

Use data from users to recommend items: called *collaborative filtering*.

Combine them!

In [None]:
from scipy.spatial.distance import cosine  #cosine distance function--not cosine similarity

In [None]:
def cosine_similarity(A,B):
    return 1 - cosine(A,B)

In [None]:
cosine_similarity((1,0),(0,1))

In [None]:
cosine_similarity((1,2),(3,4))

In [None]:
cosine_similarity([1,1,0],[0,1,2])


In [None]:
import numpy as np

In [None]:
np.array([np.mean(ratings, 1)]).T

In [None]:

ratings

Those NaNs are trouble. 

One way to normalize is to subtract each users' mean rating for his or her row.

In [None]:
ratings.mean(axis=1).head()

In [None]:
ratings.fillna(0).sub(pd.Series(ratings.mean(axis=1)), 0)

In [None]:
ratings_normalized=ratings.fillna(0).sub(pd.Series(ratings.mean(axis=1)), 0)

Now can compare users to users and movies to movies!

In [None]:
cosine_similarity(ratings_normalized.loc[24], ratings_normalized.loc[25])

To compute the similarities, we'll pick one film (#1) and compute the cosine similarity with every other film. What's number one?

In [None]:
films.loc[1]

In [None]:
def find_similarities(film):
    similarities={}
    for i in range(1,944):
         similarities[i]=cosine_similarity(ratings_normalized.loc[film], ratings_normalized.loc[i])
    return pd.Series(similarities)

In [None]:
find_similarities(1)

So what are the most similar films according to this crazy way of proceeding?

In [None]:
find_similarities(1).sort_values(ascending=False).head()

In [None]:
films['movie_title'][find_similarities(1).sort_values(ascending=False).head().index]

In [None]:
def most_similar(film, number=5):
    most=films['movie_title'][find_similarities(film).sort_values(ascending=False).head(number).index]
    print(most)

In [None]:
most_similar(3)

In [None]:
most_similar(143)

Not the most promising approach!!

For a good survey of recommending engines at scale, see the chapter from the [Stanford mining massive data course](http://infolab.stanford.edu/~ullman/mmds/ch9.pdf)


Major problem: too high a dimensional space to use lots of algorithms efficiently!

Trick: reduce dimensionality using aspects of films and users!

Version of principal component analysis called SVD.

SVD decomposes a large matrix into three components:

![SVD diagram](http://xieyan87.com/wp-content/uploads/2015/06/SVD.png)


Allows you to generate *latent factors* and then calculate similarities. 

![latent factor](https://image.slidesharecdn.com/petroniphdthesispresentation-161104150721/95/mining-at-scale-with-latent-factor-models-for-matrix-completion-8-638.jpg?cb=1478272108)

Serious vs. escapist
geared-male vs. geared-female
&c.



In [None]:
A = ratings_normalized.values.T / np.sqrt(len(films) - 1)
U, S, V = np.linalg.svd(A)

# modified from numpy focused https://alyssaq.github.io/2015/20150426-simple-movie-recommender-using-svd/

Chose how many factors to consider.

In [None]:
k=25
pd.DataFrame(V.T[:, :k])

In [None]:
def find_similarities(film, sliced):
    similarities={}
    for i in range(1,944):
         similarities[i]=cosine_similarity(sliced[film-1], sliced[i-1])
    return pd.Series(similarities)

def most_similar(film, sliced, number=5):
    most=films['movie_title'][find_similarities(film, sliced).sort_values(ascending=False).head(number).index]
    print(most)

In [None]:
k = 50
movie_id = 1 # Grab an id from movies.dat
top_n = 10

sliced = V.T[:, :k] # representative data


In [None]:
find_similarities(1,sliced)


In [None]:
most_similar(405, sliced)

# Back to Netflix challenge

![winners](https://graphics8.nytimes.com/images/2009/09/21/technology/netflixawards.480.jpg)

Very close--came down to which group submitted first!

![leaderboard](https://cdn0.tnwcdn.com/wp-content/blogs.dir/1/files/2012/04/NFlix-520x285.png)

## Social ensemble of teams competing.

![venn](https://i1.wp.com/s3-ap-northeast-1.amazonaws.com/wpstoragepublicshare/netflix/bellkor_team.png)

## Algorithmic ensemble 

![bellkordiagram](https://i.imgur.com/cHXxYIl.jpg)



# **Huge** victory of predictive machine learning values!

# But meanwhile...

![dumpster_fire](https://media1.tenor.com/images/2b68afa54bb22fbe90f9201dfaaa2af0/tenor.gif?itemid=7182596)



[FAQ](https://web.archive.org/web/20070202024240/https://www.netflixprize.com/faq) for Netflix Challenge reads:

>“Is there any customer information in the dataset that should be kept private?” 
    
>“No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy [. . . ] Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation. Of course, since you know all your own ratings that really isn’t a privacy problem is it?”

# Sorry, nope. Not so much.


Arvind Narayanan and Vitaly Shmatikov then of UT Austin

showed 

>an adversary who knows only a little bit about
an individual subscriber can easily identify this subscriber’s
record in the [Netflix] dataset. Using the Internet
Movie Database as the source of background knowledge,
we successfully identified the Netflix records of
known users, uncovering their apparent political preferences
and other potentially sensitive information.


[Robust De-anonymization of Large Datasets](https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf)



# How does all this lead to more

![dumpster_fire](https://media1.tenor.com/images/2b68afa54bb22fbe90f9201dfaaa2af0/tenor.gif?itemid=7182596)

glory of recommender engines:

> long tail
- connect people who may have never known one another
- connect people with things they might never have known

disaster of recommender enginers:

> put like with like: filter bubble

Political twitter according to ["Political Polarization on Twitter"](http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/download/2847/3275)

![polarization](http://themonkeycage.org/wp-content/uploads/2011/07/Screen-shot-2011-07-27-at-11.23.29-AM.png)

political retweet (left) and mention (right) networks, laid out using a force-directed algorithm.
