# The Jester Dataset

![](https://vignette.wikia.nocookie.net/helmet-heroes/images/9/9b/Jester_Hat.png/revision/latest/scale-to-width-down/340?cb=20131023213944)

This morning we will be building a recommendation system using User ratings of jokes.

By the end of this notebook, we will know how to 
- Format data for user:user recommendation
- Find the cosign similarity between two vectors
- Use K Nearest Neighbor to indentify vector similarity
- Filter a dataframe to identify the highest rated joke based on K most similar users.

In [1]:
import pandas as pd
import numpy as np

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances


import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

### About user data
Format:

- Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").
- One row per user
- The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.
- The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes.


In [2]:
df = pd.read_csv('./data/jesterfinal151cols.csv', header=None)
df = df.fillna(99)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,141,142,143,144,145,146,147,148,149,150
0,62,99,99,99,99,0.21875,99,-9.28125,-9.28125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
1,34,99,99,99,99,-9.6875,99,9.9375,9.53125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
2,18,99,99,99,99,-9.84375,99,-9.84375,-7.21875,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
3,82,99,99,99,99,6.90625,99,4.75,-5.90625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
4,27,99,99,99,99,-0.03125,99,-9.09375,-0.40625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


### Joke data

In [4]:
jokes = pd.read_table('./data/jester_items.tsv', header = None)
jokes.head()

Unnamed: 0,0,1
0,1:,"A man visits the doctor. The doctor says, ""I h..."
1,2:,This couple had an excellent relationship goin...
2,3:,Q. What's 200 feet long and has 4 teeth? A. Th...
3,4:,Q. What's the difference between a man and a t...
4,5:,Q. What's O. J. Simpson's web address? A. Slas...


The 0 column is the join column we need to connect with the user dataframe. 

In the cell below, we 
- Remove the ':' character from the `0` column
- Convert the column to an integer datatype
- Set the `0` column as the index for our jokes table.

In [5]:
jokes[0] = jokes[0].apply(lambda x: x.replace(':', ''))
jokes[0] = jokes[0].astype(int)
jokes.set_index(0, inplace=True)

In [6]:
jokes.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
1,"A man visits the doctor. The doctor says, ""I h..."
2,This couple had an excellent relationship goin...
3,Q. What's 200 feet long and has 4 teeth? A. Th...
4,Q. What's the difference between a man and a t...
5,Q. What's O. J. Simpson's web address? A. Slas...


We will be creating a basic recommendation system using cosine similarity. 

Let's quickly review cosine similarity.

### Cosine similarity

Cosine similarty = 1 - cosign distance

#### What does cosine similarity measure?
- The angle between two vectors
    - if cosine(v1, v2) == 0 -> perpendicular
    - if cosine(v1, v2) == 1 -> same direction
    - if cosine(v1, v2) == -1 -> opposite direction

Let's create two vectors and find their cosine distance

In [7]:
v1 = np.array([1, 2])
v2 = np.array([1, 2.5])

distance = cosine_distances(v1.reshape(1, -1), v2.reshape(1, -1))

Now, we can subtract the distance from 1 to find the cosine similarity.

In [8]:
similarity = 1 - distance
similarity

array([[0.99654576]])

There is also an function for this that we can use.

In [9]:
cosine_similarity(v1.reshape(1, -1), v2.reshape(1, -1))

array([[0.99654576]])

# Build a recommender system 
How do we recommend a joke to userA?
- user to user ->
    - find users that are similar to userA
    - Identify jokes that have been rated highly by those similar users.

### Let's condition the data for a recommender system


In [10]:
## User we would like to recommend a joke to
user_index = 0

## Drop column that totals the numbers of jokes each user has rated. 
## Isolate the row for the desired user
userA = df.drop(0, axis=1).loc[user_index, :]

# All other users
others = df.drop(0, axis=1).drop(index=user_index, axis=0)


# Find the nearest neighbors
knn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
knn.fit(others)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=-1, n_neighbors=5, p=2, radius=1.0)

Great! Now we can use the vector of ratings for userA as an input to our knn model.

The knn model returns the distance between userA and the nearest K neighbors as well as their index.

In [11]:
distances, indices = knn.kneighbors(userA.values.reshape(1, -1))
distances, indices = distances[0], indices[0]


print('---------------------------------------------------------------------------------------------')
print("userA's K nearest neighbor distances:", distances) 
print('---------------------------------------------------------------------------------------------')
print("Index for nearest neighbors indices:",indices)
print('---------------------------------------------------------------------------------------------')

---------------------------------------------------------------------------------------------
userA's K nearest neighbor distances: [0.12284198 0.12953529 0.13661332 0.13848128 0.141326  ]
---------------------------------------------------------------------------------------------
Index for nearest neighbors indices: [228 243 288 302  76]
---------------------------------------------------------------------------------------------


#### Now that we have our most similar users, what's next?

#### Find their highest rated items that aren't rated by userA

In [12]:
# let's get jokes not rated by userA
jokes_not_rated = np.where(userA==99)[0]
jokes_not_rated = np.delete(jokes_not_rated, 0)

Next we need to isolate the nearest neighbors in our data, and examine their ratings for jokes userA has not rated.

In [13]:
user_jokes = df.drop(0, axis=1).iloc[indices][jokes_not_rated]
user_jokes

Unnamed: 0,1,2,3,5,8,9,10,11,13,27,...,140,141,142,143,144,145,146,147,148,149
228,99,99,99,-3.65625,-10.0,99,99,99,-9.875,99.0,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
243,99,99,99,-9.1875,-6.4375,99,99,99,2.78125,-3.09375,...,-5.1875,-5.375,-4.3125,-4.125,4.5625,-2.9375,-0.53125,-3.875,4.21875,-4.875
288,99,99,99,-7.0,3.65625,99,99,99,-6.8125,6.375,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
302,99,99,99,-9.03125,9.5625,99,99,99,-9.4375,-0.375,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
76,99,99,99,3.03125,-6.6875,99,99,99,4.09375,-9.71875,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


Let's total up the ratings of each joke!

To do this, we need to replace 99 values with 0

In [14]:
ratings = user_jokes.replace(99, 0).sum()

Right now, the user_jokes dataframe has rows set to individual users and jokes set as columns.

We want to look at the jokes of each of these users. To do that, let's transform our user_jokes dataframe

In [15]:
user_jokes = user_jokes.T

user_jokes.head()

Unnamed: 0,228,243,288,302,76
1,99.0,99.0,99.0,99.0,99.0
2,99.0,99.0,99.0,99.0,99.0
3,99.0,99.0,99.0,99.0,99.0
5,-3.65625,-9.1875,-7.0,-9.03125,3.03125
8,-10.0,-6.4375,3.65625,9.5625,-6.6875


Great! Now we add the joke ratings as a column to our user_jokes dataframe

In [16]:
user_jokes['total'] = ratings
user_jokes.head()

Unnamed: 0,228,243,288,302,76,total
1,99.0,99.0,99.0,99.0,99.0,0.0
2,99.0,99.0,99.0,99.0,99.0,0.0
3,99.0,99.0,99.0,99.0,99.0,0.0
5,-3.65625,-9.1875,-7.0,-9.03125,3.03125,-25.84375
8,-10.0,-6.4375,3.65625,9.5625,-6.6875,-9.90625


Using the method .idxmax(), we return the index for the joke with the highest rating!

In [17]:
recommend_index = user_jokes['total'].idxmax()
recommend_index

32

In [18]:
# checking our work
user_jokes.sort_values(by='total', ascending=False).head()

Unnamed: 0,228,243,288,302,76,total
32,99.0,2.3125,7.21875,-0.40625,1.375,10.5
66,99.0,6.40625,99.0,-0.5625,3.4375,9.28125
54,99.0,-4.6875,7.125,99.0,4.3125,6.75
72,99.0,-1.5,6.0,-0.40625,2.25,6.34375
111,99.0,0.96875,5.125,99.0,99.0,6.09375


Now all we have to do is plug in the index to our jokes dataframe, and return the recommended joke!

In [19]:
jokes.iloc[recommend_index][1]

'What do you call an American in the finals of the world cup? "Hey beer man!"'

# We did it!

### Assignment

Please create a function called `recommend_joke` that will receive a user index and returns a recommended joke.

In [20]:
def recommendation_data():
    df = pd.read_csv('./data/jesterfinal151cols.csv', header=None)
    df = df.fillna(99)
    jokes = pd.read_table('./data/jester_items.tsv', header = None)
    jokes[0] = jokes[0].apply(lambda x: x.replace(':', ''))
    jokes[0] = jokes[0].astype(int)
    jokes.set_index(0, inplace=True)
    
    return df, jokes

def userA_and_others(user_index, df):
    ## Drop column that counts the numbers of jokes each user has rated. 
    ## Isolate the row for the desired user
    userA = df.drop(0, axis=1)\
          .loc[user_index, :]
    
    # Isolate all other users
    others = df.drop(0, axis=1).drop(index=user_index, axis=0)
    
    return userA, others

def nearest_neighbors(userA, others):
    # Fit Nearest Neighbors
    knn = NearestNeighbors(n_neighbors=5, metric='cosine', n_jobs=-1)
    knn.fit(others)
    
    distances, indices = knn.kneighbors(userA.values.reshape(1, -1))
    distances, indices = distances[0], indices[0] 
    
    return distances, indices

def find_joke(df, neighbor_indices, jokes_not_rated):
    
    user_jokes = df.drop(0, axis=1).iloc[neighbor_indices][jokes_not_rated]
    ratings = user_jokes.replace(99, 0).sum()
    user_jokes = user_jokes.T
    user_jokes['total'] = ratings
    recommend_index = user_jokes['total'].idxmax()
    return jokes.iloc[recommend_index][1]    

def recommend_joke(user_index):
    
    df, jokes = recommendation_data()

    userA, others = userA_and_others(user_index, df)

    distances, neighbor_indices = nearest_neighbors(userA, others)

    
    jokes_not_rated = np.where(userA==99)[0]
    jokes_not_rated = np.delete(jokes_not_rated, 0)
    
    return find_joke(df, neighbor_indices, jokes_not_rated)

Now we can recommend a joke to any user in the dataset!

In [21]:
recommend_joke(400)

"Q: How many programmers does it take to change a lightbulb? A: NONE! That's a hardware problem..."

Let's see what the highest rated joke is for User 400.

In [22]:
highest_rated_joke_index = df.iloc[400].replace(99,0).drop(0).idxmax()
print(jokes.iloc[highest_rated_joke_index].values[0])

A country guy goes into a city bar that has a dress code, and the maitre d' demands he wear a tie. Discouraged, the guy goes to his car to sulk when inspiration strikes: He's got jumper cables in the trunk! So he wraps them around his neck, sort of like a string tie (a bulky string tie to be sure) and returns to the bar. The maitre d' is reluctant, but says to the guy, "Okay, you're a pretty resourceful fellow, you can come in... but just don't start anything!"
