# Recommender systems of Jester 1.7 jokes ratings data sets

Attribution: [DSCI_563_Lecture_7 - UBC Master of Data Science](https://github.com/UBC-MDS/DSCI_563_unsup-learn_students/blob/master/lectures/07_lecture-recommender-systems1.ipynb)

Dataset: [Kaggle's Jester 1.7M jokes ratings dataset](https://www.kaggle.com/vikashrajluhaniwal/jester-17m-jokes-ratings-dataset)

In [1]:
import os
import random
import sys
import time

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.model_selection import cross_validate, train_test_split

pd.set_option("display.max_colwidth", 0)

In [22]:
# Load the data
ratings_full = pd.read_csv('../../data/raw/jester_ratings.csv')
ratings = ratings_full[ratings_full["userId"] <= 4000]
print('Data shape:', ratings.shape)
ratings.head()

Data shape (141362, 3)


Unnamed: 0,userId,jokeId,rating
0,1,5,0.219
1,1,7,-9.281
2,1,8,-9.281
3,1,13,-6.781
4,1,15,0.875


In [23]:
def get_stats(ratings, item_key="jokeId", user_key="userId"):
    print("Number of ratings:", len(ratings))
    print("Average rating:  %0.3f" % (np.mean(ratings["rating"])))
    N = len(np.unique(ratings[user_key]))
    M = len(np.unique(ratings[item_key]))
    print("Number of users (N): %d" % N)
    print("Number of items (M): %d" % M)
    print("Fraction non-nan ratings: %0.3f" % (len(ratings) / (N * M)))
    return N, M

N, M = get_stats(ratings)

Number of ratings: 141362
Average rating:  1.200
Number of users (N): 3635
Number of items (M): 140
Fraction non-nan ratings: 0.278


In [24]:
# Creating utility matrix Y
user_key = "userId"
item_key = "jokeId"
interaction = "rating"

user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(N))))
item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(M))))
user_inverse_mapper = dict(zip(list(range(N)), np.unique(ratings[user_key])))
item_inverse_mapper = dict(zip(list(range(M)), np.unique(ratings[item_key])))

def create_Y_from_interaction_df(interaction_df, N, M):
    Y = np.zeros((N, M))
    Y.fill(np.nan)
    for index, row_value in interaction_df.iterrows():
        n = user_mapper[row_value[user_key]]
        m = item_mapper[row_value[item_key]]
        Y[n, m] = row_value[interaction]
    return Y

In [25]:
Y = create_Y_from_interaction_df(ratings, N, M)
Y.shape

(3635, 140)

## Defining the evaluation metric

In [29]:
def rmse(X1, X2):
    """
    Returns the root mean squared error.
    """
    return np.sqrt(np.nanmean((X1 - X2) ** 2))


def evaluate(pred_X, train_X, valid_X, model_name="Global average"):
    print("%s train RMSE: %0.2f" % (model_name, rmse(pred_X, train_X)))
    print("%s valid RMSE: %0.2f" % (model_name, rmse(pred_X, valid_X)))

## Baseline approaches

In [27]:
# Train validation split (using original data instead of utility matrix, easier)

X = ratings.copy()
y = ratings[user_key]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42) 

In [28]:
train_mat = create_Y_from_interaction_df(X_train, N, M)
valid_mat = create_Y_from_interaction_df(X_valid, N, M)

### 1) Global average baseline

In [30]:
avg = np.nanmean(train_mat)
pred_g = np.zeros(train_mat.shape) + avg

In [31]:
evaluate(pred_g, train_mat, valid_mat, model_name="Global average")

Global average train RMSE: 5.75
Global average valid RMSE: 5.77


### 2) [$k$-nearest neighbours imputation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html)

Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set.Two samples are considered close if the features that neither is missing are close. 

In [32]:
from sklearn.impute import KNNImputer

In [40]:
imputer = KNNImputer(n_neighbors=10) # what other nearest users among N said about the missing item (in this case joke)
train_mat_imp = imputer.fit_transform(train_mat)

In [42]:
evaluate(train_mat_imp, train_mat, valid_mat, model_name="KNN Imputer")

KNN Imputer train RMSE: 0.00
KNN Imputer valid RMSE: 4.79


#### Find the nearest neighbours of a joke based on users

In [47]:
from sklearn.neighbors import NearestNeighbors

In [45]:
jokes_df = pd.read_csv("../../data/raw/jester_items.csv")
jokes_df.head()

Unnamed: 0,jokeId,jokeText
0,1,"A man visits the doctor. The doctor says ""I have bad news for you.You have\ncancer and Alzheimer's disease"". \nThe man replies ""Well,thank God I don't have cancer!""\n"
1,2,"This couple had an excellent relationship going until one day he came home\nfrom work to find his girlfriend packing. He asked her why she was leaving him\nand she told him that she had heard awful things about him. \n\n""What could they possibly have said to make you move out?"" \n\n""They told me that you were a pedophile."" \n\nHe replied, ""That's an awfully big word for a ten year old."" \n"
2,3,Q. What's 200 feet long and has 4 teeth? \n\nA. The front row at a Willie Nelson Concert.\n
3,4,Q. What's the difference between a man and a toilet? \n\nA. A toilet doesn't follow you around after you use it.\n
4,5,"Q.\tWhat's O. J. Simpson's Internet address? \nA.\tSlash, slash, backslash, slash, slash, escape.\n"


In [69]:
def get_topk_recommendations(X, item_content_mapper, query_ind=0, metric="cosine", k=5):
    '''Return top k recommdended items and printout corresponding joke text using another database (e.g. jokes_df)
    X is the transposed trained utility matrix with rows being jokes. Find similar jokes based on respective user ratings'''
    query_idx = item_inverse_mapper[query_ind] # map item_id used in utility matrix to interaction_df 
    model = NearestNeighbors(n_neighbors=k, metric=metric)
    model.fit(X)
    
    neigh_ind = model.kneighbors([X[query_ind]], k, return_distance=False).flatten() # return the index of near k items to queried items
    neigh_ind = np.delete(neigh_ind, np.where(query_ind == query_ind)) # remove the queried items itself
    recs = [item_content_mapper[item_inverse_mapper[i]] for i in neigh_ind]
    print("Query joke: ", item_content_mapper[query_idx])

    return pd.DataFrame(data=recs, columns=["top recommendations"])

In [70]:
item_user_mapper = train_mat_imp.T # X is the transposed trained utility matrix with rows being jokes
id_joke_mapper = dict(zip(jokes_df["jokeId"], jokes_df["jokeText"]))

In [71]:
get_topk_recommendations(item_user_mapper, id_joke_mapper, query_ind=8, metric="cosine", k=5)

Query joke:  Q: If a person who speaks three languages is called "tri-lingual," and
a person who speaks two languages is called "bi-lingual," what do call
a person who only speaks one language?

A: American! 



Unnamed: 0,top recommendations
0,"Q: What is the difference between George Washington, Richard Nixon,\nand Bill Clinton?\n\nA: Washington couldn't tell a lie, Nixon couldn't tell the truth, and\nClinton doesn't know the difference.\n"
1,"A man in a hot air balloon realized he was lost. He reduced altitude and spotted a woman below. He descended a bit more and shouted, ""Excuse me, can you help me? I promised a friend I would meet him an hour ago, but I don't know where I am."" The woman below replied, ""You are in a hot air balloon hovering approximately 30 feet above the ground. You are between 40 and 41 degrees north latitude and between 59 and 60 degrees west longitude."" ""You must be an engineer,"" said the balloonist. ""I am,"" replied the woman. ""How did you know?"" ""Well,"" answered the balloonist, ""everything you told me is technically correct, but I have no idea what to make of your information, and the fact is, I am still lost. Frankly, you've not been much help so far."" The woman below responded, ""You must be in management."" ""I am,"" replied the balloonist, ""but how did you know?"" ""Well,"" said the woman, ""you don't know where you are or where you are going. You have risen to where you are due to a large quantity of hot air. You made a promise that you have no idea how to keep, and you expect people beneath you to solve your problems. The fact is, you are in exactly the same position you were in before we met, but now, somehow, it's my fault!"""
2,If pro- is the opposite of con- then congress must be the opposite\nof progress.\n
3,"Arnold Swartzeneger and Sylvester Stallone are making a movie about\nthe lives of the great composers. \nStallone says ""I want to be Mozart."" \nSwartzeneger says: ""In that case... I'll be Bach.""\n"
