# Jokes Recommendation System

This analysis contains a jokes recommendation system, which recommends jokes to users based on similar activity to other users. The joke.csv data set contains user ratings for different jokes. The data was gathered from http://eigentaste.berkeley.edu/.

User ratings for each joke range from -10 to 10.

Each joke is saved as an HTML file that can be opened in a web browser.

In [1]:
import random

# open a random joke
print(f'Open the file init{random.randint(1,100)}.html using a web browser. Then type the joke ID and text below.')

Open the file init76.html using a web browser. Then type the joke ID and text below.


**Generated Joke ID:** 76

**Generated Joke text:** There once was a man and a woman that both got in a terrible car wreck. Both of their vehicles were completely destroyed, buy fortunately, no one was hurt. In thankfulness, the woman said to the man, 'We are both okay, so we should celebrate. I have a bottle of wine in my car, let's open it.' So the woman got the bottleout of the car, and handed it to the man. The man took a really big drink, and handed the woman the bottle. The woman closed the bottle and put it down. The man asked, 'Aren't you going to take a drink?'
The woman cleverly replied, 'No, I think I'll just wait for the cops to get here.'

In [6]:
import pandas as pd

# read data
dat = pd.read_csv('joke.csv')

# verify the first 5 rows and columns of the data set
print(dat.iloc[0:5, 0:5])

# verify number of rows and columns
print('\nThe data set contains ', dat.shape[0], ' rows')
print('The data set contains ', dat.shape[1], ' columns')

    User  Joke0  Joke1  Joke2  Joke3
0  User0  -7.82   8.79  -9.66  -8.16
1  User1   4.08  -0.29   6.36   4.37
2  User2    NaN    NaN    NaN    NaN
3  User3    NaN   8.35    NaN    NaN
4  User4   8.50   4.61  -4.17  -5.39

The data set contains  24983  rows
The data set contains  101  columns


## Initial Analysis

As shown from the output above, there are many missing values in the data set. Each NULL value represents a joke that the respective user has not yet rated.

The goal of this analysis is to impute the missing values using SoftImpute in Python. Therefore, imputed scores will give an idea to what rating the user might provide, based on similar activity of other users.

In [14]:
# separate User column from the jokes data set
df = dat.iloc[:, 1:]

# verify the data set
print(df.iloc[0:5, 0:5])

# rename columns
cols = ['joke ' + str(num) for num in range(df.shape[1])]
df.columns = cols

# verify new column names
print('\n', df.iloc[0:5, 0:10])

   Joke0  Joke1  Joke2  Joke3  Joke4
0  -7.82   8.79  -9.66  -8.16  -7.52
1   4.08  -0.29   6.36   4.37  -2.38
2    NaN    NaN    NaN    NaN   9.03
3    NaN   8.35    NaN    NaN   1.80
4   8.50   4.61  -4.17  -5.39   1.36

    joke 0  joke 1  joke 2  joke 3  ...  joke 6  joke 7  joke 8  joke 9
0   -7.82    8.79   -9.66   -8.16  ...   -9.85    4.17   -8.98   -4.76
1    4.08   -0.29    6.36    4.37  ...   -0.73   -5.34    8.88    9.22
2     NaN     NaN     NaN     NaN  ...    9.03    9.27     NaN     NaN
3     NaN    8.35     NaN     NaN  ...   -2.82    6.21     NaN    1.84
4    8.50    4.61   -4.17   -5.39  ...    7.04    4.61   -0.44    5.73

[5 rows x 10 columns]


In [17]:
# save number of rows and columns
m = df.shape[0]
n = df.shape[1]

# display total number of entries in the data set
print('Total number of entries is: ', m * n)

# display number of missing entries
print('Number of missing entries is: ', df.isna().sum().sum())

# display the missing rate
print('The missing rate is: ', round((df.isna().sum().sum() / (m * n)) * 100, 2), '%')

Total number of entries is:  2498300
Number of missing entries is:  687845
The missing rate is:  27.53 %


## Recommendation Design

In order to create a recommendation system, the first step is to obtain a list of unrated jokes from each user. Then, the imputed values can be populated for the unrated jokes, sorted, then presented to the user.

In [19]:
import numpy as np

# get the indexes of missing entries
na_user = np.where(df.isna())[0]
na_joke = np.where(df.isna())[1]

# print indexes
print('NULL user indexes: ', na_user)
print('NULL joke indexes: ', na_joke)

NULL user indexes:  [    0     0     0 ... 24982 24982 24982]
NULL joke indexes:  [70 71 72 ... 97 98 99]


In [21]:
# initialize a list for unrated jokes per user
unrated = [None] * m

# populate list
for user in range(m):
  null_index = np.where(na_user == user)
  unrated[user] = na_joke[null_index]

# test with one user
print(unrated[9999])

[ 0  3  8 23 29 32 36 42 43 54 56 57 58 59 62 63 66 69 70 71 73 74 75 76
 77 78 79 80 81 83 84 85 86 87 88 89 90 91 93 94 95 96 97 98 99]


## SoftImpute

The SoftImpute method was selected for imputation due to better performance relative to other imputation methods (KNN, MissForest, etc.)

In [24]:
from fancyimpute import SoftImpute

# leverage SoftImpute to impute missing ratings
imputer = SoftImpute()
imputed = imputer.fit_transform(df)

# convert imputations to a data frame
df_imputed = pd.DataFrame(imputed)

[SoftImpute] Max Singular Value of X_init = 3567.445922
[SoftImpute] Iter 1: observed MAE=0.415889 rank=100
[SoftImpute] Iter 2: observed MAE=0.416288 rank=100
[SoftImpute] Iter 3: observed MAE=0.416598 rank=100
[SoftImpute] Iter 4: observed MAE=0.416836 rank=100
[SoftImpute] Iter 5: observed MAE=0.417016 rank=100
[SoftImpute] Iter 6: observed MAE=0.417150 rank=100
[SoftImpute] Iter 7: observed MAE=0.417248 rank=100
[SoftImpute] Iter 8: observed MAE=0.417316 rank=100
[SoftImpute] Iter 9: observed MAE=0.417361 rank=100
[SoftImpute] Iter 10: observed MAE=0.417389 rank=100
[SoftImpute] Iter 11: observed MAE=0.417403 rank=100
[SoftImpute] Iter 12: observed MAE=0.417405 rank=100
[SoftImpute] Iter 13: observed MAE=0.417399 rank=100
[SoftImpute] Iter 14: observed MAE=0.417387 rank=100
[SoftImpute] Iter 15: observed MAE=0.417369 rank=100
[SoftImpute] Iter 16: observed MAE=0.417348 rank=100
[SoftImpute] Iter 17: observed MAE=0.417324 rank=100
[SoftImpute] Iter 18: observed MAE=0.417298 rank=100

In [25]:
# create a list of dataframes, where each df is a jokes recommendation dataframe for each user
joke_recommender = [pd.DataFrame() for i in range(m)]

In [28]:
# populate movie recommendations per user
for usr in range(m):

  # initialize dataframe
  rec_df = pd.DataFrame()
  
  # get unrated jokes for the user
  unrated_jokes = unrated[usr]
  
  # initialize lists for the recommendations
  joke_labels = []
  joke_imputed_scores = []

  # get unrated joke imputed scores and labels
  for unr_joke in unrated_jokes:
    rec_joke_score = df_imputed.iloc[usr, unr_joke]
    rec_joke_name = 'joke ' + str(unr_joke)

    # append labels to lists
    joke_imputed_scores.append(rec_joke_score)
    joke_labels.append(rec_joke_name)

  # insert lists into dataframe as columns
  rec_df['recommended_joke'] = joke_labels
  rec_df['imputed_score'] = joke_imputed_scores
  
  # sort dataframe by imputed score
  rec_df = rec_df.sort_values(by='imputed_score', ascending=False)

  # set the user index in the master list to the recommended dataframe for the user
  joke_recommender[usr] = rec_df

## Output Analysis

The output of the recommendation system is a sorted list of recommended jokes for each user. Each recommended joke has not been rated by the user, and jokes which are at the top of each user list have higher imputed ratings, meaning the user may enjoy jokes at the top of the list moreso than the bottom of the list.

In [29]:
# verify the recommended jokes for a user
print(joke_recommender[9999])

   recommended_joke  imputed_score
42          joke 97       4.073630
19          joke 71       3.621257
22          joke 75       3.098355
26          joke 79       2.956815
44          joke 99       2.881525
24          joke 77       2.603944
1            joke 3       2.480557
25          joke 78       2.243834
27          joke 80       2.162497
30          joke 84       2.119794
5           joke 32       1.911207
23          joke 76       1.842801
34          joke 88       1.771972
37          joke 91       1.740824
32          joke 86       1.713594
18          joke 70       1.640814
21          joke 74       1.532897
20          joke 73       1.463982
14          joke 62       1.343466
17          joke 69       1.127577
41          joke 96       1.064115
35          joke 89       1.008989
43          joke 98       0.983060
36          joke 90       0.919978
0            joke 0       0.795669
29          joke 83       0.780956
28          joke 81       0.735476
39          joke 94 

In [32]:
# get the highest rated joke for a user
bestjoke_num = joke_recommender[9999].iloc[0,0][5:]

# get the corresponding filename for the HTML file containing the joke
bestjoke_file = '/content/init'+bestjoke_num+'.html'
print(bestjoke_file)

/content/init97.html


## Test Recommendations

The jokes below contain the HTML output of the highest recommended jokes for 2 users in the data set.

In [33]:
import IPython

# display the highest recommended joke for the user
IPython.display.HTML(filename=bestjoke_file)

0
"A teacher is explaining to her class how different languages use negatives differently. She says, ""In all languages, a positive followed by a negative or a negative followed by a positive makes a negative. In some languages, two negatives together make a positive, while in others they make a negative. But in no language do two positives make a negative."" One of the students puts up his hand and says, ""Yeah, right."""


In [36]:
# get and print the highest rated joke for another user
best_joke_second_usr = joke_recommender[100].iloc[0, 0][5:]

best_joke_file_second_usr = '/content/init' + best_joke_second_usr + '.html'
IPython.display.HTML(filename=best_joke_file_second_usr)

0
"A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist. The doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY. . . ""Can you read this?"" the doctor asked. ""Read it?"" the Czech answered. ""Doc, I know him!"""
