## __Recommendation System Via Hybrid Comparison__

The following is a two-stage filtering of raw data for the Netflix recommendation competition to retain only users sufficiently similar to the randomly chosen subscriber (represented by *user_id*). The first stage filters by overlap of movies rated. The second filters by correlation of ratings.

The threshold for the first filter is a percentage of movies rated by *user_id* that another user has rated.

The threshold for the second filter is a correlation coefficient between the ratings of *user_id* and a user for only movies they both have rated.

Threshold selections can be found at the top of the program.

In [1]:
import pandas as pd

movie_thresh = 0.5    # threshold proportion of random_ID's movies that a user must have rated to be retained
rho_thresh = 0.4   # threshold a user's correlation coefficient must meet to be retained

In [None]:
# runtime: 20 sec
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

df = pd.read_csv(r'rating.csv')

In [2]:
df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40
5,1,112,3.5,2004-09-10 03:09:00
6,1,151,4.0,2004-09-10 03:08:54
7,1,223,4.0,2005-04-02 23:46:13
8,1,253,4.0,2005-04-02 23:35:40
9,1,260,4.0,2005-04-02 23:33:46


In [3]:
len(df)
# Sanity check: total number of ratings before any thresholds applied

20000263

In [4]:
column = df["userId"]
max_value = column.max() 
max_value
# Output: total number of users in data frame (highest ID number)

138493

In [5]:
# runtime: 3 sec
# choose randomly a user for whom to give recommendations
random_id = int(column.sample(1))
random_id

##################################################
####### Randomly selected user's ID number #######
##################################################

2996

In [6]:
# Pulling off the records of just the target user
user_df = df[df["userId"] == random_id]
user_df.head(10)
len(user_df)

#################################################
##### Number of movies rated by random_user #####
#################################################

52

In [7]:
# Putting the movies watched by random_user into a list
movies_watched = user_df["movieId"].tolist()

# Check: format, sanity check
print(movies_watched[0:10])


[24, 31, 44, 372, 410, 588, 719, 724, 762, 849]


In [8]:
# Runtime: 2 seconds
movies_watched_df = df[df["movieId"].isin(movies_watched)]


# Output: all rows referring to a movie random_id rated
movies_watched_df.head(20)

Unnamed: 0,userId,movieId,rating,timestamp
32,1,1200,4.0,2005-04-02 23:29:20
35,1,1214,4.0,2004-09-10 03:12:57
79,1,2288,4.0,2004-09-10 03:14:37
147,1,6502,3.5,2004-09-10 03:14:08
192,2,1214,5.0,2000-11-21 15:36:54
214,2,2858,3.0,2000-11-21 15:30:59
222,2,3534,3.0,2000-11-21 15:34:49
237,3,24,3.0,1999-12-14 12:54:08
290,3,1200,4.0,1999-12-11 13:20:44
295,3,1214,5.0,1999-12-11 13:27:36


In [9]:
user_movie_count = movies_watched_df.groupby(["userId"]).movieId.count()
user_movie_count.head(10)
# Output: number of moviews a user has rated that randome_user has also rated
# Check: Most users will have rated at least one of random_user's rated movies

userId
1      4
2      3
3      7
5      1
6      2
7      4
8      2
11    19
13     2
14     1
Name: movieId, dtype: int64

In [10]:
############################################
####### Overlap threshold applied here #####
############################################

# We will pull off only users who have rated at least this proportion of the movies the target user has rated.
m_count = movie_thresh*len(movies_watched)
m_count

# Output: movie count threshold for retention of a user

26.0

In [11]:
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ["userId", "movie_count"]

# Choosing 50% instead of 60% makes a huge difference in number of records that finally get through.

# choose a ratio of 0.50. User ids who watched at least 50 percent of target user's movies 
users_same_movies = user_movie_count[user_movie_count["movie_count"] > m_count].sort_values("movie_count", ascending=False)
users_same_movies.nunique()

# Output: total number of users that rated at least 50% of movies rated by random_user

userId         528
movie_count     23
dtype: int64

In [12]:
# Sanity check: movie_count should be no higher than len(movies_watched) above
users_same_movies.head(10)

Unnamed: 0,userId,movie_count
2448,2996,52
60773,74142,49
6872,8405,47
28296,34576,46
98884,120575,46
10714,13064,46
96926,118205,45
9933,12131,45
42822,52260,44
44645,54465,44


In [13]:
similar_users = users_same_movies["userId"].tolist()
similar_users_df = df[df["userId"].isin(similar_users)]

# Sanity check: ALL ratings of users heavily overlapping random_id's movies
similar_users_df.head(20)

Unnamed: 0,userId,movieId,rating,timestamp
13174,116,1,3.0,2005-11-23 02:06:57
13175,116,2,2.0,2005-11-23 06:41:08
13176,116,3,2.0,2005-11-23 06:40:58
13177,116,6,1.5,2005-11-23 16:03:02
13178,116,8,1.0,2005-11-24 00:22:10
13179,116,9,1.5,2005-11-23 20:29:11
13180,116,10,2.0,2005-11-23 16:00:40
13181,116,11,2.0,2005-11-23 16:03:35
13182,116,12,0.5,2005-11-23 23:44:19
13183,116,15,0.5,2005-11-24 03:58:08


In [14]:
# Number of ratings by users that have strong movie overlap with random_id
# Compare to len(df) above (about 1/4?)
len(similar_users_df)

987282

In [15]:
# Create userId vs movieId pivot talbe so that
# corr() can find pairwise correlation between columns.

# Checking the starting format 
user_df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
440919,2996,24,3.5,2009-10-11 08:49:18
440920,2996,31,1.5,2009-10-11 08:48:39
440921,2996,44,3.0,2009-10-11 08:47:57
440922,2996,372,3.0,2009-10-11 08:49:32
440923,2996,410,2.5,2009-10-11 09:25:10
440924,2996,588,2.5,2009-10-11 09:34:32
440925,2996,719,3.0,2009-10-11 08:50:18
440926,2996,724,2.0,2009-10-11 08:49:26
440927,2996,762,2.5,2009-10-11 08:48:21
440928,2996,849,3.0,2009-10-11 08:50:49


In [16]:
# Sanity check: moview_watched_df should include all of random_id's records
movies_watched_df[movies_watched_df["userId"]==random_id]

Unnamed: 0,userId,movieId,rating,timestamp
440919,2996,24,3.5,2009-10-11 08:49:18
440920,2996,31,1.5,2009-10-11 08:48:39
440921,2996,44,3.0,2009-10-11 08:47:57
440922,2996,372,3.0,2009-10-11 08:49:32
440923,2996,410,2.5,2009-10-11 09:25:10
440924,2996,588,2.5,2009-10-11 09:34:32
440925,2996,719,3.0,2009-10-11 08:50:18
440926,2996,724,2.0,2009-10-11 08:49:26
440927,2996,762,2.5,2009-10-11 08:48:21
440928,2996,849,3.0,2009-10-11 08:50:49


In [17]:
# Create a movieId vs. userId pivot table
# Running time: immediate
movies_watched_df2 = movies_watched_df[movies_watched_df['userId'].isin(similar_users)]
movies_watched_df3 = movies_watched_df2.drop(['timestamp'], axis=1)
movies_watched_df_pivot = movies_watched_df3.pivot(index='userId', columns='movieId')

In [18]:
# Check: No more than about 40% of ratings are NaN
movies_watched_df_pivot.head(10)

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,24,31,44,372,410,588,719,724,762,849,1064,1200,1214,1274,1285,1320,1371,1373,1375,1592,1608,1805,1887,2152,2278,2288,2427,2505,2572,2600,2826,2858,3250,3534,4238,4643,4973,5902,5954,6383,6502,6953,7173,7444,27075,31424,34579,48872,53953,54997,58293,69757
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2
116,1.0,,1.0,,2.0,3.0,1.0,1.0,1.0,2.5,1.5,,,,,1.5,1.0,3.0,2.5,,2.0,3.5,2.5,0.5,3.0,,1.0,3.5,2.0,,,4.5,,,2.0,2.0,,,1.5,,4.0,,,,,0.5,,,,,,
156,4.0,,,,3.0,,2.0,3.0,3.0,4.0,,5.0,5.0,,3.0,4.0,3.0,3.0,4.0,2.0,4.0,4.0,,,5.0,4.0,3.0,4.0,,5.0,5.0,5.0,3.0,5.0,4.0,4.0,,,,,,,,,,,,,,,,
768,,,2.0,,,3.0,,3.0,3.0,2.0,,5.0,5.0,3.0,,4.0,4.0,4.0,4.0,,3.0,4.0,,,4.0,4.0,3.0,3.0,,,4.0,3.5,,,3.0,2.5,3.5,,,2.5,4.0,,3.0,,,,,,3.5,4.0,2.5,
775,2.5,2.0,3.0,3.5,3.5,3.0,2.0,2.5,1.0,3.5,,5.0,5.0,4.5,4.0,,3.0,2.0,3.0,,3.5,,,,4.5,3.5,4.0,2.5,,4.0,,3.5,,2.0,1.5,2.0,5.0,5.0,,,3.5,,,,,,,,,,,
903,2.0,,2.0,3.0,3.0,4.0,2.0,2.0,1.0,2.0,2.0,5.0,4.0,4.0,4.0,2.0,1.0,1.0,2.0,1.0,3.0,2.0,1.0,1.0,3.0,5.0,2.0,,3.0,,3.0,5.0,4.0,2.0,,1.0,,,,,,,,,,,,,,,,
982,2.5,3.0,1.5,3.5,3.5,3.0,2.5,2.5,2.0,3.0,,4.0,4.0,,3.5,3.5,3.0,2.5,3.0,,3.0,3.0,,,3.5,4.5,4.5,3.5,,3.5,,4.0,3.5,,,,4.0,3.5,4.0,,3.5,3.5,,,,1.0,,,,3.0,,
1507,3.5,,1.5,3.5,0.5,5.0,0.5,3.0,1.0,3.0,,4.5,4.5,,,2.0,2.0,3.0,2.5,,3.0,,,,4.5,3.0,2.5,1.0,3.0,3.5,4.5,4.5,,,3.0,0.5,3.0,4.0,4.5,,3.0,,,3.0,,0.5,,,,,,
1849,3.0,3.0,,3.5,2.5,4.0,2.0,3.5,,,3.0,3.5,4.0,4.0,4.0,3.0,3.0,3.0,3.0,2.5,4.0,,,,4.0,3.5,,3.0,3.5,3.0,,4.0,,2.5,,,4.0,,,,3.5,,,4.0,,,,,,,,
1972,3.0,3.0,4.0,,,2.0,5.0,3.0,4.0,3.0,,5.0,5.0,2.0,4.0,4.0,3.0,2.0,3.0,3.0,4.0,5.0,,,5.0,4.0,3.0,4.0,,5.0,4.0,5.0,4.0,4.0,5.0,4.0,,,,,,,,,,,,,,,,
2261,3.0,2.5,,,2.0,,1.5,2.5,1.5,3.0,,3.0,5.0,,3.0,2.0,3.0,,2.5,,3.5,3.0,,,4.0,2.0,,2.5,,,1.5,3.5,2.5,,3.0,4.5,3.5,3.0,3.0,,1.0,2.5,2.5,1.0,,,3.5,,,4.0,,


In [19]:
# Pivot, to prepare to correlate columns
corr_df = movies_watched_df_pivot.transpose()


In [20]:
# Data cleaning: corrwith ignores NaN
corr_df2 = corr_df.corrwith(corr_df[random_id], method='pearson')
# running time: 2 secs

In [21]:
# Unsorted
corr_df2.head(10)

userId
116     0.590162
156     0.524907
768     0.450244
775     0.659202
903     0.583968
982     0.648820
1507    0.474431
1849    0.365668
1972    0.365800
2261    0.176100
dtype: float64

In [22]:
# Sanity check: Count of unique users with sufficient 
# overlap matches length of user_same_movies above
corr_df3 = corr_df2.sort_values(ascending=False).drop_duplicates()
corr_df3.shape

(528,)

In [23]:
# Sanity check: only random_id has correlation 
# coefficient of 1; only random_id with small
# count of ratings would have correlation
# coefficients approaching 0.8
corr_df3.head(10)

userId
2996      1.000000
10560     0.818056
42726     0.801348
114898    0.770761
120572    0.770562
84476     0.766129
34587     0.754595
80207     0.753130
48904     0.738503
55430     0.737899
dtype: float64

In [24]:
# Uninteresting formatting
# Convert series to dataframe
corr_df4 = corr_df3.to_frame()


In [25]:
# Check: formatting
corr_df4.rename( columns={0:'corr'}, inplace=True )
corr_df4.head(10)

Unnamed: 0_level_0,corr
userId,Unnamed: 1_level_1
2996,1.0
10560,0.818056
42726,0.801348
114898,0.770761
120572,0.770562
84476,0.766129
34587,0.754595
80207,0.75313
48904,0.738503
55430,0.737899


In [26]:
##########################################################
##### Correlation coefficient threshold applied here #####
##########################################################
# Pull off users that are highly correlated in ratings
# True would prevent userId index from being retained as a column
corr_df5 = corr_df4.drop(axis=0, index = random_id)
top_users = corr_df5[(corr_df5["corr"] >= rho_thresh)]
top_users.head(10)

Unnamed: 0_level_0,corr
userId,Unnamed: 1_level_1
10560,0.818056
42726,0.801348
114898,0.770761
120572,0.770562
84476,0.766129
34587,0.754595
80207,0.75313
48904,0.738503
55430,0.737899
28398,0.735203


In [27]:
top_users.size

# Number of users left after applying movie-overlap and rating-correlation filters
# Compare to max_value above

296

In [28]:
# Uninteresting reformatting
top_users.reset_index(drop=False, inplace=True)

# Check: format to use isin(): sequential index, two labeled columns
top_users.head(10)

Unnamed: 0,userId,corr
0,10560,0.818056
1,42726,0.801348
2,114898,0.770761
3,120572,0.770562
4,84476,0.766129
5,34587,0.754595
6,80207,0.75313
7,48904,0.738503
8,55430,0.737899
9,28398,0.735203


In [29]:
# top_users['userId'].astype(int)
# similar_users_df['userId'].astype(int)
# top_users.dtypes

In [30]:
# Pull off all highly correleted highly overlapping users from original dataframe with full detail
similar_users_df2 = similar_users_df[similar_users_df['userId'].isin(top_users['userId'])]
similar_users_df2.head(10)


Unnamed: 0,userId,movieId,rating,timestamp
13174,116,1,3.0,2005-11-23 02:06:57
13175,116,2,2.0,2005-11-23 06:41:08
13176,116,3,2.0,2005-11-23 06:40:58
13177,116,6,1.5,2005-11-23 16:03:02
13178,116,8,1.0,2005-11-24 00:22:10
13179,116,9,1.5,2005-11-23 20:29:11
13180,116,10,2.0,2005-11-23 16:00:40
13181,116,11,2.0,2005-11-23 16:03:35
13182,116,12,0.5,2005-11-23 23:44:19
13183,116,15,0.5,2005-11-24 03:58:08


In [31]:
# Merge in correlation coefficient dictionary, for later credibility weighting of ratings
# merge is preferred to concat for columns
similar_users_df3 = pd.merge(similar_users_df2, top_users, how='inner')
similar_users_df3.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,corr
0,116,1,3.0,2005-11-23 02:06:57,0.590162
1,116,2,2.0,2005-11-23 06:41:08,0.590162
2,116,3,2.0,2005-11-23 06:40:58,0.590162
3,116,6,1.5,2005-11-23 16:03:02,0.590162
4,116,8,1.0,2005-11-24 00:22:10,0.590162
5,116,9,1.5,2005-11-23 20:29:11,0.590162
6,116,10,2.0,2005-11-23 16:00:40,0.590162
7,116,11,2.0,2005-11-23 16:03:35,0.590162
8,116,12,0.5,2005-11-23 23:44:19,0.590162
9,116,15,0.5,2005-11-24 03:58:08,0.590162


## __Parameter selection__

The goal is to whittle down the initial 20 million records to about half a million. If the output is far fewer, the filter parameters for correlation and movie count can be lowered and raised respectively and the program restarted *after* the point at which *user_id* is generated.

Note that if *user_id* has rated few movies (less than 100), correlation coefficients can be extreme, either very high (approaching 0.8) or very low (never higher than 0.25).

While high $\rho$ may be misleading in cases of small movie counts, the *relative* values between users are what would used for credibility weighting, i.e., the use of including correclation coefficients in the output file is to differentiate credibility of different users' ratings in predicting *user_id*'s taste.

In [32]:
# Sanity check: sufficient number of ratings left (millions? 20,000 is too low)
similar_users_df3.shape

(566062, 5)

In [33]:
similar_users_df3.to_csv("rating_corr.csv",header=True, index=False)


#### References:

https://medium.com/codex/hybrid-recommender-system-netflix-prize-dataset-e9f6b4a875aa

https://www.kaggle.com/code/ayseymn/hybrid-recommender-system-netflix/notebook
