### Non-Personalized and Stereotype-Based Recommenders

This notebook is an exercise for non-personalized and stereotyped recommendations. Personalized level is one of the analyical dimensions for recommender systems. Non-personalized recommenders is a **generic** method which means that "everyone receives same recommendations". Statistically speaking it related with summary statistics. 

This method is usually used in hotels, restaurants rating. For example "best seller", "most popular", "trending hot", which shows "best-liked" one. It could achieve nice results when personalized recommendations are impossible (cold start for instance). Also sometimes it could do weak-personalizations, like add `age` or `zipcode` features.

Computing process is simple and fast, we can use metrics like **average ratings**, **popularity**, **proportion of liking** and we can compute use **frequency**, **average** or more complicated methods (normalized, damped mean). In a word, we calculate aggregate preference for prediction, rank items for recommendations.

In [1]:
import numpy as np
import pandas as pd

Import movies' rating data, there are 20 users and 20 movies, the numbers are ratings of users to movies, null values means the movies are not seen yet.

In [2]:
movie_ratings = pd.read_csv('HW1-data.csv')
movie_ratings.shape

(20, 22)

In [3]:
movie_ratings.head(3)

Unnamed: 0,User,"Gender (1 =F, 0=M)",260: Star Wars: Episode IV - A New Hope (1977),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),1: Toy Story (1995),2028: Saving Private Ryan (1998),...,2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
0,755,0,1.0,5.0,2.0,,4.0,4.0,2.0,2.0,...,2.0,,5.0,2.0,,4.0,2.0,5.0,,
1,5277,0,5.0,3.0,,2.0,4.0,2.0,1.0,,...,3.0,2.0,2.0,,2.0,,5.0,1.0,3.0,
2,1577,1,,,,5.0,2.0,,4.0,,...,,1.0,4.0,4.0,1.0,1.0,2.0,3.0,1.0,3.0


In [4]:
movie_start_idx = 2

#### 1. Top movies by mean rating

Calculate average rating directly.

In [5]:
# calculate the mean rating for each movie
movie_ratings.iloc[:, movie_start_idx:].mean(skipna=True).sort_values(ascending=False)

318: Shawshank Redemption, The (1994)                      3.600000
260: Star Wars: Episode IV - A New Hope (1977)             3.266667
541: Blade Runner (1982)                                   3.222222
1265: Groundhog Day (1993)                                 3.166667
593: Silence of the Lambs, The (1991)                      3.062500
296: Pulp Fiction (1994)                                   3.000000
1210: Star Wars: Episode VI - Return of the Jedi (1983)    3.000000
2028: Saving Private Ryan (1998)                           3.000000
34: Babe (1995)                                            3.000000
527: Schindler's List (1993)                               3.000000
3578: Gladiator (2000)                                     2.916667
2396: Shakespeare in Love (1998)                           2.909091
1198: Raiders of the Lost Ark (1981)                       2.909091
2571: Matrix, The (1999)                                   2.833333
2762: Sixth Sense, The (1999)                   

#### 2. Top movies by count

Calculate count directly.

In [6]:
# count the number of ratings for each movie
movie_ratings.iloc[:, movie_start_idx:].notnull().sum().sort_values(ascending=False)

1: Toy Story (1995)                                        17
593: Silence of the Lambs, The (1991)                      16
260: Star Wars: Episode IV - A New Hope (1977)             15
1210: Star Wars: Episode VI - Return of the Jedi (1983)    14
780: Independence Day (ID4) (1996)                         13
2762: Sixth Sense, The (1999)                              12
527: Schindler's List (1993)                               12
2571: Matrix, The (1999)                                   12
1265: Groundhog Day (1993)                                 12
2916: Total Recall (1990)                                  12
3578: Gladiator (2000)                                     12
2028: Saving Private Ryan (1998)                           11
1259: Stand by Me (1986)                                   11
296: Pulp Fiction (1994)                                   11
1198: Raiders of the Lost Ark (1981)                       11
2396: Shakespeare in Love (1998)                           11
318: Sha

#### 3. Top movies by percent liking

Set `rating=4` as a threshold, higher than 4 means "like it".

In [7]:
((movie_ratings.iloc[:, movie_start_idx:] >= 4).sum() / \
movie_ratings.iloc[:, movie_start_idx:].notnull().sum()) \
.sort_values(ascending=False)

318: Shawshank Redemption, The (1994)                      0.700000
260: Star Wars: Episode IV - A New Hope (1977)             0.533333
3578: Gladiator (2000)                                     0.500000
541: Blade Runner (1982)                                   0.444444
593: Silence of the Lambs, The (1991)                      0.437500
2571: Matrix, The (1999)                                   0.416667
1265: Groundhog Day (1993)                                 0.416667
34: Babe (1995)                                            0.400000
296: Pulp Fiction (1994)                                   0.363636
2028: Saving Private Ryan (1998)                           0.363636
1259: Stand by Me (1986)                                   0.363636
1210: Star Wars: Episode VI - Return of the Jedi (1983)    0.357143
1: Toy Story (1995)                                        0.352941
527: Schindler's List (1993)                               0.333333
2762: Sixth Sense, The (1999)                   

#### 4. Association with Toy Story

Find associations with movie "Toy Story": calculate movies that most often occur with it. Using `count(x ^ y) / count(x)`. In other words, for each movie, calculate the percentage of raters who also rated "Toy Story".

In [8]:
# column index of #1: Toy Story
print(np.where(pd.Series(movie_ratings.columns.tolist()).str.contains('Toy Story')))

(array([8], dtype=int64),)


In [9]:
# user who watched #1: Toy Story
toyStory_users = movie_ratings.iloc[:, 8].notnull()
# top association with #1: Toy Story
(movie_ratings.loc[toyStory_users].iloc[:, movie_start_idx:].notnull().sum() \
 / np.sum(toyStory_users)).sort_values(ascending=False)[1:]

260: Star Wars: Episode IV - A New Hope (1977)             0.823529
1210: Star Wars: Episode VI - Return of the Jedi (1983)    0.764706
593: Silence of the Lambs, The (1991)                      0.764706
780: Independence Day (ID4) (1996)                         0.764706
1265: Groundhog Day (1993)                                 0.647059
2916: Total Recall (1990)                                  0.647059
296: Pulp Fiction (1994)                                   0.588235
2762: Sixth Sense, The (1999)                              0.588235
527: Schindler's List (1993)                               0.588235
3578: Gladiator (2000)                                     0.529412
1259: Stand by Me (1986)                                   0.529412
1198: Raiders of the Lost Ark (1981)                       0.529412
2571: Matrix, The (1999)                                   0.529412
2028: Saving Private Ryan (1998)                           0.470588
2396: Shakespeare in Love (1998)                

#### 5. Correlation with Toy Story

In [10]:
# ratings correlation with Toy Story
movie_ratings.iloc[:, movie_start_idx:].corr().iloc[:, 6].sort_values(ascending=False)[1:]

318: Shawshank Redemption, The (1994)                      0.888523
34: Babe (1995)                                            0.811107
296: Pulp Fiction (1994)                                   0.709842
2028: Saving Private Ryan (1998)                           0.596849
356: Forrest Gump (1994)                                   0.522913
541: Blade Runner (1982)                                   0.329634
3578: Gladiator (2000)                                     0.253396
2916: Total Recall (1990)                                  0.163796
2396: Shakespeare in Love (1998)                           0.101768
1265: Groundhog Day (1993)                                -0.062858
780: Independence Day (ID4) (1996)                        -0.069923
260: Star Wars: Episode IV - A New Hope (1977)            -0.119005
527: Schindler's List (1993)                              -0.220315
2762: Sixth Sense, The (1999)                             -0.245770
1198: Raiders of the Lost Ark (1981)            

#### 6. Male-Female differeces in average rating

In [11]:
# gender index
male_idx = movie_ratings.iloc[:, 1] == 0
female_idx = movie_ratings.iloc[:, 1] == 1

In [12]:
# male ratings
male_ratings = movie_ratings.loc[male_idx].iloc[:, movie_start_idx:].mean(skipna=True)
male_ratings

260: Star Wars: Episode IV - A New Hope (1977)             3.125000
1210: Star Wars: Episode VI - Return of the Jedi (1983)    3.000000
356: Forrest Gump (1994)                                   2.250000
318: Shawshank Redemption, The (1994)                      3.400000
593: Silence of the Lambs, The (1991)                      3.333333
3578: Gladiator (2000)                                     2.833333
1: Toy Story (1995)                                        2.300000
2028: Saving Private Ryan (1998)                           3.142857
296: Pulp Fiction (1994)                                   2.625000
1259: Stand by Me (1986)                                   3.000000
2396: Shakespeare in Love (1998)                           2.142857
2916: Total Recall (1990)                                  2.200000
780: Independence Day (ID4) (1996)                         2.857143
541: Blade Runner (1982)                                   3.000000
1265: Groundhog Day (1993)                      

In [13]:
# female ratings
female_ratings = movie_ratings.loc[female_idx].iloc[:, movie_start_idx:].mean(skipna=True)
female_ratings

260: Star Wars: Episode IV - A New Hope (1977)             3.428571
1210: Star Wars: Episode VI - Return of the Jedi (1983)    3.000000
356: Forrest Gump (1994)                                   3.000000
318: Shawshank Redemption, The (1994)                      3.800000
593: Silence of the Lambs, The (1991)                      2.714286
3578: Gladiator (2000)                                     3.000000
1: Toy Story (1995)                                        3.571429
2028: Saving Private Ryan (1998)                           2.750000
296: Pulp Fiction (1994)                                   4.000000
1259: Stand by Me (1986)                                   2.428571
2396: Shakespeare in Love (1998)                           4.250000
2916: Total Recall (1990)                                  1.714286
780: Independence Day (ID4) (1996)                         2.666667
541: Blade Runner (1982)                                   3.500000
1265: Groundhog Day (1993)                      

In [14]:
(male_ratings - female_ratings).sort_values(ascending=False)

1198: Raiders of the Lost Ark (1981)                       1.666667
527: Schindler's List (1993)                               1.000000
2571: Matrix, The (1999)                                   0.742857
1265: Groundhog Day (1993)                                 0.666667
593: Silence of the Lambs, The (1991)                      0.619048
1259: Stand by Me (1986)                                   0.571429
2916: Total Recall (1990)                                  0.485714
2028: Saving Private Ryan (1998)                           0.392857
780: Independence Day (ID4) (1996)                         0.190476
1210: Star Wars: Episode VI - Return of the Jedi (1983)    0.000000
3578: Gladiator (2000)                                    -0.166667
260: Star Wars: Episode IV - A New Hope (1977)            -0.303571
2762: Sixth Sense, The (1999)                             -0.333333
318: Shawshank Redemption, The (1994)                     -0.400000
541: Blade Runner (1982)                        

In [15]:
# male overall ratings
male_overall_ratings = \
movie_ratings.loc[male_idx].iloc[:, movie_start_idx:].sum().sum() / \
movie_ratings.loc[male_idx].iloc[:, movie_start_idx:].notnull().sum().sum()

male_overall_ratings

2.905511811023622

In [16]:
# female overall ratings
female_overall_ratings = \
movie_ratings.loc[female_idx].iloc[:, movie_start_idx:].sum().sum() / \
movie_ratings.loc[female_idx].iloc[:, movie_start_idx:].notnull().sum().sum()

female_overall_ratings

2.9473684210526314

In [17]:
female_overall_ratings - male_overall_ratings

0.0418566100290092

#### 7. Male-Female differences in liking

In [18]:
# male percentages
male_percentages = \
(movie_ratings.loc[male_idx].iloc[:, movie_start_idx:] >= 4).sum() / \
movie_ratings.loc[male_idx].iloc[:, movie_start_idx:].notnull().sum()

male_percentages

260: Star Wars: Episode IV - A New Hope (1977)             0.500000
1210: Star Wars: Episode VI - Return of the Jedi (1983)    0.250000
356: Forrest Gump (1994)                                   0.000000
318: Shawshank Redemption, The (1994)                      0.600000
593: Silence of the Lambs, The (1991)                      0.555556
3578: Gladiator (2000)                                     0.500000
1: Toy Story (1995)                                        0.200000
2028: Saving Private Ryan (1998)                           0.285714
296: Pulp Fiction (1994)                                   0.250000
1259: Stand by Me (1986)                                   0.250000
2396: Shakespeare in Love (1998)                           0.000000
2916: Total Recall (1990)                                  0.200000
780: Independence Day (ID4) (1996)                         0.285714
541: Blade Runner (1982)                                   0.200000
1265: Groundhog Day (1993)                      

In [19]:
# female percentages
female_percentages = \
(movie_ratings.loc[female_idx].iloc[:, movie_start_idx:] >= 4).sum() / \
movie_ratings.loc[female_idx].iloc[:, movie_start_idx:].notnull().sum()

female_percentages

260: Star Wars: Episode IV - A New Hope (1977)             0.571429
1210: Star Wars: Episode VI - Return of the Jedi (1983)    0.500000
356: Forrest Gump (1994)                                   0.500000
318: Shawshank Redemption, The (1994)                      0.800000
593: Silence of the Lambs, The (1991)                      0.285714
3578: Gladiator (2000)                                     0.500000
1: Toy Story (1995)                                        0.571429
2028: Saving Private Ryan (1998)                           0.500000
296: Pulp Fiction (1994)                                   0.666667
1259: Stand by Me (1986)                                   0.428571
2396: Shakespeare in Love (1998)                           0.750000
2916: Total Recall (1990)                                  0.000000
780: Independence Day (ID4) (1996)                         0.333333
541: Blade Runner (1982)                                   0.750000
1265: Groundhog Day (1993)                      

In [20]:
(male_percentages - female_percentages).sort_values(ascending=False)

1198: Raiders of the Lost Ark (1981)                       0.500000
2571: Matrix, The (1999)                                   0.371429
527: Schindler's List (1993)                               0.333333
593: Silence of the Lambs, The (1991)                      0.269841
2916: Total Recall (1990)                                  0.200000
1265: Groundhog Day (1993)                                 0.166667
2762: Sixth Sense, The (1999)                              0.000000
3578: Gladiator (2000)                                     0.000000
780: Independence Day (ID4) (1996)                        -0.047619
260: Star Wars: Episode IV - A New Hope (1977)            -0.071429
1259: Stand by Me (1986)                                  -0.178571
318: Shawshank Redemption, The (1994)                     -0.200000
2028: Saving Private Ryan (1998)                          -0.214286
1210: Star Wars: Episode VI - Return of the Jedi (1983)   -0.250000
1: Toy Story (1995)                             

In [21]:
# male overall percentages
male_overall_percentages = \
(movie_ratings.loc[male_idx].iloc[:, movie_start_idx:] >= 4).sum().sum() / \
movie_ratings.loc[male_idx].iloc[:, movie_start_idx:].notnull().sum().sum()

male_overall_percentages

0.33858267716535434

In [22]:
# female overall percentages
female_overall_percentages = \
(movie_ratings.loc[female_idx].iloc[:, movie_start_idx:] >= 4).sum().sum() / \
movie_ratings.loc[female_idx].iloc[:, movie_start_idx:].notnull().sum().sum()

female_overall_percentages

0.42105263157894735

In [23]:
female_overall_percentages - male_overall_percentages

0.082469954413593