## Most Popular Items Recommendation
Recommending most popular items is one of the easiest way to generate recommendation. In this notebook implementation of most popular items recommendation will be performed step by step.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df_ratings = pd.read_csv('datasets/user_ratings.csv')
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [3]:
"""Information check to get total counts
and data types for each column."""
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
 4   title      100836 non-null  object 
 5   genres     100836 non-null  object 
dtypes: float64(1), int64(3), object(2)
memory usage: 4.6+ MB


In [4]:
# NaN value check
df_ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
dtype: int64

In [5]:
# Unique value counts of each column
df_ratings.nunique()

userId         610
movieId       9724
rating          10
timestamp    85043
title         9719
genres         951
dtype: int64

### 2. Basic Analysis

In [6]:
# Average Ratings
avg_ratings = df_ratings[['title','rating']].groupby(['title']).mean()
avg_ratings.sort_values(by='rating',ascending=False).head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
Gena the Crocodile (1969),5.0
True Stories (1986),5.0
Cosmic Scrat-tastrophe (2015),5.0
Love and Pigeons (1985),5.0
Red Sorghum (Hong gao liang) (1987),5.0


In [7]:
df_ratings[df_ratings.title == 'Gena the Crocodile (1969)']

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
88684,105,175293,5.0,1526208082,Gena the Crocodile (1969),Animation|Children


In [8]:
df_ratings[df_ratings.title == 'True Stories (1986)']

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
95219,260,7815,5.0,1109410367,True Stories (1986),Comedy|Musical


Even though we have obtained average rating of each movie this rating information is misleading. The reason is movies which have high rating but low vote count will always cover the top recommendations. In order to avoid this situation minimum vote count threshold will be applied.

In [9]:
rating_frequencies = df_ratings['title'].value_counts()
rating_frequencies

Forrest Gump (1994)                 329
Shawshank Redemption, The (1994)    317
Pulp Fiction (1994)                 307
Silence of the Lambs, The (1991)    279
Matrix, The (1999)                  278
                                   ... 
Sex, Drugs & Taxation (2013)          1
Extraordinary Tales (2015)            1
Tomorrow (2015)                       1
Embrace of the Serpent (2016)         1
31 (2016)                             1
Name: title, Length: 9719, dtype: int64

In [10]:
# Get get 0.95 percentile of frequencies as minimum vote count
min_vote_count = np.quantile(rating_frequencies.values, 0.95)

In [11]:
popular_movies = rating_frequencies[rating_frequencies > min_vote_count].index
popular_movies

Index(['Forrest Gump (1994)', 'Shawshank Redemption, The (1994)',
       'Pulp Fiction (1994)', 'Silence of the Lambs, The (1991)',
       'Matrix, The (1999)', 'Star Wars: Episode IV - A New Hope (1977)',
       'Jurassic Park (1993)', 'Braveheart (1995)',
       'Terminator 2: Judgment Day (1991)', 'Schindler's List (1993)',
       ...
       'Thomas Crown Affair, The (1999)',
       'Seven Samurai (Shichinin no samurai) (1954)',
       'Wallace & Gromit: A Close Shave (1995)', 'Nine Months (1995)',
       'Big Daddy (1999)', 'The Martian (2015)',
       'Princess Mononoke (Mononoke-hime) (1997)', 'Jackie Brown (1997)',
       'Garden State (2004)', 'Hotel Rwanda (2004)'],
      dtype='object', length=473)

In [12]:
# 
df_popular_movies = df_ratings[df_ratings['title'].isin(popular_movies)]
print(f"{len(df_popular_movies['title'].unique())} movies will be considered at recommendation.") 

473 movies will be considered at recommendation.


In [13]:
# Most liked popular movies 
popular_movie_avg_ratings = df_popular_movies[['title', 'rating']].groupby('title').mean()
popular_movie_avg_ratings.sort_values(by='rating', ascending=False)

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
"Shawshank Redemption, The (1994)",4.429022
"Godfather, The (1972)",4.289062
Fight Club (1999),4.272936
Cool Hand Luke (1967),4.271930
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964),4.268041
...,...
Johnny Mnemonic (1995),2.679245
Judge Dredd (1995),2.669355
City Slickers II: The Legend of Curly's Gold (1994),2.645455
Coneheads (1993),2.420635


### Conclusion

In this notebook most simple recommendation system is achieved via eliminating unpopular movies and getting average rating for each movie. 