# Movie Recommendations

This notebook uses the [MovieLens dataset](https://grouplens.org/datasets/movielens/latest/)
as well as content information that is linked through the respective movie pages on [TMDB](https://www.themoviedb.org/)

* I have included csv files under our class repo on GitHub
* License info is included in the file https://raw.githubusercontent.com/benjum/UCLA-24W-DH150/main/Data/movielens-data/README.txt

In [None]:
import pandas as pd

In [None]:
ratings = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLA-24W-DH150/main/Data/movielens-data/ratings.csv')
movies = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLA-24W-DH150/main/Data/movielens-data/movies.csv')

In [None]:
ratings

In [None]:
movies

610 users and 9724 movies

In [None]:
len(ratings['userId'].unique())

In [None]:
len(ratings['movieId'].unique())

In [None]:
ratings['rating'].unique()

# Idea 1:  Recommend the most popular movie

In [None]:
movies['R'] = 0.0
movies['v'] = 0
movies['WR'] = 0.0

In [None]:
for i in movies.index:
    movies.loc[i,'R'] = ratings.loc[ratings['movieId'] == movies.iloc[i]['movieId']]['rating'].mean()

In [None]:
movies.sort_values(by='R',ascending=False)[:10]

According to the IMDB website https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV#
:

<i>
The following formula is used to calculate the Top Rated 250 titles. This formula provides a true 'Bayesian estimate', which takes into account the number of votes each title has received, minimum votes required to be on the list, and the mean vote for all titles:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

Where:

R = average for the movie (mean) = (rating)

v = number of votes for the movie = (votes)

m = minimum votes required to be listed in the Top Rated 250 list (currently 25,000)

C = the mean vote across the whole report

 

Please be aware the Top 250 Movie list only includes feature films: shorts, TV movies, miniseries and documentaries are not included in the Top 250 Movies Chart. The Top 250 TV Shows Chart includes TV Series, but not TV episodes or Movies.
</i>

In [None]:
m_id = 1
movies.loc[movies['movieId']==m_id, 'title'][0]

In [None]:
ratings.loc[ratings['movieId'] == m_id, 'rating'].count()

In [None]:
ratings.loc[ratings['movieId'] == m_id, 'rating'].mean()

In [None]:
ratings['rating'].mean()

In [None]:
m_id = 1

R = ratings.loc[ratings['movieId'] == m_id, 'rating'].mean()
v = ratings.loc[ratings['movieId'] == m_id, 'rating'].count()
m = 1
C = ratings['rating'].mean()

WR = (v*R + m*C) / (v+m)
print(WR)

In [None]:
for i in movies.index:
    movies.loc[i,'v'] = ratings.loc[ratings['movieId'] == movies.iloc[i]['movieId']]['rating'].count()

In [None]:
movies.sort_values(by='R',ascending=False)[:10]

In [None]:
C = ratings['rating'].mean()
print(C)

In [None]:
# WR = (v*R + m*C) / (v+m)
for i in movies.index:
    movies.loc[i,'WR'] = (movies.loc[i,'v'] * movies.loc[i,'R'] + 1*C) / (movies.loc[i,'v'] + 1)

In [None]:
movies.sort_values(by='WR',ascending=False)[:10]

In [None]:
# WR = (v*R + m*C) / (v+m)
m = 250
movies['WR'] = (movies['v'] * movies['R'] + m*C) / (movies['v'] + m)

In [None]:
movies.sort_values(by='WR',ascending=False)[:10]