# 2489-1819 Data Curation  T1 Final Exam 

The final exam will contain 1 question with subquestions for 70% of the total points (20% for 5 questions in SQL and 50% for Python). 

## Game of Thrones or The Big Bang Theory?

“What movie should I watch this evening?” This perhaps is a question you would ask yourself very often. As for me — yes, and more than once. As such, from Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers.

We are going to examine a MovieLens dataset which provides non-commercial, personalized movie recommendations. 

This dataset describes 5-star rating from MovieLens. It contains ratings and tag applications across movies created by  users. Users were selected at random for inclusion. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files movies.csv, ratings.csv. More details about the contents and use of all these files follows.

**Ratings Data File Structure (ratings_fe.csv)**
All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
`userId,movieId,rating,timestamp`

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

**Movies Data File Structure (movies_fe.xlsx)**
Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:
`movieId,title,year,genres`

The database `movielens_small` contains 4 tables: *ratings, tags, movies and links*. In the multiple choice questions, you need to use these tables to answer those.


Answer the following questions using the provided dataset. You can write down intermediate results towards the final answers

In [2]:
import pandas as pd
import numpy as np

### Question 1 (10 points)

However, errors and inconsistencies may exist in these files shown as below:

The ratings in the `rating_fe.csv` should be made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). So if the ratings that are larger than 5 or smaller than 0.5, you need to round it to the value of 5 and 1. For example, if a movie is rated 8, then it might be wronly rated and you need to change the value to 5. Similarly, if a movie is rated -1, then it should be changed to 1, if any.

In [15]:
ratings = pd.read_csv('ratings_fe.csv',encoding = 'utf-8', index_col = 0, skiprows = 10)

ratings.loc[ratings.rating < 0.5] = 1
ratings.loc[ratings.rating > 5] = 5

def round_to_nearest_half_int(num):
    return round(num * 2) / 2

ratings['rating'] = round_to_nearest_half_int(ratings['rating'])

ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,6,2.0,980730861
1,1,22,3.0,980731380
2,1,32,2.0,980731926
3,1,50,5.0,980732037
4,1,110,4.0,980730408
...,...,...,...,...
100018,706,1023,4.0,841429779
100019,706,1073,3.0,852915721
100020,706,1150,3.0,847647519
100021,706,1183,4.0,850465137


In [17]:
movies_c = pd.read_excel('movies_fe.xlsx', skiprows = 15)
movies = movies_c.dropna(subset=['year'])
movies

Unnamed: 0,movieId,title,year,genres
0,1,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,1995.0,Adventure|Children|Fantasy
2,3,Grumpier Old Men,1995.0,Comedy|Romance
3,4,Waiting to Exhale,1995.0,Comedy|Drama|Romance
4,5,Father of the Bride Part II,1995.0,Comedy
...,...,...,...,...
8564,123109,P.U.N.K.S,1999.0,Children|Comedy|Sci-Fi
8565,124857,Deception,2013.0,Action
8566,125916,Fifty Shades of Grey,2015.0,Drama
8567,126407,Face of Terror,2005.0,Action|Drama|Thriller


In [25]:
movies_ratings = pd.merge(ratings,movies, how = "right", on = "movieId")

df = movies_ratings.dropna()

df

Unnamed: 0,userId,movieId,rating,timestamp,title,year,genres
0,1.0,1,1.0,1.000000e+00,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy
1,1.0,1,1.0,1.000000e+00,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy
2,1.0,1,1.0,1.000000e+00,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy
3,7.0,1,5.0,8.355833e+08,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy
4,1.0,1,1.0,1.000000e+00,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...,...
100133,365.0,123109,3.0,1.424225e+09,P.U.N.K.S,1999.0,Children|Comedy|Sci-Fi
100134,516.0,124857,3.0,1.421437e+09,Deception,2013.0,Action
100135,568.0,125916,0.5,1.425048e+09,Fifty Shades of Grey,2015.0,Drama
100136,516.0,126407,3.0,1.422146e+09,Face of Terror,2005.0,Action|Drama|Thriller


### Question 2 (5 points)

Show the top 5 years with the most number of movies:

In [37]:
top_5_years = df.groupby('year')['movieId'].count().sort_values(ascending = False).head(5)

top_5_years

year
1995.0    13502
1994.0     6442
1996.0     5592
1999.0     4921
1993.0     4616
Name: movieId, dtype: int64

### Question 3 (5 points)
Show the average rating of movies with ID 100:

In [38]:
avg_rat_100 = df.loc[(df.movieId == 100)]['rating'].mean()
avg_rat_100

3.293103448275862

### Question 4 (5 points)

Show the median ratings given by user with ID 500:

In [39]:
avg_us_500 = df.loc[(df.userId == 500)]['rating'].median()
avg_us_500

4.0

### Question 5 (10 points)

Among all movies that uer with Id 500 has rated, show the his/her top 10 favorite movies (i.e., the movies he/she rated 5 more recently) as three columns: `movieId, title, rating`

In [71]:
movies_500 = df.loc[(df.userId == 500) & (df.rating != 0)].sort_values(by=['rating','timestamp'],ascending=[False,True])

top_10_movies = movies_500.head(10)

columns = top_10_movies[['movieId','title','rating']]

columns.reset_index(drop=True, inplace= True)

columns

Unnamed: 0,movieId,title,rating
0,527,Schindler's List,5.0
1,1210,Star Wars: Episode VI - Return of the Jedi,5.0
2,1291,Indiana Jones and the Last Crusade,5.0
3,5952,"Lord of the Rings: The Two Towers, The",5.0
4,750,Dr. Strangelove or: How I Learned to Stop Worr...,5.0
5,541,Blade Runner,5.0
6,1253,"Day the Earth Stood Still, The",5.0
7,1396,Sneakers,5.0
8,3793,X-Men,5.0
9,2173,"Navigator: A Mediaeval Odyssey, The",5.0


### Question 6 (15 points)

Now you need to implement a **recommender system using collaborative filtering method**. This works simply as to recommend movies that "people who like this movie also like these movies". For example, people who like to watch Star Wars are very likely to watch Star Treks. 

In order to do so, you need to find all users who like one movie (i.e., post a rating of 5), and identify the movies these users also like, ranked by the number of likes. 

Show the recommended movie list with top 10 movies that users who like the *Titanic* may also like.

In [75]:
total_movies_liked = df[df['rating'] == 5]

total_movies_liked = total_movies_liked[total_movies_liked['title'] != 'Titanic']

movie_counts = total_movies_liked.groupby('title').size().sort_values(ascending=False).head(10)

print(movie_counts)

title
Father of the Bride Part II           5071
Shawshank Redemption, The              101
Silence of the Lambs, The              100
Star Wars: Episode IV - A New Hope      92
Pulp Fiction                            82
Usual Suspects, The                     74
Schindler's List                        69
Forrest Gump                            68
Godfather, The                          66
Matrix, The                             64
dtype: int64


In [76]:
# Specify the movie title that you want to use as a reference for finding similar likes
reference_movie_title = 'Titanic'

# Get a list of users who liked the reference movie
users_who_liked_reference = df[(df['title'] == reference_movie_title) & (df['rating'] == 5)]['userId'].unique()

# Filter the DataFrame to include only ratings from these users
similar_likes_df = df[(df['userId'].isin(users_who_liked_reference)) & (df['rating'] == 5) & (df['title'] != reference_movie_title)]

# Group by movie title and count the number of likes
movie_counts = similar_likes_df['title'].value_counts().head(10)

# Print the top 10 movies liked by users who liked the reference movie
print(movie_counts)

Silence of the Lambs, The             6
Braveheart                            5
Matrix, The                           4
Forrest Gump                          4
Shawshank Redemption, The             4
Star Wars: Episode IV - A New Hope    4
Godfather, The                        4
Contact                               3
Back to the Future                    3
Schindler's List                      3
Name: title, dtype: int64


Congratulations! You just build the first recommender system that worth 1 million dollars :D