# 2489-2122 Data Curation  T1 Final Exam 

The final exam will contain 1 question with subquestions for 70% of the total points.

## A Million Dollar Question: Squid Game or Alice in Borderland?


“What TV sereis should I binge-watch this evening?” This perhaps is a question you would ask yourself very often. As for me — yes, and more than once. As such, from Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers. **Netflix is forecasting it will add 3.5 million paying subscribers thanks to the surprise hit Squid Games**

We are going to examine a MovieLens dataset which provides non-commercial, personalized movie recommendations. 

This dataset describes user ratings from MovieLens. It contains ratings and tag applications across movies created by  users. Users were selected at random for inclusion. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files movies_fe.xlsx, ratings_fe.csv. More details about the contents and use of all these files follows.

**Ratings Data File Structure (ratings_fe.csv)**
All ratings are contained in the file ratings_fe.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
`userId,movieId,rating,timestamp`

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

**Movies Data File Structure (movies_fe.xlsx)**
Movie information is contained in the file movies_fe.xlsx. Each line of this file after the header row represents one movie, and has the following format:
`movieId,title,year,genres`

Answer the following questions using the provided dataset. You can write down intermediate results towards the final answers

In [2]:
import pandas as pd
import numpy as np

### Question 1 (10 points)

However, errors and inconsistencies may exist in these files shown as below:

The ratings in the `rating_fe.csv` should be made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). So if the ratings that are larger than 5 or smaller than 0.5, you need to round it to the value of 5 and 1, respectively. For example, if a movie is rated 8, then it might be wronly rated and you need to change the value to 5. Similarly, if a movie is rated negatively, e.g., -1, then it should be changed to 1, if any.

The movie information in the `movies_fe.xlsx` contains the movies with the missing information about **year**. You should also remove them.

You should also inspect the data to make sure the correct starting row of the data. 

In [None]:
ratings = pd.read_csv('ratings_fe.csv',index_col = 0, skiprows = 10, encoding = 'utf-8')

ratings.loc[ratings['rating'] < 0.5, 'rating'] = 1
ratings.loc[ratings['rating'] > 5, 'rating'] = 5

ratings['rating'] = ratings['rating'].abs()

def round_to_nearest_half_int(num):
    return round(num * 2) / 2

ratings['rating'] = round_to_nearest_half_int(ratings['rating'])

ratings['rating'] = ratings['rating'].astype(int)

ratings

In [None]:
mov = pd.read_excel('movies_fe.xlsx', index_col = 0, skiprows = 15)

movies = mov.dropna(subset = ['year'])

movies

In [7]:
df = pd.merge(ratings, movies, on = ['movieId'], how = 'left')
df.dropna()
df

Unnamed: 0,userId,movieId,rating,timestamp,title,year,genres
0,1,6,2,980730861,Heat,1995.0,Action|Crime|Thriller
1,1,22,3,980731380,Copycat,1995.0,Crime|Drama|Horror|Mystery|Thriller
2,1,32,2,980731926,Twelve Monkeys (a.k.a. 12 Monkeys),1995.0,Mystery|Sci-Fi|Thriller
3,1,50,5,980732037,"Usual Suspects, The",1995.0,Crime|Mystery|Thriller
4,1,110,4,980730408,Braveheart,1995.0,Action|Drama|War
...,...,...,...,...,...,...,...
100018,706,1023,4,841429779,Winnie the Pooh and the Blustery Day,1968.0,Animation|Children|Musical
100019,706,1073,3,852915721,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical
100020,706,1150,3,847647519,"Return of Martin Guerre, The (Retour de Martin...",1982.0,Drama
100021,706,1183,4,850465137,"English Patient, The",1996.0,Drama|Romance|War


### Question 2 (10 points)

Show the top 5 Action movies with the highest median ratings:

In [8]:
not_missing = df['genres'].notna()
action_movies = df[not_missing & df['genres'].str.contains('Action')]

title_rating = action_movies.groupby('title')['rating'].median()
sort = title_rating.sort_values(ascending=False)

top_5 = sort.head(5)

top_5

title
Blood of Heroes, The (Salute of the Jugger, The)                     5.0
Batman: Under the Red Hood                                           5.0
Master of the Flying Guillotine (Du bi quan wang da po xue di zi)    5.0
Dead or Alive: Hanzaisha                                             5.0
After the Sunset                                                     5.0
Name: rating, dtype: float64

### Question 3 (15 points)

Among all movies that user with Id 500 has rated, show the his/her top 5 favorite movies in each of the following three genres **Adventure**, **Comedy**, **Drama** (i.e., the movie he/she rated 5) more recently as three columns: `movieId, title, genre`. If you see the movies with overlapping genres, it is ok to include them several times.

In [9]:
id_500 = df[df['userId'] == 500.0]

adv = id_500[(id_500['genres'].str.contains('Adventure'))]  
com = id_500[(id_500['genres'].str.contains('Comedy'))]
dra = id_500[(id_500['genres'].str.contains('Drama'))]

adv_sorted = adv.sort_values(by = ['timestamp'], ascending= True)
com_sorted = com.sort_values(by = ['timestamp'], ascending= True)
dra_sorted = dra.sort_values(by = ['timestamp'], ascending= True)

top_5_adv = adv_sorted[adv_sorted['rating'] == 5.0].head(5)
top_5_com = com_sorted[com_sorted['rating'] == 5.0].head(5)
top_5_dra = dra_sorted[dra_sorted['rating'] == 5.0].head(5)

top_5_adv = top_5_adv[['movieId','title','genres']]
top_5_com = top_5_com[['movieId','title','genres']]
top_5_dra = top_5_dra[['movieId','title','genres']]

print(top_5_adv)
print(top_5_com)
print('Top 5 Favourite Drama Movies')
print(top_5_dra)

       movieId                                              title  \
73243     1210         Star Wars: Episode VI - Return of the Jedi   
73251     1291                 Indiana Jones and the Last Crusade   
73239     1196     Star Wars: Episode V - The Empire Strikes Back   
73382     5952             Lord of the Rings: The Two Towers, The   
73366     4993  Lord of the Rings: The Fellowship of the Ring,...   

                        genres  
73243  Action|Adventure|Sci-Fi  
73251         Action|Adventure  
73239  Action|Adventure|Sci-Fi  
73382        Adventure|Fantasy  
73366        Adventure|Fantasy  
       movieId                                              title  \
73215      750  Dr. Strangelove or: How I Learned to Stop Worr...   
73261     1527                                 Fifth Element, The   
73259     1396                                           Sneakers   
73335     3033                                         Spaceballs   
73232     1073                Willy Wonka 

### Question 4 (15 points)

Show the pivot table of mean and standard deviation for the ratings of movies across the row of released decades (for example, year 1995 belongs to 1990s decade), and the column of quartile of the timestamp values (in terms of 4 groups).

In [None]:
pivot_table = df.pivot_table(index=['Popularity_Level'], columns=['Name_length_long', 'Name_length_short'], values='Name', aggfunc=['mean','std'])

### Question 5 (20 points)

Now you need to implement a **recommender system using collaborative filtering method**. This works simply as to recommend movies that "people who like this movie also like these movies". For example, people who like to watch Star Wars are very likely to watch Star Treks. 

In order to do so, you need to find all users who like one movie (i.e., post a rating of 5), and identify the movies these users also like, ranked by the number of likes. 

Show the recommended movie list with top 10 movies that users who like the *Titanic* may also like.

In [10]:
import pandas as pd

# Assuming you have a dataset in a DataFrame with columns 'userId', 'movieId', 'rating', and 'title' for movies
# Filter for users who liked "Titanic" (rating = 5)
titanic_lovers = df[(df['title'] == 'Titanic') & (df['rating'] == 5)]['userId']

# Find movies rated by these users
movies_rated_by_titanic_lovers = df[df['userId'].isin(titanic_lovers)]['movieId']

# Count the number of times each movie is rated
movie_counts = movies_rated_by_titanic_lovers.value_counts()

# Sort movies by count in descending order
recommended_movie_ids = movie_counts.index.tolist()

titanic_movie_id = 1721

# Exclude "Titanic" from the recommendations if it's in the list
if titanic_movie_id in recommended_movie_ids:
    recommended_movie_ids.remove(titanic_movie_id)

# Get the top 10 recommended movies as a DataFrame
top_10_recommendations = pd.DataFrame(recommended_movie_ids[:10], columns=['ID'])

# Assuming you have a DataFrame 'movies_df' with 'movieId' and 'title' columns
# Merge 'top_10_recommendations' with 'movies_df' to get movie titles
top_10_recommendations = top_10_recommendations.merge(movies, left_on='ID', right_on='movieId')

# Display the recommended movies with titles
print(top_10_recommendations[['ID', 'title']])

     ID                          title
0   480                  Jurassic Park
1   589     Terminator 2: Judgment Day
2  1580      Men in Black (a.k.a. MIB)
3   356                   Forrest Gump
4   593      Silence of the Lambs, The
5  2959                     Fight Club
6  2571                    Matrix, The
7  2762               Sixth Sense, The
8  2028            Saving Private Ryan
9   780  Independence Day (a.k.a. ID4)


Congratulations! You just build the first [recommender system that worth 1 million dollars](https://www.netflixprize.com/) :D

![netflix_prize](https://cdn.vox-cdn.com/thumbor/Kp9TEknNzIQV-ZijAm74cfHx_D0=/0x124:1100x700/fit-in/1200x630/cdn.vox-cdn.com/uploads/chorus_asset/file/15788062/netflix-prize1.0.1537040369.jpg)