# Analysis on MovieLens Data

In this notebook, we explore the MovieLens latest small dataset to understand patterns in movie ratings, genres, and rater behavior. 

Dataset: [MovieLens Latest Small](https://grouplens.org/datasets/movielens/latest/)

## Questions We Will Answer

1. **Raters and movies by the numbers**
   - How many raters are there?
   - How many movies have been rated?

2. **Questions about genres**
   - What are the most common movie genres?
   - What is the average rating per genre?
   - What is the average variance per genre?

3. **Movie ratings through the years**
   - Which years had the most movies released?
   - What is the average movie rating per year?

4. **About the raters**
   - Which raters give the highest ratings on average? Which give the lowest?
   - Do raters who rate movies more often give higher or lower ratings on average?


In [29]:
import pandas as pd
import numpy as np

ratings = pd.read_csv("ratings.csv")
movies = pd.read_csv("movies.csv")

movies["genres"] = movies["genres"].str.split("|")
movies["year"]=  (
    movies["title"]
    .str.extract(r"\((\d{4})\)")
    .astype("float")
    .astype("Int64")
)
ratings_with_genres = pd.merge(
    ratings, 
    movies[["movieId", "genres", "year"]], 
    on="movieId", 
    how="left"
)

In [30]:
print(f"Number of unique users: {ratings["userId"].nunique()}")

Number of unique users: 610


In [31]:
print(f"Most common genres: {movies.explode("genres").groupby("genres").genres.count().sort_values(ascending=False)}")

Most common genres: genres
Drama                 4361
Comedy                3756
Thriller              1894
Action                1828
Romance               1596
Adventure             1263
Crime                 1199
Sci-Fi                 980
Horror                 978
Fantasy                779
Children               664
Animation              611
Mystery                573
Documentary            440
War                    382
Musical                334
Western                167
IMAX                   158
Film-Noir               87
(no genres listed)      34
Name: genres, dtype: int64


In [32]:
print(f"Average rating per genre: {ratings_with_genres.explode("genres").groupby("genres")['rating'].mean().sort_values(ascending=False)}")

Average rating per genre: genres
Film-Noir             3.920115
War                   3.808294
Documentary           3.797785
Crime                 3.658294
Drama                 3.656184
Mystery               3.632460
Animation             3.629937
IMAX                  3.618335
Western               3.583938
Musical               3.563678
Adventure             3.508609
Romance               3.506511
Thriller              3.493706
Fantasy               3.491001
(no genres listed)    3.489362
Sci-Fi                3.455721
Action                3.447984
Children              3.412956
Comedy                3.384721
Horror                3.258195
Name: rating, dtype: float64


In [33]:
print(f"Average variance in ratings per genre: {ratings_with_genres.explode("genres").groupby("genres")["rating"].var().sort_values(ascending=False)}")

Average variance in ratings per genre: genres
(no genres listed)    1.483580
Horror                1.305514
Sci-Fi                1.147746
Comedy                1.137509
Children              1.115128
Action                1.104455
Fantasy               1.078872
Adventure             1.058990
Thriller              1.051442
Romance               1.047411
Western               1.024314
Mystery               1.012590
Crime                 0.989375
Musical               0.978601
IMAX                  0.976401
Drama                 0.958701
War                   0.957529
Animation             0.940249
Film-Noir             0.786764
Documentary           0.673156
Name: rating, dtype: float64


In [34]:
print(f"Top 10 years with the most movies released: {movies.groupby("year").year.count().sort_values(ascending=False).head(10)}")

Top 10 years with the most movies released: year
2002    311
2006    295
2001    294
2007    284
2000    283
2009    282
2004    279
2003    279
2014    278
1996    276
Name: year, dtype: Int64


In [35]:
print(f"Average rating per year: {ratings_with_genres.groupby("year").rating.mean().sort_values(ascending=False)}")

Average rating per year: year
1917    4.500000
1930    4.205882
1921    4.100000
1934    4.088235
1944    4.043478
          ...   
1996    3.335329
1932    3.333333
1903    2.500000
1919    2.000000
1915    2.000000
Name: rating, Length: 106, dtype: float64


In [36]:
# Users with the highest and lowest ratings
rating_per_user = ratings_with_genres.groupby("userId").rating
avg_rating_per_user = rating_per_user.mean()
print(f"Top 10 harshest raters: {avg_rating_per_user.sort_values().head(10)}") 
print(f"Top 10 most generous raters: {avg_rating_per_user.sort_values(ascending=False).head(10)}") 

Top 10 harshest raters: userId
442    1.275000
139    2.144330
508    2.145833
153    2.217877
567    2.245455
311    2.339286
298    2.363685
517    2.386250
308    2.426087
3      2.435897
Name: rating, dtype: float64
Top 10 most generous raters: userId
53     5.000000
251    4.869565
515    4.846154
25     4.807692
30     4.735294
523    4.693333
348    4.672727
171    4.634146
452    4.556931
43     4.552632
Name: rating, dtype: float64


In [37]:
user_info = ratings.groupby("userId").agg(
    avg_rating=("rating", "mean"),
    num_of_ratings=("rating", "count")
).reset_index()
corr_between_ratings_num_of_ratings = user_info["avg_rating"].corr(user_info["num_of_ratings"], method="kendall")
print(f"""
The correlation between average rating and the number of ratings is {round(corr_between_ratings_num_of_ratings, 5)}. 
There is a very weak negative correlation between a user's average rating and the number of ratings they have given, 
meaning that higher average ratings are slightly associated with fewer total ratings, but the trend is very small.
""")


The correlation between average rating and the number of ratings is -0.09381. 
There is a very weak negative correlation between a user's average rating and the number of ratings they have given, 
meaning that higher average ratings are slightly associated with fewer total ratings, but the trend is very small.

