## Movie Ratings Analysis

This notebook analyzes movie ratings from several users. Each person rated a selection of popular movies on a scale from 1 to 5.
We use pandas to clean the data, calculate averages, and explore how normalization and standardization affect the interpretation of ratings.

In [138]:
import pandas as pd
import numpy as np

df = pd.read_csv('Movie Ratings.csv')
print (f"Original Data")
print("=" *71)
print(df)

Original Data
  Unnamed: 0  Demon Slayer  Kpop Demon Hunters  Avatar  Hamilton  Joker
0        Ace             5                 4.0     NaN       3.0      5
1   Brooklyn             4                 NaN     2.0       4.0      3
2     Caesar             3                 5.0     4.0       5.0      4
3       Doug             5                 3.0     5.0       NaN      2
4      Edgar             4                 4.0     3.0       4.0      5


In [139]:
# Clean up the column names and set proper index
df_clean = df.copy()
df_clean.rename(columns={'Unnamed: 0': 'Name'}, inplace=True)
df_clean.set_index('Name', inplace=True)

# Clean up movie names (remove extra spaces)
df_clean.columns = [col.strip() for col in df_clean.columns]

# Store the cleaned dataframe for subsequent steps
df = df_clean 

print (f"Cleaned Data")
print("=" *68)
print(df)

Cleaned Data
          Demon Slayer  Kpop Demon Hunters  Avatar  Hamilton  Joker
Name                                                               
Ace                  5                 4.0     NaN       3.0      5
Brooklyn             4                 NaN     2.0       4.0      3
Caesar               3                 5.0     4.0       5.0      4
Doug                 5                 3.0     5.0       NaN      2
Edgar                4                 4.0     3.0       4.0      5


In [140]:
# Average rating for each Name
name_means = df.mean(axis=1)
print("Average Rating per Name")
print("=" *32)
for name, avg_rating in name_means.items():
    ratings_count = df.loc[name].count()
    print(f"{name}: {avg_rating:.2f} ({ratings_count} movies rated)")

# Average rating for each movie
movie_means = df.mean(axis=0)
print("\nAverage Rating per Movie")
print("=" *41)
for movie, avg_rating in movie_means.items():
    ratings_count = df[movie].count()
    print(f"{movie}: {avg_rating:.2f} ({ratings_count} people rated)")

Average Rating per Name
Ace: 4.25 (4 movies rated)
Brooklyn: 3.25 (4 movies rated)
Caesar: 4.20 (5 movies rated)
Doug: 3.75 (4 movies rated)
Edgar: 4.00 (5 movies rated)

Average Rating per Movie
Demon Slayer: 4.20 (5 people rated)
Kpop Demon Hunters: 4.00 (4 people rated)
Avatar: 3.50 (4 people rated)
Hamilton: 4.00 (4 people rated)
Joker: 3.80 (5 people rated)


Each user’s average rating (how generous or strict they are overall).
Each movie’s average rating (how well it was received).
Missing values (movies a person didn’t rate) are automatically ignored by pandas.

In [141]:
def normalize_name_ratings(name_ratings):
    # Normalize ratings for a single Name to 0-1 scale
    
    min_rating = name_ratings.min()
    max_rating = name_ratings.max()
    
    normalized = (name_ratings - min_rating) / (max_rating - min_rating)
    return normalized

# Apply normalization
normalized_df = df.apply(normalize_name_ratings, axis=1)

print(f"Normalized Ratings (0-1 scale)")
print("=" *68)
print(normalized_df.round(3))

# Calculate averages for normalized data
norm_name_means = normalized_df.mean(axis=1)
norm_movie_means = normalized_df.mean(axis=0)

Normalized Ratings (0-1 scale)
          Demon Slayer  Kpop Demon Hunters  Avatar  Hamilton  Joker
Name                                                               
Ace                1.0               0.500     NaN       0.0    1.0
Brooklyn           1.0                 NaN     0.0       1.0    0.5
Caesar             0.0               1.000     0.5       1.0    0.5
Doug               1.0               0.333     1.0       NaN    0.0
Edgar              0.5               0.500     0.0       0.5    1.0


Normalization rescales each user’s ratings so their lowest rating becomes 0 and their highest becomes 1.
This transformation makes it possible to compare rating patterns between users, regardless of whether they tend to rate high or low.

In [142]:
# Calculate averages for normalized data
norm_name_means = normalized_df.mean(axis=1)
norm_movie_means = normalized_df.mean(axis=0)

print("Average Normalized Rating per Name")
print("=" *34)
for name, avg_rating in norm_name_means.items():
    print(f"{name}: {avg_rating:.3f}")

print("\nAverage Normalized Rating per Movie")
print("=" *35)
for movie, avg_rating in norm_movie_means.items():
    print(f"{movie}: {avg_rating:.3f}")

Average Normalized Rating per Name
Ace: 0.625
Brooklyn: 0.625
Caesar: 0.600
Doug: 0.583
Edgar: 0.500

Average Normalized Rating per Movie
Demon Slayer: 0.700
Kpop Demon Hunters: 0.583
Avatar: 0.375
Hamilton: 0.625
Joker: 0.600


After normalization, averages show how each user rated relative to their personal scale.
For instance, if someone’s highest and lowest ratings are both high (like 5 and 4), their normalized ratings still range from 0 to 1 — showing relative preferences, not absolute enthusiasm.

ADVANTAGES of Normalized Ratings:

1. **Eliminates Individual Rating Bias**: Our data shows names like Ace and Caesar had different rating scales. 
   Normalization removes these personal tendencies and puts everyone on the same 0-1 scale.

2. **Better Cross-Name Comparability**: With normalized ratings, we can directly compare preferences 
   across people regardless of whether they are harsh or lenient raters. All names now have comparable averages.

3. **Focus on Relative Preferences**: Normalization highlights which movies a person prefers 
   relative to their own rating pattern. For example, Doug strongly preferred Avatar (1.0) over other films.
   
DISADVANTAGES of Normalized Ratings:

1. **Loss of Absolute Meaning**: Avatar's normalized average of 0.375 doesn't tell us if the movie 
   is actually poor or just less preferred compared to other options.

2. **Amplifies Small Differences**: Small absolute rating differences become large normalized gaps, 
   potentially overemphasizing minor preference variations.

3. **Interpretation Difficulty**: Normalized scores (0-1) are less intuitive than 
   traditional 1-5 star ratings for most users.

In [143]:
def standardize_name_ratings(name_ratings):
   
    mean_rating = name_ratings.mean()
    std_rating = name_ratings.std()
    
    standardized = (name_ratings - mean_rating) / std_rating
    return standardized

# Apply standardization
standardized_df = df.apply(standardize_name_ratings, axis=1)

print(f"Standardized Ratings")
print("=" *68)
print(standardized_df.round(3))

# Calculate averages for standardized data
std_name_means = standardized_df.mean(axis=1)
std_movie_means = standardized_df.mean(axis=0)



Standardized Ratings
          Demon Slayer  Kpop Demon Hunters  Avatar  Hamilton  Joker
Name                                                               
Ace              0.783              -0.261     NaN    -1.306  0.783
Brooklyn         0.783                 NaN  -1.306     0.783 -0.261
Caesar          -1.434               0.956  -0.239     0.956 -0.239
Doug             0.833              -0.500   0.833       NaN -1.167
Edgar            0.000               0.000  -1.414     0.000  1.414


Standardization converts each user’s ratings into z-scores.
A z-score shows how far each rating deviates from the user’s mean rating, measured in standard deviations.
This transformation centers each user’s ratings around 0, with positive values for above-average ratings and negative for below-average ones.

In [None]:
print("Average Standardized Rating per Name")
print("=" *36)
for name, avg_rating in std_name_means.items():
    print(f"   {name}: {avg_rating:.3f}")

print("\nAverage Standardized Rating per Movie")
print("=" *37)
for movie, avg_rating in std_movie_means.items():
    print(f"   {movie}: {avg_rating:.3f}")

Average Standardized Rating per Name
   Ace: 0.000
   Brooklyn: 0.000
   Caesar: -0.000
   Doug: 0.000
   Edgar: 0.000

Average Standardized Rating per Movie
   Demon Slayer: 0.193
   Kpop Demon Hunters: 0.049
   Avatar: -0.531
   Hamilton: 0.108
   Joker: 0.106


Averages of standardized data are close to zero for each user (as expected for z-scores).
This confirms that ratings are now centered on each user’s personal mean, eliminating differences in rating scale or bias.

ADVANTAGES of Standardized Ratings:

1. **Centers All Ratings Around Zero**: Every person now has an average rating of exactly 0.000, 
   completely eliminating individual rating bias and making direct comparisons meaningful.

2. **Preserves Relative Rating Patterns**: While centering the data, standardization maintains 
   each person's unique rating distribution, showing how extreme their preferences are relative to their own average.

3. **Identifies True Outliers**: Positive scores indicate above-average preference for that person, 
   while negative scores show below-average ratings. For example, Edgar's Joker rating (1.414) stands out as a strong positive preference.

DISADVANTAGES of Standardized Ratings:

1. **Difficult Interpretation for Non-Technical Users**: Z-scores like 0.783 or -1.306 are 
   not intuitive and hard for most people to understand compared to simple 1-5 star ratings.

2. **Amplifies Rating Consistency Issues**: People with very consistent rating patterns 
   (small standard deviation) can have artificially inflated z-scores for small absolute differences.

3. **Loses Absolute Benchmarking**: We can no longer tell if a movie is generally "good" or "bad" - 
   we only know if it's better or worse than that particular person's average rating.