# Week 7 Assignment: Movie Ratings Analysis
## Normalization and Standardization

## Load the Data

In [2]:
import pandas as pd
import numpy as np

# Load movie ratings from CSV
df = pd.read_csv('movie_ratings.csv')

print("Raw Data:")
print(df)
print(f"\nShape: {df.shape}")

Raw Data:
    Person  Predator Badlands  Wicked  Don't Worry Darling  \
0  Charity                  4       2                    4   
1   Payton                  5       3                    3   
2  Sabrina                  5       5                    5   
3    Maria                  3       1                    4   
4   Bernie                  3       1                    3   
5  Michael                  4       3                    4   

   K-Pop Demon Hunters  You People  Sinners  
0                  1.0           4        4  
1                  4.0           4        3  
2                  4.0           5        5  
3                  NaN           3        4  
4                  1.0           2        5  
5                  3.0           4        4  

Shape: (6, 7)


## Step 1: Original Ratings

In [3]:
# Set person as index
df = df.set_index('Person')

# Average rating per person
print("Average Rating Per Person:")
avg_person = df.mean(axis=1)
print(avg_person)

print("\n" + "="*50)
print("\nAverage Rating Per Movie:")
avg_movie = df.mean(axis=0)
print(avg_movie)

Average Rating Per Person:
Person
Charity    3.166667
Payton     3.666667
Sabrina    4.833333
Maria      3.000000
Bernie     2.500000
Michael    3.666667
dtype: float64


Average Rating Per Movie:
Predator Badlands      4.000000
Wicked                 2.500000
Don't Worry Darling    3.833333
K-Pop Demon Hunters    2.600000
You People             3.666667
Sinners                4.166667
dtype: float64


## Step 2: Normalize Ratings

Normalize each person's ratings to a 0-1 scale.

Formula: (rating - min) / (max - min)

In [None]:
# Normalize each person's ratings
def normalize(row):
    # Only look at ratings they actually gave
    valid = row.dropna()
    if len(valid) == 0:
        return row
    
    min_rating = valid.min()
    max_rating = valid.max()
    
    # If all ratings are the same, can't normalize
    if max_rating == min_rating:
        return row
    
    # Apply the formula
    return (row - min_rating) / (max_rating - min_rating)

# Normalize for each person
df_norm = df.apply(normalize, axis=1)

print("Normalized Ratings:")
print(df_norm)

print("\n" + "="*50)
print("\nAverage Normalized Rating Per Person:")
avg_norm_person = df_norm.mean(axis=1)
print(avg_norm_person)

print("\nAverage Normalized Rating Per Movie:")
avg_norm_movie = df_norm.mean(axis=0)
print(avg_norm_movie)

## Step 3: Standardize Ratings

Standardize each person's ratings to z-scores.

Formula: (rating - mean) / standard deviation

In [None]:
# Standardize each person's ratings
def standardize(row):
    valid = row.dropna()
    if len(valid) <= 1:
        return row
    
    mean = valid.mean()
    std = valid.std()
    
    if std == 0:
        return row
    
    # Apply the formula
    return (row - mean) / std

# Standardize for each person
df_std = df.apply(standardize, axis=1)

print("Standardized Ratings (z-scores):")
print(df_std)

print("\n" + "="*50)
print("\nAverage Standardized Rating Per Person:")
avg_std_person = df_std.mean(axis=1)
print(avg_std_person)

print("\nAverage Standardized Rating Per Movie:")
avg_std_movie = df_std.mean(axis=0)
print(avg_std_movie)

## Comparison

In [None]:
# Side-by-side comparison
print("Average by Person - Comparison:")
comparison = pd.DataFrame({
    'Original': avg_person,
    'Normalized': avg_norm_person,
    'Standardized': avg_std_person
})
print(comparison)

print("\n" + "="*50)
print("\nAverage by Movie - Comparison:")
movie_comp = pd.DataFrame({
    'Original': avg_movie,
    'Normalized': avg_norm_movie,
    'Standardized': avg_std_movie
})
print(movie_comp)

## Analysis: Normalized vs Original Ratings

### Advantages of Normalized Ratings:

1. **Fair Comparison**: Different people rate differently. Some give mostly 4-5s, others give mostly 1-3s. Normalization puts everyone on the same scale so we can actually compare their preferences.

2. **Removes Bias**: If one person is naturally more generous with ratings and another is more critical, normalization accounts for that. Now a normalized 0.5 from each person means the same thing.

3. **Better for Recommendations**: In recommendation systems, normalized ratings help us see what each person actually prefers instead of just their rating habits.

4. **Consistent Scale**: Everyone's ratings end up on a 0-1 scale, which makes comparison easier.

### Disadvantages of Normalized Ratings:

1. **Lose the Actual Scores**: We can't tell if someone rated a movie 2 or 3 from the normalized value. We just know it's somewhere in their range.

2. **Hard to Understand**: "Normalized 0.75" doesn't mean as much as "rated 4 out of 5". People don't naturally think in normalized scores.

3. **Problems with Missing Data**: If someone only watched one movie, you can't normalize it (no range to scale). Missing ratings make this harder.

4. **Extreme Ratings Get Blown Up**: If someone's range is narrow (like they rate 2-4), then that 2 becomes 0.0 and the 4 becomes 1.0. But maybe they just don't rate things very high or low.

### When to Use Each:

- **Original**: When you want actual scores and people rate similarly
- **Normalized**: When comparing people with different rating styles
- **Standardized**: When you need statistical stuff or data centered around 0