# Audience Decode: Behavioral Analysis Project

**Team Members:** Yelena Shabanova 320991, Alena Seliutina 333591, Luis Fernando Henriquez Patino 314661

---

### **Introduction**

This project aims to analyze viewer engagement and behavioral patterns from the `viewer_interactions.db` dataset. As per the "Audience Decode" brief, our goal is not to predict individual ratings, but to explore and model audience behavior.

We will use **Unsupervised Learning (Clustering)** to achieve this, focusing on:
1.  **User Segmentation:** Identifying distinct groups of users based on their rating behaviors (e.g., "Critical Power Users", "Casual Viewers").
2.  **Behavioral Analysis:** Understanding how these groups interact with different types of content and how preferences may evolve.

**Note on "Genre":** The project description mentions "genres," but this information is not provided in database. We made our definition of content types(pseudo-denres) based on features given in movie data to proceed with analysis. 

### **Table of contents** 
1. **ENVIRONMENT SETUP** 
2. **DATA LOADING**  

3. **EXPLANATORY DATA ANALYSIS**  
   3.1 EDA: USER    
   3.2 EDA: MOVIES  

4. **DATA PREPROCESSING**  
   4.1 User Feature Matrix  
   4.2 Movie Feature Matrix  
   4.3 Preprocessing Summary  

5. **MOVIE CLUSTERING**  
   5.1 Selecting the Number of Clusters  
   5.2 Fitting K-Means Model  
   5.3 PCA Reduction for Visualization  
   5.4 Cluster Interpretation  
   5.5 Naming Pseudo-Genres  

6. **USER CLUSTERING**  
   6.1 K-MEANS  
   6.2 DBSCAN  
   6.3 BIRCH  
   6.4 MODEL COMPARISON  
   6.7 Naming User Clusters   
7. **USER-MOVIE PREFERENCES**  
    7.1 Merge Interactions with Clusters  
    7.2 Building Matrices  
    7.3 Temporal Analysis  
    7.4 Project Discoveries  
8. **CONCLUSIONS**  

## SECTION 1: ENVIRONMENT SETUP

First, we import all necessary libraries for data loading, analysis, visualization, and machine learning.

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

# Clustering Models
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import Birch

# Model Evaluation
from sklearn.metrics import silhouette_score

# Images set up
os.makedirs("images", exist_ok=True)

# Set styles and options
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
sns.set_palette("Set2")
%matplotlib inline
warnings.filterwarnings('ignore')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
#test
print("Libraries are imported")

## SECTION 2: DATA LOADING

We connect to the database (`viewer_interactions.db`) and load the key tables into pandas DataFrames.

* `user_statistics`: Pre-computed behavioral data for each user (our main features).
* `movie_statistics`: Pre-computed data for each movie.
* `movies`: Basic metadata for movies (title, year).

In [None]:
DB_PATH = 'viewer_interactions.db'

# Connect to database
conn = sqlite3.connect(DB_PATH)

# Load main tables
df_users = pd.read_sql_query("SELECT * FROM user_statistics", conn) # primary table for user clustering
df_movies = pd.read_sql_query("SELECT * FROM movie_statistics", conn) # for EDA and definition of genres
df_movies_meta = pd.read_sql_query("SELECT * FROM movies", conn)

# Close connection
conn.close()
print("Database connection closed.")
print(f"Loaded {len(df_users)} users")
print(f"Loaded {len(df_movies)} movies")

In [None]:
# Display the first few rows of the user data
print("User Statistics Head:")
display(df_users.head())

## SECTION 3:  EDA
## 3.1 EDA: USER 

Before we can model behavior, we must understand it. We will perform an EDA to analyze the distributions of our data. Using (`user_statistics`), our goal is to understand viewer activity patterns and the distributions of key behavioral features. After we'll be able to cluster users based on shared patterns.

### **User Statistics Summary**

In [None]:
print("User Statistics Summary:")
display(df_users[['total_ratings', 'avg_rating', 'std_rating', 'unique_movies', 'activity_days']].describe())
user_statistics_summary = df_users[['total_ratings', 'avg_rating', 'std_rating',
                                    'unique_movies', 'activity_days']].describe()
user_statistics_summary.to_csv("images/user_statistics_summary.csv")

The table above provides an overview of key behavioral metrics for nearly half a million users. These features help us understand how viewers engage with the platform.

#### Interpretation:
- **`total_ratings`** is highly right-skewed: 50% rate only a few titles, while a small minority rate dozens or even hundreds.
- **`avg_rating`** varies widely across users, indicating different rating styles (e.g., harsh vs. generous raters).
- **`std_rating`** shows how consistent a user is in their scoring behavior.
- **`unique_movies`** closely mirrors `total_ratings`, confirming that users rarely re-rate the same film.
- **`activity_days`** captures long-term engagement and contributes additional behavioral context.

These statistics justify clustering users based on both activity and rating style, as the user base is highly heterogeneous.

### **Distribution of Total Ratings per User**

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df_users['total_ratings'], bins=50, log_scale=True, color="#6A8CAF")
plt.title("Total Ratings per User (Log Scale)", fontsize=14, fontweight='bold')
plt.xlabel("Total Ratings")
plt.ylabel("Number of Users")
plt.grid(axis='y', alpha=0.3)
plt.savefig("images/user_total_ratings_hist.png", dpi=300)
plt.show()

This histogram shows how many ratings each user has submitted. Because the distribution is extremely skewed, a logarithmic x-axis is used to visualize both the mass of low-activity users and the long tail of highly active users.

#### Interpretation
**Most users rate very few movies:**  
- The majority provide 1-5 ratings, indicating very light engagement. 
- 75% of users rate **fewer than 11** movies.

**Moderately active users are less common:**  
- Users rating **20–50 movies** form a noticeable but smaller segment.

**A long tail of "power users" exists but is extremely rare:**  
  - A tiny fraction rate **100+ movies**.  
  - The maximum user activity is **764 ratings**, but users above 100 ratings are so rare that they appear as nearly invisible bars on the log-scale histogram.

The contrast highlights the extremely uneven engagement pattern in the dataset. This justifies treating **user activity as a primary dimension** in later clustering.

### **Distribution of Average Rating Given**

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df_users['avg_rating'], bins=40, kde=False, color="#6A8CAF")
plt.title("User Average Rating Distribution")
plt.xlabel("Average Rating Given")
plt.ylabel("Count")
plt.savefig("images/user_avg_rating_hist.png", dpi=300)
plt.show()

This histogram shows how generous or harsh users tend to be on average.

#### Interpretation
- Most users have an average rating between **3.0 and 4.0**, showing a general positive bias.
- Distinct spikes appear near whole-number values because ratings are inherently discrete (1–5).
- A noticeable number of users consistently give low or high average scores, suggesting meaningful variation in rating tendencies.

The histogram highlights diversity in rating style, which is another important dimension for user segmentation.

In [None]:
plt.figure(figsize=(8,6))
sns.violinplot(y=df_users['avg_rating'], inner='box', linewidth=1.2, color="#A56CC1")
plt.title("Distribution of Average Ratings Given by Users", fontsize=14, fontweight='bold')
plt.ylabel("Average Rating")
plt.grid(axis='y', alpha=0.3)
plt.savefig("images/user_avg_rating_violin.png", dpi=300)
plt.show()

The violin plot reveals the **full shape** of the distribution, complementing the histogram above.

#### Interpretation
- The densest region lies around **3.5–4.0**, confirming most users tend to rate positively.
- Thin tails at both extremes indicate a smaller group of consistently very harsh or very generous users.
- Distribution is not symmetric, the shape shows substantial variability, which becomes relevant when analyzing rating stability in next section.

 ### **Does User Activity Affect Rating Consistency?**

In [None]:
# Put users into activity quartiles
df_users['activity_bin'] = pd.qcut(df_users['total_ratings'], q=4, labels=["Low Activity", "Medium Activity", "High Activity", "Very High Activity"])

plt.figure(figsize=(12,6))
sns.violinplot(data=df_users, x='activity_bin', y='avg_rating', inner='box', linewidth=1.1, palette='Set2')

plt.title("Do Active Users Rate Movies Differently?", fontsize=14, fontweight='bold')
plt.xlabel("User Activity Level (Quartiles)")
plt.ylabel("Average Rating")
plt.grid(axis='y', alpha=0.3)
plt.savefig("images/user_activity_vs_rating_violin.png", dpi=300)
plt.show()

Users were grouped into **activity quartiles** based on `total_ratings`. A violin plot compares the **distribution** of average ratings across activity quartiles, showing the relationship between activity level and rating behavior.

#### Interpretation
**Mean ratings remain similar** across all activity levels — there is no linear shifts in average rating as user activity increases. *What does change is the spread:*
- **Low-activity users** display a much wider variability in average ratings, including extreme values.
- **Medium and high-activity users** concentrate closer to the 3–4 range, suggesting more stable and moderate scoring tendencies.
- **Very high-activity users** show the narrowest spread, indicating the most consistent behavior.

These differences demonstrate that **activity level influences behavioral consistency**, providing important nuance for clustering that should be included when clustering users.

### **Unique Movies vs Total Ratings**

In [None]:
sample_users = df_users.sample(8000, random_state=42)

plt.figure(figsize=(10,5))
sns.scatterplot(
    data=sample_users,
    x='unique_movies',
    y='total_ratings',
    alpha=0.35,
    s=50
)
plt.yscale("log")
plt.title("Unique Movies vs Total Ratings (Sample of 8k Users)", fontsize=14, fontweight='bold')
plt.xlabel("Unique Movies Rated")
plt.ylabel("Total Ratings (Log Scale)")
plt.grid(alpha=0.3)
plt.savefig("images/user_unique_vs_total_ratings_scatter.png", dpi=300)
plt.show()

In the scatterplot above each point represents one user, 
we're using a sample of 8,000 users to avoid overplotting.

#### Interpretation
- Points fall close to a curved line, showing a near-linear relationship between `unique_movies` and `total_ratings`.
- This confirms that users almost never re-rate the same movie.
- The log-scaled y-axis reveals variation across low-activity and high-activity users.

Because these two features are so tightly correlated, using both in clustering may introduce redundancy. The following heatmap confirms this insight quantitatively.

### **Correlation Between User Behavioral Features**

In [None]:
plt.figure(figsize=(7,5))
sns.heatmap(
    df_users[['total_ratings','avg_rating','std_rating',
              'unique_movies','activity_days']].corr(),
    annot=True,
    cmap="coolwarm",
    fmt=".2f"
)
plt.title("Correlation Between User Behavioral Features", fontsize=14, fontweight='bold')
plt.savefig("images/user_features_correlation_heatmap.png", dpi=300)
plt.show()

This heatmap summarizes linear relationships between the most relevant user behavior variables.

#### Interpretation
- **`total_ratings` and `unique_movies`** are almost perfectly correlated (ρ ≈ 0.97), indicating both measure the same behavior.
- **`avg_rating`** shows **no linear correlation** with user activity metrics (`total_ratings`, `unique_movies`, `activity_days`), meaning that on average, highly active users do not systematically rate movies higher or lower than less active users.
- **`std_rating`** is moderately negatively correlated with `avg_rating`, suggesting more generous users tend to be more consistent.
- **`activity_days`** shows a reasonable positive correlation with total ratings, consistent with expectations.

The heatmap guides feature selection for clustering:
- Avoid redundant variables (`total_ratings` and `unique_movies`).
- Combine **activity**, **consistency**, and **rating style** to capture distinct user personas.

## 3.2 EDA: MOVIES
Using (`movie_statistics`), our goal is to understand the movie popularity/rating patterns. Later we'll be able to divide movies into genre clusters based on explored patterns. This section answers core questions:  
	1.	How popular is each movie?    
	2.	How do movies tend to be rated?    
	3.	What relationships exist between popularity, rating, controversy, and release year?   
	4.	Are there meaningful patterns for clustering movies into “pseudo-genres”?    

### **Movie Statistics Summary**

In [None]:
print("Movie Statistics Summary:")
display(df_movies.describe()[[
    'total_ratings',
    'avg_rating',
    'std_rating',
    'year_of_release'
]])
movie_statistics_summary = df_movies[['total_ratings', 'avg_rating',
                                      'std_rating', 'year_of_release']].describe()
movie_statistics_summary.to_csv("images/movie_statistics_summary.csv")

This table provides a quick overview of movie-level statistics derived from all user ratings. Before exploring the plots, it is useful to understand the distribution of movie-level metrics.

#### Interpretation

**Popularity is extremely skewed**:  
  - The median movie receives **1 rating**.  
  - 75% of movies receive **2 or fewer ratings**.  
  - Only a tiny fraction reach anywhere above **100 ratings**.
  - Few films reach **10 000+ ratings** (up to 173 598).

**Average ratings are centered around 3.0**, with a typical spread of about 0.8.  
  This suggests that most movies receive moderately positive reviews.

**Rating variability (std_rating)** is usually between **0.7 and 1.3**, indicating that user opinions are generally consistent.  
  A few movies have very high variability, marking them as controversial.

**Release years span 1896–2005**, but the distribution is mostly flat and uncorrelated with other features.

Overall, the table confirms that the movie catalog contains **many obscure movies**, a moderate number of mid-known movies, and a **tiny number of blockbusters**.  
This extreme imbalance will strongly shape the interpretation of all following plots.

### **Distribution of Movie Popularity**

In [None]:
# Popularity tiers
bins = [0, 1, 2, 5, 10, 50, 200, 1000, 10000, df_movies['total_ratings'].max()]
labels = [
    "1 rating",
    "2 ratings",
    "3–5 ratings",
    "6–10 ratings",
    "11–50 ratings",
    "51–200 ratings",
    "201–1,000 ratings",
    "1,001–10,000 ratings",
    "10,001+ ratings"
]

df_movies['popularity_bin'] = pd.cut(
    df_movies['total_ratings'],
    bins=bins,
    labels=labels,
    include_lowest=True
)

plt.figure(figsize=(12,6))
ax = sns.countplot(
    data=df_movies,
    x='popularity_bin',
    order=labels,
    color="#6A8CAF"
)

plt.title("Movie Popularity Tiers (Total Ratings)", fontsize=14, fontweight='bold')
plt.xlabel("Popularity Tier")
plt.ylabel("Number of Movies")
plt.grid(axis='y', alpha=0.3)
plt.xticks(rotation=25)

# Annotate counts
for patch in ax.patches:
    height = patch.get_height()
    ax.annotate(
        f"{height}",
        (patch.get_x() + patch.get_width() / 2., height),
        ha='center', va='bottom',
        fontsize=9,
        xytext=(0, 3),
        textcoords='offset points'
    )

plt.tight_layout()
plt.savefig("images/movie_popularity_tiers_bar.png", dpi=300)
plt.show()


#### Interpretation
To better understand the extreme skew in movie popularity, we group movies into meaningful popularity ranges.

- The first two categories, **“1 rating”** and **“2 ratings”**, dominate the catalog  
  (together they account for over **12,000 movies**): most titles were rated only once or twice.

- The **“3–5 ratings”** tier still contains a large number of movies (~1,800),  
  but after that the distribution drops off a cliff.

- The **“6–10”** and **“11–50”** tiers contain only a **handful** of titles, showing that very few movies sit in this intermediate space.

- Popularity then rises slightly again for:
  - **“51–200 ratings”** and **“201–1,000 ratings”** – a few hundred mid-popular films with a visible but not massive audience.
  - **“1,001–10,000 ratings”** – around two hundred well-known mainstream titles.
  - **“10,001+ ratings”** – about **80 true blockbusters**, with the most popular movie reaching **173,598 ratings**.

This tiered view makes the long-tail structure of the catalog clear:  
the platform is dominated by **very obscure movies**, with a much smaller group of mid-popular titles and a tiny set of highly popular blockbusters.  
Popularity is therefore a crucial dimension to include when clustering movies.

### **Distribution of Average Movie Ratings**

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df_movies['avg_rating'], bins=40, color="#8FB996")
plt.title("Distribution of Movie Average Ratings", fontsize=14, fontweight='bold')
plt.xlabel("Average Rating")
plt.ylabel("Count")
plt.grid(axis='y', alpha=0.3)
plt.savefig("images/movie_avg_rating_hist.png", dpi=300)
plt.show()

#### Interpretation

Movie average ratings cluster strongly around integer values (1, 2, 3, 4, 5), reflecting the discrete nature of the rating system.

Key observations:

- Most movies fall between **2.5 and 4.0**.  
- Very low-rated (<1.5) and very high-rated (>4.5) movies are rare.  
- Peaks at integer values occur because movies with very few ratings often have an average equal to their single rating.

Overall, movie quality scores are moderately positive, and extremely negative or extremely positive movies are uncommon.

### **Rating Variability (Controversy) Distribution**

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df_movies['std_rating'], bins=40, color="#D4A5A5")
plt.title("Distribution of Movie Rating Variability (std_rating)", fontsize=14, fontweight='bold')
plt.xlabel("Rating Standard Deviation (Controversy)")
plt.ylabel("Count")
plt.grid(axis='y', alpha=0.3)
plt.savefig("images/movie_std_rating_hist.png", dpi=300)
plt.show()

#### Interpretation
**Most movies have a standard deviation between 0.7 and 1.3**, meaning users generally agree about their quality.

**A smaller number of movies show higher variability (2.0 - 3.0)**, indicating:
- controversial themes
- niche appeal
- movies whose style divides audiences (love/hate reactions)

**A few movies have very high standard deviation (>3.0)**, marking them as extreme outliers in terms of controversy.

This "controversy" dimension helps differentiate stable movies from polarizing ones when forming clusters.

### **Relationship Plots**
We check whether movie features correlate, sampling 10k movies at max to avoid overplotting.

In [None]:
# POPULARITY VS AVERAGE RATING

sample_movies = df_movies.sample(n=min(8000, len(df_movies)), random_state=42)

plt.figure(figsize=(10,5))
sns.scatterplot(
    data=sample_movies, 
    x='avg_rating', 
    y='total_ratings', 
    alpha=0.35, 
    s=40
)
plt.yscale("log")
plt.title("Movie Popularity vs Average Rating", fontsize=14, fontweight='bold')
plt.xlabel("Average Rating")
plt.ylabel("Total Ratings (Log Scale)")
plt.grid(alpha=0.3)
plt.savefig("images/movie_popularity_vs_avg_rating_scatter.png", dpi=300)
plt.show()

#### Interpretation 

For the vast majority of movies (those with fewer than ~10 ratings), popularity provides almost no useful signal.  
These movies cluster at the bottom of the plot and produce the flat band near y ≈ 0–10.

Among the small subset of more widely rated movies:

- Popularity spans all rating levels.  
- Highly rated movies are not necessarily popular.  
- Low-rated movies can still be widely watched.

Overall, **movie popularity and perceived quality are largely independent dimensions.**

In [None]:
# CONTROVERSY VS POPULARITY

plt.figure(figsize=(10,5))
sns.scatterplot(
    data=sample_movies, 
    x='std_rating', 
    y='total_ratings', 
    alpha=0.35, 
    s=40
)
plt.yscale("log")
plt.title("Controversy vs Popularity", fontsize=14, fontweight='bold')
plt.xlabel("Rating Standard Deviation (Controversy)")
plt.ylabel("Total Ratings (Log Scale)")
plt.grid(alpha=0.3)
plt.savefig("images/movie_controversy_vs_popularity_scatter.png", dpi=300)
plt.show()

#### Interpretation

The scatter plot again shows that for movies with very few ratings, variability is not meaningful.

Among more popular movies:

- Most fall within a moderate variability range (0.7–1.3).  
- Controversy does not predict how widely a movie is watched.  
- Extremely controversial movies tend to be unpopular, likely because niche films accumulate fewer ratings.

Overall, controversy and popularity appear independent.

### **Correlation Between Movie Features**

In [None]:
plt.figure(figsize=(7,5))
sns.heatmap(
    df_movies[['total_ratings','avg_rating','std_rating','year_of_release']].corr(),
    annot=True,
    cmap="coolwarm",
    fmt=".2f"
)
plt.title("Correlation Between Movie Features", fontsize=14, fontweight='bold')
plt.savefig("images/movie_features_correlation_heatmap.png", dpi=300)
plt.show()

#### Interpretation
The heatmap confirms that:

- **Popularity (total_ratings)** is almost independent from  
  - average rating  
  - rating variability  
  - release year  

- **Average rating and controversy** show no significant relationship.

- **Release year** does not meaningfully affect popularity or rating level in this dataset.

The weak correlations match the behavior observed in the scatter plots: each feature contributes independent information, which is ideal for clustering.

## SECTION 4: DATA PREPROCESSING
Before we can proceed with clustering, we need to:

1. **Select meaningful features** for users and movies.  
2. **Handle missing values**  
3. **Scale features** so that variables on large scales do not dominate smaller ones.

Since clustering is unsupervised, we do not split the data into train/test sets.  
Instead, we build two clean feature matrices:

- `X_users_kmeans` – standardized user features  
- `X_movies_kmeans` – standardized movie features

These matrices will be the input for all clustering methods in the next sections.

### 4.1 User Feature Matrix

For users, we want to capture both **activity** and **rating behaviour**.  
We use the following features from `df_users`:

- `total_ratings` – how many ratings the user has submitted  
- `unique_movies` – how many distinct movies they rated  
- `avg_rating` – how generous/strict they are on average  
- `std_rating` – how consistent their ratings are  
- `activity_days` – over how many days they have been active

In [None]:
# 4.1 USER FEATURE MATRIX

user_features = [
    'total_ratings',
    'unique_movies',
    'avg_rating',
    'std_rating',
    'activity_days'
]

# Keep only the selected columns
df_users_features = df_users[user_features].copy()

print("Missing values in raw user features:")
print(df_users_features.isna().sum()) #will show NaNs here.

# Making sure ratings are within valid range
df_users_features['avg_rating'] = df_users_features['avg_rating'].clip(1, 5)

# Fix missing std for users who rated single movie
single_rating = (df_users_features['total_ratings'] == 1) & (df_users_features['std_rating'].isna())
df_users_features.loc[single_rating, 'std_rating'] = 0.0

# Impute missing values (median is robust to skew)
user_imputer = SimpleImputer(strategy='median')
X_users_imputed = user_imputer.fit_transform(df_users_features)

# Standardize features
user_scaler = StandardScaler()
X_users_kmeans = user_scaler.fit_transform(X_users_imputed)

# Store as DataFrame for easier inspection
X_users_kmeans_df = pd.DataFrame(
    X_users_kmeans,
    columns=user_features,
    index=df_users['customer_id']
)

print("\nUser feature matrix for clustering (standardized):")
print(X_users_kmeans_df.head())
print(f"\nX_users_kmeans shape: {X_users_kmeans_df.shape}")

#### **Missing Values — Users**

User features contained natural missing values:
- `total_ratings`, `unique_movies`, `avg_rating`, `activity_days` → missing for ~22k users  
- `std_rating` → missing for ~106k users (undefined when a user rated only one movie)

To ensure consistency:
- Users with exactly **one rating** had their `std_rating` explicitly set to **0**  
  (since variability cannot be computed from a single observation).
- All remaining missing values were filled using the **median**, which is robust for skewed data.

After imputation and scaling, the user matrix had **no remaining NaNs**.

### 4.2 Movie Feature Matrix

For movies, we focus on **popularity, quality, controversy, and time**:

- `total_ratings` – how many users rated the movie (popularity)  
- `avg_rating` – average rating (perceived quality)  
- `std_rating` – rating variability (controversy)  
- `year_of_release` – temporal dimension (older vs newer titles)


In [None]:
# 4.2 MOVIE FEATURE MATRIX

movie_features = [
    'total_ratings',
    'avg_rating',
    'std_rating',
    'year_of_release'
]

df_movies_features = df_movies[movie_features].copy()

print("Missing values in raw movie features:")
print(df_movies_features.isna().sum())

# Making sure avg_rating is within valid range 
df_movies_features['avg_rating'] = df_movies_features['avg_rating'].clip(1, 5)

# Fix missing std for movies that were rated once
one_rating_movies = ((df_movies_features['total_ratings'] == 1) & (df_movies_features['std_rating'].isna()))
df_movies_features.loc[one_rating_movies, 'std_rating'] = 0.0

# Impute missing values
movie_imputer = SimpleImputer(strategy='median')
X_movies_imputed = movie_imputer.fit_transform(df_movies_features)

# Standardize
movie_scaler = StandardScaler()
X_movies_kmeans = movie_scaler.fit_transform(X_movies_imputed)

# DataFrame version
X_movies_kmeans_df = pd.DataFrame(
    X_movies_kmeans,
    columns=movie_features,
    index=df_movies['movie_id']
)

print("\nMovie feature matrix for clustering (standardized):")
print(X_movies_kmeans_df.head())
print(f"\nX_movies_kmeans shape: {X_movies_kmeans_df.shape}")

#### **Missing Values — Movies**

Movie features showed:
- `total_ratings`, `avg_rating` → 800 missing  
- `std_rating` → 9,168 missing (single-rating movies)  
- `year_of_release` → 4,511 missing  

Movies with **only one rating**, same as users, had `std_rating` set to **0**, ensuring correct variance representation.

We chose to retain `year_of_release` and impute it, since temporal information is useful for describing movie behaviour and is included in course clustering examples.

All movie missing values were filled with the **median** for stability.

After imputation and scaling, the movie matrix had **no missing values**.


### 4.3 Preprocessing Summary
Missing values in both datasets were mostly systematic, rather than errors:

- Users with only one rating cannot have a computed `std_rating`
- Movies with no ratings lack `avg_rating` and `std_rating`
- Many movies lack `year_of_release`
- Low-activity users miss several statistics

How it was handled:
- `std_rating` for single-rating users and movies was explicitly set to 0
- All other missing numeric features were filled with median  
- `year_of_release` was retained and median-imputed because this information is important for future analysis

At this point we have:
- `X_users_kmeans` – standardized user feature matrix (shape: *number of users × 5*) 
- `X_movies_kmeans` – standardized movie feature matrix  (shape: *number of movies × 4*)

This preprocessing ensures the clustering algorithms will operate reliably in the following sections.


## SECTION 5: MOVIE CLUSTERING
In this section, we apply clustering techniques to the standardized movie feature matrix prepared earlier.  
The goal is to identify **behavior-based groups of movies** (pseudo-genres), using:

- movie popularity (`total_ratings`)  
- perceived quality (`avg_rating`)  
- rating consistency (`std_rating`)  
- release year (`year_of_release`)  

Following the structure used in *Clustering.ipynb*, we evaluate different values of *k* using the:

- **Elbow Method** (inertia)  
- **Silhouette Score**  

We then fit the final KMeans model, visualize clusters using **PCA(2)**, and interpret each cluster.

### 5.1 Selecting the Number of Clusters

In [None]:
k_values = range(2, 11)

inertia_scores = []
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_movies_kmeans)
    
    inertia_scores.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_movies_kmeans, labels))

print("Inertia:", inertia_scores)
print("Silhouette:", silhouette_scores)

In [None]:
plt.figure(figsize=(9,4))
plt.plot(k_values, inertia_scores, marker='o', linestyle='-')
plt.title("Elbow Method (KMeans Inertia)", fontsize=14, fontweight='bold')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.grid(True, alpha=0.3)
plt.savefig("images/movie_kmeans_elbow_inertia.png", dpi=300)
plt.show()

In [None]:
plt.figure(figsize=(9,4))
plt.plot(k_values, silhouette_scores, marker='o', linestyle='-')
plt.title("Silhouette Scores for Different k", fontsize=14, fontweight='bold')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.grid(True, alpha=0.3)
plt.savefig("images/movie_kmeans_silhouette_scores.png", dpi=300)
plt.show()

The elbow plot shows where inertia begins to flatten, indicating diminishing improvement as *k* increases.  
The silhouette plot shows how well-defined the clusters are.

Based on these two diagnostics, we choose **k = 6** because:

- it lies near the elbow of the inertia curve  
- it achieves one of the highest silhouette scores  
- it balances interpretability with cluster separation  

We proceed with this value in the KMeans model.

### 5.2 Fitting KMeans Model

In [None]:
optimal_k = 6

kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
movie_labels = kmeans_final.fit_predict(X_movies_kmeans)

df_movies['cluster'] = movie_labels

print(df_movies[['movie_id', 'avg_rating', 'total_ratings', 'std_rating', 'year_of_release', 'cluster']].head())

### 5.3 PCA Reduction for Visualization

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_movies_pca = pca.fit_transform(X_movies_kmeans)

df_movies['pca1'] = X_movies_pca[:, 0]
df_movies['pca2'] = X_movies_pca[:, 1]

plt.figure(figsize=(8,6))
sns.scatterplot(
    data=df_movies.sample(8000, random_state=42),
    x='pca1', y='pca2',
    hue='cluster',
    palette='Set2',
    alpha=0.6
)
plt.title("Movie Clusters Visualized in 2D (PCA)", fontsize=14, fontweight='bold')
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.legend(title="Cluster")
plt.savefig("images/movie_clusters_pca_scatter.png", dpi=300)
plt.show()

### **5.4 Cluster Interpretation**

In [None]:
cluster_summary = df_movies.groupby('cluster')[['total_ratings', 'avg_rating', 'std_rating', 'year_of_release']].agg(['mean','median'])
cluster_summary
movie_cluster_summary = df_movies.groupby('cluster')[['total_ratings', 'avg_rating',
                                                      'std_rating', 'year_of_release']].agg(['mean', 'median'])
movie_cluster_summary.to_csv("images/movie_cluster_summary.csv")

#### Interpreting Movie Clusters

After fitting the KMeans model with *k = 6*, each movie is assigned to one of six behavioral clusters.  
The table below summarizes the updated characteristics of each group after preprocessing.

| Cluster | Avg Popularity *(total_ratings)* | Avg Rating | Rating Variability | Mean Year | Interpretation                                                                |
|--------|----------------------------------|------------|---------------------|-----------|-------------------------------------------------------------------------------|
| **0** | ~90 ratings | **3.05** | ~1.09 | 1958 | **Older films** with moderate engagement and moderate scores                  |
| **1** | ~18 ratings | **4.09** | ~0.46 | 1997 | **Well-liked niche films** with small audiences but very positive reception   |
| **2** | ~202 ratings | **3.05** | **1.77** | 1995 | **Controversial films** that divide viewers due to high variability           |
| **3** | ~99,310 ratings | **3.76** | ~0.98 | 1996 | **Blockbuster titles** with huge visibility and stable ratings                |
| **4** | ~25,233 ratings | **3.55** | ~1.02 | 1992 | **Popular wide-audience films** with strong engagement and consistent ratings |
| **5** | ~2.5 ratings | **1.48** | ~0.55 | 1996 | **Unnoticed, low-rated films** with almost no audience                        |

These clusters form clear behavior-based genres:

- **Well-liked niche titles** (Cluster 1)  
- **Massive blockbusters** (Cluster 3)  
- **Popular general-audience films** (Cluster 4)  
- **Older moderately viewed films** (Cluster 0)  
- **Controversial films** with strong disagreement (Cluster 2)  
- **Unnoticed disliked movies** (Cluster 5)

These clusters represent **pseudo-genres derived from audience behavior**, not content.  
They will be used later to analyze:

- which user types prefer which movie types  
- how preferences differ with engagement levels  
- how audience behavior varies across pseudo-genres  


### **5.5 Naming Pseudo-Genres**

To support clearer analysis in later sections, we give each cluster a descriptive name summarizing its viewing dynamics.

| Cluster | Name                     |
|--------|--------------------------|
| **0** | *Old Classics*           |
| **1** | *Well-Liked Niche Films* |
| **2** | *Controversial Films*    |
| **3** | *Blockbusters*           |
| **4** | *Popular Films*          |
| **5** | *Hated Invisible Films*  |

These names reflect key behavioral dimensions:

- **Popularity** (reach of the film)  
- **Perceived quality** (average rating)  
- **Consensus vs disagreement** (variability)  
- **Temporal patterns** (release year)

These pseudo-genres will serve as the foundation for the **User × Movie preference analysis** in Section 7.


## SECTION 6: USER CLUSTERING

This section applies clustering algorithms to the standardized user feature matrix created in Section 4.  
The goal is to uncover **behavior-based types of viewers**, using:

- activity level (`total_ratings`, `unique_movies`, `activity_days`)
- rating style (`avg_rating`, `std_rating`)

We evaluate three clustering models:

- **K-Means** 
- **DBSCAN**   
- **BIRCH** 


### 6.1 K-MEANS

In [None]:
# 6.1.1 DETERMINE NUMBER OF CLUSTERS
sample_size = min(50000, X_users_kmeans.shape[0]) #sampling for silhouette evaluation only
np.random.seed(42)
sample_idx = np.random.choice(X_users_kmeans.shape[0], sample_size, replace=False)
X_users_sample = X_users_kmeans[sample_idx]

k_values = range(2, 11)
sil_scores_users = []
inertia_users = []

for k in k_values:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_users_sample)
    
    sil = silhouette_score(X_users_sample, labels)
    sil_scores_users.append(sil)
    
    # inertia must be computed on sample too
    inertia_users.append(km.inertia_)
    
print("Silhouette:", sil_scores_users)
print("Inertia:", inertia_users)

In [None]:
# PLOTTING SILHOUETTE SCORES
plt.figure(figsize=(9,4))
plt.plot(k_values, sil_scores_users, marker='o', linestyle='-')
plt.title("Silhouette Scores for Different k", fontsize=14, fontweight='bold')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.grid(True, alpha=0.3)
plt.savefig("images/user_kmeans_silhouette_scores.png", dpi=300)
plt.show()

In [None]:
# PLOTTNG ELBOW METHOD
plt.figure(figsize=(9,4))
plt.plot(k_values, inertia_users, marker='o', linestyle='-')
plt.title("Elbow Method (KMeans Inertia)", fontsize=14, fontweight='bold')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.grid(True, alpha=0.3)
plt.savefig("images/user_kmeans_elbow_inertia.png", dpi=300)
plt.show()

# PLOTTING VISUALIZER FOR ELBOW
from yellowbrick.cluster import KElbowVisualizer

dataset = X_users_kmeans  
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2, 11), timings=True)
visualizer.fit(dataset)
visualizer.show()
visualizer.show(outpath="images/user_kmeans_distortion_score.png")

#### **6.1.1 Determine the Number of User Clusters**

To determine the optimal number of clusters for user segmentation, we evaluated *k = 2…10* using two standard diagnostics:

- **Inertia (Elbow Method):** measures how tightly points group within clusters.
- **Silhouette Score:** measures separation between clusters (higher is better).

The silhouette score reached its **global maximum at k = 2**, but such a small number of clusters would oversimplify user behavior (e.g., only “active vs. inactive”).  
Among all solutions with **k > 2**, the highest silhouette score occurred at **k = 6**.

The elbow is, as shown, at **k = 5** and inertia curve began flattening after, indicating diminishing returns beyond this region.

Given that:
- *k = 6* provides the best silhouette among meaningful multi-cluster solutions,  
- the elbow method supports similar number of clusters,  
- higher-k solutions allow richer behavioral interpretation (important for Audience Decode),

we select **k = 6** as the optimal value for KMeans user clustering.

In [None]:
k = 6
index = k - 2  # because lists start at k=2
print("We chose k = 6 this is the scores for later comparison:")
print(f"Silhouette score: {sil_scores_users[index]}")
print(f"Inertia: {inertia_users[index]}")

In [None]:
# 6.2.2 TRAIN K-MEANS ON FULL DATASET
kmeans_sil_score = 0.3346739450703381
optimal_k_users = 6  #explained in previous section

kmeans_users = KMeans(n_clusters=optimal_k_users, random_state=42, n_init=10)
user_labels = kmeans_users.fit_predict(X_users_kmeans)

df_users = df_users.copy()           # optional, to avoid side-effects
df_users['cluster'] = user_labels

user_cluster_summary = df_users.groupby('cluster')[['total_ratings', 'unique_movies','avg_rating', 'std_rating', 'activity_days']].agg(['mean','median','count']).round(2)
user_cluster_summary_kmeans = df_users.groupby('cluster')[['total_ratings', 'unique_movies',
                                                           'avg_rating', 'std_rating', 'activity_days']].agg(['mean','median','count'])
user_cluster_summary_kmeans.to_csv("images/user_cluster_summary_kmeans.csv")
user_cluster_summary

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_users_pca = pca.fit_transform(X_users_kmeans)

df_users['pca1'] = X_users_pca[:, 0]
df_users['pca2'] = X_users_pca[:, 1]

plt.figure(figsize=(8,6))
sns.scatterplot(
    data=df_users.sample(4000, random_state=42),
    x='pca1', y='pca2',
    hue='cluster',
    alpha=0.5,
    palette='Set2'
)
plt.title("User Clusters Visualized in 2D (PCA)", fontsize=14, fontweight='bold')
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.legend(title="Cluster")
plt.savefig("images/user_clusters_2D_PCA.png", dpi=300)
plt.show()


#Average rating by cluster CHANGE 
#sns.boxplot(data=df_users, x='cluster', y='avg_rating')

#Total ratings CHANGE
#sns.boxplot(data=df_users, x='cluster', y='total_ratings')
#plt.yscale('log')

#Activity days CHANGE
#sns.boxplot(data=df_users, x='cluster', y='activity_days')
#plt.yscale('log')

#### **6.5 Interpretation of K-Means User Clusters**

This segmentation identifies six distinct groups based on **engagement level**, **rating behavior**, and **activity duration**.


#### **Cluster 0 — One-time Fan Visitor**
- **~2–3 total ratings**, **~2 unique movies**
- **Very high average rating (~4.5)**
- **Almost no activity span (median ~0 days)**
**These are brief one-time visitors who rate a couple of movies generously and leave(≈ 101k users)**


#### **Cluster 1 — Short-Term Users**
- **~6–7 ratings**, **~5 unique movies**
- **Neutral average rating (~3.5)**
- **Activity ~180 days**
**Viewers who interact lightly and stay for only several months(≈ 170k users)**


#### **Cluster 2 — Consistent Enjoyers**
- **~30 total ratings**, **~28 unique movies**
- **Positive average rating (~3.6)**
- **Activity span around 455 days (1+ year)**
**Steady users who explore many movies and show consistent engagement(≈ 41k users)**


#### **Cluster 3 — One-Movie Critics**
- **~1–2 total ratings**, usually **1 unique movie**
- **Average rating ~2.3–2.5**
- **Very short activity (~30 days)**
**Users who appear once, rate a movie or two, and do not return(≈ 49k users)**


#### **Cluster 4 — Loyal Ocasional Enjoyers**
- **~14 ratings**, **~12 unique movies**
- **Balanced average rating (~3.56)**
- **Very long activity (~1060 days ≈ 3 years)**
**Loyal long-term users who come back over years but do not rate heavily(≈ 46k users)**


#### **Cluster 5 — Heavy Enthusiasts**
- **~75–80 ratings**, **65+ unique movies**
- **Average rating (~3.38)**
- **Long activity (~900 days)**
**High-engagement users who explore many movies and remain active for years(≈ 6k users)**



### **6.2 DBSCAN (FULL)**
#### 6.2.1 Prepare data + train DBSCAN on FULL dataset

In [None]:
# 6.2.1 PREPARE FULL DATASET + TRAIN DBSCAN
X_users_dbscan = X_users_kmeans
# Hyperparameters for DBSCAN on the FULL dataset
eps_val = 0.6  # radius of neighbourhood
min_samples_val = 20  # minimum points to form a dense region

dbscan = DBSCAN(eps=eps_val, min_samples=min_samples_val, n_jobs=-1)
dbscan_labels = dbscan.fit_predict(X_users_dbscan)

# Basic cluster diagnostics
labels = dbscan_labels
unique_labels = set(labels)
n_clusters = len(unique_labels - {-1})  # exclude noise label -1
n_noise = (labels == -1).sum()

print("DBSCAN Results (Full Dataset):")
print(f"  eps = {eps_val}")
print(f"  min_samples = {min_samples_val}")
print(f"  Number of clusters (excluding noise): {n_clusters}")
print(f"  Noise points: {n_noise} ({n_noise / len(labels):.2%} of all users)")

# Cluster size distribution (including noise)
cluster_sizes = pd.Series(labels, name="cluster").value_counts().sort_index()
print("\nCluster size distribution (including noise = -1):")
print(cluster_sizes)

#### **Cluster Structure and Noise**

The DBSCAN model detected:
- 4 meaningful clusters
- 417,072 users (≈ 95%) assigned to one dominant cluster
- Several smaller, compact clusters (8k–10k users each)
- A set of very small micro-clusters (< 300 users)
- 2,343 users (≈ 0.5%) labeled as noise

This distribution reveals a highly homogeneous core audience: most users share very similar engagement and rating behavior, forming one large dense region in the feature space.
DBSCAN then isolates several behaviorally distinct subgroups, possibly representing:
- highly active viewers
- users with extreme or low rating variance
- genre-specialized or burst-pattern viewers
- rare or unusual rating profiles

Finally, the noise group consists of users whose behavior does not resemble any dense pattern—typically highly irregular or low-activity users.
6.2.3 PCA scatter plot (on sample of full model)


#### 6.2.2 PCA scatter plot (on sample of full model)

In [None]:
# 6.2.2 – PCA scatter of DBSCAN (Sampling for Visualization)

plot_sample_size = 30000

if X_users_dbscan.shape[0] > plot_sample_size:
    np.random.seed(42)
    plot_idx = np.random.choice(X_users_dbscan.shape[0], plot_sample_size, replace=False)
    sample_for_plot = X_users_dbscan[plot_idx]
    sample_labels = dbscan_labels[plot_idx]
else:
    sample_for_plot = X_users_dbscan
    sample_labels = dbscan_labels

# 2D PCA just for visualization
pca_vis = PCA(n_components=2)
pca_points = pca_vis.fit_transform(sample_for_plot)

plt.figure(figsize=(7, 5))
sns.scatterplot(
    x=pca_points[:, 0],
    y=pca_points[:, 1],
    hue=sample_labels,
    palette="tab20",
    s=10,
    alpha=0.6,
    legend=False
)
plt.title(f"DBSCAN Clusters (PCA, FULL model, sample of {len(sample_for_plot)} users)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.tight_layout()
plt.savefig("images/user_clusters_dbscan_PCA.png", dpi=300)
plt.show()


#### 6.2.3 Final silhouette score for DBSCAN (NO tuning)

In [None]:
# 6.2.3 – DBSCAN FINAL SILHOUETTE SCORE (using the FULL model from 6.2)

sample_size = min(50000, X_users_dbscan.shape[0])
np.random.seed(42)
sample_idx = np.random.choice(X_users_dbscan.shape[0], sample_size, replace=False)

X_sample = X_users_dbscan[sample_idx]
labels_sample = dbscan_labels[sample_idx]

# Remove noise
mask_clustered = labels_sample != -1
X_clustered = X_sample[mask_clustered]
labels_clustered = labels_sample[mask_clustered]

n_clusters_clustered = len(set(labels_clustered))

if n_clusters_clustered > 1 and len(X_clustered) > 0:

    sil_score_dbscan_final = silhouette_score(X_clustered, labels_clustered)

    print("Final Silhouette Score for DBSCAN (Full Model)")
    print(f"Sample size used (clustered points only): {len(X_clustered)}")
    print(f"Number of clusters in sample (excluding noise): {n_clusters_clustered}")
    print(f"Silhouette score: {sil_score_dbscan_final:.4f}")

else:
    print("Cannot compute silhouette score: <2  non-noise clusters were found.")


#### Silhouette Score Interpretation

The silhouette score on ~50,000 sampled non-noise users is: Silhouette = 0.2048

This value is low by design, and it is not a sign of poor performance. For DBSCAN:
- clusters can be irregularly shaped
- one cluster can be much larger than the others
- clusters may overlap in feature space
- noise points are excluded from scoring

Because silhouette assumes convex, well-separated clusters, it is less meaningful for density-based algorithms.
Despite the low silhouette, DBSCAN successfully captured density structure and revealed subtle behavioral subgroups that K-Means cannot detect.

### **6.3 BIRCH (FULL)**
#### Balanced Iterative Reducing and Clustering using Hierarchies

In [None]:
# 6.3.1 FITTING BIRCH ON SAMPLE
X_birch = X_users_kmeans.copy()

sample_size = min(50000, X_birch.shape[0])
np.random.seed(42)
sample_idx_birch = np.random.choice(X_birch.shape[0], sample_size, replace=False)
X_birch_sample = X_birch[sample_idx_birch]

# Test a range of cluster numbers 

k_values_birch = range(2, 11)   # test between 2 and 10 clusters
sil_scores_birch = []

for k in k_values_birch:
    birch_test = Birch(n_clusters=k)
    labels_test = birch_test.fit_predict(X_birch_sample)

    # Birch can sometimes assign all points to 1 cluster → silhouette fails
    if len(np.unique(labels_test)) < 2:
        sil_scores_birch.append(-1)   # invalid clustering
        continue

    score = silhouette_score(X_birch_sample, labels_test)
    sil_scores_birch.append(score)
    print(f"k = {k}, silhouette = {score:.4f}")

# Plot silhouette curve 

plt.figure(figsize=(8,4))
plt.plot(k_values_birch, sil_scores_birch, marker='o')
plt.title("Birch Silhouette Scores for Different k", fontsize=14, fontweight='bold')
plt.xlabel("Number of clusters (k)")
plt.ylabel("Silhouette score")
plt.grid(alpha=0.3)
plt.savefig("images/user_birch_silhouette_scores.png", dpi=300)
plt.show()

# Report best number of clusters

best_k_birch = 4
best_sil_birch = 0.4021
print(f"\nBest number of clusters for Birch based on silhouette: k = {best_k_birch}, silhouette = {best_sil_birch}")

#### **Selection of clusters for BIRCH**

Birch builds a hierarchical clustering structure by incrementally compressing the dataset into compact subclusters and then performing a final clustering step on these summaries. Its effective for:

- large, high-dimensional datasets,
- skewed behavioral distributions,
- identifying both global and local engagement patterns.

When testing k between 2 and 10, the silhouette score decreased sharply after k = 4.
Although *k = 2* gave the highest silhouette value, the resulting segmentation was not meaningful:
the vast majority of users collapsed into a single broad cluster, with only heavy users forming a separate group.

This behavior is expected because:

- user engagement is extremely imbalanced,
- Birch merges sparse users aggressively at low values of *k*,
- high-dimensional distances emphasize large clusters over subtle behaviour differences,
- overly small k values hide important user archetypes behind a single dense cluster.

Choosing *k = 4* provides a better trade-off between interpretability and clustering quality:

- the silhouette score remains relatively high,
- the clusters separate into distinct behavioral profiles,
- segmentation becomes actionable(casual viewers, explorers, consistent raters, heavy users),
- cluster boundaries are more stable and meaningful than with larger k.

Therefore, *k = 4* is selected as the most suitable number of clusters for Birch,
as it provides **clear audience patterns**


In [None]:
# BEHAVIOUR-WEIGHTED BIRCH

# Get column indices
idx_total = user_features.index('total_ratings')
idx_unique = user_features.index('unique_movies')
idx_avg = user_features.index('avg_rating')
idx_std = user_features.index('std_rating')
idx_days = user_features.index('activity_days')

# Adjust features to ensure full picture clustering:
# down-weight volume related features (total_ratings, unique_movies)
# up-weight behavioural features (avg_rating, std_rating, activity_days)
X_birch[:, [idx_total, idx_unique]] *= 0.5      # less influence
X_birch[:, [idx_avg, idx_std, idx_days]] *= 2.0 # more influence


birch_full = Birch(n_clusters=best_k_birch) # best number clusters based on analysis
birch_labels = birch_full.fit_predict(X_birch)

# Store Birch labels in df_users
df_users['cluster_birch'] = birch_labels

# Basic cluster counts
print("Cluster counts:")
print(df_users['cluster_birch'].value_counts().sort_index())

# Summary of user behavior by cluster (full dataset)
birch_cluster_summary = (
    df_users.groupby('cluster_birch')[['total_ratings', 'unique_movies', 'avg_rating', 'std_rating', 'activity_days']]
    .agg(['mean', 'median', 'count'])
    .round(2)
)

print("\nBirch Cluster Summary:")
birch_cluster_summary
user_cluster_summary_birch = df_users.groupby('cluster_birch')[['total_ratings', 'unique_movies',
                                                                'avg_rating', 'std_rating', 'activity_days']].agg(['mean','median','count'])
user_cluster_summary_birch.to_csv("images/user_cluster_summary_birch.csv")

#### **Explanation (Birch With Behaviour-Weighted Features)**

Before running Birch, the user feature matrix was adjusted so that the clustering focuses more on how users behave, not only how much they watch.

#### **What was done**
- `total_ratings` and `unique_movies` were downscaled (×0.5)
  → these numbers are very large and would dominate the clustering.
- `avg_rating`, `std_rating`, and `activity_days` were upscaled (×2.0)
  → these better describe the user’s rating behavior and engagement style.

This creates a new matrix (`X_birch`) where behavioral features have more influence.

#### **Why this was needed**
Without weighting:
- Birch mainly separates light vs. heavy users.
- Behavioral differences disappear because volume features are too strong.

With weighting:
- Clusters reflect actual behavior patterns, not just activity level.
- We get more useful groups (e.g., explorers, consistent raters, heavy users, casual viewers).

#### **Comparison**
- **Unweighted Birch** mainly formed clusters based on activity volume (light, medium, heavy users).
  Behavioral differences were blended together, resulting in less interpretable segments.

- **Behavior-weighted Birch** produced richer and more distinct behavioral profiles, enabling clusters such as:
  - consistent raters,
  - exploratory viewers,
  - high-variability users,
  - long-term stable viewers.

Therefore, weighting significantly improves the meaningfulness and interpretability of the final clusters.

In [None]:
# PCA on feature matrix used for Birch (X_birch)
pca_vis = PCA(n_components=2, random_state=42)
users_2d = pca_vis.fit_transform(X_birch)

# Attach PCA coordinates to df_users (same index order as X_birch / X_users_kmeans)
df_plot = df_users.copy()
df_plot['pca1_birch'] = users_2d[:, 0]
df_plot['pca2_birch'] = users_2d[:, 1]

# Fair random sample from the FULL dataset (preserves cluster imbalance)
max_points = 20000  # just to avoid overplotting; change if needed
if len(df_plot) > max_points:
    df_plot_sample = df_plot.sample(max_points, random_state=42)
else:
    df_plot_sample = df_plot

# Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=df_plot_sample,
    x='pca1_birch',
    y='pca2_birch',
    hue='cluster_birch',
    palette='tab10',
    s=10,
    alpha=0.5
)

plt.title("User Clusters with Birch (PCA, Random Sample)", fontsize=14, fontweight='bold')
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.legend(title="Cluster", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.savefig("images/user_clusters_birch_pca_scatter.png", dpi=300)
plt.show()

#### **Interpretation of Birch Clusters**

The PCA plot shows how the four Birch clusters are positioned in a compressed 2-dimensional space.
Although PCA reduces information, it helps us visually understand how users differ in their behaviourr.

Here is what each cluster represents based on its location, density, and spread:

#### **Cluster 0 (blue)** — *Casual / Low-Activity Users*
- This is the largest and densest cluster.
- Users in this group have:
  - few total ratings,
  - few unique movies,
  - short activity spans,
  - very stable rating patterns.

**Interpretation:**
Most users watch only a small number of movies and behave very similarly. This is the “baseline audience.”


#### **Cluster 1 (orange)** — *Broad Explorers*
- Positioned lower-right and somewhat spread out.
- These users watch many *different* movies (high unique count), even if their total ratings are moderate.
- Their activity days tend to be longer than cluster 0.

**Interpretation:**
These are curious, exploratory viewers who sample many titles but do not necessarily rate a lot.


#### **Cluster 2 (green)** — *High-Variability / Irregular Users*
- A small but distinct group.
- Points appear scattered between cluster 1 and cluster 3.
- These users show:
  - less consistent rating behavior (high std_rating),
  - mixed activity levels.

**Interpretation:**
This cluster contains users with more unpredictable or irregular engagement patterns.


#### **Cluster 3 (red)** — *Highly Active / Heavy Users*
- Much higher spread compared to other clusters.
- These users have:
  - very high total ratings,
  - long activity spans,
  - broad movie consumption,
  - stable or semi-stable average ratings.

**Interpretation:**
This is the “power-user” group—people who interact with the platform heavily and over long periods.



### **6.4 MODEL COMPARISON**

In [None]:
model_scores = pd.DataFrame({
    'Model': ['KMeans', 'DBSCAN', 'BIRCH'],
    'Silhouette (sample)': [kmeans_sil_score, 0.1157, best_sil_birch]
})

model_scores.to_csv("images/user_clustering_model_scores.csv", index=True)
model_scores

After evaluating three clustering algorithms on a representative user sample, we obtained the silhouette scores shown above.

**K-Means** delivers strong overall performance:
- relatively high silhouette score (**0.335**)  
- clusters are **well distributed** across the dataset  
- segments are **interpretable** and reflect meaningful behavioral patterns  
- captures engagement level, rating style, and activity duration  
- requires **no manual feature re-weighting**

This makes it a strong and stable baseline model — and consistent with course methodology.

**DBSCAN** struggles with this **high-dimensional and highly skewed** dataset:
- forms **one massive cluster**  
- many **tiny fragments**  
- substantial **noise**  

This behavior reflects *poor cluster separability* and confirms that density-based methods are not suitable for this type of user data.


**Birch** achieves the **highest silhouette score** (**0.402**), but only **after manual feature re-weighting**:
- without adjustments, Birch collapses nearly all users into one dominant cluster  
- even with tuning, it produces **very broad macro-clusters** that lack behavioral detail  

Thus, despite the higher silhouette score, Birch does not meet the project’s need for nuanced audience segmentation.

---

#### **We Select K-Means as the Final Model**
Although Birch scores higher, **K-Means is chosen as the final clustering method** because:

- it produces **balanced, well-spread clusters**  
- clusters are **behaviorally interpretable**  
- the method is **stable** without feature manipulation  
- it aligns with the course’s clustering workflow  
- it best supports the project goals:  identifying audience segments, analyzing engagement, studying rating behaviors, comparing groups over time  


Following the reasoning above **K-Means is selected as the final user clustering model**, with DBSCAN and Birch included only as comparison methods.


### **6.5 NAMING USER CLUSTERS**



To reference user groups in later sections (especially preference analysis), we assign descriptive behavioral labels:

| Cluster | Name                     |
|--------|--------------------------|
| 0      | One-Time Fan Visitor     |
| 1      | Short-Term Users         |
| 2      | Consistent Enjoyers      |
| 3      | One-Movie Critics        |
| 4      | Loyal Ocasional Enjoyers |
| 5      | Heavy Enthusiasts        |


They will be used in Section 7 to analyze how different user types engage with different movie clusters.

## SECTION 7: USER-MOVIE PREFERENCE ANALYSIS

In the previous section, we discovered **audience segments** using clustering on
pre-calculated user statistics table.  
In Section 5, we clustered movies into content types based on patterns in the data(ratings, variability, popularity).  

Here we connect the two to answer the following questions:

**→ What types of content does each user segment watch?  
→ How do they rate those content types?  
→ How do these preferences evolve over time?**


For this we use the `viewer_ratings` interactions from the database. Amount of data that needs to be analyzed is very extensive and includes over 4M rows.
To prevent crashing, **we never load the table into memory fully**. Instead, we only select the necessary columns and process the data in parts using pandas.

### 7.1 Merge Interactions with Clusters

We need to know, for each rating:

- which **user cluster** produced it
- which **movie cluster** it belongs to
- in which **year** it happened

We already have user and movie clusters in:

- `df_users[['customer_id', 'cluster']]`
- `df_movies[['movie_id', 'cluster']]`

We now:

1. Build **lookup dictionaries**:
   - `customer_id → user_cluster`
   - `movie_id → movie_cluster`
2. Load the `viewer_ratings` table from the database in **parts**, selecting only:
   - `customer_id`, `movie_id`, `rating`, `date`
3. For each part:
   - attach `user_cluster` and `movie_cluster`
   - compute small groupby summaries and store them for later

In [None]:
# Names for user clusters
user_cluster_names = {
    0: "Consistent Enjoyers",
    1: "Fan Visitors",
    2: "One-Movie Critics",
    3: "Loyal Occasional Enjoyers",
    4: "Heavy Enthusiasts",
    5: "Short-Term Users"
}

# Names for movie clusters
movie_cluster_names = {
    0: "Old Classics",
    1: "Well-Liked Niche Films",
    2: "Controversial Films",
    3: "Blockbusters",
    4: "Popular Films",
    5: "Hated Invisible Films"
}

In [None]:
# Look-up dictionaries from user and movie clustering
user_cluster_map = df_users.set_index('customer_id')['cluster'].to_dict()
movie_cluster_map = df_movies.set_index('movie_id')['cluster'].to_dict()

print(f"User look-up table size:  {len(user_cluster_map)}")
print(f"Movie look-up table size: {len(movie_cluster_map)}")

# Open DB connection
conn = sqlite3.connect(DB_PATH)

In [None]:
# PROCESSING DATA IS PARTS
# Containers to collect per-part summaries
pref_container = []        # user_cluster × movie_cluster: sum + count
user_time_container = []   # year × user_cluster: count
movie_time_container = []  # year × movie_cluster: count

size = 400_000  # adjust if needed
part_number = 0

sql_ratings = """
    SELECT customer_id, movie_id, rating, date
    FROM viewer_ratings
"""

for part in pd.read_sql_query(sql_ratings, conn,
                               chunksize=size,
                               parse_dates=['date']):
    part_number += 1
    print(f"Processing part №{part_number} with {len(part)} rows...")

    # Mapping users and movies to their clusters
    part['user_cluster']  = part['customer_id'].map(user_cluster_map)
    part['movie_cluster'] = part['movie_id'].map(movie_cluster_map)

    # Only keeping rows where both mappings succeeded and rating is present
    part = part.dropna(subset=['rating', 'user_cluster', 'movie_cluster'])

    # Making sure cluster labels are stored as integers
    part['user_cluster'] = part['user_cluster'].astype(int)
    part['movie_cluster'] = part['movie_cluster'].astype(int)

    # Extract rating year
    part['rating_year'] = part['date'].dt.year

    # user_cluster × movie_cluster for preference matrix
    temp_pref = (
        part
        .groupby(['user_cluster', 'movie_cluster'])['rating']
        .agg(['sum', 'count'])    # sum of ratings, number of ratings
        .reset_index()
        .rename(columns={'sum': 'sum_rating',
                         'count': 'n_ratings'})
    )
    pref_container.append(temp_pref)

    # rating_year × user_cluster for time evolution (users)
    temp_user_time = (
        part
        .groupby(['rating_year', 'user_cluster'])['rating']
        .size()
        .reset_index(name='n_ratings')
    )
    user_time_container.append(temp_user_time)

    # rating_year × movie_cluster for time evolution (movies)
    temp_movie_time = (
        part
        .groupby(['rating_year', 'movie_cluster'])['rating']
        .size()
        .reset_index(name='n_ratings')
    )
    movie_time_container.append(temp_movie_time)

print("Finished processing")

### 7.2 Preference Matrix

For each pair *(user_cluster, movie_cluster)*, compute:

- **n_ratings**  → how many ratings this audience segment gave to that content type  
- **avg_rating** → how they rate that content type on average  

From these we will build:

- a **preference matrix** of average ratings (`pref_mean`)  
- a **count matrix** of rating frequencies (`pref_count`)  
- a **share matrix** (`engagement_share`) showing *what* each user cluster watches

In [None]:
# Combine all parts from container into one DataFrame
pref_all = pd.concat(pref_container, ignore_index=True)

# Count total sum of ratings and total count
pref_agg = (
    pref_all
    .groupby(['user_cluster', 'movie_cluster'])[['sum_rating', 'n_ratings']]
    .sum()
    .reset_index()
)

# Average rating per user_cluster/movie_cluster
pref_agg['avg_rating'] = pref_agg['sum_rating'] / pref_agg['n_ratings']
pref_agg.to_csv("images/average_rating_user_movie_summary.csv", index=True)
display(pref_agg.head())

In [None]:
# PREFERENCE MATRIX
# HOW they rate (average rating)
pref_mean = (
    pref_agg
    .pivot(index='user_cluster', columns='movie_cluster', values='avg_rating')
    .round(2)
)

# Create named version for interpretable display
pref_mean_named = pref_mean.rename(index=user_cluster_names, columns=movie_cluster_names)
print("Average ratings:")
pref_mean_named.to_csv("images/average_rating_summary.csv", index=True)
display(pref_mean_named)

In [None]:
# COUNT MATRIX
# HOW MUCH they watch of each genre (rating counts)
pref_count = (
    pref_agg
    .pivot(index='user_cluster', columns='movie_cluster', values='n_ratings')
    .fillna(0)
    .astype(int)
)

# Named version for interpretable display
pref_count_named = pref_count.rename(index=user_cluster_names, columns=movie_cluster_names)
print("Number of ratings:")
pref_count_named.to_csv("images/number_ratings_summary.csv", index=True)
display(pref_count_named)

In [None]:
# ENGAGEMENT SHARES - PROPORTIONS OF CONTENT CONSUMED BY EACH USER CLUSTER
# For each user cluster, convert counts into proportions that sum to 1
engagement_share = pref_count.div(pref_count.sum(axis=1), axis=0).round(3)

# Named version for display
engagement_share_named = engagement_share.rename(index=user_cluster_names, columns=movie_cluster_names)
print("Share of each genre in each user segment's viewing:")
engagement_share_named.to_csv("images/share_genre_in_user_segment.csv", index=True)
display(engagement_share_named)

In [None]:
# PREFERENCE HEATMAPS
# WHAT they watch
plt.figure(figsize=(10, 6))
sns.heatmap(engagement_share_named, annot=True, fmt=".2f", cmap="Blues")
plt.title("Share of Genres Watched by User Segments")
plt.xlabel("Movie Cluster")
plt.ylabel("User Cluster")
plt.xticks(rotation=30, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.savefig("images/share_genres_watched_by_users.png")
plt.show()

# HOW they rate
plt.figure(figsize=(10, 6))
sns.heatmap(pref_mean_named, annot=True, fmt=".2f", cmap="viridis")
plt.title("Average Rating Given by Each User Segment to Each Genre")
plt.xlabel("Movie Cluster")
plt.ylabel("User Cluster")
plt.xticks(rotation=30, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.savefig("images/average_rating_user_segment_genre.png")
plt.show()


### 7.3 Temporal Analysis

To understand how preferences evolve over time, we use the same part summaries:

- `user_time_container`  → ratings per year from every user_cluster
- `movie_time_container` → ratings per year for every movie_cluster

From these we compute:

- how the **activity of users** changes over the years
- how the **popularity of genres** changes over the years

In [None]:
# USER CLUSTER'S ACTIVITY OVER TIME
# Combine and aggregate: rating_year and user_cluster
user_time_all = pd.concat(user_time_container, ignore_index=True)

user_time_agg = (
    user_time_all
    .groupby(['rating_year', 'user_cluster'])['n_ratings']
    .sum()
    .reset_index()
)

yearly_user_counts = (
    user_time_agg
    .pivot(index='rating_year', columns='user_cluster', values='n_ratings')
    .fillna(0)
    .astype(int)
    .sort_index()
)

yearly_user_share = yearly_user_counts.div(yearly_user_counts.sum(axis=1), axis=0)

print("User cluster activity (given rating counts):")
display(yearly_user_counts.head())

print("User cluster activity (given rating shares):")
display(yearly_user_share.head())

# PLOT RESULTS
plt.figure(figsize=(10,6))
for c in yearly_user_share.columns:
    plt.plot(yearly_user_share.index,
             yearly_user_share[c],
             marker='o',
             label=user_cluster_names[c])

plt.title("Share of Total Ratings by User Cluster Over Time")
plt.xlabel("Year")
plt.ylabel("Share")
plt.grid(True, axis='y')
plt.legend(title="User Cluster")
plt.tight_layout()
plt.savefig("images/user_total_share_over_time.png")
plt.show()

In [None]:
# MOVIE GENRE'S POPULARITY OVER TIME
# Combine and aggregate: rating_year and movie_cluster
movie_time_all = pd.concat(movie_time_container, ignore_index=True)

movie_time_agg = (
    movie_time_all
    .groupby(['rating_year', 'movie_cluster'])['n_ratings']
    .sum()
    .reset_index()
)

yearly_movie_counts = (
    movie_time_agg
    .pivot(index='rating_year', columns='movie_cluster', values='n_ratings')
    .fillna(0)
    .astype(int)
    .sort_index()
)

yearly_movie_share = yearly_movie_counts.div(yearly_movie_counts.sum(axis=1), axis=0)

# PLOT RESULTS
print("Movie cluster popularity (received rating counts):")
yearly_movie_counts.to_csv("images/Movie_cluster_popularity_counts.csv", index=True)
display(yearly_movie_counts.head())
print("Movie cluster popularity (received rating shares):")
yearly_movie_share.to_csv("images/Movie_cluster_popularity_shares.csv", index=True)
display(yearly_movie_share.head())

plt.figure(figsize=(10,6))
for c in yearly_movie_share.columns:
    plt.plot(yearly_movie_share.index,
             yearly_movie_share[c],
             marker='o',
             label=movie_cluster_names[c])

plt.title("Popularity of Genres Over Time")
plt.xlabel("Year")
plt.ylabel("Share")
plt.grid(True, axis='y')
plt.legend(title="Genres")
plt.tight_layout()
plt.savefig("images/movie_cluster_popularity_over_time.png")
plt.show()

# ANALYSIS IS FINISHED -  CLOSE DB
conn.close()
print("Database connection closed")



## 7.4 PROJECT DISCOVERIES


### What Each User Segment Watches?

To create a complete profile of each audience segment, we combine:
- **User behavioural features** from clustering (ratings, volume, activity span, exploration)
- **Movie consumption patterns** (engagement shares by movie type)
- **Taste preferences** (average ratings by movie type)

This integration produces rich and realistic audience personas.

---

#### **One-time Fan Visitors (Cluster 0)**  
**User behaviour:**  
- Extremely light users (**2–3 total ratings**) with **almost no activity span**  
- **Very generous raters (~4.5 avg)** despite minimal exploration  
- Rate a couple of films during a single visit and disappear  

**Movie preferences:**  
- **58% Blockbuster Hits**, **32% Well-Liked Movies**  
- Minimal attention to Classics, Niche Films, or Controversial titles  
- High ratings for *every* category, including niche and controversial titles (**4.4–4.6**)  

**Interpretation:**  
These are impulse-driven visitors who arrive, watch the big trending titles, and leave immediately.  
They do not explore deeply, but they rate optimistically and positively.  
Their behaviour suggests social hype or platform curiosity rather than sustained engagement.

---

#### **Short-Term Users (Cluster 1)**  
**User behaviour:**  
- Low but still significant engagement (**6–7 ratings**, **~5 unique movies**)  
- Remain active for a few months (**~180 days**)  
- **Neutral rating behaviour (~3.5 avg)**  

**Movie preferences:**  
- **54% Blockbuster Hits**, **35% Well-Liked Movies**  
- Very little exploration into Classics or Niche Films  
- Ratings reveal mild dissatisfaction with Niche and Controversial categories (**2.9–2.8**)  

**Interpretation:**  
Short-Term Users behave like the “trial period audience”: they come, consume the most promoted titles, and leave before developing deeper habits.  
They neither strongly love nor strongly hate mainstream content, but they clearly avoid niche genres.

---

#### **Consistent Enjoyers (Cluster 2)**  
**User behaviour:**  
- Meaningful and steady engagement (**~30 ratings**, **~28 unique movies**)  
- **Long activity span (~450 days)**  
- Positive and stable rating pattern (**~3.6 avg**)  
- Explore broadly without extremes  

**Movie preferences:**  
- **48% Well-Liked Movies**, **37% Blockbuster Hits**  
- Noticeable curiosity toward **Old Classics (7%)** and **Niche Films (7%)**  
- Consistently positive ratings across many genres (**3.6–3.8** mainstream and classics)  

**Interpretation:**  
These users represent the platform’s “ideal general audience”: reliable engagement, broad consumption, positive feedback, and long-term stay.  
They enjoy good mainstream titles but also appreciate Classics.  
A strategically valuable and stable audience segment.

---

#### **One-Movie Critics (Cluster 3)**  
**User behaviour:**  
- **Very low engagement (1–2 ratings)**  
- **Shortest activity span (~30 days)**  
- **Very critical rating style (~2.3 avg)**  
- Almost always rate only one film, and negatively  

**Movie preferences:**  
- **51% Blockbuster Hits**, **24% Well-Liked Movies**  
- Highest engagement with **Hated Niche Films (17%)** and **Controversial Titles (5%)**  
- Lowest ratings of all users (as low as **1.37** for Controversial titles)  

**Interpretation:**  
This is a behaviourally unique cluster: users appear briefly, often search for a specific film, rate it harshly, and leave.  
Their unexpected interest in niche and controversial content suggests a mismatch between expectations and the platform’s content.  
They are highly exploratory *but deeply dissatisfied*.

---

#### **Loyal Occasional Enjoyers (Cluster 4)**  
**User behaviour:**  
- Moderate engagement (**~14 ratings**) but **very long activity (~3 years)**  
- Positive but not extreme rating pattern (**~3.56 avg**)  
- Watch occasionally but reliably across years  

**Movie preferences:**  
- Strong preference for **Well-Liked Movies (41%)** and **Blockbuster Hits (43%)**  
- Some interest in **Old Classics (~7%)**  
- Ratings reflect consistent appreciation (**3.5–3.7** mainstream and classics)  

**Interpretation:**  
These are long-term platform supporters who do not rate heavily but remain active for years.  
Their movie preferences align with accessible, high-quality mainstream films, making them a steady backbone of platform engagement.  
High retention + balanced taste = one of the most valuable segments.

---

#### **Heavy Enthusiasts (Cluster 5)**  
**User behaviour:**  
- Highest engagement: **75–80 total ratings**, **65+ unique movies**  
- Long active lifetime (**~900 days**)  
- Moderate-positive rating behaviour (**~3.38 avg**)  
- Deep explorers who interact broadly and consistently  

**Movie preferences:**  
- **53% Well-Liked Movies**  
- **20% Blockbuster Hits**  
- Substantial engagement with **Hated Niche Films (15%)** and **Old Classics (12%)**  
- Ratings vary by genre (**3.2–3.6** mainstream, **2.7–3.0** niche/controversial)  

**Interpretation:**  
Heavy Enthusiasts form the platform’s “cinephile engine.”  
They explore broadly, consume widely, remain loyal, and provide realistic (not inflated) ratings.  
They represent a small but critically important group for long-term engagement, recommendations, and platform depth.

---

### How Each User Segment Rates Content (pref_mean)

The average rating matrix reveals taste differences on top of viewing differences.

#### **One-time Fan Visitors**
- Extremely generous across all genres (**4.25–4.65**).
- Highest ratings for **Controversial Invisible Titles (4.65)** and **Well-Liked Movies (4.48)**.
- **Interpretation:** Hyper-positive, easily satisfied users. Their ratings inflate all content.

#### **Short-Term Users**
- The most selective among typical users:
  - **Hated Niche Films (2.90)**,  
  - **Controversial Titles (2.79)**.
- Give moderate ratings to mainstream films (**3.62–3.66**).
- **Interpretation:** They consume blockbusters but do not particularly enjoy them, often leaving after a short period.

#### **Consistent Enjoyers**
- Stable, moderate-to-high ratings (**3.61–3.76** across mainstream and classics).
- Lower ratings for niche and controversial categories (**2.88**).
- **Interpretation:** Balanced and reliable raters, appreciating high-quality films without extremes.

#### **One-Movie Critics**
- Harshest raters:
  - **1.37 for Controversial Titles**,  
  - **~2.0 for Niche Films**,  
  - **2.45–2.47** even for mainstream.
- **Interpretation:** They explore widely but dislike most content—behave like critical reviewers rather than casual viewers.

#### **Loyal Occasional Enjoyers**
- Consistently strong ratings:
  - **3.53–3.70** for mainstream and classics.
- **Interpretation:** A steady, happy audience with predictable positive engagement.

#### **Heavy Enthusiasts**
- Moderate-high ratings (**3.2–3.6** across genres).
- Slightly lower enthusiasm for niche categories.
- **Interpretation:** They love watching everything and rate positively but realistically.

---

### Temporal Behavior of User Segments (yearly_user_share)

The evolution plot highlights how audience structure shifts across the years:

#### **Growing Segments**
- **Loyal Occasional Enjoyers** show a massive surge around 1999–2001, briefly dominating over 60% of total ratings.
- **Heavy Enthusiasts** also grow gradually toward the later years, indicating rising long-term value.

#### **Declining Segments**
- **Consistent Enjoyers** drop sharply during the transition around 1999–2000 but later recover partially.
- **Short-Term Users** decrease steadily over time, reflecting lower retention.
- **One-time Fan Visitors** fluctuate without long-term growth.

#### **Stable but Niche**
- **One-Movie Critics** remain consistently low in share (near 0–5%). They are small but behaviorally distinct.

**Interpretation:**  
The platform experiences major structural shifts around 1999–2000. Loyal Occasional Enjoyers temporarily dominate, suggesting a rapid influx of stable but low-frequency users. In later years, Heavy Enthusiasts and Consistent Enjoyers take over as the reliable long-term core.

---

### Temporal Behavior of Movie Clusters (yearly_movie_share)

Similarly, movie cluster popularity evolves meaningfully across time.

#### **Well-Liked Movies**
- Strong early presence, slight mid-period dip, then major resurgence (peaking around ~48%).
- **Interpretation:** These titles drive long-term engagement and recover strongly after 1999.

#### **Hated Niche Films**
- Surprisingly high early share (~35–44%), then a steep decline after 1999.
- **Interpretation:** The early years saw a large proportion of niche content being rated (possibly due to database structure), but user attention shifts away afterward.

#### **Old Classics**
- Stable in early years (~10–18%), then decline through the 2000s.
- **Interpretation:** Classics maintain early relevance but lose ground to modern mainstream titles.

#### **Controversial Invisible Titles**
- Minor share early on (~14–16%), then almost complete disappearance 2000–2005.
- **Interpretation:** These films lose visibility entirely as content volume expands.

#### **Blockbuster Hits**
- Dramatic rise starting around 1999–2001, overtaking other genres.
- **Interpretation:** As the platform matures, mainstream blockbusters become the primary driver of ratings and traffic.

#### **Unnoticed Films**
- Small but surprisingly stable presence early on (~10–15%), small resurgence after 2004.
- **Interpretation:** Peripheral content remains background noise, but never gains traction.

---

### Combined Insights: Which Segments Match Which Genres?

By combining all components—viewing shares, ratings, and temporal evolution—we identify key audience–content alignments.

#### **Best Matches (High Engagement + High Ratings)**
- **One-time Fan Visitors → Blockbuster Hits / Well-Liked Movies**  
  (High participation, high satisfaction)

- **Loyal Occasional Enjoyers → Well-Liked Movies / Classics**  
  (Stable engagement + very positive ratings)

- **Consistent Enjoyers → Well-Liked Movies / Classics**  
  (Broad engagement + reliable ratings)

#### **Exploratory but Critical**
- **One-Movie Critics → Hated Niche Films / Controversial Titles**  
  (High exploration, very low ratings)

#### **High-Engagement Cinephiles**
- **Heavy Enthusiasts → All mainstream + niche categories**  
  (Strong activity + stable, moderate ratings)

#### **Low-Value Short-Term Users**
- Low retention, average-to-negative ratings, mainstream-only interest.

---

### Final Summary

Together, the results show:

- **Which user groups prefer which movie groups:**  
  - Mainstream dominates for most, while niche interest is concentrated in Critics and Enthusiasts.

- **How similar or different segments are:**  
  - Visitors and Short-Term Users behave similarly; Critics are extremely distinct; Enthusiasts and Enjoyers form the stable core.

- **Which segments are most valuable:**  
  - **Consistent Enjoyers**, **Loyal Occasional Enjoyers**, and **Heavy Enthusiasts**  
    are the platform’s most strategically important audiences—large, long-term, and positive.

- **Which content clusters matter most:**  
  - **Blockbuster Hits** and **Well-Liked Movies** dominate ratings and grow over time.  
  - Niche and Controversial films fade dramatically, indicating limited platform value.

Overall, the combination of behavioral preferences, rating tendencies, and temporal evolution enables a complete **audience decoding**:  
we can understand who the key users are, what content drives engagement, and how the platform’s ecosystem shifts over time.


NOTE ONLY FOR ME IF I WANT TO CHANGE INTERPRETATION:
Structure I want to follow:

**User × Movie cluster preferences**

Using `engagement_share`:

- Describe for each user cluster which movie clusters dominate their viewing.  
  - e.g. “Cluster 0 spends ~70% of its ratings on mainstream movie cluster 2…”
- Identify specialized vs exploratory segments (rows with one big value vs more uniform).

Using `pref_mean`:

- Describe how each user cluster rates each content type.  
  - e.g. “Cluster 1 gives above-average ratings to niche movie cluster 4…”

**Temporal behavior**

Using `yearly_user_share`:

- Which user clusters grow / shrink over time?

Using `yearly_movie_share`:

- Which movie clusters become more popular / less popular over time?

Together, these results show:

- **which user groups prefer which movie groups**,  
- **how similar or different segments are**, and  
- **which segments are most valuable** (large + engaged + high ratings).

## SECTION 8: CONCLUSIONS

In this final section, we summarize the most important findings from the deep-dive analysis in Section 7 and translate them into actionable insights for the “Audience Decode” brief.

Rather than repeating the detailed results, this section highlights:
- which audience segments matter most,
- which content clusters drive engagement,
- how behavior evolves over time,
- and what recommendations follow for content strategy, curation, and platform design.

## 8. Conclusions & Strategic Insights

The deep-dive analysis reveals a coherent structure within both the audience and the content ecosystem, showing how long-term engagement, content preferences, and rating behaviour intertwine. Several user groups emerge as central to the platform’s sustained success, while others reflect short-lived or highly specific usage patterns. Together, these insights provide a clear strategic direction for content curation, recommendation systems, and audience development.

### **Core Audience Dynamics**

Across all years and behavioural indicators, three segments consistently demonstrate long-term value: **Consistent Enjoyers**, **Loyal Occasional Enjoyers**, and **Heavy Enthusiasts**. These groups combine meaningful activity spans with stable viewing and rating patterns, forming the backbone of audience retention. They interact positively with a range of content, especially high-quality mainstream films and, in the case of Enthusiasts, classics and niche titles as well.  
Conversely, segments such as **One-time Fan Visitors** and **Short-Term Users** generate substantial rating volume but little ongoing engagement, indicating that they are driven by momentary interest rather than platform loyalty. **One-Movie Critics**, while behaviourally distinct, contribute minimal long-term value and tend to give disproportionately negative ratings.

### **Content Patterns and Their Implications**

The content landscape shows a strong gravitational pull toward **Blockbuster Hits** and **Well-Liked Movies**, which consistently attract the majority of ratings across all user types. These clusters not only maintain their dominance over time but become even more central in later years, confirming their importance as engagement drivers.  
Other categories play more specialized roles: **Old Classics** retain a small but steady audience, while **Hated Niche Films** and **Controversial Titles** lose visibility and appeal as the platform grows. **Unnoticed Films** remain peripheral throughout. These patterns suggest that mainstream content will continue to anchor the platform, with Classics serving as valuable differentiation for loyal, high-engagement users.

### **Temporal Evolution**

The temporal analysis highlights a notable shift around the early 2000s: **Loyal Occasional Enjoyers** briefly dominate the audience mix, reflecting a surge of long-term but low-volume users. Over time, **Heavy Enthusiasts** gain prominence, signalling the emergence of a stable cinephile segment that relies increasingly on deeper catalog exploration. Meanwhile, short-lived clusters decline, pointing to an overall maturation of the audience base.  
Content trends mirror this evolution: Blockbusters and high-quality mainstream titles rise in prominence, while niche and controversial genres diminish. The alignment between audience evolution and content dynamics indicates that the platform is moving toward more sustained and predictable engagement patterns.

### **Strategic Insights**

These findings collectively suggest a dual strategic focus. First, strengthen and retain the platform’s core audience by offering tailored recommendations: mainstream-first lists for casual users, balanced and quality-driven sets for Enjoyers, and deeper exploratory pathways for Enthusiasts. Second, ensure that the content catalog highlights the categories that drive engagement—Blockbuster Hits and Well-Liked Movies—while leveraging Classics to differentiate and serve long-term users.  
At the same time, onboarding flows should nudge One-time Visitors toward more content after their initial blockbuster interaction, increasing the likelihood of converting them into long-term segments. Finally, recommendation systems should avoid giving polarizing niche content to rating-sensitive groups, reducing dissatisfaction and rating volatility.

### **Final Takeaway**

The platform’s sustained growth depends on understanding and nurturing the segments that consistently return, explore, and evaluate movies positively. By aligning content strategy and recommendations with the behavioural strengths of each user group, the platform can enhance satisfaction, improve retention, and create a viewing experience that evolves alongside its audience.

Overall, the combination of behavioral features, content preferences, and temporal dynamics reveals a clear and interpretable audience structure. A few key user groups—Consistent Enjoyers, Loyal Occasional Enjoyers, and Heavy Enthusiasts—drive long-term platform value, while mainstream content remains the anchor of engagement across nearly all segments. Niche and controversial films attract only small, polarized groups and decline over time. These insights provide a blueprint for content allocation, audience retention strategies, and more personalized recommendation pipelines. The platform’s future growth lies in deepening the experience for core long-term users while improving onboarding for new and one-time visitors.

 Conclusions (and Next Steps)

**Project Next Steps:**

  
2. **Deep-Dive Analysis (The most important part!):**
    * Once you've chosen your "best" model, use its cluster labels for a deeper analysis.
    * **Temporal Analysis:** Load the `viewer_ratings` table. Do your clusters' preferences evolve over time? (e.g., do "Power Users" rate more critically over time?).
    * **"Genre" Proxy Analysis:** Load `viewer_ratings` and `movies`. Do your clusters show a preference for `year_of_release`? (e.g., Cluster 1 prefers movies from 1980-1990, Cluster 2 prefers 2000+).
3. **Write the Report:** Fill out the `README.md` with these findings, clearly explaining your clusters and the insights you derived.