# üéÆ GameRx | Evaluating Emotional Clustering Performance Across Archetypes


### What this notebook does  
We‚Äôll check how strong our emotional clusters are.  
Simple goal ‚Üí make sure the archetypes we found *actually hold up.* 

### Main Tasks  
- Load final clustered data  
- Test model quality (Silhouette, Davies‚ÄìBouldin, Calinski‚ÄìHarabasz)  
- Compare scores across cluster options  
- Save all results for dashboards and reports  

### üìÇ Input  
`08_game_clusters_ready.csv`

### üíæ Output  
`09_model_eval_results.csv`  

### Why this matters  
This step confirms our clusters make sense.  
So when we say *Comfort & Uplift* or *Tension Thrill,*  
we know the data agrees.

---

## Table of Contents  

1. [Setup & Load Data](#setup--load-data)  
2. [Check Data Overview](#check-data-overview)  
3. [Compute Evaluation Metrics](#compute-evaluation-metrics)  
4. [Compare Cluster Scores](#compare-cluster-scores)  
5. [Save Results](#save-results)  
6. [Insights & Next Steps](#insights--next-steps)

---

## 1. Setup & Load Data  
Import libraries and load the clustered dataset from the cleaned folder.

In [2]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)

# File path
file_path = r"D:\YVC\YVC Portfolio Implementation\Data Analytics Projects\GameRx Your Digital Dose\02 Data\cleaned\08_game_clusters_ready.csv"

# Load data
df = pd.read_csv(file_path)

# Quick check
print("‚úÖ Data loaded successfully:", df.shape)
df.head()

‚úÖ Data loaded successfully: (105008, 16)


Unnamed: 0,AppID,Name_g,primary_genre_g,relief_tag,cluster_label,archetype,anger_per_100w,anticipation_per_100w,disgust_per_100w,fear_per_100w,joy_per_100w,sadness_per_100w,surprise_per_100w,trust_per_100w,positive_per_100w,negative_per_100w
0,20200,Galactic Bowling,Casual,Comfort,1,Balanced Mixers,1.594,3.161,0.869,1.85,3.368,1.656,1.72,3.096,6.299,3.139
1,655370,Train Bandit,Action,Catharsis,1,Balanced Mixers,1.188,2.923,0.668,1.437,3.038,1.006,1.386,2.418,5.152,2.075
2,1732930,Jolt Project,Action,Catharsis,1,Balanced Mixers,1.188,2.923,0.668,1.437,3.038,1.006,1.386,2.418,5.152,2.075
3,1355720,Henosis‚Ñ¢,Adventure,Validation,1,Balanced Mixers,1.295,3.344,0.873,1.245,3.292,1.062,1.317,2.516,5.938,2.443
4,1139950,Two Weeks in Painland,Adventure,Validation,1,Balanced Mixers,1.295,3.344,0.873,1.245,3.292,1.062,1.317,2.516,5.938,2.443


---

## 2. Check Data Overview

A quick look at the dataset to confirm everything loaded correctly.  
Focus: make sure key columns (including cluster labels) are present.

### üéØ Goal
Verify the data matches the export from Notebook 08  
before running any evaluations.

In [3]:
# Check basic info
df.info()

# Peek at first few rows
df.head()

# Check cluster label counts
print("\nüß© Cluster distribution:")
print(df['cluster_label'].value_counts())

# Quick check for missing values
print("\nüö´ Missing values per column:")
print(df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105008 entries, 0 to 105007
Data columns (total 16 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   AppID                  105008 non-null  int64  
 1   Name_g                 105008 non-null  object 
 2   primary_genre_g        105008 non-null  object 
 3   relief_tag             105008 non-null  object 
 4   cluster_label          105008 non-null  int64  
 5   archetype              105008 non-null  object 
 6   anger_per_100w         105008 non-null  float64
 7   anticipation_per_100w  105008 non-null  float64
 8   disgust_per_100w       105008 non-null  float64
 9   fear_per_100w          105008 non-null  float64
 10  joy_per_100w           105008 non-null  float64
 11  sadness_per_100w       105008 non-null  float64
 12  surprise_per_100w      105008 non-null  float64
 13  trust_per_100w         105008 non-null  float64
 14  positive_per_100w      105008 non-nu

### üîç Results: Data Overview

The clustered dataset loaded correctly  
with **105,008 rows** and **16 columns**.

Key columns include: `AppID`, `primary_genre_g`, `relief_tag`, `cluster_label`.

### What this means
- No missing values  
- Each game has **10 NRC emotion scores**  
- Every row includes genre, relief tag, emotional profile, and cluster label  
- Data is clean and ready for evaluation


### Cluster Snapshot
- **Cluster 1:** 102,718 entries  
- **Cluster 0:** 2,284 entries  
- **Cluster 4:** 6 entries  

Cluster 1 is the largest, suggesting a dominant emotional pattern.


### üóùÔ∏è Key Checks
- Data types are correct (`float64`, `int64`, `object`)  
- No nulls found  
- File size ~ **12.8 MB**  

Everything is clean and ready for metric analysis.

---

## 3. Compute Evaluation Metrics

Time to check how strong and meaningful the clusters are.

### Metrics to run
- **Silhouette Score** ‚Üí measures separation and clarity  
- **Davies-Bouldin Score** ‚Üí lower = less overlap  
- **Calinski-Harabasz Score** ‚Üí higher = tighter, well-formed clusters  

### Goal
Get a quick read on how balanced and distinct  
the emotional archetypes are.

In [4]:
# Select only numeric emotion features for evaluation
emotion_cols = [
    'anger_per_100w', 'anticipation_per_100w', 'disgust_per_100w',
    'fear_per_100w', 'joy_per_100w', 'sadness_per_100w',
    'surprise_per_100w', 'trust_per_100w',
    'positive_per_100w', 'negative_per_100w'
]

X = df[emotion_cols]
labels = df['cluster_label']

# Compute metrics
silhouette = silhouette_score(X, labels)
davies = davies_bouldin_score(X, labels)
calinski = calinski_harabasz_score(X, labels)

# Show results
print("‚úÖ Evaluation Metrics")
print(f"Silhouette Score: {silhouette:.3f}")
print(f"Davies-Bouldin Score: {davies:.3f}")
print(f"Calinski-Harabasz Score: {calinski:.3f}")

‚úÖ Evaluation Metrics
Silhouette Score: 0.800
Davies-Bouldin Score: 0.609
Calinski-Harabasz Score: 27945.417


### üîç Results: Clustering Evaluation

The evaluation scores show strong, well-separated clusters.  
The emotional archetypes appear clear and consistent.

### Scores
- **Silhouette Score:** 0.800  
  High separation and clean boundaries  

- **Davies-Bouldin Score:** 0.609  
  Low overlap between clusters  

- **Calinski-Harabasz Score:** 27,945.417  
  Compact and distinct cluster structure  


### Takeaway
The clustering performs well.  
Each emotional archetype forms a clear, meaningful group.  
The model is stable and ready for the next step.

---

## 4. Compare Cluster Scores

Check how the model performs with different numbers of clusters.  
This helps find the setup with the best structure and balance.

### Goal
Test several **k values** and compare their scores.

### Reminder
- Higher **Silhouette** and **Calinski-Harabasz** ‚Üí better separation  
- Lower **Davies-Bouldin** ‚Üí less overlap  

In [5]:
from sklearn.cluster import KMeans

# Range of cluster counts to test
k_values = [2, 3, 4, 5, 6, 7, 8, 9, 10]

results = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)

    silhouette = silhouette_score(X, labels)
    davies = davies_bouldin_score(X, labels)
    calinski = calinski_harabasz_score(X, labels)

    results.append({
        'k': k,
        'Silhouette': round(silhouette, 3),
        'Davies_Bouldin': round(davies, 3),
        'Calinski_Harabasz': round(calinski, 3)
    })

# Create a DataFrame with results
scores_df = pd.DataFrame(results)

# Display comparison table
print("‚úÖ Cluster comparison complete:")
scores_df

‚úÖ Cluster comparison complete:


Unnamed: 0,k,Silhouette,Davies_Bouldin,Calinski_Harabasz
0,2,0.734,0.241,85440.0
1,3,0.784,0.17,191300.0
2,4,0.757,0.204,415300.0
3,5,0.9,0.171,865100.0
4,6,0.943,0.112,2276000.0
5,7,0.981,0.032,2847000.0
6,8,0.991,0.027,4283000.0
7,9,0.999,0.026,5407000.0
8,10,0.999,0.548,6412000.0


### üîç Results: Cluster Score Comparison

### What the table shows
Each row tests a different number of clusters (k).  
The scores reveal how clean and well-separated the clusters are.

### üóùÔ∏è Key takeaways
- **Silhouette** rises and peaks around **k = 9‚Äì10**  
- **Davies-Bouldin** drops sharply until **k = 9**  
  ‚Üí clearer, stronger clusters  
- **Calinski-Harabasz** increases with higher k  
  ‚Üí compact and well-defined groups  

### What it means
The model improves as k increases.  
**k = 8‚Äì9** offers the best balance of clarity, separation, and structure.

### ‚û°Ô∏è Next step
Save these scores and use them to choose the final  
cluster setup for GameRx.

---

## 5. Save Results

Save all evaluation scores for future reports and dashboards.  
This keeps model performance easy to track over time.

### Goal
Create a clean CSV file with all cluster comparison metrics.

In [6]:
# File path for saving results
save_path = r"D:\YVC\YVC Portfolio Implementation\Data Analytics Projects\GameRx Your Digital Dose\02 Data\cleaned\09_model_eval_results.csv"

# Save results to CSV
scores_df.to_csv(save_path, index=False)

print("üíæ Results saved successfully:")
print(save_path)

üíæ Results saved successfully:
D:\YVC\YVC Portfolio Implementation\Data Analytics Projects\GameRx Your Digital Dose\02 Data\cleaned\09_model_eval_results.csv


---

## 6. Insights & Next Steps  

This notebook completed the **model evaluation phase** of GameRx.  
All clustering results were tested, validated, and saved for reuse.  

### What Was Done  
- Calculated Silhouette, Davies-Bouldin, and Calinski-Harabasz scores  
- Compared performance across multiple cluster counts  
- Confirmed strong separation in the emotional archetypes  
- Saved all evaluation results for future dashboards and reporting  

These checks show that the hybrid clustering model is stable  
and forms clear, data-driven emotional groups.

### ‚û°Ô∏è Next Step  
Move into **`10_merge_hybrid_master.ipynb`** to bring everything together.  

This phase combines:
- cluster labels  
- relief tags  
- genre data  

into one unified master dataset.

It also sets the stage for deciding whether to add  
extra context like `psych_genre` in later steps.