# Game Classification and Clustering Analysis

In this notebook, we'll explore how clustering can be used as a feature engineering step before classification using a video games dataset.

### Learning Objectives:
1. Understand how clustering can identify natural groupings in gaming data
2. Use cluster assignments as features for classification
3. Build and evaluate a machine learning pipeline
4. Practice handling categorical and numerical data

In [11]:
%pip install pandas numpy matplotlib seaborn scikit-learn

In [12]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import ast
import kagglehub

# Set random seed for reproducibility
np.random.seed(42)

In [13]:
# Load the dataset
df = pd.read_csv(kagglehub.dataset_download("arnabchaki/popular-video-games-1980-2023")+'/games.csv')

# Display first few rows and basic information
print("Dataset Shape:", df.shape)
df.head()

## Task 1: Data Preprocessing

Let's prepare our data for clustering by handling both numerical and categorical features.

In [14]:
# Convert string representations of lists to actual lists
df['Genres'] = df['Genres'].apply(ast.literal_eval)

# Convert Times Listed and Number of Reviews from string format (e.g., '1.1K') to numeric
def convert_to_numeric(x):
    if isinstance(x, str):
        if 'K' in x:
            return float(x.replace('K', '')) * 1000
    return float(x)

df['Times_Listed_Numeric'] = df['Times Listed'].apply(convert_to_numeric)
df['Reviews_Numeric'] = df['Number of Reviews'].apply(convert_to_numeric)

# Create genre features using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
genre_features = pd.DataFrame(mlb.fit_transform(df['Genres']),
                             columns=mlb.classes_,
                             index=df.index)


In [15]:
# Prepare features for clustering
cluster_features = ['Rating', 'Times_Listed_Numeric', 'Reviews_Numeric']
X_cluster = df[cluster_features].copy()

# Handle missing values
X_cluster = X_cluster.fillna(X_cluster.mean())

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)

# Create a binary classification target (High-rated vs Average/Low-rated games)
df['high_rated'] = (df['Rating'] >= df['Rating'].quantile(0.75)).astype(int)

# Print basic statistics of our features
print("\nFeature Statistics:")
print(X_cluster.describe())

## Task 2: Clustering Analysis

Let's perform K-means clustering to identify natural groupings in our games.

In [16]:
# Perform K-means clustering
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to the dataframe
df['Cluster'] = cluster_labels

# Visualize clusters
plt.figure(figsize=(12, 6))

# Create scatter plot
plt.subplot(1, 2, 1)
sns.scatterplot(data=df, x='Rating', y='Times_Listed_Numeric', 
                hue='Cluster', palette='deep')
plt.title('Game Clusters: Rating vs Times Listed')

# Create cluster profile plot
plt.subplot(1, 2, 2)
cluster_profiles = df.groupby('Cluster')[cluster_features].mean()
cluster_profiles_scaled = pd.DataFrame(scaler.transform(cluster_profiles),
                                      columns=cluster_features,
                                      index=cluster_profiles.index)
sns.heatmap(cluster_profiles_scaled, cmap='coolwarm', center=0)
plt.title('Cluster Profiles')

plt.tight_layout()
plt.show()

# Student Exercise: What characteristics define each cluster?
# Write your analysis here: 

## Task 3: Building a Classification Pipeline

Now let's use our cluster assignments and genre features for classification.

In [18]:
# Combine numerical features with genre features and cluster labels
X = pd.concat([X_cluster, genre_features, 
               pd.get_dummies(df['Cluster'], prefix='Cluster')], axis=1)
y = df['high_rated']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Student Exercise: What does the classification report tell us about our model's performance?
# Write your analysis here: 

## Task 4: Feature Importance Analysis

In [19]:
# Get feature importance
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': clf.feature_importances_
})
importance = importance.sort_values('importance', ascending=False)

# Plot top 15 most important features
plt.figure(figsize=(12, 6))
sns.barplot(data=importance.head(15), x='importance', y='feature')
plt.title('Top 15 Most Important Features')
plt.tight_layout()
plt.show()


## Bonus Challenges:

1. Try different clustering algorithms (e.g., DBSCAN, Hierarchical Clustering)
2. Experiment with feature selection methods
3. Analyze the relationship between genres and ratings

## Discussion Questions:

1. How do different genres tend to cluster together?
2. What makes certain games more likely to be highly rated?
3. How could game developers use these insights?
4. What are the limitations of our analysis?

# Advanced Game Analysis: Bonus Challenges

This notebook extends our previous analysis with advanced clustering techniques and deeper insights.

In [None]:
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage


## Challenge 1: Different Clustering Algorithms

In [None]:
def compare_clustering_algorithms(X_scaled, df):
    # 1. DBSCAN
    dbscan = DBSCAN(eps=0.5, min_samples=5)
    dbscan_labels = dbscan.fit_predict(X_scaled)
    
    # 2. Hierarchical Clustering
    hierarchical = AgglomerativeClustering(n_clusters=4)
    hierarchical_labels = hierarchical.fit_predict(X_scaled)
    
    # Visualize results
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Original K-means
    sns.scatterplot(data=df, x='Rating', y='Times_Listed_Numeric',
                    hue='Cluster', ax=axes[0,0], palette='deep')
    axes[0,0].set_title('K-means Clustering')
    
    # DBSCAN
    sns.scatterplot(data=df, x='Rating', y='Times_Listed_Numeric',
                    hue=dbscan_labels, ax=axes[0,1], palette='deep')
    axes[0,1].set_title('DBSCAN Clustering')
    
    # Hierarchical
    sns.scatterplot(data=df, x='Rating', y='Times_Listed_Numeric',
                    hue=hierarchical_labels, ax=axes[1,0], palette='deep')
    axes[1,0].set_title('Hierarchical Clustering')
    
    # Dendrogram
    linkage_matrix = linkage(X_scaled[:100], 'ward')  # Using first 100 samples for visibility
    dendrogram(linkage_matrix, ax=axes[1,1])
    axes[1,1].set_title('Hierarchical Clustering Dendrogram (First 100 samples)')
    
    plt.tight_layout()
    plt.show()
    
    return dbscan_labels, hierarchical_labels

dbscan_labels, hierarchical_labels = compare_clustering_algorithms(X_scaled, df)