## 1. Data Loading
We load the VGChartz dataset and inspect its structure.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

sns.set()

In [None]:
df = pd.read_csv("../data/vgsales.csv")
df.head()

## 2. Data Cleaning & Feature Engineering
We engineer new variables to enable hypothesis testing.

In [None]:
import pandas as pd  

def decade(y):
    
    if pd.isna(y):
        return "Unknown"
    y = int(float(y))
    return f"{(y//10)*10}s"

df["decade"] = df["Year"].apply(decade)
df[["Name", "Year", "decade"]].head()

In [None]:
def game_type(genre):
    g = genre.lower()
    if any(x in g for x in ["sports", "racing", "fighting", "shooter", "misc"]):
        return "multiplayer"
    return "singleplayer"

df["game_type"] = df["Genre"].apply(game_type)
df[["Name", "Genre", "game_type"]].head()

In [None]:
def platform_family(p):
    p = p.lower()
    if "ps" in p:
        return "PlayStation"
    if "x" in p:
        return "Xbox"
    if p in ["wii", "ds", "3ds", "snes", "nes", "gb", "gba"]:
        return "Nintendo"
    if p == "pc":
        return "PC"
    return "Other"

df["platform_family"] = df["Platform"].apply(platform_family)
df[["Name", "Platform", "platform_family"]].head()

In [None]:
def simplify_genre(g):
    g = g.lower()
    if g in ["action", "action-adventure"]:
        return "Action"
    if g in ["fighting", "shooter"]:
        return "Combat"
    if g in ["racing", "sports"]:
        return "Competitive"
    if g in ["strategy", "simulation"]:
        return "Strategy"
    return "Story"

df["genre_simple"] = df["Genre"].apply(simplify_genre)
df[["Name", "Genre", "genre_simple"]].head()

In [None]:
def sales_category(x):
    if x < 1:
        return "low"
    elif x < 5:
        return "mid"
    else:
        return "high"

df["sales_cat"] = df["Global_Sales"].apply(sales_category)
df[["Name", "Global_Sales", "sales_cat"]].head()

## 3. Exploratory Data Analysis

### Global Sales Trend by Decade
The figure below shows how total global game sales evolve over decades.


In [None]:
plt.figure(figsize=(12,6))
sns.lineplot(x="decade", y="Global_Sales", data=df, estimator=sum)
plt.title("Global Sales Trend by Decade")
plt.show()

Sales increase sharply after the 1990s, peaking during the 2000s, which coincides
with the rise of console gaming and mass-market adoption.

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x="game_type", y="Global_Sales", data=df)
plt.title("Singleplayer vs Multiplayer Sales")
plt.show()

Multiplayer games show higher median sales, but the distribution is more skewed with several high-selling outliers.

### Platform Family Analysis (Enrichment)
We compare total sales across platform families and test whether Nintendo and PlayStation differ in average game sales.

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x="platform_family", y="Global_Sales", data=df, estimator=sum)
plt.title("Total Global Sales by Platform Family")
plt.show()

In [None]:
from scipy.stats import ttest_ind

ps = df[df["platform_family"] == "PlayStation"]["Global_Sales"]
nt = df[df["platform_family"] == "Nintendo"]["Global_Sales"]

t_stat2, p_value2 = ttest_ind(ps, nt, equal_var=False)
t_stat2, p_value2

This enrichment test checks whether average per-game sales differ between Nintendo and PlayStation titles.

### Genre Distribution (Context)
We visualize how game genres are distributed in the dataset to contextualize comparisons.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="genre_simple", data=df, order=df["genre_simple"].value_counts().index)
plt.title("Game Genre Distribution")
plt.xticks(rotation=45)
plt.show()

## 4. Hypothesis Testing

### H1: Multiplayer games have higher global sales than single-player games.

H₀: Mean sales of multiplayer games = Mean sales of single-player games  
H₁: Mean sales of multiplayer games ≠ Mean sales of single-player games

We test multiple hypotheses to examine sales differences across
game types, time periods, and platform families. This allows us
to assess whether observed patterns are statistically significant.

In [None]:
from scipy.stats import ttest_ind

mp = df[df["game_type"] == "multiplayer"]["Global_Sales"]
sp = df[df["game_type"] == "singleplayer"]["Global_Sales"]

t_stat, p_value = ttest_ind(mp, sp, equal_var=False)  # Welch t-test
t_stat, p_value

Welch’s t-test is applied due to unequal variances.  
A p-value below 0.05 indicates a statistically significant difference.

### H2: Global sales differ significantly across decades.

In [None]:
from scipy.stats import f_oneway

groups = [g["Global_Sales"].values for _, g in df.groupby("decade") if len(g) > 1]

f_stat, p_value3 = f_oneway(*groups)
f_stat, p_value3

## 5. Interpretation of Results

The statistical tests suggest that multiplayer games generate significantly higher
sales on average. Sales also vary significantly across decades and platform families,
highlighting structural shifts in the gaming industry.