<a href="https://colab.research.google.com/github/aesnin12/CSMODELProject/blob/main/CSMODEL_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Variable Descriptions of the dataset

| Variable | Description |
|-----------|-------------|
| Variable          | Description                                               |
| ----------------- | --------------------------------------------------------- |
| `Name`            | Title of the video game                                   |
| `Platform`        | The gaming platform (e.g. Wii, NES, PS4)                  |
| `Year_of_Release` | Year when the game was released                           |
| `Genre`           | Genre (category) of the game (e.g. Sports, Racing)        |
| `Publisher`       | Company that published the game                           |
| `NA_Sales`        | Sales in North America (in millions of units)             |
| `EU_Sales`        | Sales in Europe (in millions of units)                    |
| `JP_Sales`        | Sales in Japan (in millions of units)                     |
| `Other_Sales`     | Sales in the rest of the world (in millions of units)     |
| `Global_Sales`    | Total worldwide sales (in millions of units)              |
| `Critic_Score`    | Average critic review score (typically 0‚Äì100)             |
| `Critic_Count`    | Number of critic reviews used to compute the critic score |
| `User_Score`      | Average user review score (typically on a 0‚Äì10 scale)     |
| `User_Count`      | Number of user reviews used to compute the user score     |
| `Developer`       | Studio or company that developed the game                 |
| `Rating`          | Age/content rating (e.g. ESRB ratings: E, T, M, etc.)     |

## Dataset Cleaning and Setup



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
from scipy.stats import ttest_ind, f_oneway
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


In [None]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("kendallgillies/video-game-sales-and-ratings")

print("Path to dataset files:", path)

df = pd.read_csv(os.path.join(path, "Video_Game_Sales_as_of_Jan_2017.csv"))
df.head()

In [None]:
df.info()

In [None]:
df['User_Score'] = pd.to_numeric(df['User_Score'], errors='coerce')
df['Critic_Score'] = pd.to_numeric(df['Critic_Score'], errors='coerce')
df['Year_of_Release'] = pd.to_numeric(df['Year_of_Release'], errors='coerce')

df = df.dropna(subset=['Global_Sales', 'Critic_Score', 'User_Score', 'Year_of_Release'])


In [None]:
platform_counts = df["Platform"].value_counts(dropna=False)
print(platform_counts)

##### Make platforms in each perspective company (Playstation, Xbox, Nintendo, PC, or Other)

In [None]:
platform_groups = {
    "PlayStation": ["PS", "PS2", "PS3", "PS4", "PSP", "PSV"],
    "Xbox": ["X", "X360", "XONE"],
    "Nintendo": ["NES", "SNES", "N64", "GC", "WII", "WIIU", "GBA", "DS", "3DS", "G"],  # G = Game Boy
    "PC": ["PC"],
    "Sega": ["GEN", "SAT", "DC", "SCD", "GG"],
    "Atari": ["2600"],
    "NEC": ["TG16", "PCFX"],
    "SNK": ["NG"],
    "Bandai": ["WS"],
    "Panasonic": ["3DO"],
}

# Clean and standardize platform names
df["Platform"] = df["Platform"].astype(str).str.strip().str.upper()

# Function to categorize each platform
def categorize_platform(p):
    for group, names in platform_groups.items():
        if p in names:
            return group
    return "Other"

# Apply the grouping and overwrite the column
df["Platform"] = df["Platform"].apply(categorize_platform)

##### Replace RP (Rating Pending) to just none.

In [None]:
df['Rating'].replace('RP', 'No Rating', inplace=True)
df['Rating'].fillna('No Rating', inplace=True)

## Exploratory Data Analysis

##### For EDA, we can specify video game success criteria by two definitions: (1) Sales success, (2) Critical sucess or both. For this we will use the variables (1) Genre (2) Critic and User scores (3) Platforms and (4) Publishers. In this EDA we will more focus on Sales success by correlating them in the said variables.

### EDA Question 1: Which game genres are associated with higher global sales?

In [None]:
genre_sales = df.groupby('Genre')['Global_Sales'].agg(['mean', 'median', 'sum', 'count']).sort_values(by='mean', ascending=False)
genre_sales

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(data=genre_sales.reset_index(), x='Genre', y='mean', palette='viridis')
plt.title('Average Global Sales by Genre')
plt.xlabel('Genre')
plt.ylabel('Average Global Sales (in millions)')
plt.xticks(rotation=45)
plt.show()

###### Excluding the `Miscellaneous` genre, the data reveals that `Shooter` games have the highest average global sales among all video game genres. This indicates that shooter titles tend to perform exceptionally well in the global market compared to other genres such as `Platform`, `Sports`, or `Racing` games. The popularity of `Shooter` games may be attributed to their broad appeal, competitive gameplay, and strong player engagement across different regions. Overall, this suggests that the shooter genre holds a dominant position in terms of commercial success within the gaming industry.

### EDA Question 2: How do critic and user scores relate to global sales?

In [None]:
df[['Critic_Score', 'User_Score', 'Global_Sales']].corr()

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(df[['Critic_Score', 'User_Score', 'Global_Sales']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Between Scores and Global Sales')
plt.show()

sns.scatterplot(data=df, x='Critic_Score', y='Global_Sales', alpha=0.6)
plt.title('Critic Score vs Global Sales')
plt.show()

sns.scatterplot(data=df, x='User_Score', y='Global_Sales', alpha=0.6)
plt.title('User Score vs Global Sales')
plt.show()

##### The correlation heatmap and scatter plots show that there is a weak positive correlation (r = 0.24) between `Critic_Score` and `Global_Sales`. This suggests that while games with higher critic ratings tend to sell slightly better worldwide, the relationship is not particularly strong. In other words, although positive critic reviews may contribute to improved sales, they are not the sole determinant of a game's commercial success.

##### In contrast, the correlation between `User_Score` and `Global_Sales` is even weaker (r = 0.09), indicating that player ratings have minimal direct influence on overall sales performance. This could imply that user opinions, while valuable for community perception, do not necessarily translate into higher revenue.

##### Overall, the findings suggest that well-reviewed games (especially by critics) often achieve higher sales, but a good review alone does not guarantee commercial success ‚Äî other factors such as marketing, franchise reputation, platform exclusivity, and release timing may play more substantial roles in driving sales.

### EDA Question 3: Which gaming platforms tend to have higher-selling games?

In [None]:
platform_sales = df.groupby('Platform')['Global_Sales'].agg(['mean', 'median', 'sum', 'count']).sort_values(by='mean', ascending=False)
platform_sales

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(data=platform_sales.reset_index(), x='Platform', y='mean', palette='magma')
plt.title('Average Global Sales by Platform')
plt.xlabel('Platform')
plt.ylabel('Average Global Sales (in millions)')
plt.xticks(rotation=45)
plt.show()

#### The analysis shows an extremely close competition between `Nintendo` and `PlayStation`. Nintendo leads with an average of 0.852 million units sold per game, while `PlayStation` follows closely at 0.846 million units. This tiny gap suggests that both platforms have been almost equally successful in driving game sales, with `Nintendo` holding only a slight edge in overall performance.

### EDA Question 4: Who are the top publishers in terms of global sales performance?

In [None]:
publisher_sales = df.groupby('Publisher')['Global_Sales'].agg(['mean', 'sum', 'count']).sort_values(by='sum', ascending=False).head(10)
publisher_sales

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(data=publisher_sales.reset_index(), x='Publisher', y='sum', palette='cubehelix')
plt.title('Top 10 Publishers by Total Global Sales')
plt.xlabel('Publisher')
plt.ylabel('Total Global Sales (in millions)')
plt.xticks(rotation=45)
plt.show()


##### `Electronic Arts` (EA) leads all publishers with the highest total global sales, demonstrating its strong dominance in the gaming industry through successful franchises such as FIFA, Battlefield, and The Sims. Following closely behind is `Nintendo`, whose iconic series like Mario, The Legend of Zelda, and Pok√©mon continue to drive impressive worldwide sales. Other major publishers, including Activision, Sony Computer Entertainment, Take-Two Interactive, and Ubisoft, also maintain significant market shares but remain behind EA and Nintendo in overall performance. This highlights EA‚Äôs position as the top-performing publisher globally in terms of video game sales.



EDA Question	Key Insights
1. In Genre + Sales, tend to generate higher average and total sales.
2. Scores + Sales,	Moderate positive correlation with Critic Scores; weaker with User Scores.
3. Platform + Sales,	Platforms like PlayStation or Nintendo have higher-selling games on average.
4. Publisher + Sales, EA, Nintendo and Activision account for a large share of total global sales.

# **Main Research Question:**
  What determines a game's success in terms of global sales?

### **Importance and Significance:**
  Factors such as game genre, critic and user scores, platform, publisher, and region may be important factors that determines how a game will sell in the global market. Through this research question, the team aims to understand how these factors correlate with each other and what needs to be tested for its significance so that future developers can see how they should market their game.

### **How this can benefit game studios:**
An upcoming or present game studio will be able to give an educated decision on the following example questions:

* Which publishing studio is the most competitive and yields the most global sales?

* How can we find the most success if the game studio aims for a region-based localization approach.

  *(Ex. Japanese game studio aiming to release games only in Japan.)*
* What platforms do this game's genre see more success in?
* Do games released on more platforms actually sell more?



# **Data modeling:**

To answer the research question "what determines a game's success in terms of global sales?", we need to first get a measurable defintion of success, we first defined success as a game for reaching ‚â• 1.0 million global sales, which is a common industry benchmark for a financially successful "hit" game

The video game industry commonly uses the term ‚Äúmillion-seller‚Äù to classify commercially successful titles. Several academic studies categorize game sales using the 1M threshold as the lower boundary for high-performing games. Major publishers such as Nintendo and Sony publicly report games surpassing 1M copies as key milestones. Therefore, using 1M global sales as the benchmark provides a meaningful and industry-recognized definition of game success.

Putra, Rafi. (2025). Classification and Prediction of Video Game Sales Levels Using the Naive Bayes Algorithm Based on Platform, Genre, and Regional Market Data. IJIIS: International Journal of Informatics and Information Systems. 8. 12-21. 10.47738/ijiis.v8i1.242.

we created a binary target variable:
* 1 = Hit game (‚â• 1M sales)
* 0 = Non-hit game

With this in hand, we can build a predictive model that reveals which feature contributes the most to making a game successful

In [None]:
df_model = df.dropna(subset = ['Global_Sales']).copy()

#define binary sucess label
df_model['is_hit'] = (df_model['Global_Sales'] >= 1.0).astype(int)

print(df_model['is_hit'].value_counts())
print(df_model['is_hit'].value_counts(normalize = True))


# Feature enginering:

## To understand what factors determine a game‚Äôs success we prepared the following predictors:

## Numeric Predictors

* Critic_Score ‚Äî professional reviews

* User_Score ‚Äî player reviews

* Year_of_Release ‚Äî timing effects

* platform_count ‚Äî number of platforms the game was released on

## Categorical Predictors

* Genre

* Platform

* Publisher_grouped (top publishers kept, others grouped as ‚ÄúOther‚Äù)

### These match the key variables explored earlier during EDA.

In [None]:
top_publishers = df_model['Publisher'].value_counts().head(10).index
df_model['Publisher_grouped'] = np.where(
    df_model['Publisher'].isin(top_publishers),
    df_model['Publisher'],
    'Other'
)

platform_counts = df_model.groupby('Name')['Platform'].transform('nunique')
df_model['platform_count'] = platform_counts

feature_cols = [
    'Critic_Score', 'User_Score', 'Year_of_Release',
    'platform_count', 'Genre', 'Platform', 'Publisher_grouped'
]

## data cleaning

In [None]:
df_model_clean = df_model.dropna(subset=['Critic_Score', 'User_Score', 'Year_of_Release'])
print(len(df_model_clean))

# Model building

to determine which factors contributes to a game's success, logistic regression is used because
* it predicts a binary outcome (hit or not hit)
* it provides coefficients, showing which feature increases or decreases the probability of success
* it is pretty interpretable and can directyly answer what determines success

In [None]:
X = df_model_clean[feature_cols]
y = df_model_clean['is_hit']
numeric_features = ['Critic_Score', 'User_Score', 'Year_of_Release', 'platform_count']
categorical_features = ['Genre', 'Platform', 'Publisher_grouped']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=1000))
])

clf.fit(X_train, y_train)

## Evaluation metrics
* Accuracy - how many predictions were correct overall
* Precision - when the model predicts a hit, how often is it correct?
* Recall - how many real hits did the model catch
* F1-score - balanced measure of precision and recall
* Confusion Matrix - Shows where the model was right or wrong
  
| hit | Predicted Non-Hit |Predicted Hit|
|-----------|-------------|-------------|
| Actual Non-Hit| True Negative| False Positive|
| Actual Hit| False Negative| True Positive|

In [None]:
y_pred = clf.predict(X_test)

print("Accuracy :", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall   :", recall_score(y_test, y_pred))
print("F1-score :", f1_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

## model evaluation
* accuracy - 85.65%, meaning it predicts the game success reasonablly well however because most games are non-hits in this dataset, the accuracy itself can be misleading
* precision - 75% ,shows that when the model predicts "hit or success" it is 75% correct of the time, this is good and it means that the model makes few false hit predictions
* recall - 0.346 , this indicates the model only indetifies 1/3 of actual hit games, missing many successful titles, but overall expected due to the imbalance dataset of hits and non-hits where the hits is way higher than the non-hits
* f1 score - 0.473 reflects moderate balance between precision and recall
* confusion matrix
  
| hit | Predicted Non-Hit |Predicted Hit|
|-----------|-------------|-------------|
| Actual Non-Hit| 1139 correct non-hit prediction (True Negative)| 31 false hit prediction (False Positive)|
| Actual Hit| 176 missed hit prediction (False Negative)| 93 correct hit predictions (True Positive)|

In [None]:
log_reg = clf.named_steps['model']
ohe = clf.named_steps['preprocess'].named_transformers_['cat']

cat_feature_names = ohe.get_feature_names_out(categorical_features)
all_feature_names = np.concatenate([numeric_features, cat_feature_names])

coef_df = pd.DataFrame({
    'feature': all_feature_names,
    'coefficient': log_reg.coef_[0],
    'abs_coef': np.abs(log_reg.coef_[0])
})


In [None]:
coef_df['clean_feature'] = (
    coef_df['feature']
    .str.replace("Publisher_grouped_", "", regex=False)
    .str.replace("Platform_", "", regex=False)
    .str.replace("Genre_", "", regex=False)
)

In [None]:
def get_category(name):
    if name.startswith("Publisher_grouped_"):
        return "Publisher"
    elif name.startswith("Platform_"):
        return "Platform"
    elif name.startswith("Genre_"):
        return "Genre"
    else:
        return "Numeric"

In [None]:
coef_df['Category'] = coef_df['feature'].apply(get_category)

In [None]:
print("\nüìå PUBLISHERS\n")
display(coef_df[coef_df['Category'] == 'Publisher']
        .sort_values('abs_coef', ascending=False)
        [['clean_feature', 'coefficient', 'abs_coef']])

In [None]:
print("\nüìå PLATFORMS\n")
display(coef_df[coef_df['Category'] == 'Platform']
        .sort_values('abs_coef', ascending=False)
        [['clean_feature', 'coefficient', 'abs_coef']])

In [None]:
print("\nüìå GENRES\n")
display(coef_df[coef_df['Category'] == 'Genre']
        .sort_values('abs_coef', ascending=False)
        [['clean_feature', 'coefficient', 'abs_coef']])

In [None]:
print("\nüìå NUMERIC FEATURES\n")
display(coef_df[coef_df['Category'] == 'Numeric']
        .sort_values('abs_coef', ascending=False)
        [['clean_feature', 'coefficient', 'abs_coef']])

The model shows that a game's success is strongly influenced by **Publisher Reputation**, **Critics Review Socres**, **Platforms**, and **genre**. **Nitendo-published** games and titles with higher critic scores have a much greater chance of reaching 1 million global sales. **PlayStation** releases tend to perform better than other, while **PC** and niche genres like **Strategy** and **Adventure** are less likely to producce hit games. Although the model has high accuracy and good precision, it struggles to detect all successfull games due to classs imbalance. Overall, both **game quality** and **market factors** determine whether a game becomes a commercial success

# Statistical Inference

### 1. Do games released on multi-platforms sell more in terms of global sales?

### Two Tailed T Test on Two Independent Sample Means<br>
Null Hypothesis: There is no significant difference between the mean global sales of multi-platform and single-platform games.<br>
Alternative Hypothesis: There is a significant difference between the mean global sales of multi-platform and single-platform games.

$$
H_0: \mu_{\text{multi}} = \mu_{\text{single}}
$$

$$
H_1: \mu_{\text{multi}} \neq \mu_{\text{single}}
$$


Each game's platform was counted based on how many platforms it has appeared on.
Duplicates were removed and two samples were retrieved based on if the number of platforms was greater than 1 or notm
to its respective variables multi and single. A two tailed t test on two sample means was then performed.

In [None]:
df['num_platforms'] = df.groupby('Name')['Platform'].transform('nunique')

df['multi_platform'] = (df['num_platforms'] > 1).astype(int)

df_unique = df.drop_duplicates(subset=['Name'])

multi = df_unique[df_unique['multi_platform'] == 1]['Global_Sales']
single = df_unique[df_unique['multi_platform'] == 0]['Global_Sales']


In [None]:
t_stat, p_value = ttest_ind(multi, single, equal_var=True, nan_policy='omit')

print("t-statistic:", t_stat)
print("p-value (two-tailed):", p_value)

Since the p-value is really small, we can safely reject the null hypothesis and say that there is a significant difference between the global sales of multiplatform and singleplatform games

### 2. Does genre affect global sales?

### ANOVA to compare genre means<br>
Null Hypothesis: All genres have the same mean global sales.<br>
Alternative Hypothesis: At least one genre has a different mean global sales.

$$
H_0: \mu_{\text{Sports}} = \mu_{\text{Platform}} = \mu_{\text{Racing}} = \mu_{\text{Role-Playing}} = \mu_{\text{Puzzle}} = \mu_{\text{Misc}} = \mu_{\text{Shooter}} = \mu_{\text{Simulation}} = \mu_{\text{Action}} = \mu_{\text{Fighting}} = \mu_{\text{Adventure}} = \mu_{\text{Strategy}}
$$

$$
H_1: \text{At least one } \mu_{\text{genre}} \text{ is different}
$$

In [None]:
genres = df['Genre'].unique()

sales_by_genre = [df[df['Genre'] == g]['Global_Sales'] for g in genres]

f_stat, p_value = f_oneway(*sales_by_genre)

print("F-statistic:", f_stat)
print("p-value:", p_value)

We can reject null hypothesis and say that there is at least one game genre that has a different global sales mean due to the very low p value.
The F statistic is also high which means that there exists a significant amount of variation between the genres.

We can support this conclusion with the previous ERD question, the average global sales per genre,
in which we see shooters, platformers, and racing games to be the top genres,
indicating that these genres may perform better in global sales compared to the other less popular genres.

### 3. Critic and User Scores
Do higher critic or user scores affect global sales?

### For High and Low Critic Score
Null Hypothesis: There is no significant difference between the mean global sales of high and low critic scored games.<br>
Alternative Hypothesis: There is a significant difference between the mean global sales of high and low critic scored games.

$$
H_0: \mu_{\text{high}} = \mu_{\text{low}}
$$

$$
H_1: \mu_{\text{high}} \neq \mu_{\text{low}}
$$



In [None]:
df_unique = df.drop_duplicates(subset=['Name'])

# df_unique assuming that games reviewed on different platforms are the same
high_score = df_unique[df_unique['Critic_Score'] >= 80]['Global_Sales']
low_score = df_unique[df_unique['Critic_Score'] < 80]['Global_Sales']

t_stat, p_value = ttest_ind(high_score, low_score, equal_var=True, nan_policy='omit')

print("t-statistic:", t_stat)
print("p-value (two-tailed):", p_value)

### For High and Low User Score
Null Hypothesis: There is no significant difference between the mean global sales of high and low user scored games.<br>
Alternative Hypothesis: There is a significant difference between the mean global sales of high and low user scored games.

$$
H_0: \mu_{\text{high}} = \mu_{\text{low}}
$$

$$
H_1: \mu_{\text{high}} \neq \mu_{\text{low}}
$$

In [None]:
df_unique = df.drop_duplicates(subset=['Name'])

# df_unique assuming that games reviewed on different platforms are the same
high_score = df_unique[df_unique['User_Score'] >= 7]['Global_Sales']
low_score = df_unique[df_unique['User_Score'] < 7]['Global_Sales']
# also its 1-10 so imma just assume 7 is high enough

t_stat, p_value = ttest_ind(high_score, low_score, equal_var=True, nan_policy='omit')

print("t-statistic:", t_stat)
print("p-value (two-tailed):", p_value)

In [None]:
df.to_csv("Video_Game_Sales_Current.csv", index=False)

# **Insights and Conclusions**

Based on the Comprehensive Exploratory Data Analysis (EDA), Logistic Regression Modeling, and Statistical Inference performed above, we can answer the main research question: **"What determines a game's success in terms of global sales?"**

### **1. Key Determinants of Success**

**A. The Power of the Publisher**
* **Nintendo Dominance:** Our predictive model identified `Nintendo` as the single strongest positive coefficient (+1.53) for predicting a "Hit" game. While **Electronic Arts** has the highest *total* sales volume due to the sheer number of titles released, **Nintendo** titles have a significantly higher *average* sales per game (~2.88M vs 0.92M).
* **Publisher Hierarchy:** Being associated with top-tier publishers (Nintendo, Sony, Activision, EA) drastically increases the probability of success compared to independent or smaller publishers, which showed a negative correlation in our model.

**B. The Critical Role of Professional Reviews**
* **Critics > Users:** There is a clear distinction between professional reception and player sentiment regarding sales. `Critic_Score` showed a moderate positive correlation with sales and was the strongest numeric predictor in our logistic regression model (+1.40 coefficient).
* **User Disconnect:** `User_Score` had a very weak correlation with sales (r=0.09) and actually showed a negative coefficient in the multivariate model. This suggests that while high user scores are good for reputation, they do not drive commercial purchasing behavior as effectively as pre-release hype generated by professional critics.
* **Statistical Proof:** Our T-Test confirmed that games with a Critic Score $\ge$ 80 sell significantly better than those below that threshold (t-stat: 17.8).

**C. Platform Strategy: Go Wide**
* **Multi-platform Advantage:** Our statistical inference proved definitively that games released on multiple platforms sell significantly more than single-platform exclusives ($p < 0.05$). Unless a studio is a first-party developer (like Nintendo or Sony), restricting a game to one console limits revenue potential significantly.
* **Platform Leaders:** In terms of average sales per game, **Nintendo** and **PlayStation** are the leaders. PC gaming showed lower average sales per title in this specific dataset, though this may be skewed by the high volume of niche/indie titles on PC compared to curated console ecosystems.

**D. Genre Selection**
* **High Performers:** **Shooter**, **Platform**, and **Sports** games generate the highest average revenue. The Shooter genre, in particular, is a consistent high-performer.
* **Niche Markets:** Genres like **Strategy** and **Adventure** have negative coefficients in our predictive model, indicating they are statistically less likely to reach the "1 Million Sales" benchmark compared to action-oriented genres.

---

### **2. Recommendations for Game Studios**

Based on these findings, a new or existing game studio aiming to maximize global sales should consider the following strategies:

1.  **Prioritize Quality Assurance & Media Relations:** Because `Critic_Score` is a massive driver for sales, investing in polish to ensure high review scores is more financially impactful than catering solely to user wishlists.
2.  **Avoid Exclusivity:** Unless funded by a major console manufacturer, aim for a **multi-platform release** (PlayStation + Xbox + Nintendo) to maximize market penetration.
3.  **Genre Viability:** If financial success is the primary goal, developing for mass-market genres like **Shooters** or **Action** provides a safer baseline than niche genres like Strategy, which require a very specific target audience to succeed.
4.  **Publisher Partnerships:** If possible, securing a publishing deal with a top-10 publisher (like Ubisoft, Take-Two, or Sony) significantly increases the statistical likelihood of producing a "Hit."

### **3. Model Limitations**
* **Recall (34.6%):** Our logistic model is conservative; it has high precision (it's usually right when it predicts a hit) but low recall (it misses many actual hits). This indicates that there are other "X-factors" (marketing budget, IP recognition, trends/virality) not present in this dataset that contribute to a game becoming a sleeper hit.
* **Data Age:** As the dataset ends around 2017, recent trends such as the rise of Free-to-Play models (Fortnite, Warzone) and mobile gaming dominance are not represented here.