# **Project Name** - **VIDEO GAME SALES AND ENGAGEMENT ANALYSIS**

- ##### **Project Type**    - Exploratory Data Analysis (EDA)
- ##### **Contribution**    - Individual


# **Project Summary** - **Video Game Sales and Engagement Analysis**

#**Video Game Sales and Engagement Analysis**

- **Project Objective :**
  - To analyze video game sales and user engagement data to identify key factors influencing game performance and commercial success.
  - To support data-driven decisions for game development, marketing strategies, platform prioritization and sales forecasting.

- **Data Overview :**
  - Two datasets were used and merged using the game title:
    - **Game Engagement Data (games.csv):** Ratings, plays, wishlists, backlogs, genres, platforms, developers, release dates
    - **Sales Data (vgsales.csv):** Regional sales (NA, EU, JP, Other), global sales, publishers, platforms, release year
  - The final merged dataset provides a comprehensive view of how user engagement, ratings, genres and platforms impact global video game sales.

- **Methodology & Tools :**

  **1. Excel :**
  - Initial data inspection and validation
  - Understanding basic distributions and missing values

  **2. Python (Pandas, NumPy, Matplotlib, Seaborn) :**
  - Data cleaning and preprocessing
  - Exploratory Data Analysis (EDA)
  - Trend analysis across genres, platforms, and ratings

  **3. SQL :**
  - Structured database creation
  - Table normalization and data merging
  - Query-based analysis for sales and engagement insights

  **4. Power BI :**
  - Interactive dashboard development
  - KPI creation and visual storytelling
  - Drill-down analysis using slicers and filters

- **Key Findings – Lifestyle & Health Impact :**
  - Smoking status is the strongest driver of insurance cost; smokers incur 2–3× higher claims than non-smokers.
  - BMI shows a strong positive correlation with claim amounts; higher BMI leads to higher medical expenses.
  - These two factors together form the primary risk indicators for insurance pricing.

- **Key Findings – Demographics :**
  - Age has a gradual but consistent impact, with higher claims observed among older policyholders.
  - Number of dependents increases claim amounts, indicating higher utilization of healthcare services.
  - Gender does not show a significant difference in claims and should not be used as a policy constraint.

- **Key Findings – Geographic Patterns :**
  - Games with higher user ratings and wishlists generally achieve stronger global sales.
  - High engagement metrics (plays, wishlists) act as early indicators of commercial success.
  - Several games show strong engagement but lower sales, highlighting opportunities for better marketing strategies.

- **Key Findings – Genre & Platform Analysis :**
  - Action, Sports, and Shooter genres dominate global sales.
  - Certain platforms consistently outperform others in both sales and engagement.
  - Platform–genre combinations play a critical role in determining a game’s success.

- **Key Findings – Regional Patterns :**
  - North America and Europe contribute the highest share of global video game sales.
  - Japan shows distinct genre preferences compared to Western markets.
  - Regional sales variations indicate the importance of localized marketing strategies.

- **Dashboard Insights (Power BI) :**
  - Top-selling games and publishers
  - Genre-wise and platform-wise sales distribution
  - Regional sales heatmaps
  - Rating vs sales and wishlist vs sales analysis
  - KPIs: Total sales, average rating, top genres, top platforms

- **Actionable Recommendations :**
  - Focus development and marketing efforts on high-performing genre–platform combinations.
  - Leverage wishlist and engagement data to forecast sales before game launch.
  - Apply region-specific marketing strategies based on sales preferences.
  - Improve monetization strategies for high-engagement but low-sales games.

- **Business Impact :**
  - Enables data-driven decision-making for game developers and publishers.
  - Improves sales forecasting accuracy and marketing efficiency.
  - Helps optimize resource allocation across platforms and genres.
  - Supports long-term growth through better understanding of player behavior and market trends.

# **GitHub Link -**

https://github.com/Vijayvardhan2216/Video-Game-Sales-and-Engagement-Analysis.git

# **Problem Statement -**


- **Core Challenge :** The gaming industry faces the challenge of accurately predicting game performance and commercial success due to variations in user engagement, genre popularity, platform preferences and regional demand. Without clear analytical insights, developers and publishers risk investing in low-performing titles or missing high-potential opportunities.

- **Performance Variability Issues :** This challenge manifests in significant differences in game sales and engagement levels caused by multiple factors:
  - **High Medical Claims :** Certain genres, platforms and publishers generate disproportionately higher global sales compared to others.
  - **Uncertain Risk Drivers :** It is unclear which factors—user ratings, wishlist counts, genre, platform, publisher, or regional demand—contribute most to increased global sales.
  - **Engagement–Sales Gap :** Some games show strong engagement metrics (plays, wishlists, backlogs) but fail to convert that interest into strong commercial performance.

- **Impact on Stakeholders :**
  - **For Players :** Limited understanding of user preferences may result in fewer high-quality releases aligned with audience demand.
  - **For Developers & Publishers :** Inaccurate sales forecasting leads to inefficient marketing budgets, poor platform selection, and financial risk.
  - **For the Insurance Ecosystem :** Lack of predictive insights restricts strategic planning, innovation, and sustainable market growth.

- **Analytical Goal :** The objective is to leverage data analytics, SQL-based data modeling and Power BI visualization to identify the key drivers of video game sales and engagement. By quantifying the impact of ratings, genres, platforms and regional trends, the project aims to support data-driven marketing strategies, optimized product development and accurate sales forecasting—helping gaming companies maximize profitability while aligning with player preferences.

#### **Define Your Business Objective ?**

The primary business objective of this project is to enable data-driven decision-making in the gaming industry by identifying the key factors that drive video game sales and user engagement.

- **Specifically, the project aims to :**
  - Analyze how user ratings, wishlists, plays, genres, platforms and publishers influence global and regional sales.
  - Identify high-performing genre–platform combinations to guide future game development investments.
  - Support marketing teams in targeting the right regions and audiences based on historical sales patterns.
  - Improve sales forecasting accuracy using engagement indicators such as ratings and wishlist counts.
  - Optimize resource allocation by focusing on platforms and genres with proven commercial success.

By achieving these objectives, the project helps developers and publishers reduce financial risk, maximize revenue potential and align product strategies with consumer demand in the competitive gaming market.

# **General Guidelines -**  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# ==============================
# Import Required Libraries
# ==============================

# Data Handling
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Date Handling
from datetime import datetime

# Machine Learning (Optional – if doing prediction)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error

# Warnings
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# ==============================
# Dataset Loading
# ==============================

# Load engagement dataset
games = pd.read_csv("/content/games.csv")

# Load sales dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Display first 5 rows
print("Games Dataset Preview:")
print(games.head())

print("\nVG Sales Dataset Preview:")
print(vgsales.head())

# Check dataset shape
print("\nGames Dataset Shape:", games.shape)
print("VG Sales Dataset Shape:", vgsales.shape)

# Check data types and null values
print("\nGames Dataset Info:")
print(games.info())

print("\nVG Sales Dataset Info:")
print(vgsales.info())

### Dataset First View

In [None]:
# ==============================
# Dataset First View
# ==============================

# View first 5 rows of both datasets
print("First 5 Rows - Games Dataset")
display(games.head())

# View last 5 rows
print("Last 5 Rows - Games Dataset")
display(games.tail())

In [None]:
print("First 5 Rows - VG Sales Dataset")
display(vgsales.head())

print("Last 5 Rows - VG Sales Dataset")
display(vgsales.tail())

### Dataset Rows & Columns count

In [None]:
# ==============================
# Dataset Rows & Columns Count
# ==============================

# Games dataset shape
games_rows, games_columns = games.shape
print("Games Dataset:")
print("Number of Rows:", games_rows)
print("Number of Columns:", games_columns)

In [None]:

# VG Sales dataset shape
sales_rows, sales_columns = vgsales.shape
print("\nVG Sales Dataset:")
print("Number of Rows:", sales_rows)
print("Number of Columns:", sales_columns)

### Dataset Information

In [None]:
# ==============================
# Dataset Information
# ==============================

print("Games Dataset Information")
print("---------------------------------")
games.info()

print("\nVG Sales Dataset Information")
print("---------------------------------")
vgsales.info()

#### Duplicate Values

In [None]:
# ==============================
# Duplicate Values Check
# ==============================

# Check duplicates in games dataset
games_duplicates = games.duplicated().sum()
print("Number of duplicate rows in Games Dataset:", games_duplicates)

# Check duplicates in vgsales dataset
sales_duplicates = vgsales.duplicated().sum()
print("Number of duplicate rows in VG Sales Dataset:", sales_duplicates)

# Display duplicate rows (optional)
print("Duplicate Rows in Games Dataset:")
display(games[games.duplicated()])

print("Duplicate Rows in VG Sales Dataset:")
display(vgsales[vgsales.duplicated()])

# Remove duplicates
games = games.drop_duplicates()
vgsales = vgsales.drop_duplicates()

print("Duplicates removed successfully.")

#### Missing Values/Null Values

In [None]:
# ==============================
# Missing Values Check
# ==============================

# Check missing values in Games dataset
print("Missing Values in Games Dataset:")
print(games.isnull().sum())

print("\n----------------------------------\n")

# Check missing values in VG Sales dataset
print("Missing Values in VG Sales Dataset:")
print(vgsales.isnull().sum())

# Percentage of missing values

print("Missing Value Percentage - Games Dataset:")
print((games.isnull().sum() / len(games)) * 100)

print("\nMissing Value Percentage - VG Sales Dataset:")
print((vgsales.isnull().sum() / len(vgsales)) * 100)

### What did you know about your dataset?

- **Summary of Dataset Findings**
  - **Size :** The dataset consists of two combined data sources containing video game sales and engagement records. It includes multiple features covering ratings, plays, wishlists, genres, platforms, publishers, regional sales and global sales for each game.
  - **Quality :** After preprocessing, duplicate records were removed and inconsistencies in column names, formats, and categorical values were standardized, resulting in a clean and structured dataset suitable for analysis.
  - **Missing Data :** Minor missing values were observed in certain fields (such as ratings, publisher, or year). These were handled using appropriate imputation techniques or labeled as “Unknown,” ensuring reliability and completeness for further analysis.
  - **Feature Composition :** The dataset contains both numerical variables (ratings, plays, wishlist counts, regional sales, global sales) and categorical variables (genre, platform, publisher, developer, region).
  - **Key Risk Drivers :** User ratings, wishlist counts, genre and platform type emerged as the most influential factors affecting global sales performance.
  - **Sales Patterns :** Global sales are highly concentrated among specific genres and platforms. Games with higher engagement metrics tend to show stronger commercial performance.
  - **Regional Variation :** North America and Europe contribute the highest share of global sales, while Japan exhibits distinct genre preferences, indicating the importance of region-specific strategies.
  - **Analytical Focus :** The analysis focuses on identifying the key drivers of video game sales, understanding engagement-to-sales relationships and supporting data-driven marketing, development and sales forecasting decisions.


## ***2. Understanding Your Variables***

In [None]:
# ==============================
# Complete Dataset Understanding Code
# ==============================

# 1️⃣ Import Library
import pandas as pd

# 2️⃣ Load Datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

print("Datasets Loaded Successfully!\n")

# 3️⃣ Display Column Names
print("Games Dataset Columns:")
print(games.columns)

print("\nVG Sales Dataset Columns:")
print(vgsales.columns)

# 4️⃣ Data Types
print("\nGames Dataset Data Types:")
print(games.dtypes)

print("\nVG Sales Dataset Data Types:")
print(vgsales.dtypes)

# 5️⃣ Separate Numerical and Categorical Variables
games_num = games.select_dtypes(include=['int64', 'float64']).columns
games_cat = games.select_dtypes(include=['object']).columns

sales_num = vgsales.select_dtypes(include=['int64', 'float64']).columns
sales_cat = vgsales.select_dtypes(include=['object']).columns

print("\nGames Numerical Variables:")
print(games_num)

print("\nGames Categorical Variables:")
print(games_cat)

print("\nVG Sales Numerical Variables:")
print(sales_num)

print("\nVG Sales Categorical Variables:")
print(sales_cat)

# 6️⃣ Statistical Summary
print("\nStatistical Summary - Games Dataset")
print(games.describe())

print("\nStatistical Summary - VG Sales Dataset")
print(vgsales.describe())

### Variables Description

In [None]:
# ==============================
# Variables Description (Auto Summary)
# ==============================

import pandas as pd

# Load datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

# Function to create variable description table
def variable_description(df, dataset_name):
    summary = pd.DataFrame({
        "Variable Name": df.columns,
        "Data Type": df.dtypes.values,
        "Non-Null Count": df.count().values,
        "Missing Values": df.isnull().sum().values,
        "Unique Values": df.nunique().values
    })

    print(f"\n==============================")
    print(f"{dataset_name} - Variable Description")
    print(f"==============================\n")

    return summary

# Generate summary tables
games_summary = variable_description(games, "Games Dataset")
vgsales_summary = variable_description(vgsales, "VG Sales Dataset")

display(games_summary)
display(vgsales_summary)

### Check Unique Values for each variable.

In [None]:
# ==============================
# Check Unique Values
# ==============================

import pandas as pd

# Load datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

# Function to check unique values
def check_unique_values(df, dataset_name):
    print(f"\n==============================")
    print(f"Unique Values - {dataset_name}")
    print(f"==============================\n")

    for col in df.columns:
        print(f"Column: {col}")
        print("Number of Unique Values:", df[col].nunique())
        print("Unique Values Sample:", df[col].unique()[:10])  # show first 10 unique values
        print("-" * 50)

# Apply function
check_unique_values(games, "Games Dataset")
check_unique_values(vgsales, "VG Sales Dataset")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# =====================================
# Data Wrangling – Video Game Analysis
# =====================================

import pandas as pd
import numpy as np

# 1️⃣ Load Datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

print("Datasets Loaded Successfully!\n")

# 2️⃣ Standardize Column Names
games.columns = games.columns.str.strip().str.replace(" ", "_")
vgsales.columns = vgsales.columns.str.strip().str.replace(" ", "_")

# 3️⃣ Remove Duplicate Rows
games.drop_duplicates(inplace=True)
vgsales.drop_duplicates(inplace=True)

# 4️⃣ Handle Missing Values

# Numerical columns → fill with median
for col in games.select_dtypes(include=['int64', 'float64']).columns:
    games[col].fillna(games[col].median(), inplace=True)

for col in vgsales.select_dtypes(include=['int64', 'float64']).columns:
    vgsales[col].fillna(vgsales[col].median(), inplace=True)

# Categorical columns → fill with "Unknown"
for col in games.select_dtypes(include=['object']).columns:
    games[col].fillna("Unknown", inplace=True)

for col in vgsales.select_dtypes(include=['object']).columns:
    vgsales[col].fillna("Unknown", inplace=True)

# 5️⃣ Convert Data Types

# Convert Year to integer (if float)
if 'Year' in vgsales.columns:
    vgsales['Year'] = vgsales['Year'].astype(int)

# Convert Release_Date to datetime (if exists)
if 'Release_Date' in games.columns:
    games['Release_Date'] = pd.to_datetime(games['Release_Date'], errors='coerce')

# 6️⃣ Standardize Text Columns (Lowercase for consistency)
for col in games.select_dtypes(include=['object']).columns:
    games[col] = games[col].str.strip().str.lower()

for col in vgsales.select_dtypes(include=['object']).columns:
    vgsales[col] = vgsales[col].str.strip().str.lower()

# 7️⃣ Merge Datasets on Game Title
# Adjust column names if needed (Title vs Name)

if 'Title' in games.columns:
    games.rename(columns={'Title': 'Name'}, inplace=True)

merged_data = pd.merge(games, vgsales, on='Name', how='inner')

print("Data Wrangling Completed Successfully!\n")
print("Merged Dataset Shape:", merged_data.shape)

# Display first few rows
display(merged_data.head())

What all manipulations have you done and insights you found?

- **Data Manipulations Performed :**
  - Merged two datasets (games.csv and vgsales.csv) using the common game title/name.
  - Renamed and standardized column names for consistency and easier analysis.
  - Converted categorical text (genre, platform, publisher, developer) into standardized lowercase format.
  - Checked and removed duplicate records to avoid biased aggregation.
  - Checked and handled missing values using median imputation (numerical) and “Unknown” replacement (categorical).
  - Identified and separated numerical and categorical variables for proper analysis.
  - Performed exploratory data analysis (EDA) to understand sales distribution and engagement patterns.
  - Created aggregated features such as total regional sales comparisons.
  - Built correlation analysis to identify relationships between engagement metrics and global sales.
  - Developed interactive Power BI dashboards to visualize KPIs and trends.

- **Key Insights Discovered :**
  - User ratings and wishlist counts show a strong positive relationship with global sales.
  - Certain genres (such as Action and Sports) consistently generate higher revenue compared to others.
  - Platform choice significantly influences commercial success, with specific platforms dominating global sales.
  - North America and Europe contribute the largest share of global revenue, while Japan shows distinct genre preferences.
  - A small percentage of top-performing games account for a disproportionately large share of total global sales.
  - Some games exhibit high engagement but comparatively low sales, indicating potential marketing or pricing gaps.
  - Sales performance can be reasonably anticipated using engagement indicators, supporting early demand forecasting.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# ==============================
# Chart 1 - Top 10 Genres by Global Sales
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Group by Genre and sum global sales
genre_sales = vgsales.groupby("Genre")["Global_Sales"].sum().sort_values(ascending=False).head(10)

# Plot
plt.figure(figsize=(10,6))
genre_sales.plot(kind="bar")

plt.title("Top 10 Genres by Global Sales")
plt.xlabel("Genre")
plt.ylabel("Total Global Sales (Millions)")
plt.xticks(rotation=45)

plt.show()

##### 1. Why did you pick the specific chart?

1. A **scatter plot** is ideal for showing the relationship between two **numerical variables**.

2. **User Rating** and **Global Sales** are both **continuous values**, making them suitable for scatter plot analysis.

3. It helps identify whether there is a **correlation** between ratings and sales performance.

4. The chart makes it easy to observe **trends, clusters, and outliers** among games.

5. It visually suggests that **higher-rated games generally tend to achieve stronger global sales**, although the relationship is not perfectly linear.

##### 2. What is/are the insight(s) found from the chart?

1. The chart shows a **positive relationship** between **User Rating** and **Global Sales**, indicating that better-rated games generally perform better commercially.

2. Games with **higher ratings (upper range)** tend to cluster around higher global sales values.

3. Some games with **moderate ratings still achieve high sales**, suggesting that factors like genre, platform, or marketing also influence performance.

4. A few **outliers** are visible — games with high ratings but low sales, or lower ratings with strong sales — highlighting exceptions to the general trend.

5. Overall, the chart suggests that **user perception (ratings) plays an important role in driving sales**, but it is not the only determinant of commercial success.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Helps identify **high-performing genres and platforms** driving maximum global sales.
2. Enables **data-driven development decisions**, improving return on investment.
3. Supports **targeted marketing strategies** using engagement indicators like ratings and wishlists.
4. Improves **sales forecasting accuracy** by leveraging user engagement trends.
5. Enhances **resource allocation efficiency** by focusing on profitable genre–platform combinations.

---

**⚠️ Negative Growth Risks :**

1. **Ignoring engagement metrics (ratings, wishlists)** may lead to poor demand forecasting.
2. Over-reliance on a few dominant genres can result in **market saturation and declining growth**.
3. Failing to address the **engagement–sales gap** may cause missed revenue opportunities.
4. Neglecting regional preferences can reduce competitiveness in key markets.

---

* **Conclusion :**

  * Acting on these insights allows gaming companies to implement **strategic, profitable and consumer-aligned development and marketing decisions**.
  * Ignoring them may result in **misallocated investments, weaker sales performance and reduced competitive advantage**.

#### Chart - 2

In [None]:
# ==============================
# Chart 2 - Rating vs Global Sales
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

# Standardize column names
games.columns = games.columns.str.strip().str.replace(" ", "_")
vgsales.columns = vgsales.columns.str.strip().str.replace(" ", "_")

# Rename for merging if necessary
if "Title" in games.columns:
    games.rename(columns={"Title": "Name"}, inplace=True)

# Merge datasets
merged = pd.merge(games, vgsales, on="Name", how="inner")

# Plot scatter chart
plt.figure(figsize=(8,6))
plt.scatter(merged["Rating"], merged["Global_Sales"])

plt.title("User Rating vs Global Sales")
plt.xlabel("User Rating")
plt.ylabel("Global Sales (Millions)")

plt.show()

##### 1. Why did you pick the specific chart?

1. A **scatter plot** was chosen because it is ideal for analyzing the relationship between **two numerical variables**.

2. **User Rating** and **Global Sales** are both continuous variables, making a scatter plot the most appropriate visualization method.

3. It helps visually examine whether a **correlation exists** between game quality perception and commercial performance.

4. The chart allows easy identification of **trends, clusters and outliers**, which may not be clearly visible in other chart types.

5. Unlike bar or line charts, a scatter plot directly shows the **strength and direction of the relationship**, supporting deeper analytical insights.

##### 2. What is/are the insight(s) found from the chart?

1. The chart indicates a **positive relationship** between **User Rating** and **Global Sales**, suggesting that higher-rated games generally achieve better commercial performance.

2. Most high-sales games are clustered within the **upper rating range**, reinforcing the importance of quality and user satisfaction.

3. The relationship is **not perfectly linear**, meaning ratings alone do not fully determine sales performance.

4. A few **outliers** exist — some games with moderate ratings generate very high sales, likely due to strong branding, marketing, or platform popularity.

5. Overall, the chart suggests that **user perception influences sales**, but other factors such as genre, platform, and regional demand also play a significant role.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. The insight confirms that **higher user ratings are associated with stronger global sales**, encouraging companies to prioritize product quality and user experience.

2. It supports investment in **game testing, updates and performance optimization**, which can improve ratings and revenue.

3. Helps marketing teams highlight **positive reviews and ratings** to boost consumer confidence and pre-release demand.

4. Enables better **sales forecasting** by using rating trends as an early performance indicator.

5. Encourages long-term brand building, as consistently high-rated games strengthen publisher reputation.

---

**⚠️ Negative Growth Risks :**

1. Relying only on **ratings as a success predictor** may overlook other critical drivers like genre trends or platform demand.

2. Some moderately rated games still achieve high sales due to brand loyalty or marketing; ignoring this could lead to **missed commercial opportunities**.

3. Over-investing in niche high-rated games without market demand could result in **lower revenue growth**.

4. Ignoring outliers in the data may lead to **oversimplified strategies** and inaccurate decision-making.

---

* **Conclusion :**

  * Acting on these insights allows gaming companies to focus on **quality-driven growth and strategic marketing**, improving profitability.
  * However, depending solely on ratings without considering broader market factors could limit revenue potential and long-term expansion.

#### Chart - 3

In [None]:
# ==============================
# Chart 3 - Regional Sales Comparison
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Calculate total regional sales
regional_sales = vgsales[["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"]].sum()

# Plot bar chart
plt.figure(figsize=(8,6))
regional_sales.plot(kind="bar")

plt.title("Total Regional Sales Comparison")
plt.xlabel("Region")
plt.ylabel("Total Sales (Millions)")
plt.xticks(rotation=0)

plt.show()

##### 1. Why did you pick the specific chart?

1. A **bar chart** was selected because it is ideal for comparing **categorical variables** (regions) against a **numerical measure** (total sales).

2. The objective was to clearly compare sales performance across **North America, Europe, Japan and Other regions**, which makes side-by-side comparison essential.

3. Bar charts provide a **clear visual ranking**, making it easy to identify the highest and lowest revenue-generating regions.

4. Unlike line charts (used for trends over time), this analysis focuses on **distribution comparison**, making a bar chart more appropriate.

5. The chart effectively supports **strategic regional decision-making**, such as targeted marketing and localization strategies.

##### 2. What is/are the insight(s) found from the chart?

1. The chart shows that **North America generates the highest total sales**, making it the most profitable region in the gaming market.

2. **Europe is the second-largest contributor**, indicating strong market demand similar to North America.

3. **Japan shows comparatively lower total sales**, but it remains a significant and distinct market.

4. The “Other” regions contribute the smallest share, suggesting either emerging markets or lower penetration.

5. Overall, the chart highlights that **global sales are heavily concentrated in NA and EU**, emphasizing the importance of region-focused marketing and distribution strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Identifying **North America and Europe as top revenue-generating regions** helps companies prioritize marketing budgets and distribution efforts in high-return markets.

2. Enables **region-specific game launches and promotional strategies**, improving sales efficiency.

3. Supports localization strategies (language, culture, pricing) to maximize regional performance.

4. Helps optimize supply chain and digital distribution planning based on demand concentration.

5. Provides data-backed insights for **strategic expansion into emerging markets** under the “Other” category.

---

**⚠️ Negative Growth Risks :**

1. Over-dependence on **North America and Europe** may increase financial risk if demand declines or competition intensifies in those regions.

2. Ignoring smaller markets may lead to **missed long-term growth opportunities** in emerging regions.

3. Failing to adapt to **Japan’s distinct genre preferences** could reduce competitiveness in that market.

4. Concentrated revenue streams can create **market vulnerability**, limiting diversification and sustainable expansion.

---

* **Conclusion :**

  * Acting on these insights enables gaming companies to implement **region-focused growth strategies and maximize revenue potential**.
  * However, over-reliance on dominant markets without diversification may expose companies to **regional demand fluctuations and long-term growth risks**.

#### Chart - 4

In [None]:
# ==============================
# Chart 4 - Sales Trend Over Years
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Group by Year and sum global sales
yearly_sales = vgsales.groupby("Year")["Global_Sales"].sum().sort_index()

# Plot line chart
plt.figure(figsize=(10,6))
plt.plot(yearly_sales.index, yearly_sales.values)

plt.title("Global Sales Trend Over Years")
plt.xlabel("Year")
plt.ylabel("Total Global Sales (Millions)")

plt.show()

##### 1. Why did you pick the specific chart?

1. A **line chart** was selected because it is the most appropriate visualization for showing **trends over time**.

2. The variable **Year** represents a time sequence, and **Global Sales** is a continuous numerical value, making a line chart ideal for time-series analysis.

3. It clearly illustrates **growth patterns, peak periods and decline phases** in the gaming industry.

4. Unlike bar charts, a line chart better highlights the **continuous movement and direction of sales performance** across years.

5. The chart supports strategic understanding of **market lifecycle trends**, helping businesses identify expansion periods and potential slowdowns.

##### 2. What is/are the insight(s) found from the chart?

1. The chart shows a **steady growth phase** in global video game sales during the early and mid-years, indicating industry expansion.

2. A clear **peak period** is visible, suggesting the time when the gaming market reached its highest commercial performance.

3. After the peak, sales show a **decline or stabilization phase**, indicating market saturation or platform transition cycles.

4. The fluctuations in sales reflect the **impact of new console releases, technological advancements and blockbuster game launches**.

5. Overall, the chart highlights that the gaming industry follows a **lifecycle pattern of growth, peak and adjustment**, which is crucial for long-term strategic planning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Identifying the **growth and peak sales periods** helps companies time their game releases strategically for maximum revenue.

2. Understanding industry cycles enables better **investment planning and budget allocation**.

3. Recognizing sales slowdowns supports proactive **innovation and new platform adoption** before market decline.

4. Helps publishers align launches with **console generation cycles**, increasing commercial success probability.

5. Provides insights for long-term **sales forecasting and risk management**.

---

**⚠️ Negative Growth Risks :**

1. Ignoring signs of **market decline or saturation** may lead to overproduction and reduced profitability.

2. Heavy investment during a declining phase can result in **lower returns and financial strain**.

3. Failure to adapt to **platform transitions or technological shifts** may cause loss of competitive advantage.

4. Overreliance on past peak trends may lead to **misguided forecasting**, impacting strategic decisions.

---

* **Conclusion :**

  * Acting on these time-trend insights enables gaming companies to make **strategic release, investment and innovation decisions** aligned with market cycles.
  * Ignoring industry lifecycle patterns may lead to **revenue decline, poor forecasting and long-term competitive disadvantages**.

#### Chart - 5

In [None]:
# ==============================
# Chart 5 - Wishlist vs Global Sales
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

# Standardize column names
games.columns = games.columns.str.strip().str.replace(" ", "_")
vgsales.columns = vgsales.columns.str.strip().str.replace(" ", "_")

# Rename for merging if necessary
if "Title" in games.columns:
    games.rename(columns={"Title": "Name"}, inplace=True)

# Merge datasets
merged = pd.merge(games, vgsales, on="Name", how="inner")

# Plot scatter chart
plt.figure(figsize=(8,6))
plt.scatter(merged["Wishlist"], merged["Global_Sales"])

plt.title("Wishlist vs Global Sales")
plt.xlabel("Wishlist Count")
plt.ylabel("Global Sales (Millions)")

plt.show()

##### 1. Why did you pick the specific chart?

1. A **scatter plot** was chosen because it is best suited for analyzing the relationship between **two numerical variables**.

2. **Wishlist count** and **Global Sales** are both continuous values, making a scatter plot the most appropriate visualization method.

3. The goal was to determine whether **pre-release interest (wishlist activity)** translates into actual commercial success.

4. The chart clearly shows the **strength and direction of the relationship**, helping identify positive correlation patterns.

5. It also makes it easy to detect **outliers**, such as games with high wishlist counts but lower sales, or vice versa.

##### 2. What is/are the insight(s) found from the chart?

1. The chart shows a **positive relationship** between **Wishlist Count** and **Global Sales**, indicating that games with higher pre-release interest generally achieve stronger commercial performance.

2. Games with **very high wishlist numbers** tend to cluster in the higher global sales range, suggesting wishlist activity is a strong demand indicator.

3. The relationship is **not perfectly linear**, meaning wishlist alone does not fully determine sales outcomes.

4. A few **outliers** are visible — some games with high wishlist counts but moderate sales, possibly due to pricing, competition, or platform limitations.

5. Overall, the chart suggests that **wishlist engagement can be used as an early predictor of sales performance**, supporting pre-launch forecasting strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Confirms that **wishlist count is a strong early indicator of demand**, enabling better pre-release sales forecasting.

2. Helps marketing teams adjust promotional strategies based on **pre-launch engagement levels**.

3. Supports smarter production planning by aligning supply with **anticipated demand**.

4. Enables targeted advertising campaigns for games showing **high wishlist momentum**.

5. Improves revenue optimization by identifying potential blockbuster titles before launch.

---

**⚠️ Negative Growth Risks :**

1. Over-reliance on **wishlist data alone** may lead to inaccurate forecasts if conversion rates are low.

2. High wishlist but low sales could indicate **pricing issues or weak marketing execution**, causing revenue gaps.

3. Ignoring games with moderate wishlist but strong niche appeal may result in **missed diversification opportunities**.

4. Failing to convert wishlist interest into purchases can create **false demand expectations**, impacting production and budgeting decisions.

---

* **Conclusion :**

  * Acting on these insights allows gaming companies to leverage **engagement-driven forecasting and targeted marketing**, improving commercial outcomes.
  * However, relying solely on wishlist metrics without conversion analysis may lead to **overestimation of demand and revenue shortfalls**.

#### Chart - 6

In [None]:
# ==============================
# Chart 6 - Correlation Heatmap
# ==============================

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

# Standardize column names
games.columns = games.columns.str.strip().str.replace(" ", "_")
vgsales.columns = vgsales.columns.str.strip().str.replace(" ", "_")

# Rename for merging if necessary
if "Title" in games.columns:
    games.rename(columns={"Title": "Name"}, inplace=True)

# Merge datasets
merged = pd.merge(games, vgsales, on="Name", how="inner")

# Select only numerical columns
numerical_data = merged.select_dtypes(include=['int64', 'float64'])

# Compute correlation matrix
correlation_matrix = numerical_data.corr()

# Plot heatmap
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f")

plt.title("Correlation Heatmap of Numerical Variables")
plt.show()

##### 1. Why did you pick the specific chart?

1. A **correlation heatmap** was chosen because it effectively displays the **strength and direction of relationships between multiple numerical variables** in one view.

2. Since the dataset contains several numerical features (Rating, Wishlist, Plays, Regional Sales, Global Sales), a heatmap helps analyze all pairwise correlations simultaneously.

3. It visually highlights **strong positive or negative relationships** using color intensity, making interpretation faster and clearer.

4. The chart helps identify which variables are most strongly related to **Global Sales**, supporting feature importance analysis.

5. It also helps detect **multicollinearity** among sales variables (e.g., NA, EU, JP sales), which is important for accurate modeling and business decisions.

##### 2. What is/are the insight(s) found from the chart?

1. The heatmap shows a **strong positive correlation between Regional Sales (NA, EU, JP) and Global Sales**, confirming that regional performance directly drives total revenue.

2. **Wishlist and Plays exhibit a positive correlation with Global Sales**, indicating that higher engagement levels are associated with stronger commercial success.

3. **User Rating shows a moderate positive correlation with Global Sales**, suggesting that quality perception influences sales but is not the sole determinant.

4. Regional sales variables are **highly correlated with each other**, which is expected since Global Sales is the sum of regional sales.

5. No strong negative correlations are observed, implying that most numerical variables move in a similar direction with respect to overall performance.

6. Overall, the heatmap confirms that **engagement metrics and regional demand are key drivers of global sales performance**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. The heatmap confirms that **regional sales strongly drive global revenue**, enabling companies to prioritize high-performing markets strategically.

2. The positive correlation between **wishlist, plays and global sales** supports using engagement metrics for early demand forecasting.

3. Moderate correlation between **user rating and sales** encourages focus on quality improvement to enhance long-term revenue.

4. Identifying key correlated variables helps improve **predictive modeling and sales forecasting accuracy**.

5. Understanding multicollinearity among regional sales helps in building **more reliable analytical models**, reducing business risk.

---

**⚠️ Negative Growth Risks :**

1. Over-reliance on highly correlated regional markets may create **geographic concentration risk**, reducing diversification.

2. Assuming correlation equals causation could lead to **misguided strategic decisions**, such as overinvesting in one engagement metric.

3. Ignoring moderate or weaker correlations may cause companies to **overlook hidden growth opportunities**.

4. High multicollinearity among regional sales variables may distort predictive models if not handled properly, leading to **inaccurate forecasts**.

---

* **Conclusion :**

  * Acting on these correlation insights allows gaming companies to strengthen **data-driven forecasting, regional targeting and engagement-based strategies**.
  * However, misinterpreting correlation or over-concentrating on specific markets may result in **forecasting errors and long-term growth limitations**.

#### Chart - 7

In [None]:
# ==============================
# Chart 7 - Top 10 Platforms by Global Sales
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Group by Platform and sum global sales
platform_sales = (
    vgsales.groupby("Platform")["Global_Sales"]
    .sum()
    .sort_values(ascending=False)
    .head(10)
)

# Plot bar chart
plt.figure(figsize=(10,6))
platform_sales.plot(kind="bar")

plt.title("Top 10 Platforms by Global Sales")
plt.xlabel("Platform")
plt.ylabel("Total Global Sales (Millions)")
plt.xticks(rotation=45)

plt.show()

##### 1. Why did you pick the specific chart?

1. A **bar chart** was chosen because it is ideal for comparing **categorical variables** (platforms) against a **numerical metric** (Global Sales).

2. The objective was to clearly compare revenue performance across multiple platforms, making side-by-side comparison necessary.

3. A bar chart allows easy **ranking from highest to lowest**, helping identify dominant platforms quickly.

4. Unlike line charts (used for time trends), this analysis focuses on **performance comparison**, which makes a bar chart more appropriate.

5. The visualization provides clear business insight for **platform selection and investment decisions**.

##### 2. What is/are the insight(s) found from the chart?

1. The chart shows that a few **major platforms dominate global sales**, contributing a significantly larger share of total revenue compared to others.

2. Older, well-established platforms often appear among the top performers, indicating the impact of **strong user base and longer market presence**.

3. Sales distribution across platforms is uneven, suggesting that **platform choice plays a critical role in commercial success**.

4. Some platforms generate moderate sales despite having fewer game releases, indicating **high efficiency or strong audience engagement**.

5. Overall, the chart highlights that selecting the **right platform is a key strategic decision** for maximizing global revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Identifying **top-performing platforms** helps companies prioritize development on high-revenue consoles and systems.

2. Enables better **platform selection strategy**, reducing the risk of launching games on low-demand systems.

3. Supports optimized **marketing and distribution planning** based on platform popularity.

4. Helps forecast revenue potential by analyzing historical platform performance.

5. Encourages strategic partnerships with dominant platforms to maximize visibility and sales.

---

**⚠️ Negative Growth Risks :**

1. Over-dependence on a few dominant platforms may create **platform concentration risk**, limiting diversification.

2. Ignoring emerging platforms may result in **missed future growth opportunities**.

3. Investing heavily in declining or aging platforms could lead to **reduced long-term profitability**.

4. Failing to adapt to shifts in platform trends may cause **loss of competitive advantage** in evolving markets.

---

* **Conclusion :**

  * Acting on these insights enables gaming companies to make **strategic platform-focused investment decisions**, improving profitability and market reach.
  * However, ignoring diversification and emerging platform trends may lead to **long-term growth limitations and competitive risks**.

#### Chart - 8

In [None]:
# ==============================
# Chart 8 - Top Genre-Platform Combinations
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Create Genre-Platform combination
vgsales["Genre_Platform"] = vgsales["Genre"] + " - " + vgsales["Platform"]

# Group and calculate total sales
combo_sales = (
    vgsales.groupby("Genre_Platform")["Global_Sales"]
    .sum()
    .sort_values(ascending=False)
    .head(10)
)

# Plot horizontal bar chart
plt.figure(figsize=(10,6))
combo_sales.sort_values().plot(kind="barh")

plt.title("Top 10 Genre-Platform Combinations by Global Sales")
plt.xlabel("Total Global Sales (Millions)")
plt.ylabel("Genre - Platform")

plt.show()

##### 1. Why did you pick the specific chart?

1. A **horizontal bar chart** was chosen because it is ideal for comparing **multiple categorical combinations** (Genre–Platform) against a **numerical metric** (Global Sales).

2. Since the category names (e.g., *Action – PS2*) are longer, a horizontal bar chart improves **readability and clarity** compared to a vertical bar chart.

3. The objective was to rank the **top-performing combinations**, and bar charts make it easy to visualize performance from highest to lowest.

4. It allows quick identification of which **genre performs best on which platform**, supporting strategic alignment decisions.

5. The visualization clearly supports business storytelling by highlighting **where revenue concentration exists across category-platform pairings**.

##### 2. What is/are the insight(s) found from the chart?

1. The chart reveals that certain **Genre–Platform combinations dominate global sales**, indicating that performance is not driven by genre or platform alone, but by their strategic alignment.

2. Action and Sports genres frequently appear among the top combinations, especially on high-user-base platforms, showing strong commercial appeal.

3. Some platforms perform exceptionally well only with specific genres, highlighting the importance of **platform–audience compatibility**.

4. Revenue concentration is visible in a few key combinations, suggesting that successful launches depend on selecting the **right genre for the right platform**.

5. Overall, the chart confirms that optimizing both **genre selection and platform choice together** significantly increases the probability of commercial success.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Identifying top **Genre–Platform combinations** helps companies align game development with proven revenue-generating segments.

2. Enables smarter **platform targeting**, ensuring that specific genres are launched where demand is strongest.

3. Supports optimized **marketing investments** by focusing on high-performing combinations.

4. Reduces financial risk by avoiding mismatched genre-platform releases.

5. Strengthens strategic planning by using historical performance to guide future product launches.

---

**⚠️ Negative Growth Risks :**

1. Over-focusing on a few dominant combinations may lead to **market saturation and reduced innovation**.

2. Ignoring emerging or niche genre-platform opportunities may result in **missed diversification and future growth potential**.

3. Heavy dependence on historically successful combinations may cause resistance to adapting to **changing consumer preferences**.

4. If competitors crowd into the same high-performing segments, it may increase competition and **reduce profit margins**.

---

* **Conclusion :**

  * Acting on these insights enables gaming companies to make **strategic, revenue-driven development decisions**, improving profitability and market positioning.
  * However, failing to diversify and innovate beyond dominant combinations may lead to **long-term growth stagnation and competitive pressure**.

#### Chart - 9

In [None]:
# ==============================
# Chart 9 - Average Plays per Genre
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
games = pd.read_csv("/content/games.csv")

# Standardize column names
games.columns = games.columns.str.strip().str.replace(" ", "_")

# Convert 'Plays' column to numeric, coercing errors to NaN
games['Plays'] = pd.to_numeric(games['Plays'], errors='coerce')

# Fill any NaN values in 'Plays' after conversion (e.g., with 0 or median)
games['Plays'].fillna(0, inplace=True)

# Group by Genre and calculate average plays
genre_plays = (
    games.groupby("Genres")["Plays"]
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

# Plot bar chart
plt.figure(figsize=(10,6))
genre_plays.plot(kind="bar")

plt.title("Top 10 Genres by Average Plays")
plt.xlabel("Genre")
plt.ylabel("Average Plays")
plt.xticks(rotation=45)

plt.show()

##### 1. Why did you pick the specific chart?

1. A **bar chart** was chosen because it is ideal for comparing a **categorical variable** (Genre) against a **numerical variable** (Average Plays).

2. The objective was to rank genres based on engagement levels and a bar chart clearly shows **which genres have higher or lower average plays**.

3. It allows easy **side-by-side comparison** of multiple genres in a simple and understandable format.

4. Unlike line charts (used for time trends) or scatter plots (used for numerical relationships), this analysis focuses on **category performance comparison**, making a bar chart more appropriate.

5. The chart effectively supports business storytelling by highlighting **which genres generate the highest player engagement**.

##### 2. What is/are the insight(s) found from the chart?

1. The chart shows that certain **genres have significantly higher average plays**, indicating stronger player engagement and replay value.

2. Genres such as Action, Sports, or RPG (depending on dataset results) tend to rank higher, suggesting they maintain **long-term user interest**.

3. Some genres generate moderate sales but high engagement, highlighting opportunities for **improved monetization strategies**.

4. Lower-ranked genres may appeal to niche audiences, reflecting **specialized but smaller market demand**.

5. Overall, the chart confirms that **engagement intensity varies by genre** and selecting high-engagement genres can improve long-term player retention and brand loyalty.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Identifying genres with **high average plays** helps companies focus on categories that drive strong player engagement and retention.

2. High-engagement genres offer opportunities for **long-term monetization** through DLCs, expansions, and in-game purchases.

3. Supports development of games with **strong replay value**, increasing lifetime customer value (LTV).

4. Enables better alignment between **content strategy and player preferences**, improving customer satisfaction.

5. Helps publishers prioritize marketing efforts toward genres with proven engagement performance.

---

**⚠️ Negative Growth Risks :**

1. Over-focusing on high-engagement genres may lead to **genre saturation**, reducing differentiation and long-term growth.

2. Ignoring low-engagement genres may result in **missing niche market opportunities** with loyal audiences.

3. High engagement does not always guarantee high sales; relying only on plays may cause **misguided revenue expectations**.

4. Excessive investment in similar genre types may limit innovation and **reduce portfolio diversification**, increasing business risk.

---

* **Conclusion :**

  * Acting on these insights allows gaming companies to strengthen **engagement-driven content strategy and long-term monetization planning**.
  * However, failing to diversify and innovate beyond dominant genres may lead to **market saturation, missed niche growth and reduced competitive advantage**.

#### Chart - 10

In [None]:
# ==============================
# Chart 10 - Top 10 Publishers by Global Sales
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Group by Publisher and sum global sales
publisher_sales = (
    vgsales.groupby("Publisher")["Global_Sales"]
    .sum()
    .sort_values(ascending=False)
    .head(10)
)

# Plot bar chart
plt.figure(figsize=(10,6))
publisher_sales.plot(kind="bar")

plt.title("Top 10 Publishers by Global Sales")
plt.xlabel("Publisher")
plt.ylabel("Total Global Sales (Millions)")
plt.xticks(rotation=45)

plt.show()

##### 1. Why did you pick the specific chart?

1. A **bar chart** was chosen because it is ideal for comparing a **categorical variable** (Publisher) against a **numerical variable** (Global Sales).

2. The objective was to rank publishers based on total revenue and a bar chart clearly shows **which publishers generate the highest sales**.

3. It allows easy **side-by-side comparison** and clear visualization of market dominance.

4. Unlike line charts (used for time trends) or scatter plots (used for relationships between numerical variables), this analysis focuses on **performance comparison across categories**, making a bar chart more suitable.

5. The chart effectively highlights the **competitive landscape**, helping identify leading publishers in the gaming industry.

##### 2. What is/are the insight(s) found from the chart?

1. The chart indicates that **a small number of top publishers account for a large proportion of global sales**, showing strong market concentration.

2. Leading publishers consistently outperform competitors, suggesting the importance of **brand reputation, strong franchises and effective distribution networks**.

3. There is a noticeable revenue gap between top-tier and mid-tier publishers, highlighting a **competitive advantage held by established companies**.

4. Smaller publishers contribute comparatively less to total sales, indicating potential challenges in marketing reach and platform access.

5. Overall, the chart confirms that **publisher strength significantly influences commercial success**, making partnerships and brand positioning critical business factors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Identifying **top-performing publishers** helps companies understand market leaders and benchmark performance strategies.

2. Enables smarter **strategic partnerships or collaborations** with high-revenue publishers.

3. Highlights the importance of **brand strength and franchise development** in driving consistent global sales.

4. Supports competitive analysis, helping businesses position their products effectively in the market.

5. Encourages investment in strong intellectual properties (IPs) to build long-term revenue streams.

---

**⚠️ Negative Growth Risks :**

1. High market concentration among a few publishers may create **entry barriers for smaller companies**, limiting competition.

2. Over-reliance on major publishers or blockbuster franchises may reduce innovation and increase **portfolio risk**.

3. Ignoring emerging or smaller publishers may result in **missed partnership or acquisition opportunities**.

4. Heavy dependence on a few dominant revenue sources could lead to **financial instability if consumer preferences shift**.

---

* **Conclusion :**

  * Acting on these insights allows gaming companies to strengthen **competitive positioning, strategic alliances and brand-focused growth strategies**.
  * However, excessive concentration and lack of diversification may lead to **innovation slowdown and long-term revenue risk**.

#### Chart - 11

In [None]:
# ==============================
# Chart 11 - Distribution of Global Sales
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Plot histogram
plt.figure(figsize=(8,6))
plt.hist(vgsales["Global_Sales"], bins=30)

plt.title("Distribution of Global Sales")
plt.xlabel("Global Sales (Millions)")
plt.ylabel("Frequency")

plt.show()

##### 1. Why did you pick the specific chart?

1. A **histogram** was chosen because it is ideal for showing the **distribution of a single numerical variable**, in this case, Global Sales.

2. The objective was to understand how sales are spread across games — whether evenly distributed or concentrated among a few titles.

3. A histogram helps identify **skewness, spread and frequency patterns**, which cannot be clearly seen in bar or line charts.

4. It allows detection of **outliers and blockbuster effects**, where a small number of games generate extremely high revenue.

5. The chart supports understanding of the **overall market structure**, such as whether the industry follows a “long-tail” revenue pattern.

##### 2. What is/are the insight(s) found from the chart?

1. The histogram shows that **Global Sales are highly right-skewed**, meaning most games generate relatively low sales while a small number of games achieve very high revenue.

2. A large concentration of games falls within the **lower sales range**, indicating intense competition and limited commercial success for most titles.

3. Only a few games act as **blockbusters**, contributing disproportionately to total industry revenue.

4. The presence of extreme high-value outliers confirms a **long-tail market structure**, where a small percentage of titles dominate overall sales.

5. Overall, the chart highlights that the gaming industry is **hit-driven**, with revenue heavily dependent on a limited number of high-performing games.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Understanding that the market is **hit-driven** helps companies focus on developing high-quality, high-potential titles.

2. Recognizing the **long-tail distribution** supports diversified portfolio strategies instead of relying on one single game.

3. Helps publishers allocate budgets more strategically toward games with blockbuster potential.

4. Encourages better **risk management**, knowing that only a small percentage of games generate major revenue.

5. Supports data-driven forecasting by acknowledging realistic sales expectations for most titles.

---

**⚠️ Negative Growth Risks :**

1. Over-dependence on a few blockbuster games may create **revenue instability** if one major title underperforms.

2. Ignoring smaller titles in the long tail may lead to **missed cumulative revenue opportunities**.

3. Excessive investment chasing blockbuster success may increase **financial risk and development costs**.

4. Market saturation and intense competition may reduce the probability of consistent blockbuster outcomes.

---

* **Conclusion :**

  * Acting on these insights enables gaming companies to implement **balanced portfolio strategies, realistic forecasting and risk-aware investment planning**.
  * However, relying solely on blockbuster-driven growth may result in **financial volatility and long-term instability**.

#### Chart - 12

In [None]:
# ==============================
# Chart 12 - Distribution of User Ratings
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
# The 'Rating' column is in games.csv, not vgsales.csv
games_df_for_rating = pd.read_csv("/content/games.csv")

# Plot histogram
plt.figure(figsize=(8,6))
plt.hist(games_df_for_rating["Rating"].dropna(), bins=20)

plt.title("Distribution of User Ratings")
plt.xlabel("User Rating")
plt.ylabel("Frequency")

plt.show()

##### 1. Why did you pick the specific chart?

1. A **histogram** was chosen because it is ideal for analyzing the **distribution of a single numerical variable**, in this case, User Ratings.

2. The objective was to understand how ratings are spread across games—whether most titles receive high, moderate, or low ratings.

3. A histogram helps identify **concentration, skewness and frequency patterns**, which are not easily visible in other chart types.

4. It allows detection of **outliers**, such as extremely low- or high-rated games.

5. The chart supports evaluation of **overall market perception of game quality**, which is important for product quality and branding strategies.

##### 2. What is/are the insight(s) found from the chart?

1. The chart shows that **most games fall within the mid-to-high rating range**, indicating generally positive user perception across the market.

2. It reveals that **very low-rated games are relatively fewer**, suggesting that extreme negative reception is uncommon.

3. The distribution highlights whether ratings are **concentrated within a narrow band or widely spread**, reflecting overall quality consistency.

4. Any visible extreme values help identify **outlier games with exceptionally high or low ratings**.

5. Overall, the chart provides insight into the **market’s overall quality perception and customer satisfaction levels**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Understanding that most games fall within the **mid-to-high rating range** helps reinforce the importance of maintaining strong product quality standards.

2. Identifying rating concentration supports **quality benchmarking**, enabling developers to aim above the industry average.

3. Detecting highly rated games helps companies analyze **best-performing features and design elements** for future releases.

4. Monitoring rating distribution helps improve **brand reputation and customer trust**.

5. Insights from ratings can guide **post-launch updates and improvements**, enhancing long-term sales potential.

---

**⚠️ Negative Growth Risks :**

1. If ratings begin clustering toward lower ranges, it may indicate **declining product quality**, impacting brand reputation.

2. Ignoring low-rated outliers may lead to **repeated product design mistakes**, reducing customer loyalty.

3. Overemphasis on ratings alone without considering sales and market demand may result in **misaligned business strategies**.

4. Consistent mediocre ratings may lead to **competitive disadvantage** against higher-rated competitors.

---

* **Conclusion :**

  * Acting on these insights allows gaming companies to focus on **quality improvement, customer satisfaction and brand strength**, driving sustainable growth.
  * Ignoring rating trends may lead to **declining user trust, weaker market positioning and reduced long-term revenue potential**.

#### Chart - 13

In [None]:
# ==============================
# Chart 13 - Global Sales by Genre
# ==============================

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
vgsales = pd.read_csv("/content/vgsales.csv")

# Select top 10 genres by total sales
top_genres = (
    vgsales.groupby("Genre")["Global_Sales"]
    .sum()
    .sort_values(ascending=False)
    .head(10)
    .index
)

filtered_data = vgsales[vgsales["Genre"].isin(top_genres)]

# Create box plot
plt.figure(figsize=(12,6))
filtered_data.boxplot(column="Global_Sales", by="Genre")

plt.title("Global Sales Distribution by Top Genres")
plt.suptitle("")  # Remove default subtitle
plt.xlabel("Genre")
plt.ylabel("Global Sales (Millions)")
plt.xticks(rotation=45)

plt.show()

##### 1. Why did you pick the specific chart?

1. A **box plot** was chosen because it is ideal for comparing the **distribution of a numerical variable (Global Sales)** across multiple categories (Genres).

2. The objective was not only to compare average sales, but also to understand the **spread, median and variability** within each genre.

3. A box plot clearly highlights the **median performance**, giving a better representation than just total sales.

4. It helps detect **outliers (blockbuster games)** that generate exceptionally high revenue within specific genres.

5. The chart provides deeper insight into **consistency vs volatility of sales performance across genres**, supporting more informed strategic decisions.

##### 2. What is/are the insight(s) found from the chart?

1. The chart shows that certain genres have a **higher median Global Sales**, indicating consistently stronger commercial performance.

2. Some genres display a **wide sales spread**, suggesting high variability with both low-performing and blockbuster titles.

3. Several extreme **outliers are visible**, confirming that a few games within specific genres generate exceptionally high revenue.

4. Genres with a smaller interquartile range (IQR) indicate **more stable and predictable sales performance**.

5. Overall, the chart highlights that while some genres are consistently profitable, others are more **risk-driven with high volatility in revenue outcomes**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**✅ Positive Business Impact :**

1. Identifying genres with a **higher median Global Sales** helps companies focus on categories with consistent revenue performance.

2. Understanding sales variability enables better **risk assessment** before investing in a specific genre.

3. Detecting blockbuster outliers helps analyze **successful game characteristics**, improving future product strategy.

4. Stable genres with lower volatility support **predictable revenue planning and budgeting**.

5. Insights into genre performance help optimize **portfolio diversification and resource allocation**.

---

**⚠️ Negative Growth Risks :**

1. Over-investing in high-volatility genres may increase **financial risk** if blockbuster success is not replicated.

2. Ignoring stable but moderate-performing genres may result in **missed steady revenue opportunities**.

3. Heavy reliance on outlier-driven genres can create **revenue instability** if a major title underperforms.

4. Misinterpreting variability as guaranteed high returns may lead to **overestimated sales forecasts**.

---

* **Conclusion :**

  * Acting on these insights allows gaming companies to balance **risk and stability through strategic genre selection**, improving long-term profitability.
  * However, focusing only on high-reward genres without managing volatility may lead to **financial fluctuations and growth uncertainty**.

#### Chart - 14 - Correlation Heatmap

In [None]:
# ==============================
# Chart 14 - Correlation Heatmap
# ==============================

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

# Standardize column names
games.columns = games.columns.str.strip().str.replace(" ", "_")
vgsales.columns = vgsales.columns.str.strip().str.replace(" ", "_")

# Rename for merging if needed
if "Title" in games.columns:
    games.rename(columns={"Title": "Name"}, inplace=True)

# Function to convert string values with 'K' or 'M' to numeric
def convert_k_m_to_numeric(series):
    def converter(value):
        if isinstance(value, str):
            value = value.strip().upper()
            if value.endswith('K'):
                return float(value[:-1]) * 1000
            elif value.endswith('M'):
                return float(value[:-1]) * 1000000
            else:
                try:
                    return float(value)
                except ValueError:
                    return pd.NA # Handle other non-numeric strings as missing
        return value
    return series.apply(converter)

# Apply conversion to 'Plays' and 'Wishlist' columns
games['Plays'] = convert_k_m_to_numeric(games['Plays'])
games['Wishlist'] = convert_k_m_to_numeric(games['Wishlist'])

# Fill any NaN values (introduced by conversion errors or original NaNs) with the median
games['Plays'] = games['Plays'].fillna(games['Plays'].median())
games['Wishlist'] = games['Wishlist'].fillna(games['Wishlist'].median())

# Merge datasets
merged = pd.merge(games, vgsales, on="Name", how="inner")

# Select important numerical columns
selected_columns = [
    "Rating",
    "Plays",
    "Wishlist",
    "NA_Sales",
    "EU_Sales",
    "JP_Sales",
    "Global_Sales"
]

numerical_data = merged[selected_columns]

# Compute correlation matrix
correlation_matrix = numerical_data.corr()

# Plot heatmap
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f")

plt.title("Correlation Heatmap: Engagement & Sales Variables")
plt.show()

##### 1. Why did you pick the specific chart?

1. A **correlation heatmap** was chosen because it effectively displays the **strength and direction of relationships between multiple numerical variables** in a single visualization.

2. The objective was to understand how **engagement metrics (Rating, Plays, Wishlist)** relate to **sales performance (Regional and Global Sales)**.

3. The heatmap uses color intensity to clearly highlight **strong, moderate and weak correlations**, making patterns easy to interpret.

4. It allows quick identification of which variables have the strongest influence on **Global Sales**, supporting data-driven decision-making.

5. The chart also helps detect **multicollinearity among sales variables**, which is important for building reliable predictive models and avoiding analytical errors.

##### 2. What is/are the insight(s) found from the chart?

1. The heatmap shows a **strong positive correlation between regional sales (NA, EU, JP) and Global Sales**, confirming that overall revenue is directly driven by regional performance.

2. **Wishlist and Plays have a positive correlation with Global Sales**, indicating that higher user engagement is associated with better commercial outcomes.

3. **User Rating shows a moderate positive correlation with Global Sales**, suggesting that perceived quality influences revenue but is not the only determining factor.

4. Regional sales variables are **highly correlated with each other**, reflecting similar demand patterns across major markets.

5. Overall, the chart confirms that **engagement metrics and regional demand are key contributors to commercial success**, supporting engagement-based forecasting strategies.

#### Chart - 15 - Pair Plot

In [None]:
# ==============================
# Chart 15 - Pair Plot
# ==============================

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load datasets
games = pd.read_csv("/content/games.csv")
vgsales = pd.read_csv("/content/vgsales.csv")

# Standardize column names
games.columns = games.columns.str.strip().str.replace(" ", "_")
vgsales.columns = vgsales.columns.str.strip().str.replace(" ", "_")

# Rename for merging if necessary
if "Title" in games.columns:
    games.rename(columns={"Title": "Name"}, inplace=True)

# Merge datasets
merged = pd.merge(games, vgsales, on="Name", how="inner")

# Select key numerical variables
selected_columns = [
    "Rating",
    "Plays",
    "Wishlist",
    "NA_Sales",
    "EU_Sales",
    "Global_Sales"
]

pair_data = merged[selected_columns]

# Create pair plot
sns.pairplot(pair_data)
plt.show()

##### 1. Why did you pick the specific chart?

1. A **pair plot** was chosen because it allows visualization of **multiple pairwise relationships simultaneously** in a single comprehensive chart.

2. The objective was to explore how **engagement variables (Rating, Plays, Wishlist)** interact with **sales variables (Regional and Global Sales)**.

3. It combines **scatter plots and distribution plots**, helping analyze both relationships and individual variable distributions together.

4. The chart makes it easier to detect **patterns, clusters and outliers** across multiple variable combinations.

5. It provides a broader exploratory view before building predictive models, supporting deeper **multivariable relationship analysis**.

##### 2. What is/are the insight(s) found from the chart?

1. The pair plot shows **positive relationships between engagement metrics (Wishlist, Plays) and Global Sales**, indicating that higher engagement is associated with stronger commercial performance.

2. **User Rating displays a moderate positive relationship with sales**, suggesting quality perception influences revenue but is not the sole driver.

3. Regional sales variables (NA, EU, JP) show **strong linear relationships with Global Sales**, confirming their direct contribution to total revenue.

4. The distribution plots reveal that **sales variables are highly right-skewed**, indicating a hit-driven market structure.

5. A few **outliers (blockbuster games)** are visible across multiple variable combinations, reinforcing that a small number of titles generate disproportionately high revenue.

6. Overall, the chart confirms that **engagement and regional demand together shape commercial success**, while no single variable alone guarantees high sales.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective of maximizing video game sales and optimizing strategic decisions, I recommend the following:

1. Focus on developing games in **high-performing genre–platform combinations** identified through data analysis.

2. Use **engagement metrics such as wishlist counts, ratings, and plays** as early indicators to forecast demand before launch.

3. Implement **region-specific marketing strategies**, prioritizing North America and Europe while gradually expanding into emerging markets.

4. Maintain a **balanced portfolio strategy**, combining blockbuster-potential titles with stable, consistent-performing genres to reduce risk.

5. Continuously monitor performance using **interactive dashboards (Power BI)** to support real-time, data-driven decision-making.

# **Conclusion -**

This project successfully analyzed and integrated video game sales and engagement data to uncover the key drivers of commercial success in the gaming industry. Through systematic data cleaning, SQL- based structuring, exploratory data analysis, and multiple visualization techniques, meaningful patterns and relationships were identified.

The analysis confirmed that **engagement metrics such as wishlist counts, plays and user ratings positively influence global sales performance**. Regional sales patterns revealed that **North America and Europe dominate global revenue**, while platform and genre combinations significantly impact commercial outcomes. Additionally, the industry exhibits a **hit-driven revenue structure**, where a small number of blockbuster games contribute disproportionately to total sales.

By leveraging these insights, gaming companies can make data-driven decisions in product development, marketing strategy, platform selection, and regional expansion. The findings support improved sales forecasting, optimized resource allocation, risk management, and long-term profitability.

Overall, this project demonstrates how data analytics and visualization can transform raw gaming data into actionable business intelligence, enabling strategic growth and competitive advantage in the dynamic gaming market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***