# **Genre Dominance Analyzer**

## Objectives

- Clean and prepare the dataset for analysis.
- Analyze genre dominance by ratings and playtime.
- Visualize trends for the dashboard (line plot, bar plot, scatter plot, heatmap).

## Inputs
- **Dataset File**: The raw CSV file from Kaggle containing video game data (e.g., `Title`, `Release Date`, `Genre`, `Rating`, `Avg Playtime`).
- **Python Libraries**: `pandas`, `numpy`, `matplotlib`, `seaborn`, `plotly` (installed via `requirements.txt`).
- **Environment**: A Python virtual environment set up in VS Code with the Jupyter extension.


## Outputs

- **Cleaned Dataset**: - Processed dataset with standardized genres and filled missing values.
- **Statistics File**: - CSV with mean, median, and std of ratings and playtime per genre.
- **Visualization Files**: 
  - Interactive line plot of genre rating trends.
  - Interactive bar plot of top genres by rating count.
  - Interactive scatter plot of emerging genres.
  - Interactive heatmap of genre-platform correlations.
- **Notebook**: This file (`.ipynb`) with EDA code, static plots, and documentation.

## Additional Comments

- **Dataset Assumptions**: Assumed median imputation for missing ratings/playtime is reasonable due to skewed distributions; alternatives like mean could bias results.
- **Genre Standardization**: Combined similar genres (e.g., 'Action-Adventure' to 'Action Adventure') for consistency, though this may oversimplify nuanced categories.
- **AI Assistance**: Used AI to refine code snippets (e.g., Plotly styling) and brainstorm visualization ideas, credited in README.
- **Limitations**: Dataset lacks real-time player feedback; future iterations could scrape reviews for sentiment analysis if scope expands.


---

# Working directory


* We access the current directory with os.getcwd()

In [36]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\babat\\Downloads'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [24]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [25]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\babat\\Downloads'

# Section 1

Import Libraries and Load Data

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Set plotting style
sns.set(style="whitegrid")
%matplotlib inline

In [27]:
# Load dataset
df = pd.read_csv(r"C:\Users\babat\Downloads\vs code\Genre-Dominance-Analyzer-2\input\games.csv")

# Display basic info
print("Dataset Info:")
print(df.info())
print("\nFirst 5 Rows:")
display(df.head())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1512 entries, 0 to 1511
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         1512 non-null   int64  
 1   Title              1512 non-null   object 
 2   Release Date       1512 non-null   object 
 3   Team               1511 non-null   object 
 4   Rating             1499 non-null   float64
 5   Times Listed       1512 non-null   object 
 6   Number of Reviews  1512 non-null   object 
 7   Genres             1512 non-null   object 
 8   Summary            1511 non-null   object 
 9   Reviews            1512 non-null   object 
 10  Plays              1512 non-null   object 
 11  Playing            1512 non-null   object 
 12  Backlogs           1512 non-null   object 
 13  Wishlist           1512 non-null   object 
dtypes: float64(1), int64(1), object(12)
memory usage: 165.5+ KB
None

First 5 Rows:


Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,4.6K,4.8K
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21K,3.2K,6.3K,3.6K
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30K,2.5K,5K,2.6K
3,3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"['Adventure', 'Indie', 'RPG', 'Turn Based Stra...","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...,28K,679,4.9K,1.8K
4,4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"['Adventure', 'Indie', 'Platform']",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with...",21K,2.4K,8.3K,2.3K



Missing Values:
Unnamed: 0            0
Title                 0
Release Date          0
Team                  1
Rating               13
Times Listed          0
Number of Reviews     0
Genres                0
Summary               1
Reviews               0
Plays                 0
Playing               0
Backlogs              0
Wishlist              0
dtype: int64


---

# Section 2

Clean the Data

In [37]:
import pandas as pd

# Load dataset
df = pd.read_csv(r"C:\Users\babat\Downloads\vs code\Genre-Dominance-Analyzer-2\input\games.csv")

# Assess missingness before cleaning
print("Rows before cleaning:", len(df))

# Convert 'k' suffix to numeric values
def convert_k_to_numeric(value):
    if isinstance(value, str) and 'k' in value.lower():
        value = value.lower().replace('k', '').strip()
        return float(value) * 1000
    return pd.to_numeric(value, errors='coerce')

df['Number of Reviews'] = df['Number of Reviews'].apply(convert_k_to_numeric)
df['Plays'] = df['Plays'].apply(convert_k_to_numeric)
df['Playing'] = df['Playing'].apply(convert_k_to_numeric)
df['Backlogs'] = df['Backlogs'].apply(convert_k_to_numeric)
df['Wishlist'] = df['Wishlist'].apply(convert_k_to_numeric)

# Drop redundant/unused columns
columns_to_drop = ['Times Listed', 'Summary', 'Reviews']
if 'Unnamed: 0' in df.columns:
    columns_to_drop.append('Unnamed: 0')  # Drop first column if it's an index
df = df.drop(columns=columns_to_drop, errors='ignore')

# Handle missing values
df['Team'].fillna('Unknown', inplace=True)  # Impute missing 'Team'
df['Rating'].fillna(df['Rating'].median(), inplace=True)  # Impute with median


# Standardize genre names
df['Genres'] = df['Genres'].str.replace('-', ' ').str.title()

# Convert Release Date to datetime and extract year
df['Release Date'] = pd.to_datetime(df['Release Date'], errors='coerce')
df['Year'] = df['Release Date'].dt.year

# Drop rows where 'Release Date' is NaT
df = df.dropna(subset=['Release Date'])

# Convert specified columns from float to int
columns_to_convert = ['Number of Reviews', 'Plays', 'Playing', 'Backlogs', 'Wishlist', 'Year']
df[columns_to_convert] = df[columns_to_convert].astype('Int64')

# Remove duplicates
df.drop_duplicates(subset=['Title'], inplace=True)

# Display cleaned data info
print("\nCleaned Dataset Info:")
print(df.info())
print("\nFirst 5 Rows of Cleaned Data:")
display(df.head())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Save cleaned data to a new CSV file
df.to_csv(r"C:\Users\babat\Downloads\vs code\Genre-Dominance-Analyzer-2\input\games_cleaned.csv", index=False)

Rows before cleaning: 1512

Cleaned Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 1096 entries, 0 to 1511
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Title              1096 non-null   object        
 1   Release Date       1096 non-null   datetime64[ns]
 2   Team               1096 non-null   object        
 3   Rating             1096 non-null   float64       
 4   Number of Reviews  1096 non-null   Int64         
 5   Genres             1096 non-null   object        
 6   Plays              1096 non-null   Int64         
 7   Playing            1096 non-null   Int64         
 8   Backlogs           1096 non-null   Int64         
 9   Wishlist           1096 non-null   Int64         
 10  Year               1096 non-null   Int64         
dtypes: Int64(6), datetime64[ns](1), float64(1), object(3)
memory usage: 109.2+ KB
None

First 5 Rows of Cleaned Data:


Unnamed: 0,Title,Release Date,Team,Rating,Number of Reviews,Genres,Plays,Playing,Backlogs,Wishlist,Year
0,Elden Ring,2022-02-25,"['Bandai Namco Entertainment', 'FromSoftware']",4.5,3900,"['Adventure', 'Rpg']",17000,3800,4600,4800,2022
1,Hades,2019-12-10,['Supergiant Games'],4.3,2900,"['Adventure', 'Brawler', 'Indie', 'Rpg']",21000,3200,6300,3600,2019
2,The Legend of Zelda: Breath of the Wild,2017-03-03,"['Nintendo', 'Nintendo EPD Production Group No...",4.4,4300,"['Adventure', 'Rpg']",30000,2500,5000,2600,2017
3,Undertale,2015-09-15,"['tobyfox', '8-4']",4.2,3500,"['Adventure', 'Indie', 'Rpg', 'Turn Based Stra...",28000,679,4900,1800,2015
4,Hollow Knight,2017-02-24,['Team Cherry'],4.4,3000,"['Adventure', 'Indie', 'Platform']",21000,2400,8300,2300,2017



Missing Values:
Title                0
Release Date         0
Team                 0
Rating               0
Number of Reviews    0
Genres               0
Plays                0
Playing              0
Backlogs             0
Wishlist             0
Year                 0
dtype: int64


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
