# Picking the Dataset
- Dataset Name: Global Video Game Sales
- Dataset Link: https://www.kaggle.com/datasets/gregorut/videogamesales

# Reason for picking this dataset
The Global Video Game Sales dataset presents a rich and diverse collection of over 16,000 video game records. It includes features such as game title, platform, genre, publisher, release year, and regional sales figures (NA, EU, JP, Others), culminating in global sales data. I selected this dataset because:
 - It offers a compelling real-world application: predicting commercial success in the gaming industry.
 - The structured, tabular format is ideal for performing regression analysis.
 - The mix of categorical and numerical features provides opportunities for feature engineering and deeper analysis.
 - It contains enough variety to explore temporal, categorical, and regional influences on sales.

# Potential Problem Areas
### Sales Prediction Problem
 - Objective: Predict global sales using game features.
 - Key Question: Can we accurately estimate a game's commercial success using variables like genre, platform, publisher, and release year?
### Feature Importance Problem
 - Objective: Determine which attributes have the strongest influence on global sales.
 - Key Question: Are certain genres or platforms significantly more predictive of success?
### Temporal Trends Problem
 - Objective: Explore how sales vary over time.
 - Key Question: How has the commercial viability of certain game types changed across decades?
### Region-Based Sales Patterns
 - Understand differences in regional performance
 - Key Question: Do sales patterns vary significantly across North America, Europe, Japan, and other regions? Are certain genres or platforms more popular in specific markets?

# Data Cleanup, Preprocessing, and Exploration
### Data Cleaning and Preprocessing Steps
 - Handling missing values
 - Feature Engineering
 - Encoding Categorical Features
 - Detecting and Handling outliers
 - Normalizing or scaling numerical features as needed
### Exploratory Visualizations
 - Distribution of Global Sales
 - Line Chart
 - Boxplots of Sales
 - Sales trend over the years

# Model Selection
### Global Sales Prediction (Regression Task)
 - Baseline Model: Linear Regressor
 - Advanced Model: RandomForestRegressor


# Results and Discussion
 - Evaluation Metrics for regression models: Mean Square Error (MSE), Mean Absolute Error (MAE), and R-squared (r2)
 - Based on these scores and visualizations, we will try to determine the best performing model.


### Importing Required Libraries

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
pd.set_option('display.max_columns', None)

### Loading the Dataset

In [10]:
df = pd.read_csv('vgsales.csv')

### Previewing the first few rows

In [13]:
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


### Dataset Dimensions

In [16]:
df.shape

(16598, 11)

### Listing Column names

In [19]:
df.columns

Index(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
      dtype='object')

### Dataset Info (Data Types and Nulls)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


### Checking Missing Values

In [25]:
df.isnull().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

### Overview of missing values
We previously found that the **Year** and **Publisher** columns contain missing values. Let’s inspect these further to decide how to handle them.

In [29]:
df[df['Year'].isnull()].head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
179,180,Madden NFL 2004,PS2,,Sports,Electronic Arts,4.26,0.26,0.01,0.71,5.23
377,378,FIFA Soccer 2004,PS2,,Sports,Electronic Arts,0.59,2.36,0.04,0.51,3.49
431,432,LEGO Batman: The Videogame,Wii,,Action,Warner Bros. Interactive Entertainment,1.86,1.02,0.0,0.29,3.17
470,471,wwe Smackdown vs. Raw 2006,PS2,,Fighting,,1.57,1.02,0.0,0.41,3.0
607,608,Space Invaders,2600,,Shooter,Atari,2.36,0.14,0.0,0.03,2.53


Let’s take a look at some records where the **Publisher** value is missing to assess how to handle them.

In [32]:
df[df['Publisher'].isnull()].head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
470,471,wwe Smackdown vs. Raw 2006,PS2,,Fighting,,1.57,1.02,0.0,0.41,3.0
1303,1305,Triple Play 99,PS,,Sports,,0.81,0.55,0.0,0.1,1.46
1662,1664,Shrek / Shrek 2 2-in-1 Gameboy Advance Video,GBA,2007.0,Misc,,0.87,0.32,0.0,0.02,1.21
2222,2224,Bentley's Hackpack,GBA,2005.0,Misc,,0.67,0.25,0.0,0.02,0.93
3159,3161,Nicktoons Collection: Game Boy Advance Video V...,GBA,2004.0,Misc,,0.46,0.17,0.0,0.01,0.64


### Imputing missing values

In [35]:
# Impute missing values in 'Year' with the median year
df['Year'] = df['Year'].fillna(df['Year'].median())

In [37]:
# Impute missing values in 'Publisher' with the most frequent publisher
df['Publisher'] = df['Publisher'].fillna(df['Publisher'].mode()[0])

### Final Null Check
Double-check that no missing values remain in the dataset.

In [41]:
df.isnull().sum()

Rank            0
Name            0
Platform        0
Year            0
Genre           0
Publisher       0
NA_Sales        0
EU_Sales        0
JP_Sales        0
Other_Sales     0
Global_Sales    0
dtype: int64

As shown above, there are no missing values remaining in the dataset.

### Converting **Year** to Integer

In [45]:
df['Year'] = df['Year'].astype(int)

### Checking for Games with Zero Global Sales

In [48]:
df[df['Global_Sales'] == 0].shape

(0, 11)

No records in the dataset have **Global_Sales** equal to zero. This indicates that all games included have registered at least some commercial performance.

### Inspecting Year Value Range

We review the minimum and maximum values in the **Year** column to ensure they make sense historically. Any future dates or extremely old years may be worth investigating.

In [55]:
df['Year'].min(), df['Year'].max()

(1980, 2020)

The Year column contains values ranging from 1980 to 2020, which aligns well with the historical timeline of the video game industry.