## Scenario:

You have been retained by a retail company to analyse a dataset based on video games. This analysis will help determine the sales strategy for the company in their upcoming Winter season.    
Each answer MUST have a separate and different visualization that can be easily understood, visually represents the answer, and all data wrangling, analysis, and visualizations must be generated using python(No EXCEL or Similar Technologies)  
The companies CTO also requires you to rationalize all the decisions that you have made in Poster that displays your visualizations.    
This rationalization MUST include your visualization design decisions for your visualizations, feature selection and any other information that you deem relevant.   

You are required to use the dataset contained within the file “vgsales.csv” and then answer the following questions using a different visualization type (eg. Bar Chart, Scatter graph etc…) for each question:

•	What are the top 5 games by global sales?   
•	What is the distribution of the most popular 4 game genres?  
•	Do older games (2005 and earlier) have a higher MEAN “eu_sales” than newer games (after 2005)?  
•	What are the 3 most common “developer” in the dataset?  

Your project must incorporate the following elements:
•	A Jupyter Notebook detailing your
o	EDA process,
o	Data Cleaning,
o	Feature Selection
o	Data Visualizations


# Assigment

## Exploratory Data Analysis

### Part One: Overview & Structure

#### Import Libraries & DataSet

In [20]:
# Core libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
import warnings
warnings.filterwarnings('ignore')

# Show plots inside notebook
%matplotlib inline

# Style
sns.set(style="whitegrid")

# Calculate means for visualisation
from numpy import mean

I import the essential Python libraries for data handling and visualization.

* `pandas` and `numpy` will help me explore the data structurally.

* `matplotlib` and `seaborn` will allow me to create visual summaries.

* `%matplotlib inline` to show plots within the notebook.

This setup is standard for performing EDA tasks in line with Ahmed (2025a).

In [21]:
# Load the dataset
vgsales = pd.read_csv('vgsales.csv')

#### View sample records

*   Preview first 5 rows of the dataset
*   Display column types, non-null counts, and memory usage
*   Get summary statistics for numerical columns

In [22]:
vgsales.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [23]:
vgsales.tail()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.0,0.0,0.01,0.0,0.01,,,,,,
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.0,0.01,0.0,0.0,0.01,,,,,,
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.0,0.0,0.01,0.0,0.01,,,,,,
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,0.0,0.0,0.0,0.01,,,,,,
16718,Winning Post 8 2016,PSV,2016.0,Simulation,Tecmo Koei,0.0,0.0,0.01,0.0,0.01,,,,,,


**Why?**

Viewing the top and bottom of the dataset gives me a sense of how the data looks. It helps spot inconsistencies, entry errors, or formatting issues early on

#### Inspect Data Types

In [24]:
vgsales.dtypes

Unnamed: 0,0
Name,object
Platform,object
Year_of_Release,float64
Genre,object
Publisher,object
NA_Sales,float64
EU_Sales,float64
JP_Sales,float64
Other_Sales,float64
Global_Sales,float64


**Why?**

I use `.dtypes` to check what types each column. This is imprtant because sometimes the data can be misread. For example sometimes numbers are read as text abd dates might be recognised correctly. It helps catch these issues before cleaning the data (Ahmed, 2024a).

#### Dataset Summary Information

In [25]:
vgsales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


**Why?**

I also use `.info()` right after to see the number of non-null values, data types, and memory usage. It gives a clearer picture of the dataset's structure and helps spot any missing values (Ahmed, 2024a).

#### Dataset Shape

In [26]:
print("Rows and columns in the dataset:", vgsales.shape)

Rows and columns in the dataset: (16719, 16)


**Why?**

I check the size of the dataset using `.shape` to inderstand how big it is. A small dataset might not need that much complex analysis, but a large one could need a lot more detailed work.

#### Summary Statistics

In [27]:
vgsales.describe(include='all')

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
count,16717,16719,16450.0,16717,16665,16719.0,16719.0,16719.0,16719.0,16719.0,8137.0,8137.0,10015,7590.0,10096,9950
unique,11562,31,,12,581,,,,,,,,96,,1696,8
top,Need for Speed: Most Wanted,PS2,,Action,Electronic Arts,,,,,,,,tbd,,Ubisoft,E
freq,12,2161,,3370,1356,,,,,,,,2425,,204,3991
mean,,,2006.487356,,,0.26333,0.145025,0.077602,0.047332,0.533543,68.967679,26.360821,,162.229908,,
std,,,5.878995,,,0.813514,0.503283,0.308818,0.18671,1.547935,13.938165,18.980495,,561.282326,,
min,,,1980.0,,,0.0,0.0,0.0,0.0,0.01,13.0,3.0,,4.0,,
25%,,,2003.0,,,0.0,0.0,0.0,0.0,0.06,60.0,12.0,,10.0,,
50%,,,2007.0,,,0.08,0.02,0.0,0.01,0.17,71.0,21.0,,24.0,,
75%,,,2010.0,,,0.24,0.11,0.04,0.03,0.47,79.0,36.0,,81.0,,


**Why?**
I use `.describe(include='all')` to get summary information for all columns. This shows me things like counts for categories, min/max values, and averages for numbers. It helps me see how the data is spread out, spot outliers, and find any issues (Ahmed, 2024a).


#### Checking for Missing Values

In [28]:
print("Missing values in each column:")
vgsales.isnull().sum()

Missing values in each column:


Unnamed: 0,0
Name,2
Platform,0
Year_of_Release,269
Genre,2
Publisher,54
NA_Sales,0
EU_Sales,0
JP_Sales,0
Other_Sales,0
Global_Sales,0


**Why?**

Finding missing values now rather than later is important. This is because I can decide whether I should fill them in, remove them, or mark them during cleaning. Knowing how many are missing helps me plan what to do next (Ahmed, 2024a).

#### Checking for Duplicates

In [29]:
duplicates = vgsales[vgsales.duplicated()]
print(f"Number of duplicate rows: {duplicates.shape[0]}")

Number of duplicate rows: 0


**Why?**

Duplicate entries can mess up the results of summaries and models. Doing this just checks if there are any without removing anything yet

### Part Two: Visual Analysis