# Tidy Tuesday 1

In this activity, the dataset of Video Games Sales provided by Gregory Smith scraped from [VGChartz](vgchartz.com) will be analyzed. This dataset consists of 16,598 games that is above 100,000 sales.

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Markdown as md

%matplotlib inline

data = pd.read_csv('data/vgsales.csv')

data.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


## Data Preparation

before analyzing the data, we would first be preparing by cleaning and removing incomplete data either from a record.

In [3]:
data.shape

(16598, 11)

In [7]:
data.dtypes

Rank              int64
Name             object
Platform         object
Year            float64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

In [17]:
data.isnull().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

### Analysis

Since there are missing values, in the year and publisher column, instead of deleting the data, a sentinel value will be provided. In addition, after examining the datatypes, we noticed that the year is of type float instead of an integer.

In [19]:
data.Year = data.Year.fillna(9999)

In [20]:
data.Publisher = data.Publisher.fillna("N/A")

In [21]:
data.isnull().sum()

Rank            0
Name            0
Platform        0
Year            0
Genre           0
Publisher       0
NA_Sales        0
EU_Sales        0
JP_Sales        0
Other_Sales     0
Global_Sales    0
dtype: int64

In [22]:
data.Year = data.Year.astype(int)

In [31]:
data.head(50)

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37
5,6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
6,7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
7,8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
8,9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
9,10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31


## Data Analysis

Now that the data has been cleaned, we can begin to analyze. In this section, the dataset will be analyzed through aggregagation or grouping. 

### Initial Observation

We could already see that the Top 5 Global Sales are from the same developer, Nintendo.


Now we will be looking at the global sales per year, platform, genre, and publisher.

In [38]:
salesPerYear = data.groupby('Year')['Global_Sales'].sum().to_frame().sort_values(by=['Global_Sales'], ascending=False)
salesPerYear.columns = ['Global_Sales']
salesPerYear.head(5)

Unnamed: 0_level_0,Global_Sales
Year,Unnamed: 1_level_1
2008,678.9
2009,667.3
2007,611.13
2010,600.45
2006,521.04


In [36]:
salesPerYear.tail(5)

Unnamed: 0_level_0,Global_Sales
Year,Unnamed: 1_level_1
1987,21.74
1983,16.79
1980,11.38
2020,0.29
2017,0.05


### Observation

It can be seen that the top 5 years of the Highest Grossing Sales were on the years 2006-2010 with 2008 being the year of the Highest Grossing Sale of Video Games globally. On the other hand, it can be seen that the years 2017 and 2020 have a suprisingly low global sales. One factor that may have contributed to this result is that the data gathered/ scraped were outdated thus the sales from 2017-2020 were incomplete.

Now we analyze the global sales per platform.

In [45]:
salesPerPlatform = data.groupby('Platform')['Global_Sales'].sum().to_frame().sort_values(by=['Global_Sales'], ascending=False)
salesPerPlatform.columns = ['Global_Sales']
salesPerPlatform.head(5)

Unnamed: 0_level_0,Global_Sales
Platform,Unnamed: 1_level_1
PS2,1255.64
X360,979.96
PS3,957.84
Wii,926.71
DS,822.49


In [46]:
salesPerPlatform.tail(5)

Unnamed: 0_level_0,Global_Sales
Platform,Unnamed: 1_level_1
WS,1.42
TG16,0.16
3DO,0.1
GG,0.04
PCFX,0.03


### Observation

Suprisingly, it can be seen that the highest grossing platform was the PlayStation 2 which is 3 console generations ago already. This was followed by the Xbox 360 and its successor, the Playstation 3 which are also platforms that are already 2 generations ago. 

Another thing to look at is that the lowest global sales were from various unpopular Japanese consoles.

In [52]:
salesPerGenre = data.groupby('Genre')['Global_Sales'].sum().to_frame().sort_values(by=['Global_Sales'], ascending = False)
salesPerGenre.columns = ['Global_Sales']
salesPerGenre.shape

(12, 1)

In [53]:
salesPerGenre.head(12)

Unnamed: 0_level_0,Global_Sales
Genre,Unnamed: 1_level_1
Action,1751.18
Sports,1330.93
Shooter,1037.37
Role-Playing,927.37
Platform,831.37
Misc,809.96
Racing,732.04
Fighting,448.91
Simulation,392.2
Puzzle,244.95


### Observation

Action, Sports, and Shooter respectively are the top 3 genre with the highest global sales. On the other hand, genres like Puzzle, Adventure, and Strategy were at the bottom.

Lastly, we will be looking at the global sales per publisher

In [55]:
salesPerPublisher = data.groupby('Publisher')['Global_Sales'].sum().to_frame().sort_values(by=['Global_Sales'], ascending=False)
salesPerPublisher.columns = ['Global_Sales']
salesPerPublisher.head(5)

Unnamed: 0_level_0,Global_Sales
Publisher,Unnamed: 1_level_1
Nintendo,1786.56
Electronic Arts,1110.32
Activision,727.46
Sony Computer Entertainment,607.5
Ubisoft,474.72


In [57]:
salesPerPublisher.tail(10)

Unnamed: 0_level_0,Global_Sales
Publisher,Unnamed: 1_level_1
Genterprise,0.01
New World Computing,0.01
Otomate,0.01
Ascaron Entertainment,0.01
Stainless Games,0.01
Ongakukan,0.01
Commseed,0.01
Takuyo,0.01
Boost On,0.01
Naxat Soft,0.01


### Observation

The top company or publisher with the global sales is Nintendo which was actually initially observed from the start with the top 5 video games coming from the said company. This was followed by EA, Activision, Sony Computer Entertainment, and Ubisoft respectively. Similar to what was observed on the bottom list of the global sales per platform, some of the lowest global sales per publisher were from various indie Japanese companies.

