# Videogames sales
* Dataset obtained from the website https://www.kaggle.com/
* Direct link: https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings (requires a Kaggle account)
* Sales (and ratings) of videogames
* Updated to 2016
* Used fields:
    * Name
    * Year_of_Release
    * Platform: {PC | PlayStation | PlayStation 2 | XBOX | ... }
    * Publisher: {Nintendo | Electronic Arts | Ubisoft ...}
    * Genre: {Action | Sports | Shooter, ...} 
    * NA_Sales: total sales of the game in North America (millions of units)
    * EU_Sales: total sales in Europe (millions of units)
    * JP_Sales: total sales in Japan (millions of units)
    * Other_Sales: total sales in other regions (millions of units)
    * User_Score: game score from users 
    * Critit_Score: game score given by critics 

We analyze several aspects of the global videogame market:  
 * Most popular Platforms and Genres (also over time)
 * Correlation of sales in different regions of the world
 * User_Score vs. Critic_Score
 * Close look at certain events in the videogame industry: console wars

### Load dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

In [2]:
plt.rcParams['figure.figsize'] = [15, 8]
plt.rcParams['font.size'] = 20

In [3]:
data = pd.read_csv("Video_Games_Sales_as_at_22_Dec_2016.csv")
data.sample(20)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
16427,Cities: Skylines Snowfall,PC,2016.0,Simulation,Paradox Development,0.0,0.01,0.0,0.0,0.01,72.0,16.0,7.1,20.0,Colossal Order,
8341,Space Griffon VF-9,PS,1995.0,Role-Playing,Panther Software,0.02,0.02,0.12,0.01,0.17,,,,,,
3592,Dogz,DS,2006.0,Simulation,Ubisoft,0.5,0.01,0.0,0.04,0.56,,,7.8,9.0,MTO,E
12764,Hakuoki: Stories of the Shinsengumi,PS3,2010.0,Adventure,Idea Factory,0.03,0.0,0.03,0.01,0.06,,,6.5,10.0,Design Factory,M
2988,Fight Night 2004,XB,2004.0,Fighting,Electronic Arts,0.51,0.15,0.0,0.02,0.68,85.0,47.0,8.7,20.0,EA Sports,T
4870,Planet 51,DS,2009.0,Action,Sega,0.22,0.14,0.0,0.04,0.39,,,tbd,,Firebrand Games,E
2395,NHL 2000,PS,1998.0,Sports,Electronic Arts,0.48,0.33,0.0,0.06,0.87,,,,,,
12480,Bubble Bobble: Old & New,GBA,2002.0,Puzzle,Empire Interactive,0.04,0.02,0.0,0.0,0.06,,,,,,
8697,Wing Commander III: Heart of the Tiger,PS,1996.0,Action,Electronic Arts,0.09,0.06,0.0,0.01,0.16,,,,,,
1213,Warcraft III: The Frozen Throne,PC,2003.0,Strategy,Activision,0.58,0.87,0.0,0.09,1.54,88.0,23.0,9,713.0,Blizzard Entertainment,T


### Cleanup: Nan Handling

Zero sales is likely to be a missing data (or product not released in that region).

We replace these misleading zeros to ``np.nan`` to avoid ambiguity.

In [4]:
data['NA_Sales'] = data['NA_Sales'].replace(0, np.nan)
data['EU_Sales'] = data['EU_Sales'].replace(0, np.nan)
data['JP_Sales'] = data['JP_Sales'].replace(0, np.nan)
data['Global_Sales'] = data['Global_Sales'].replace(0, np.nan)

In [5]:
data.sample(10)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
2026,Sonic Unleashed,PS3,2008.0,Platform,Sega,0.56,0.32,0.01,0.14,1.02,54.0,24.0,7.9,130.0,Sonic Team,E10+
14343,Nobunaga no Yabou: Kakushin with Power-Up Kit,PS2,2008.0,Strategy,Tecmo Koei,,,0.03,0.0,0.03,,,,,,
15957,Kiss Bell,PSV,2014.0,Adventure,Giga,,,0.02,0.0,0.02,,,,,,
9733,Eyeshield 21: Max Devil Power,DS,2006.0,Role-Playing,Nintendo,,,0.12,0.0,0.12,,,,,,
188,Professor Layton and the Curious Village,DS,2007.0,Puzzle,Nintendo,1.21,2.43,1.03,0.52,5.19,85.0,72.0,8.6,147.0,Level 5,E
12426,Generator Rex: Agent of Providence,X360,2011.0,Action,Activision,0.05,0.01,,0.0,0.06,,,4,6.0,Virtuos,E10+
2703,Portal 2,PC,2011.0,Shooter,Valve Software,0.33,0.32,,0.1,0.76,95.0,52.0,8.8,5999.0,Valve Software,E10+
15112,Rengoku II: The Stairway To H.E.A.V.E.N.,PSP,2006.0,Action,Konami Digital Entertainment,0.02,,,0.0,0.02,51.0,24.0,7.6,14.0,Neverland,T
3621,Dora The Explorer: Dora Saves the Snow Princess,Wii,2008.0,Platform,Take-Two Interactive,0.49,0.02,,0.04,0.55,47.0,4.0,7,5.0,High Voltage Software,E
11257,Wacky World of Sports,Wii,2009.0,Sports,Sega,0.08,,,0.01,0.09,45.0,10.0,tbd,,Tabot,E10+


### Cleanup: Type handling

Check out the dataframe data types:

In [6]:
data.dtypes

Name                object
Platform            object
Year_of_Release    float64
Genre               object
Publisher           object
NA_Sales           float64
EU_Sales           float64
JP_Sales           float64
Other_Sales        float64
Global_Sales       float64
Critic_Score       float64
Critic_Count       float64
User_Score          object
User_Count         float64
Developer           object
Rating              object
dtype: object

``User_Score`` was expected to be numeric, but it is object. Typical cause: some elements are strings!

### Cleanup: Type handling


In [7]:
data['User_Score'].unique()

array(['8', nan, '8.3', '8.5', '6.6', '8.4', '8.6', '7.7', '6.3', '7.4',
       '8.2', '9', '7.9', '8.1', '8.7', '7.1', '3.4', '5.3', '4.8', '3.2',
       '8.9', '6.4', '7.8', '7.5', '2.6', '7.2', '9.2', '7', '7.3', '4.3',
       '7.6', '5.7', '5', '9.1', '6.5', 'tbd', '8.8', '6.9', '9.4', '6.8',
       '6.1', '6.7', '5.4', '4', '4.9', '4.5', '9.3', '6.2', '4.2', '6',
       '3.7', '4.1', '5.8', '5.6', '5.5', '4.4', '4.6', '5.9', '3.9',
       '3.1', '2.9', '5.2', '3.3', '4.7', '5.1', '3.5', '2.5', '1.9', '3',
       '2.7', '2.2', '2', '9.5', '2.1', '3.6', '2.8', '1.8', '3.8', '0',
       '1.6', '9.6', '2.4', '1.7', '1.1', '0.3', '1.5', '0.7', '1.2',
       '2.3', '0.5', '1.3', '0.2', '0.6', '1.4', '0.9', '1', '9.7'],
      dtype=object)

Some videogames are rated 'tbd' (maybe "to be defined" ?!)

In [8]:
data['User_Score'] = data['User_Score'].replace('tbd', np.nan) # replace 'tbd' with np.nan
data['User_Score'] = data['User_Score'].astype(np.float) # cast to float
#data['Year_of_Release'] = data['Year_of_Release'].astype(np.int) # cast to float

### Preparation: units

Critic_Score is in range [0, 100], User_Score in range [0, 10]

In [9]:
data['User_Score'] = data['User_Score'] * 10 # Easy fix!

## Market by platform

The total number of units by platform is easily obtained using *split-apply-combine*, with sum aggregation:

In [10]:
platform_sales = data.groupby("Platform", as_index=False)[["Global_Sales"]].sum().sort_values(by="Global_Sales", ascending=False)
platform_sales.head(10)

Unnamed: 0,Platform,Global_Sales
16,PS2,1255.64
28,X360,971.63
17,PS3,939.43
26,Wii,908.13
4,DS,807.1
15,PS,730.68
6,GBA,318.5
18,PS4,314.23
19,PSP,294.3
13,PC,260.3


## Market by platform
A bar plot provides an intuitive representation:

In [None]:
platform_sales = data.groupby("Platform", as_index=False)[["Global_Sales"]].sum().sort_values(by="Global_Sales", ascending=False)
px.bar(platform_sales, x="Global_Sales", y="Platform", color="Platform", width=1600, height=900, title="Global sales, by platform")

The most popular platform of all times seems to be Playstation 2, with a large margin.

## Market by genre

In [None]:
genre_sales = data.groupby("Genre", as_index=False).sum().sort_values(by="Global_Sales", ascending=False)
genre_sales.head(3)

In [None]:
px.bar(genre_sales, x="Genre", y="Global_Sales", width=1600, height=500, color="Genre", title="Global sales, by genre")

Action games dominate the scene.

## Market by publisher

In [None]:
pub_sales = data.groupby("Publisher", as_index=False).sum().sort_values(by="Global_Sales", ascending=False)

In [None]:
px.bar(pub_sales.iloc[0:10], x="Global_Sales", y="Publisher", color="Publisher", width=1600, height=800, title="Global sales, by publisher")

Nintendo is the most popular videogame publisher.

## Market Correlation Analysis

In [None]:
idx_sales = (data['NA_Sales'] != np.nan) & (data['EU_Sales'] != np.nan) & (data['JP_Sales'] != np.nan)
data_sales = data[idx_sales]
fig = px.scatter_matrix(data_sales,
    dimensions=['NA_Sales', 'EU_Sales', 'JP_Sales'],
    hover_data=["Name", "Publisher"],
    width=1200, height=700)
fig.show()

EU and NA market are strongly correlated with each other. 
JP market looks more apart.

In [None]:
corr = data_sales[["NA_Sales", "EU_Sales", "JP_Sales"]].corr()
corr

## User score vs. Critics Score

In [None]:
px.scatter(data, x="User_Score", y="Critic_Score", hover_data=["Name", "Publisher"],
          width=1200, height=700)

In [None]:
data[(data['User_Score'] > 80) & (data['Critic_Score'] < 30)]

## User score vs. Critics Score

In [None]:
sns.jointplot(data=data, kind="hex", x="User_Score", y="Critic_Score",  height=10.0);

User ratings are correlated with Critic ratings, but there are exceptions!

### 7th-Generation Console War: PS3 vs. XBOX 360 vs. Wii

Three consoles (Sony PS3, Microsoft XBOX 360, Nintendo Wii) came out at about the same time and competed on the market. Who won?

In [None]:
data_7th = data[data['Platform'].isin(['PS3', 'X360', 'Wii'])]
data_7th_agg = data_7th.groupby(['Year_of_Release','Platform'], as_index=False)[['Global_Sales']].sum()
data_7th_agg.head(10)

### 7th-Generation Console War: PS3 vs. XBOX 360 vs. Wii

In [None]:
px.bar(data_7th_agg, x='Year_of_Release', y="Global_Sales", color="Platform", barmode="group", width=1400, height=700)

Wii was more popular then its competitors until 2009. Then, PS3 and XBOX 360 took over.

Overall, it was a tight competition

### 8th-Generation Console War: PS4 vs. XBOX One vs. WiiU

In [None]:
data_8th = data[(data['Platform'] == 'WiiU') | (data['Platform'] == 'PS4') | (data['Platform'] == 'XOne')]
data_8th_agg = data_8th.groupby(['Year_of_Release','Platform'])['Global_Sales'].sum().reset_index()

In [None]:
px.bar(data_8th_agg, x='Year_of_Release', y="Global_Sales", color="Platform", barmode="group", width=1400, height=700)

Playstation 4 (PS4) is the clear winner of the 8th generation

## Evolution of Genre

In [None]:
data_gen = data#data[data['Genre'].isin(['Action', 'Platform'])]
data_gen_agg = data_gen.groupby(['Year_of_Release','Genre'])['Global_Sales'].sum().reset_index()

In [None]:
data_gen_agg.head(5)

In [None]:
data_plot = data_gen_agg[data_gen_agg['Genre'].isin(['Action', 'Platform'])]
px.bar(data_plot, x='Year_of_Release', y="Global_Sales", color="Genre", barmode="group", width=1600, height=700)

Action game dominate nowadays, but platform (2D games) used to be very popular in the '90s

In [None]:
#sns.lineplot(data=data_gen_agg, x='Year_of_Release', y="Global_Sales", hue="Genre");

### Normalize by year

In [None]:
data_year = data.groupby("Year_of_Release")["Global_Sales"].sum()
data_year_genre = data.groupby(["Year_of_Release", "Genre"])[["Global_Sales"]].sum()
data_year_genre["Global_Share"] = data_year_genre["Global_Sales"]/data_year * 100
data_year_genre = data_year_genre.reset_index()
data_year_genre.head(5)

In [None]:
data_plot = data_year_genre[data_year_genre['Genre'].isin(['Action', 'Platform'])]
px.bar(data_plot, x='Year_of_Release', y="Global_Share", color="Genre", barmode="group", width=1600, height=700)

### Global sales by genre, time animation

In [None]:
data_plot = data_year_genre.query("Year_of_Release > 1990 and Year_of_Release < 2016")
data_plot=data_plot.sort_values(by=["Year_of_Release", "Global_Share"], ascending=[True,False])
px.bar(data_plot, x="Global_Share", y="Genre", color="Genre", animation_frame="Year_of_Release",
      animation_group="Genre",
      range_x=[0, 100],
      width=1000, height=500)

### Global sales by genre, time animation

In [None]:
data_plot = data_year_genre.query("Year_of_Release > 1990 and Year_of_Release < 2016")
data_plot=data_plot.sort_values(by=["Year_of_Release", "Global_Share"], ascending=[True,False])
px.bar(data_plot, x="Global_Share", y="Genre", color="Genre", animation_frame="Year_of_Release",
      animation_group="Genre",
      range_x=[0, 200],
      width=1000, height=500)