# Video Game Sales

### Analyze sales data from more than 16,500 games.
### Student: Aman Arbek 
### Group: BDA-1901
#### Content
+ Introduction
+ Data description
+ Formulation of research question
+ Data preparation



## 1. Introduction: Video Game Sales


This is a list of the best-selling video games of all time. The best-selling video game to date is Minecraft, a sandbox video game originally released for Microsoft Windows, Mac OS X, and Linux in 2011. The game has been ported to a wide range of platforms, selling 200 million copies, including cheaper paid mobile game downloads. Grand Theft Auto V and EA's Tetris mobile game are the only other known video games to have sold over 100 million copies. The best-selling game on a single platform is Wii Sports, with nearly 83 million sales for the Wii console. Sales figures of games are often inflated when such game is packed in with existing hardware, such in the case of Super Mario Bros. and the NES, or Wii Play and the additional Wii controller it was bundled with.

Of the top 50 best-selling video games on this list, half were developed or published by Nintendo. Several games were published by Nintendo and their affiliate, The Pokémon Company. Other publishers with multiple entries in the top 50 include Activision and Rockstar Games with five games each, and Electronic Arts and Sega with two games each. Aside from Nintendo's internal development teams, Game Freak is the developer with the most games in the top 50, with five from the Pokémon series. The oldest game in the top 50 is Pac-Man, which was released in June 1980.



source( https://en.wikipedia.org/wiki/List_of_best-selling_video_games )

## 2. Data description

Based on the data provided by Kaggle. We have 16,500 lines about games. Games that were popular 4 years ago, different sales in different countries. We will compare the most popular games around the world based on different factors.

Data were taken from the finished data set from https://www.kaggle.com/gregorut/videogamesales . The data about a video game is taken from https://www.vgchartz.com/ in 2016. Below is data that we will be scraped and used for our analysis:


+ Fields include

+ Rank - Ranking of overall sales

+ Name - The games name

+ Platform - Platform of the games release (i.e. PC,PS4, etc.)

+ Year - Year of the game's release

+ Genre - Genre of the game

+ Publisher - Publisher of the game

+ NA_Sales - Sales in North America (in millions)

+ EU_Sales - Sales in Europe (in millions)

+ JP_Sales - Sales in Japan (in millions)

+ Global_Sales - Total worldwide sales.



## 3.Formulation of research question

Analyze total revenue from games for a global market and 
compare it with the total revenue from games on each continent and illustrate it as a diagram.

Analyze on which platform the most games were produced.

Analyze which genres have been popular around the world, and across all continents.

Сomparing well-rated games with each other.

Analysis of which year there were more games

## 4.Data preparation

I deleted the column other sales.

In [1]:
# import all modules that will be used
import time                           
import pandas as pd   
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno # analyze the missing values for our .csv file
import requests   
from bs4 import BeautifulSoup
import unicodecsv as csv
import json
import seaborn as sns


%matplotlib inline
from IPython.display import HTML 
import warnings
warnings.simplefilter("ignore")

In [6]:
data = pd.read_csv('/Users/arbek/Desktop/IDA/vgsales.csv')
data.head()
del data['Other_Sales']
print(data.head())



   Rank                      Name Platform    Year         Genre Publisher  \
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo   
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo   
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo   
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Global_Sales  
0     41.49     29.02      3.77         82.74  
1     29.08      3.58      6.81         40.24  
2     15.85     12.88      3.79         35.82  
3     15.75     11.01      3.28         33.00  
4     11.27      8.89     10.22         31.37  


In [24]:
df1 = data[['NA_Sales', 'EU_Sales','JP_Sales','Global_Sales','Name']]
df1.head(5)

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Global_Sales,Name
0,41.49,29.02,3.77,82.74,Wii Sports
1,29.08,3.58,6.81,40.24,Super Mario Bros.
2,15.85,12.88,3.79,35.82,Mario Kart Wii
3,15.75,11.01,3.28,33.0,Wii Sports Resort
4,11.27,8.89,10.22,31.37,Pokemon Red/Pokemon Blue


In [25]:
df2 = data[['Name', 'Platform','Global_Sales']]
df2.head(5)

Unnamed: 0,Name,Platform,Global_Sales
0,Wii Sports,Wii,82.74
1,Super Mario Bros.,NES,40.24
2,Mario Kart Wii,Wii,35.82
3,Wii Sports Resort,Wii,33.0
4,Pokemon Red/Pokemon Blue,GB,31.37


In [26]:
# This section gives us to numerical values. 
#std: standard deviation,
#min: the minimum value that we have.
#max: the maximum value that we have.
#mean: the average of the numbers
data.describe() #To deduce the standard deviation of the dataset, because we do not know if this standard deviation is an "uncorrected" standard deviation or a "corrected" standard deviation;




Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Global_Sales
count,16598.0,16327.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.406443,0.264667,0.146652,0.077782,0.537441
std,4791.853933,5.828981,0.816683,0.505351,0.309291,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,82.74


In [42]:
print('Rating of the best sold games:',df2['Global_Sales'].value_counts().index[0])



Rating of the best sold games: 0.02


In [34]:
data.isnull().sum() 

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Global_Sales      0
dtype: int64

In [33]:
proverka= data.notnull() # Searching null values
proverka                 

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Global_Sales
0,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...
16593,True,True,True,True,True,True,True,True,True,True
16594,True,True,True,True,True,True,True,True,True,True
16595,True,True,True,True,True,True,True,True,True,True
16596,True,True,True,True,True,True,True,True,True,True


In [61]:
data.dropna()     # Delete all lines with null values

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,31.37
...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.01


In [60]:
duplicate = data.duplicated() # removes any row containing missing data. 
duplicate

0        False
1        False
2        False
3        False
4        False
         ...  
16593    False
16594    False
16595    False
16596    False
16597    False
Length: 16598, dtype: bool

In [45]:
data.shape #16598 observations and 10 columns

(16598, 10)

In [76]:
data[(data.Global_Sales> 15) ][['Platform','Global_Sales','Genre', 'Year','Name']]
#showed all games that exceed global sales by 15 percent

Unnamed: 0,Platform,Global_Sales,Genre,Year,Name
0,Wii,82.74,Sports,2006.0,Wii Sports
1,NES,40.24,Platform,1985.0,Super Mario Bros.
2,Wii,35.82,Racing,2008.0,Mario Kart Wii
3,Wii,33.0,Sports,2009.0,Wii Sports Resort
4,GB,31.37,Role-Playing,1996.0,Pokemon Red/Pokemon Blue
5,GB,30.26,Puzzle,1989.0,Tetris
6,DS,30.01,Platform,2006.0,New Super Mario Bros.
7,Wii,29.02,Misc,2006.0,Wii Play
8,Wii,28.62,Platform,2009.0,New Super Mario Bros. Wii
9,NES,28.31,Shooter,1984.0,Duck Hunt


In [25]:
platform_sort = data.sort_values(by='Year')
platform_sort

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Global_Sales
6896,6898,Checkers,2600,1980.0,Misc,Atari,0.22,0.01,0.0,0.24
2669,2671,Boxing,2600,1980.0,Fighting,Activision,0.72,0.04,0.0,0.77
5366,5368,Freeway,2600,1980.0,Action,Activision,0.32,0.02,0.0,0.34
1969,1971,Defender,2600,1980.0,Misc,Atari,0.99,0.05,0.0,1.05
1766,1768,Kaboom!,2600,1980.0,Misc,Activision,1.07,0.07,0.0,1.15
...,...,...,...,...,...,...,...,...,...,...
16307,16310,Freaky Flyers,GC,,Racing,Unknown,0.01,0.00,0.0,0.01
16327,16330,Inversion,PC,,Shooter,Namco Bandai Games,0.01,0.00,0.0,0.01
16366,16369,Hakuouki: Shinsengumi Kitan,PS3,,Adventure,Unknown,0.01,0.00,0.0,0.01
16427,16430,Virtua Quest,GC,,Role-Playing,Unknown,0.01,0.00,0.0,0.01


there are many unique names in my dataset to create diagrams from them. There are 11493 unique names in the name,
in platforms 12274,in publisher 14727'

it was not a good choice in the dataset , I don't have time to choose another

In [3]:
index_district = data['Genre'].value_counts().index
#Get the key value as an index
print(index_district)

Index(['Action', 'Sports', 'Misc', 'Role-Playing', 'Shooter', 'Adventure',
       'Racing', 'Platform', 'Simulation', 'Fighting', 'Strategy', 'Puzzle'],
      dtype='object')


In [7]:
data.info(memory_usage='deep')
#we take only accurate data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Global_Sales  16598 non-null  float64
dtypes: float64(5), int64(1), object(4)
memory usage: 5.1 MB


In [8]:
for dtype in ['float','int','object']:
    selected_dtype = data.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))

Average memory usage for float columns: 0.11 MB
Average memory usage for int columns: 0.06 MB
Average memory usage for object columns: 0.87 MB


we will examine the memory usage of different data types.

In [10]:

int_types = ["uint8", "int8", "int16"]
for it in int_types:
    print(np.iinfo(it))
    
    #To check the minimum and maximum value suitable for storage using each integer subtype

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------



In [23]:
import plotly.graph_objects as go

fig = go.Figure([go.Scatter(x=platform_sort['Global_Sales'], y=platform_sort['Year'])])
fig.show()

in this diagram, we can understand that the global sales of popular games were released in what year(2005 year)