# Benjamin Lavoie (benjaminlavoie02@gmail.com)

# CapStone project: Video games sales and ratings prediction

# Last update: February 23rd, 2024 (version 1.3)

My capstone project is about video games sales and ratings prediction.

It has 3 main datapoints:
    1. Past game sales
    2. Past game ratings
    3. Game features, like the number of players, the genre, and more.
    
I will start looking into the different datasets and making sure my main dataset is cleaned and
can be used properly.

## Table of Contents

**[1. Part 1 - Inspecting and choosing datasets](#heading--1)**

  * [1.1 - Dataset VG_Sales_All2](#heading--1-1)

  * [1.2 - Dataset Video_Games](#heading--1-2)
  
  * [1.3 - Dataset metacritic_games_master](#heading--1-3)
    
  * [1.4 - Dataset Tagged-Data-Final](#heading--1-4)
  
  * [1.5 - Dataset Cleaned Data 2](#heading--1-5)
  
  * [1.6 - Dataset opencritic_rankings_feb_2023](#heading--1-6)
  
  * [1.7 - Dataset vgsales](#heading--1-7)
  
  * [1.8 - Dataset all_video_games(cleaned)](#heading--1-8)
  
  * [1.9 - Dataset Raw Data](#heading--1-9)
  

**[2. Part 2 - Cleaning and joining datasets](#heading--2)**

  * [2.1 - Merging the 4 main datasets](#heading--2-1)


## [Next Steps](#heading--3)


<div id="heading--1"/>
<br>

# Part 1 - Inspecting and choosing datasets 

<br>

In [117]:
# importing libraries

import numpy as np
import pandas as pd
import glob
import os

In [118]:
# importing datasets, part 1

path = ''
all_files = glob.glob(os.path.join("../DataSets/*.csv"))

all_files

['../DataSets/Video_Games.csv',
 '../DataSets/metacritic_games_master.csv',
 '../DataSets/Tagged-Data-Final.csv',
 '../DataSets/Cleaned Data 2.csv',
 '../DataSets/opencritic_rankings_feb_2023.csv',
 '../DataSets/vgsales.csv',
 '../DataSets/all_video_games(cleaned).csv',
 '../DataSets/Raw Data.csv',
 '../DataSets/VG_Sales_All2.csv']

In [119]:
# importing datasets, part 2, and putting all the datasets into dataframes

df2 = pd.read_csv(all_files[0], index_col=None, header=0)
df3 = pd.read_csv(all_files[1], index_col=None, header=0)
df4 = pd.read_csv(all_files[2], index_col=None, header=0)
df5 = pd.read_csv(all_files[3], index_col=None, header=0)
df6 = pd.read_csv(all_files[4], index_col=None, header=0)
df7 = pd.read_csv(all_files[5], index_col=None, header=0)
df8 = pd.read_csv(all_files[6], index_col=None, header=0)
df9 = pd.read_csv(all_files[7], index_col=None, header=0)
df1 = pd.read_csv(all_files[8], index_col=None, header=0)

<div id="heading--1-1"/>
<br>

# 1.1 - Dataset VG_Sales_All2
<br>

In [120]:
# I will check the first dataset.

display(df1.head())
display(df1.sample(20))

Unnamed: 0,Rank,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Year,Genre
0,30,Wii Sports,Wii,Nintendo,Nintendo EAD,41.36,29.02,3.77,8.51,82.65,2006.0,Sports
1,53,Mario Kart 8 Deluxe,NS,Nintendo,Nintendo EPD,5.05,4.98,2.11,0.91,13.05,2017.0,Racing
2,75,Animal Crossing: New Horizons,NS,Nintendo,Nintendo,,,,,,2020.0,Simulation
3,80,Super Mario Bros.,NES,Nintendo,Nintendo EAD,29.08,3.58,6.81,0.77,40.24,1985.0,Platform
4,81,Counter-Strike: Global Offensive,PC,Valve,Valve Corporation,,,,,,2012.0,Shooter


Unnamed: 0,Rank,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Year,Genre
42891,55918,Bugsnax,XS,Unknown,Young Horses,,,,,,,Adventure
23673,26662,BioShock 2: Protector Trials,PC,2K Games,2K Marin,,,,,,2011.0,Shooter
46793,60103,Oxenfree,NS,Night School Studio,Night School Studio,,,,,,2017.0,Adventure
5021,6374,Dance Dance Revolution: Mario Mix,GC,Nintendo,Konami,0.36,0.09,,0.01,0.47,2005.0,Simulation
22352,24558,All-Star Tennis '99,PS,Ubisoft,Smart Dog,,,,,,1998.0,Sports
48669,62136,System Shock (Remake),XOne,Nightdive Studios,Nightdive Studios,,,,,,2020.0,Role-Playing
10469,12047,Citadel: Forged with Fire,PC,Unknown,Blue Isle Studios,,,,,,,Role-Playing
27341,32443,Falling Skies: The Game,PC,Little Orbit,Little Orbit,,,,,,2016.0,Role-Playing
20761,22525,Cooking Idol I! My! Main! Game de Hirameki! Ki...,DS,Konami,Konami,,,0.0,,0.0,2010.0,Simulation
22113,24209,AI Mahjong Selection,PS,Hamster Corporation,i4,,,,,,2002.0,Misc


In [121]:
# quick checking of best selling games.

df1.sort_values('Global_Sales', ascending = False).head(40)

Unnamed: 0,Rank,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Year,Genre
0,30,Wii Sports,Wii,Nintendo,Nintendo EAD,41.36,29.02,3.77,8.51,82.65,2006.0,Sports
3,80,Super Mario Bros.,NES,Nintendo,Nintendo EAD,29.08,3.58,6.81,0.77,40.24,1985.0,Platform
5,86,Mario Kart Wii,Wii,Nintendo,Nintendo EAD,15.91,12.92,3.8,3.35,35.98,2008.0,Racing
9,98,Wii Sports Resort,Wii,Nintendo,Nintendo EAD,15.61,10.99,3.29,3.02,32.9,2009.0,Sports
11,105,Pokémon Red / Green / Blue Version,GB,Nintendo,Game Freak,11.27,8.89,10.22,1.0,31.37,1998.0,Role-Playing
7,93,Tetris,GB,Nintendo,Bullet Proof Software,23.2,2.26,4.22,0.58,30.26,1989.0,Puzzle
13,112,New Super Mario Bros.,DS,Nintendo,Nintendo EAD,11.28,9.19,6.5,2.89,29.85,2006.0,Platform
16,128,Wii Play,Wii,Nintendo,Nintendo EAD,13.96,9.18,2.93,2.85,28.92,2007.0,Misc
14,113,New Super Mario Bros. Wii,Wii,Nintendo,Nintendo EAD,14.53,7.01,4.7,2.27,28.51,2009.0,Platform
15,127,Duck Hunt,NES,Nintendo,Nintendo R&D1,26.93,0.63,0.28,0.47,28.31,1985.0,Shooter


In [122]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50334 entries, 0 to 50333
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          50334 non-null  int64  
 1   Name          50334 non-null  object 
 2   Platform      50334 non-null  object 
 3   Publisher     50334 non-null  object 
 4   Developer     50334 non-null  object 
 5   NA_Sales      13508 non-null  float64
 6   PAL_Sales     13857 non-null  float64
 7   JP_Sales      7632 non-null   float64
 8   Other_Sales   16189 non-null  float64
 9   Global_Sales  20100 non-null  float64
 10  Year          44256 non-null  float64
 11  Genre         50334 non-null  object 
dtypes: float64(6), int64(1), object(5)
memory usage: 4.6+ MB


In [123]:
# df1.dropna(subset=['Global_Sales'], inplace=True)

In [124]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50334 entries, 0 to 50333
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          50334 non-null  int64  
 1   Name          50334 non-null  object 
 2   Platform      50334 non-null  object 
 3   Publisher     50334 non-null  object 
 4   Developer     50334 non-null  object 
 5   NA_Sales      13508 non-null  float64
 6   PAL_Sales     13857 non-null  float64
 7   JP_Sales      7632 non-null   float64
 8   Other_Sales   16189 non-null  float64
 9   Global_Sales  20100 non-null  float64
 10  Year          44256 non-null  float64
 11  Genre         50334 non-null  object 
dtypes: float64(6), int64(1), object(5)
memory usage: 4.6+ MB


In [125]:
# df1, video_games_sales:
# name,
# genre (maybe)
# platform
# publisher
# all the sales columns (in millions)

df1.drop(['Year', 'Rank'], axis=1, inplace=True)


In [126]:
df1

Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre
0,Wii Sports,Wii,Nintendo,Nintendo EAD,41.36,29.02,3.77,8.51,82.65,Sports
1,Mario Kart 8 Deluxe,NS,Nintendo,Nintendo EPD,5.05,4.98,2.11,0.91,13.05,Racing
2,Animal Crossing: New Horizons,NS,Nintendo,Nintendo,,,,,,Simulation
3,Super Mario Bros.,NES,Nintendo,Nintendo EAD,29.08,3.58,6.81,0.77,40.24,Platform
4,Counter-Strike: Global Offensive,PC,Valve,Valve Corporation,,,,,,Shooter
...,...,...,...,...,...,...,...,...,...,...
50329,Zombieland: Double Tap - Road Trip,PC,GameMill Entertainment,High Voltage Software,,,,,,Shooter
50330,Zombillie,NS,Forever Entertainment S.A.,Forever Entertainment S.A.,,,,,,Puzzle
50331,Zone of the Enders: The 2nd Runner MARS,PC,Konami,Cygames,,,,,,Simulation
50332,Zoo Tycoon: Ultimate Animal Collection,XOne,Microsoft Studios,Frontier Developments,,,,,,Simulation


In [127]:
df1.head(20)

Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre
0,Wii Sports,Wii,Nintendo,Nintendo EAD,41.36,29.02,3.77,8.51,82.65,Sports
1,Mario Kart 8 Deluxe,NS,Nintendo,Nintendo EPD,5.05,4.98,2.11,0.91,13.05,Racing
2,Animal Crossing: New Horizons,NS,Nintendo,Nintendo,,,,,,Simulation
3,Super Mario Bros.,NES,Nintendo,Nintendo EAD,29.08,3.58,6.81,0.77,40.24,Platform
4,Counter-Strike: Global Offensive,PC,Valve,Valve Corporation,,,,,,Shooter
5,Mario Kart Wii,Wii,Nintendo,Nintendo EAD,15.91,12.92,3.8,3.35,35.98,Racing
6,PLAYERUNKNOWN'S BATTLEGROUNDS,PC,PUBG Corporation,PUBG Corporation,,,,,,Shooter
7,Tetris,GB,Nintendo,Bullet Proof Software,23.2,2.26,4.22,0.58,30.26,Puzzle
8,Minecraft,PC,Mojang,Mojang AB,,,,,,Misc
9,Wii Sports Resort,Wii,Nintendo,Nintendo EAD,15.61,10.99,3.29,3.02,32.9,Sports


<div id="heading--1-2"/>
<br>

# 1.2 - Dataset Video_Games
<br>

In [128]:

display(df2.head())
display(df2.sample(20))

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
15166,Hyakka Hyakurou: Sengoku Ninpoujou,PSV,2016.0,Adventure,D3Publisher,0.0,0.0,0.02,0.0,0.02,,,,,,
6698,Avatar: The Last Airbender - The Burning Earth,Wii,2007.0,Action,THQ,0.23,0.0,0.0,0.02,0.25,,,,,,
8628,Watchmen: The End is Nigh - The Complete Exper...,PS3,2009.0,Action,Warner Bros. Interactive Entertainment,0.08,0.06,0.0,0.02,0.16,,,,,,
12512,Alien Syndrome,Wii,2007.0,Role-Playing,Sega,0.05,0.0,0.0,0.0,0.06,48.0,24.0,7.1,23.0,Totally Games,T
6277,Mass Effect Trilogy,PC,2012.0,Action,Electronic Arts,0.09,0.16,0.0,0.02,0.27,,,8,60.0,BioWare,M
7099,Bass Pro Shops: The Hunt,Wii,2010.0,Sports,XS Games,0.21,0.0,0.0,0.01,0.23,,,tbd,,Griffin International,T
10397,Hysteria Hospital: Emergency Ward,DS,2009.0,Action,Oxygen Interactive,0.1,0.0,0.0,0.01,0.11,45.0,12.0,tbd,,Gameinvest,E
1717,Super Ghouls 'n Ghosts,SNES,1991.0,Platform,Capcom,0.5,0.14,0.52,0.02,1.18,,,,,,
9364,Divinity II: The Dragon Knight Saga,X360,2010.0,Role-Playing,Focus Home Interactive,0.11,0.02,0.0,0.01,0.13,72.0,13.0,7.8,50.0,Larian Studios,M
6517,Summer Sports 2: Island Sports Party,Wii,2008.0,Sports,Ubisoft,0.24,0.0,0.0,0.02,0.26,,,tbd,,Destineer,E


In [129]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


In [130]:
# df2, videogames:
# I would keep rating or developer, but there are too many missing data.
# I won't keep anything.

<div id="heading--1-3"/>
<br>

# 1.3 - Dataset metacritic_games_master
<br>

In [131]:
display(df3.head())
display(df3.sample(20))

Unnamed: 0.1,Unnamed: 0,title,release_date,genre,platforms,developer,esrb_rating,ESRBs,metascore,userscore,critic_reviews,user_reviews,num_players
0,113,Pushmo,08-Dec-11,"Miscellaneous, Puzzle, Action, Puzzle, Action",3DS,Intelligent Systems,E,,90,8.3,31,215.0,1 Player
1,163,The Legend of Zelda: Majora's Mask 3D,13-Feb-15,"Fantasy, Action Adventure, Open-World",3DS,GREZZO,E10+,,89,8.9,82,781.0,1 Player
2,279,The Legend of Zelda: Ocarina of Time 3D,19-Jun-11,"Miscellaneous, Fantasy, Fantasy, Compilation, ...",3DS,GREZZO,E10+,Animated Blood Fantasy Violence Suggestive Themes,94,9.0,85,1780.0,1 Player
3,380,The Legend of Zelda: A Link Between Worlds,22-Nov-13,"Action RPG, Role-Playing, Action Adventure, Ge...",3DS,Nintendo,E,,91,9.0,81,1603.0,1 Player
4,417,Colors! 3D,05-Apr-12,"Miscellaneous, General, General, Application",3DS,Collecting Smiles,E,,89,7.5,15,66.0,1-2 Players


Unnamed: 0.1,Unnamed: 0,title,release_date,genre,platforms,developer,esrb_rating,ESRBs,metascore,userscore,critic_reviews,user_reviews,num_players
10397,3027,SteamWorld Dig,18-Mar-14,"Action, Action Adventure, General, Platformer,...",PS4,Image & Form,E10+,,82,7.3,18,194.0,1 Player
11781,14595,Arizona Sunshine,05-Jul-17,"Action, Shooter, First-Person, Light Gun, Arcade",PS4,Jaywalkers Interactive,M,,63,6.3,31,59.0,Up to 4 Players
15064,13719,Dragon Quest Swords: The Masked Queen and the ...,19-Feb-08,"Role-Playing, First-Person, First-Person, Japa...",Wii,"Eighting, Genius Sonority Inc.",T,Fantasy Violence Mild Suggestive Themes Use of...,65,6.1,43,28.0,1 Player
9130,3478,SoulCalibur V,31-Jan-12,"Action, Fighting, Fighting, 3D, 3D",PS3,Project Soul,T,,81,6.6,53,195.0,2 Players
955,13820,Spore Creatures,07-Sep-08,"Strategy, Breeding/Constructing, General, Bree...",DS,Griptonite Games,E,Comic Mischief Mild Cartoon Violence,65,6.8,20,25.0,1 Player
4152,6813,Herald: An Interactive Period Drama,22-Feb-17,"Adventure, 3D, First-Person",PC,"Humble Bundle, Humble Games, Endlessfluff Games",,,77,5.9,13,16.0,1 Player
3180,3500,Far Cry 3: Blood Dragon,01-May-13,"Action, Shooter, Shooter, First-Person, Modern...",PC,Ubisoft Montreal,M,Blood Nudity Sexual Content Strong Language Vi...,81,8.1,31,990.0,1 Player
12450,3568,Soul Sacrifice Delta,13-May-14,"Role-Playing, Action RPG",PSV,SCE Japan Studio,M,,82,8.5,24,149.0,Up to 4 Players
9432,8484,Dead or Alive 5,25-Sep-12,"Action, Fighting, Fighting, 3D, 3D",PS3,Team Ninja,M,Partial Nudity Sexual Themes Violence,74,7.7,34,120.0,Up to 16 Players
3650,5082,Penny Arcade Adventures: Episode Two,07-Nov-08,"Adventure, General, General",PC,Hothead Games,M,,80,7.4,7,23.0,1 Player


In [132]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19317 entries, 0 to 19316
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      19317 non-null  int64  
 1   title           19317 non-null  object 
 2   release_date    19317 non-null  object 
 3   genre           19317 non-null  object 
 4   platforms       19316 non-null  object 
 5   developer       19298 non-null  object 
 6   esrb_rating     17202 non-null  object 
 7   ESRBs           7855 non-null   object 
 8   metascore       19317 non-null  int64  
 9   userscore       19317 non-null  object 
 10  critic_reviews  19317 non-null  int64  
 11  user_reviews    17953 non-null  float64
 12  num_players     19304 non-null  object 
dtypes: float64(1), int64(3), object(9)
memory usage: 1.9+ MB


In [133]:
df3.describe()

Unnamed: 0.1,Unnamed: 0,metascore,critic_reviews,user_reviews
count,19317.0,19317.0,19317.0,17953.0
mean,9658.254077,70.626961,22.939173,204.702947
std,5576.810188,12.248404,17.323601,1431.175394
min,0.0,11.0,6.0,4.0
25%,4829.0,64.0,10.0,14.0
50%,9658.0,72.0,17.0,34.0
75%,14488.0,79.0,30.0,105.0
max,19317.0,99.0,127.0,158410.0


In [134]:
# df3, metacritic_games_master:, columns to keep
# release date
# developer (some missing values, still useful)
# esrb_rating (some missing values, still useful)
# metascore
# userscore
# critic_reviews
# user_reviews
# num_players (some missing values, still useful)

df3.drop(['Unnamed: 0', 'genre', 'ESRBs'], axis=1, inplace=True)

In [135]:
df3

Unnamed: 0,title,release_date,platforms,developer,esrb_rating,metascore,userscore,critic_reviews,user_reviews,num_players
0,Pushmo,08-Dec-11,3DS,Intelligent Systems,E,90,8.3,31,215.0,1 Player
1,The Legend of Zelda: Majora's Mask 3D,13-Feb-15,3DS,GREZZO,E10+,89,8.9,82,781.0,1 Player
2,The Legend of Zelda: Ocarina of Time 3D,19-Jun-11,3DS,GREZZO,E10+,94,9,85,1780.0,1 Player
3,The Legend of Zelda: A Link Between Worlds,22-Nov-13,3DS,Nintendo,E,91,9,81,1603.0,1 Player
4,Colors! 3D,05-Apr-12,3DS,Collecting Smiles,E,89,7.5,15,66.0,1-2 Players
...,...,...,...,...,...,...,...,...,...,...
19312,Necromunda: Hired Gun,01-Jun-21,XS,Focus Home Interactive,M,56,5.3,11,10.0,1 Player
19313,Grand Theft Auto: The Trilogy - The Definitive...,11-Nov-21,XS,"Rockstar Games, Grove Street Games",M,56,0.7,11,1124.0,1 Player
19314,Bright Memory,10-Nov-20,XS,FYQD Personal Studio,,55,4.2,31,62.0,1 Player
19315,Balan Wonderworld,26-Mar-21,XS,"Square Enix, Arzest, Balan Company",E10+,47,7.2,11,162.0,No Online Multiplayer Online Multiplayer


<div id="heading--1-4"/>
<br>

# 1.4 - Dataset Tagged-Data-Final
<br>

In [136]:
display(df4.head())
display(df4.sample(20))

Unnamed: 0,Name,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,Story Focus,Gameplay Focus,Series
0,.hack//Infection Part 1,2002.0,Role-Playing,Atari,0.49,0.38,0.26,0.13,1.27,75.0,35.0,8.5,60.0,CyberConnect2,T,x,,x
1,.hack//Mutation Part 2,2002.0,Role-Playing,Atari,0.23,0.18,0.2,0.06,0.68,76.0,24.0,8.9,81.0,CyberConnect2,T,x,,x
2,.hack//Outbreak Part 3,2002.0,Role-Playing,Atari,0.14,0.11,0.17,0.04,0.46,70.0,23.0,8.7,19.0,CyberConnect2,T,x,,x
3,[Prototype],2009.0,Action,Activision,0.84,0.35,0.0,0.12,1.31,78.0,83.0,7.8,356.0,Radical Entertainment,M,,x,x
4,[Prototype],2009.0,Action,Activision,0.65,0.4,0.0,0.19,1.24,79.0,53.0,7.7,308.0,Radical Entertainment,M,,x,x


Unnamed: 0,Name,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,Story Focus,Gameplay Focus,Series
2110,F-Zero: GP Legend,2003.0,Racing,Nintendo,0.11,0.04,0.0,0.0,0.16,77.0,31.0,8.5,18.0,Suzak,E,,x,
700,Burnout Revenge,2005.0,Racing,Electronic Arts,0.75,0.03,0.0,0.12,0.9,90.0,52.0,8.8,114.0,Criterion Games,E10+,,x,
4704,Rise of Nations: Rise of Legends,2006.0,Strategy,Microsoft Game Studios,0.0,0.03,0.0,0.01,0.03,84.0,46.0,8.5,108.0,Big Huge Games,T,,x,
150,Alter Ego,1985.0,Simulation,Activision,0.0,0.03,0.0,0.01,0.03,59.0,9.0,5.8,19.0,"Viva Media, Viva Media, LLC",T,,x,
4160,Operation Darkness,2007.0,Role-Playing,Success,0.07,0.0,0.0,0.01,0.07,46.0,22.0,7.0,21.0,Success,M,,x,
1610,EA Sports MMA,2010.0,Fighting,Electronic Arts,0.23,0.09,0.0,0.02,0.35,79.0,63.0,7.0,55.0,EA Tiburon,T,,x,
5335,Spyro: Enter the Dragonfly,2002.0,Platform,Universal Interactive,0.55,0.14,0.0,0.02,0.71,48.0,13.0,5.1,24.0,Equinoxe,E,,x,
1746,F1 2010,2010.0,Racing,Codemasters,0.0,0.05,0.0,0.01,0.05,84.0,15.0,6.6,99.0,Codemasters Birmingham,E,,x,
3639,Mytran Wars,2009.0,Strategy,Deep Silver,0.08,0.02,0.0,0.02,0.12,68.0,16.0,8.8,4.0,Stormregion,T,,x,
6117,Time Crisis: Razing Storm,2010.0,Shooter,Namco Bandai Games,0.18,0.2,0.07,0.08,0.54,58.0,49.0,5.9,15.0,Nex Entertainment,T,,x,


In [137]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6894 entries, 0 to 6893
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             6894 non-null   object 
 1   Year_of_Release  6894 non-null   float64
 2   Genre            6894 non-null   object 
 3   Publisher        6893 non-null   object 
 4   NA_Sales         6894 non-null   float64
 5   EU_Sales         6894 non-null   float64
 6   JP_Sales         6894 non-null   float64
 7   Other_Sales      6894 non-null   float64
 8   Global_Sales     6894 non-null   float64
 9   Critic_Score     6894 non-null   float64
 10  Critic_Count     6894 non-null   float64
 11  User_Score       6894 non-null   float64
 12  User_Count       6894 non-null   float64
 13  Developer        6890 non-null   object 
 14  Rating           6826 non-null   object 
 15  Story Focus      767 non-null    object 
 16  Gameplay Focus   6586 non-null   object 
 17  Series        

In [138]:
df4.describe()

Unnamed: 0,Year_of_Release,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count
count,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0
mean,2007.482303,0.39092,0.234517,0.063867,0.082,0.771487,70.258486,28.842472,7.184378,174.39237
std,4.236401,0.963231,0.684214,0.286461,0.26862,1.95478,13.861082,19.194572,1.439806,584.872155
min,1985.0,0.0,0.0,0.0,0.0,0.01,13.0,3.0,0.5,4.0
25%,2004.0,0.06,0.02,0.0,0.01,0.11,62.0,14.0,6.5,11.0
50%,2007.0,0.15,0.06,0.0,0.02,0.29,72.0,24.0,7.5,27.0
75%,2011.0,0.39,0.21,0.01,0.07,0.75,80.0,39.0,8.2,89.0
max,2016.0,41.36,28.96,6.5,10.57,82.53,98.0,113.0,9.6,10665.0


In [139]:
# df4, Tagged-Data-Final, columns to keep:
# Storyfocus/gameplay focus (very interesting)
# Series (if the game is part of a series)

df4.drop(['Year_of_Release', 'Publisher', 'Genre', 'NA_Sales', 'EU_Sales',
          'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Score',
         'Critic_Count', 'User_Score', 'User_Count', 'Developer', 'Rating'], axis=1, inplace=True)

In [140]:
df4.fillna(0, inplace=True)

In [141]:
df4.loc[df4['Name'] == 'Grand Theft Auto V']

Unnamed: 0,Name,Story Focus,Gameplay Focus,Series
2255,Grand Theft Auto V,x,x,0
2256,Grand Theft Auto V,x,x,0
2257,Grand Theft Auto V,x,x,0
2258,Grand Theft Auto V,x,x,0
2259,Grand Theft Auto V,x,x,0


In [142]:
df4.drop_duplicates(inplace=True)

In [143]:
df4.duplicated().sum()

0

In [144]:
df4

Unnamed: 0,Name,Story Focus,Gameplay Focus,Series
0,.hack//Infection Part 1,x,0,x
1,.hack//Mutation Part 2,x,0,x
2,.hack//Outbreak Part 3,x,0,x
3,[Prototype],0,x,x
5,[Prototype 2],0,x,x
...,...,...,...,...
6889,Zubo,0,x,0
6890,Zumba Fitness,0,x,0
6891,Zumba Fitness: World Party,0,x,0
6892,Zumba Fitness Core,0,x,0


<div id="heading--1-5"/>
<br>

# 1.5 - Dataset Cleaned Data 2
<br>

In [145]:
display(df5.head())
display(df5.sample(20))

Unnamed: 0,Name,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,.hack//Infection Part 1,2002,Role-Playing,Atari,0.49,0.38,0.26,0.13,1.27,75,35,8.5,60,CyberConnect2,T
1,.hack//Mutation Part 2,2002,Role-Playing,Atari,0.23,0.18,0.2,0.06,0.68,76,24,8.9,81,CyberConnect2,T
2,.hack//Outbreak Part 3,2002,Role-Playing,Atari,0.14,0.11,0.17,0.04,0.46,70,23,8.7,19,CyberConnect2,T
3,[Prototype],2009,Action,Activision,0.84,0.35,0.0,0.12,1.31,78,83,7.8,356,Radical Entertainment,M
4,[Prototype],2009,Action,Activision,0.65,0.4,0.0,0.19,1.24,79,53,7.7,308,Radical Entertainment,M


Unnamed: 0,Name,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
2681,Jimmie Johnson's Anything With an Engine,2011,Racing,Konami Digital Entertainment,0.07,0.0,0.0,0.01,0.08,67,5,6.8,4,Isopod Labs,E10+
4893,Sébastien Loeb Rally Evo,2016,Racing,Milestone S.r.l,0.0,0.04,0.0,0.01,0.04,71,28,8.1,73,Milestone S.r.l,E
3027,Lost Horizon,2010,Adventure,Deep Silver,0.0,0.1,0.0,0.02,0.11,77,24,7.9,27,Animation Arts,T
2092,Fugitive Hunter: War on Terror,2003,Shooter,Play It,0.07,0.06,0.0,0.02,0.15,35,15,5.0,23,Black Ops Entertainment,M
2343,Guitar Hero 5,2009,Misc,Activision,0.65,0.37,0.0,0.11,1.13,85,69,6.8,108,Neversoft Entertainment,T
1214,Deception IV: Blood Ties,2014,Action,Tecmo Koei,0.02,0.02,0.07,0.01,0.13,67,19,7.5,61,Koei Tecmo Games,M
4878,Scarface: Money. Power. Respect.,2006,Strategy,Vivendi Games,0.15,0.0,0.0,0.01,0.16,58,12,3.1,15,Radical Entertainment,M
3909,Need for Speed: Most Wanted,2012,Racing,Electronic Arts,0.0,0.06,0.0,0.02,0.08,82,19,8.5,525,Black Box,T
5348,SSX Blur,2007,Sports,Electronic Arts,0.29,0.01,0.0,0.03,0.33,74,52,8.1,60,EA Montreal,E
4763,RollerCoaster Tycoon,2003,Strategy,Atari,0.1,0.03,0.0,0.0,0.13,62,8,8.3,12,Atari,E


In [146]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6894 entries, 0 to 6893
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             6894 non-null   object 
 1   Year_of_Release  6894 non-null   int64  
 2   Genre            6894 non-null   object 
 3   Publisher        6893 non-null   object 
 4   NA_Sales         6894 non-null   float64
 5   EU_Sales         6894 non-null   float64
 6   JP_Sales         6894 non-null   float64
 7   Other_Sales      6894 non-null   float64
 8   Global_Sales     6894 non-null   float64
 9   Critic_Score     6894 non-null   int64  
 10  Critic_Count     6894 non-null   int64  
 11  User_Score       6894 non-null   float64
 12  User_Count       6894 non-null   int64  
 13  Developer        6890 non-null   object 
 14  Rating           6826 non-null   object 
dtypes: float64(6), int64(4), object(5)
memory usage: 808.0+ KB


In [147]:
df5.describe()

Unnamed: 0,Year_of_Release,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count
count,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0,6894.0
mean,2007.482303,0.39092,0.234517,0.063867,0.082,0.771487,70.258486,28.842472,7.184378,174.39237
std,4.236401,0.963231,0.684214,0.286461,0.26862,1.95478,13.861082,19.194572,1.439806,584.872155
min,1985.0,0.0,0.0,0.0,0.0,0.01,13.0,3.0,0.5,4.0
25%,2004.0,0.06,0.02,0.0,0.01,0.11,62.0,14.0,6.5,11.0
50%,2007.0,0.15,0.06,0.0,0.02,0.29,72.0,24.0,7.5,27.0
75%,2011.0,0.39,0.21,0.01,0.07,0.75,80.0,39.0,8.2,89.0
max,2016.0,41.36,28.96,6.5,10.57,82.53,98.0,113.0,9.6,10665.0


In [148]:
# df5, Cleaned Data 2, columns to keep:
# No columns are useful as of now

<div id="heading--1-6"/>
<br>

# 1.6 - Dataset opencritic_rankings_feb_2023
<br>

In [149]:
display(df6.head())
display(df6.sample(20))

Unnamed: 0,title,score,opencritic_classification,platforms,release_date,url
0,Super Mario Odyssey,97,Mighty,Switch,"Oct 27, 2017",https://opencritic.com/game/4504/super-mario-o...
1,The Legend of Zelda: Breath of the Wild,96,Mighty,"Wii-U, Switch","Mar 3, 2017",https://opencritic.com/game/1548/the-legend-of...
2,Red Dead Redemption 2,96,Mighty,"PS4, XB1, Stadia, PC, XBXS, PS5","Oct 26, 2018",https://opencritic.com/game/3717/red-dead-rede...
3,Elden Ring,95,Mighty,"PC, XBXS, PS5, XB1, PS4","Feb 25, 2022",https://opencritic.com/game/12090/elden-ring
4,Metroid Prime Remastered,95,Mighty,Switch,"Feb 8, 2023",https://opencritic.com/game/14280/metroid-prim...


Unnamed: 0,title,score,opencritic_classification,platforms,release_date,url
9593,Tardy,,,"PC, Switch","Mar 8, 2018",https://opencritic.com/game/7510/tardy
10153,Areia: Pathway to Dawn,,,PC,"Jan 17, 2020",https://opencritic.com/game/8889/areia-pathway...
8166,SQUAKE,,,PC,"Feb 1, 2017",https://opencritic.com/game/4020/squake
8365,Walkerman Act 1,,,PC,"May 22, 2017",https://opencritic.com/game/4542/walkerman-act-1
11440,Mayhem in Single Valley,,,PC,"May 20, 2021",https://opencritic.com/game/11443/mayhem-in-si...
3043,Chronicles of Teddy: Harmony of Exidus,74.0,Fair,"PS4, Wii-U, PS5","Mar 29, 2016",https://opencritic.com/game/1924/chronicles-of...
2593,Dying Light 2 Stay Human,76.0,Strong,"Switch, PC, XBXS, PS5, PS4, XB1","Feb 4, 2022",https://opencritic.com/game/12340/dying-light-...
10475,Indiecalypse,,,"Switch, PC, XB1, XBXS","May 29, 2020",https://opencritic.com/game/9591/indiecalypse
2256,Rescue Party: Live!,77.0,Strong,PC,"Jan 13, 2022",https://opencritic.com/game/12594/rescue-party...
9037,Battle Fleet: Ground Assault,,,PC,"May 1, 2018",https://opencritic.com/game/6069/battle-fleet-...


In [150]:
df6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13111 entries, 0 to 13110
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   title                      13110 non-null  object
 1   score                      13111 non-null  object
 2   opencritic_classification  7318 non-null   object
 3   platforms                  13111 non-null  object
 4   release_date               13111 non-null  object
 5   url                        13111 non-null  object
dtypes: object(6)
memory usage: 614.7+ KB


In [151]:
df6.describe()

Unnamed: 0,title,score,opencritic_classification,platforms,release_date,url
count,13110,13111.0,7318,13111,13111,13111
unique,13109,81.0,4,682,2640,13111
top,The,,Strong,PC,"Oct 13, 2016",https://opencritic.com/game/4504/super-mario-o...
freq,2,5793.0,2340,4670,37,1


In [152]:
# df6, opencritic_rankings_feb_2023, columns to keep:
# score
# opencritic classification

df6.drop(['release_date', 'url', 'platforms'], axis=1, inplace=True)

In [153]:
df6

Unnamed: 0,title,score,opencritic_classification
0,Super Mario Odyssey,97,Mighty
1,The Legend of Zelda: Breath of the Wild,96,Mighty
2,Red Dead Redemption 2,96,Mighty
3,Elden Ring,95,Mighty
4,Metroid Prime Remastered,95,Mighty
...,...,...,...
13106,The Settlers: New Allies,,
13107,Chef Life: A Restaurant Simulator,,
13108,Aces & Adventures,,
13109,Planet Cube: Edge,,


<div id="heading--1-7"/>
<br>

# 1.7 - Dataset vgsales
<br>

In [154]:
display(df7.head())
display(df7.sample(20))

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1562,1564,Jillian Michaels' Fitness Ultimatum 2009,Wii,2008.0,Sports,Deep Silver,0.96,0.2,0.0,0.11,1.27
4057,4059,Taiko no Tatsujin: Appare Sandaime,PS2,2003.0,Misc,Namco Bandai Games,0.0,0.0,0.49,0.0,0.49
4836,4838,MLB 2006,PS2,2005.0,Sports,Sony Computer Entertainment,0.33,0.01,0.0,0.05,0.4
3235,3237,Namco Museum,GC,2002.0,Misc,Namco Bandai Games,0.48,0.13,0.0,0.02,0.63
11543,11545,Rock Revolution,Wii,,Misc,Unknown,0.07,0.0,0.0,0.01,0.08
14595,14598,Drome Racers,GC,2003.0,Racing,Electronic Arts,0.02,0.01,0.0,0.0,0.03
13331,13333,Magic Carpet,PS,1995.0,Shooter,Electronic Arts,0.03,0.02,0.0,0.0,0.05
15544,15547,Jinsei Game DS,DS,2006.0,Misc,Atlus,0.0,0.0,0.02,0.0,0.02
11616,11618,Auto Destruct,PS,1998.0,Action,Electronic Arts,0.04,0.03,0.0,0.01,0.08
13608,13610,Scaler,XB,2004.0,Platform,Global Star,0.03,0.01,0.0,0.0,0.04


In [155]:
df7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [156]:
# df7 is the same database as df1, not using

<div id="heading--1-8"/>
<br>

# 1.8 - Dataset all video games (cleaned)
<br>

In [157]:
display(df8.head())
display(df8.sample(20))

Unnamed: 0,Title,Release Date,Developer,Publisher,Genres,Genres Splitted,Product Rating,User Score,User Ratings Count,Platforms Info
0,Ziggurat (2012),2/17/2012,Action Button Entertainment,Freshuu Inc.,Action,['Action'],,6.9,14.0,"[{'Platform': 'iOS (iPhone/iPad)', 'Platform M..."
1,4X4 EVO 2,11/15/2001,Terminal Reality,Gathering,Auto Racing Sim,"['Auto', 'Racing', 'Sim']",Rated E For Everyone,,,"[{'Platform': 'Xbox', 'Platform Metascore': '5..."
2,MotoGP 2 (2001),1/22/2002,Namco,Namco,Auto Racing Sim,"['Auto', 'Racing', 'Sim']",Rated E For Everyone,5.8,,"[{'Platform': 'PlayStation 2', 'Platform Metas..."
3,Gothic 3,11/14/2006,Piranha Bytes,Aspyr,Western RPG,"['Western', 'RPG']",Rated T For Teen,7.5,832.0,"[{'Platform': 'PC', 'Platform Metascore': '63'..."
4,Siege Survival: Gloria Victis,5/18/2021,FishTankStudio,Black Eye Games,RPG,['RPG'],,6.5,10.0,"[{'Platform': 'PC', 'Platform Metascore': '69'..."


Unnamed: 0,Title,Release Date,Developer,Publisher,Genres,Genres Splitted,Product Rating,User Score,User Ratings Count,Platforms Info
3617,Where the Water Tastes Like Wine,2/28/2018,Dim Bulb Games,Dim Bulb Games,Adventure,['Adventure'],,5.3,37.0,"[{'Platform': 'PC', 'Platform Metascore': '74'..."
1921,Puzzle Quest: Galactrix,2/24/2009,Infinite Interactive,D3Publisher,Matching Puzzle,"['Matching', 'Puzzle']",Rated E +10 For Everyone +10,7.3,,"[{'Platform': 'Xbox 360', 'Platform Metascore'..."
2273,Maskmaker,4/20/2021,Innerspace VR,MWM Interactive,First-Person Adventure,"['First-Person', 'Adventure']",,,,"[{'Platform': 'PC', 'Platform Metascore': '75'..."
13412,PAIN: Amusement Park,9/11/2008,Idol Minds,SCEA,Action,['Action'],Rated T For Teen,,,"[{'Platform': 'PlayStation 3', 'Platform Metas..."
2885,Patrick's Parabox,3/29/2022,Patrick Traynor,Patrick Traynor,Action Puzzle,"['Action', 'Puzzle']",,8.6,12.0,"[{'Platform': 'PC', 'Platform Metascore': '84'..."
239,MechWarrior 5: Mercenaries,12/10/2019,Piranha Games,Piranha Games,Vehicle Combat Sim,"['Vehicle', 'Combat', 'Sim']",,5.4,296.0,"[{'Platform': 'PC', 'Platform Metascore': '73'..."
10135,Mario Party 9,3/11/2012,Nd Cube,Nintendo,Party,['Party'],Rated E For Everyone,6.8,262.0,"[{'Platform': 'Wii', 'Platform Metascore': '73..."
1724,Phoenix Point: Behemoth Edition,10/1/2021,Snapshot Games Inc.,Prime Matter,Turn-Based Tactics,"['Turn-Based', 'Tactics']",Rated T For Teen,6.1,8.0,"[{'Platform': 'Xbox One', 'Platform Metascore'..."
13074,Dungeons & Dragons: Dragonshard,10/2/2005,Liquid Entertainment,Atari SA,Real-Time Strategy,"['Real-Time', 'Strategy']",Rated T For Teen,7.8,38.0,"[{'Platform': 'PC', 'Platform Metascore': '80'..."
294,Pool Paradise,6/28/2004,Awesome Developments,Ignition Entertainment,Billiards,['Billiards'],Rated E For Everyone,,,"[{'Platform': 'PC', 'Platform Metascore': '76'..."


In [158]:
df8.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14055 entries, 0 to 14054
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Title               14034 non-null  object 
 1   Release Date        13991 non-null  object 
 2   Developer           13917 non-null  object 
 3   Publisher           13917 non-null  object 
 4   Genres              14034 non-null  object 
 5   Genres Splitted     14034 non-null  object 
 6   Product Rating      11005 non-null  object 
 7   User Score          11714 non-null  float64
 8   User Ratings Count  11299 non-null  float64
 9   Platforms Info      14055 non-null  object 
dtypes: float64(2), object(8)
memory usage: 1.1+ MB


In [159]:
# df8, all_video_games(cleaned), columns to keep:
# developer (not missing too many)
# genres/genres splitted
# df8: not sure yet

<div id="heading--1-9"/>
<br>

# 1.9 - Dataset Raw Data
<br>

In [160]:
display(df9.head())
display(df9.sample(20))

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
552,FIFA 16,PS3,2015.0,Sports,Electronic Arts,0.41,1.84,0.05,0.4,2.71,,,3.2,58.0,EA Sports,E
2585,All Star Cheer Squad,Wii,2008.0,Sports,THQ,0.43,0.29,0.0,0.08,0.79,,,5.2,10.0,Gorilla Games,E
10663,Leisure Suit Larry: Box Office Bust,PS3,2009.0,Adventure,Codemasters,0.06,0.03,0.0,0.01,0.1,17.0,11.0,1.7,37.0,Team 17,M
134,Halo 3: ODST,X360,2009.0,Shooter,Microsoft Game Studios,4.34,1.34,0.06,0.61,6.34,83.0,94.0,7.1,1163.0,"Bungie Software, Bungie",M
12258,Cities XL 2012,PC,2011.0,Strategy,Focus Home Interactive,0.01,0.05,0.0,0.01,0.07,61.0,18.0,5.6,95.0,Monte Cristo Multimedia,E
1493,TNN Motor Sports Hardcore 4x4,PS,1996.0,Racing,ASC Games,0.73,0.5,0.0,0.09,1.31,,,,,,
5279,Mega Man X3,SNES,1995.0,Action,Laguna,0.04,0.01,0.3,0.0,0.35,,,,,,
6410,Are You Smarter than a 5th Grader? Game Time,Wii,2009.0,Puzzle,THQ,0.25,0.0,0.0,0.02,0.27,,,tbd,,THQ,E
13164,Arx Fatalis,XB,2003.0,Role-Playing,Mindscape,0.04,0.01,0.0,0.0,0.05,71.0,31.0,8.2,6.0,Arkane Studios,M
14503,Tantei Jinguuji Saburo DS: Kienai Kokoro,DS,2008.0,Adventure,Arc System Works,0.0,0.0,0.03,0.0,0.03,,,,,,


In [161]:
df9.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


In [162]:
# df9 is same database as df2, but not cleaned. I will use df2.

<br>

# Some last verifications before merging
<br>


In [163]:
# checking if all different platforms are different values and different rows

df1.loc[df1['Name'] == 'Grand Theft Auto V']

Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre
30,Grand Theft Auto V,PS3,Rockstar Games,Rockstar North,6.37,9.85,0.99,3.12,20.32,Action
33,Grand Theft Auto V,PS4,Rockstar Games,Rockstar North,6.06,9.71,0.6,3.02,19.39,Action
50,Grand Theft Auto V,X360,Rockstar Games,Rockstar North,9.06,5.33,0.06,1.42,15.86,Action
86,Grand Theft Auto V,PC,Rockstar Games,Rockstar North,0.48,0.76,,0.1,1.33,Action
140,Grand Theft Auto V,XOne,Rockstar Games,Rockstar North,4.7,3.25,0.01,0.76,8.72,Action
44750,Grand Theft Auto V,PS5,Rockstar Games,Rockstar Games,,,,,,Action-Adventure
44751,Grand Theft Auto V,XS,Rockstar Games,Rockstar Games,,,,,,Action-Adventure


In [164]:
# renaming a column for easier merging

df3 = df3.rename({'title': 'name'}, axis=1)

In [165]:
# verification that the name change worked

df3

Unnamed: 0,name,release_date,platforms,developer,esrb_rating,metascore,userscore,critic_reviews,user_reviews,num_players
0,Pushmo,08-Dec-11,3DS,Intelligent Systems,E,90,8.3,31,215.0,1 Player
1,The Legend of Zelda: Majora's Mask 3D,13-Feb-15,3DS,GREZZO,E10+,89,8.9,82,781.0,1 Player
2,The Legend of Zelda: Ocarina of Time 3D,19-Jun-11,3DS,GREZZO,E10+,94,9,85,1780.0,1 Player
3,The Legend of Zelda: A Link Between Worlds,22-Nov-13,3DS,Nintendo,E,91,9,81,1603.0,1 Player
4,Colors! 3D,05-Apr-12,3DS,Collecting Smiles,E,89,7.5,15,66.0,1-2 Players
...,...,...,...,...,...,...,...,...,...,...
19312,Necromunda: Hired Gun,01-Jun-21,XS,Focus Home Interactive,M,56,5.3,11,10.0,1 Player
19313,Grand Theft Auto: The Trilogy - The Definitive...,11-Nov-21,XS,"Rockstar Games, Grove Street Games",M,56,0.7,11,1124.0,1 Player
19314,Bright Memory,10-Nov-20,XS,FYQD Personal Studio,,55,4.2,31,62.0,1 Player
19315,Balan Wonderworld,26-Mar-21,XS,"Square Enix, Arzest, Balan Company",E10+,47,7.2,11,162.0,No Online Multiplayer Online Multiplayer


In [166]:
# checking if all different games are different values and different rows

df1[df1['Name'].str.contains('God of War')]

Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre
28,God of War (2018),PS4,Sony Interactive Entertainment,SIE Santa Monica Studio,2.83,2.17,0.13,1.02,6.15,Action
163,God of War III,PS3,Sony Computer Entertainment,SCEA Santa Monica Studio,2.74,1.36,0.12,0.6,4.81,Action
238,God of War III Remastered,PS4,Sony Computer Entertainment,SCEA Santa Monica Studio,0.4,0.33,0.02,0.15,0.89,Action
337,God of War,PS2,Sony Computer Entertainment,SCEA Santa Monica Studio,2.71,1.29,0.02,0.43,4.45,Action
382,God of War II,PS2,Sony Computer Entertainment,SCEA Santa Monica Studio,2.32,0.04,0.04,1.67,4.07,Action
568,God of War: Chains of Olympus,PSP,Sony Computer Entertainment,Ready at Dawn,1.48,1.0,0.04,0.66,3.19,Action
630,God of War: Ascension,PS3,Sony Computer Entertainment,SCEA Santa Monica Studio,1.23,0.72,0.04,0.41,2.4,Action
816,God of War,PC,PlayStation PC,SIE Santa Monica Studio,,,,,,Action-Adventure
851,God of War Collection,PS3,Sony Computer Entertainment,Bluepoint Games,1.7,0.45,0.06,0.4,2.6,Action
2074,God of War: Ghost of Sparta,PSP,Sony Computer Entertainment,Ready at Dawn,0.41,0.36,0.03,0.21,1.01,Action


<div id="heading--2"/>
<br>

# Part 2 - Cleaning and merging datasets

<br>
<br>
<div id="heading--2-1"/>

# 2.1 Merging the 4 main datasets

In [167]:
# Merging first 2 useful datasets

    
# merged_df = df1.merge(df3, on='name', how='inner') (not using anymore, keeping for tests)
merged_df = df1.merge(df3, left_on=['Name','Platform'], right_on = ['name','platforms'], how='left')

print("The new dataframe is:")
merged_df

The new dataframe is:


Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre,name,release_date,platforms,developer,esrb_rating,metascore,userscore,critic_reviews,user_reviews,num_players
0,Wii Sports,Wii,Nintendo,Nintendo EAD,41.36,29.02,3.77,8.51,82.65,Sports,Wii Sports,19-Nov-06,Wii,Nintendo,E,76.0,8.1,51.0,483.0,1-4 Players
1,Mario Kart 8 Deluxe,NS,Nintendo,Nintendo EPD,5.05,4.98,2.11,0.91,13.05,Racing,Mario Kart 8 Deluxe,28-Apr-17,NS,Nintendo,E,92.0,8.6,95.0,2379.0,Up to 12 Players
2,Animal Crossing: New Horizons,NS,Nintendo,Nintendo,,,,,,Simulation,Animal Crossing: New Horizons,20-Mar-20,NS,Nintendo,E,90.0,5.6,111.0,6386.0,Up to 8 Players
3,Super Mario Bros.,NES,Nintendo,Nintendo EAD,29.08,3.58,6.81,0.77,40.24,Platform,,,,,,,,,,
4,Counter-Strike: Global Offensive,PC,Valve,Valve Corporation,,,,,,Shooter,Counter-Strike: Global Offensive,21-Aug-12,PC,"Valve Software, Hidden Path Entertainment",M,83.0,7.3,38.0,4790.0,", Up to 10 Players"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50352,Zombieland: Double Tap - Road Trip,PC,GameMill Entertainment,High Voltage Software,,,,,,Shooter,,,,,,,,,,
50353,Zombillie,NS,Forever Entertainment S.A.,Forever Entertainment S.A.,,,,,,Puzzle,,,,,,,,,,
50354,Zone of the Enders: The 2nd Runner MARS,PC,Konami,Cygames,,,,,,Simulation,,,,,,,,,,
50355,Zoo Tycoon: Ultimate Animal Collection,XOne,Microsoft Studios,Frontier Developments,,,,,,Simulation,,,,,,,,,,


In [168]:
# checking for duplicates

merged_df.duplicated().sum()

18

In [169]:
# removing duplicates

merged_df.drop_duplicates(inplace=True)

In [170]:
# checking merged_df

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50339 entries, 0 to 50356
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            50339 non-null  object 
 1   Platform        50339 non-null  object 
 2   Publisher       50339 non-null  object 
 3   Developer       50339 non-null  object 
 4   NA_Sales        13520 non-null  float64
 5   PAL_Sales       13874 non-null  float64
 6   JP_Sales        7631 non-null   float64
 7   Other_Sales     16206 non-null  float64
 8   Global_Sales    20114 non-null  float64
 9   Genre           50339 non-null  object 
 10  name            12604 non-null  object 
 11  release_date    12604 non-null  object 
 12  platforms       12604 non-null  object 
 13  developer       12596 non-null  object 
 14  esrb_rating     11773 non-null  object 
 15  metascore       12604 non-null  float64
 16  userscore       12604 non-null  object 
 17  critic_reviews  12604 non-null  floa

In [171]:
# making sure the same game on different platforms is still different rows

df3.loc[df3['name'] == 'Grand Theft Auto V']

Unnamed: 0,name,release_date,platforms,developer,esrb_rating,metascore,userscore,critic_reviews,user_reviews,num_players
2267,Grand Theft Auto V,13-Apr-15,PC,Rockstar North,M,96,7.8,57,8197.0,Up to 32 Players
8865,Grand Theft Auto V,17-Sep-13,PS3,Rockstar North,M,97,8.3,50,4855.0,Up to 16 Players
10130,Grand Theft Auto V,18-Nov-14,PS4,Rockstar North,M,97,8.3,66,7162.0,Up to 30 Players
12251,Grand Theft Auto V,15-Mar-22,PS5,Rockstar North,M,81,2.4,22,583.0,Up to 30 Players
16363,Grand Theft Auto V,17-Sep-13,X360,Rockstar North,M,97,8.3,58,4062.0,Up to 16 Players
18012,Grand Theft Auto V,18-Nov-14,XOne,Rockstar North,M,97,7.8,14,1621.0,Up to 30 Players
19240,Grand Theft Auto V,15-Mar-22,XS,Rockstar North,M,79,3.5,11,213.0,Up to 30 Players


In [172]:
# verifying null values

merged_df.isna().sum(axis=0)

Name                  0
Platform              0
Publisher             0
Developer             0
NA_Sales          36819
PAL_Sales         36465
JP_Sales          42708
Other_Sales       34133
Global_Sales      30225
Genre                 0
name              37735
release_date      37735
platforms         37735
developer         37743
esrb_rating       38566
metascore         37735
userscore         37735
critic_reviews    37735
user_reviews      38438
num_players       37746
dtype: int64

In [173]:
# merged_df = merged_df.merge(df4, left_on='Name', right_on='Name', how='left') (not using anymore, keeping for tests)

# merging the 3rd useful dataset

merged_df = merged_df.merge(df4, how='left', on='Name')

In [174]:
merged_df

Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre,...,developer,esrb_rating,metascore,userscore,critic_reviews,user_reviews,num_players,Story Focus,Gameplay Focus,Series
0,Wii Sports,Wii,Nintendo,Nintendo EAD,41.36,29.02,3.77,8.51,82.65,Sports,...,Nintendo,E,76.0,8.1,51.0,483.0,1-4 Players,0,x,0
1,Mario Kart 8 Deluxe,NS,Nintendo,Nintendo EPD,5.05,4.98,2.11,0.91,13.05,Racing,...,Nintendo,E,92.0,8.6,95.0,2379.0,Up to 12 Players,,,
2,Animal Crossing: New Horizons,NS,Nintendo,Nintendo,,,,,,Simulation,...,Nintendo,E,90.0,5.6,111.0,6386.0,Up to 8 Players,,,
3,Super Mario Bros.,NES,Nintendo,Nintendo EAD,29.08,3.58,6.81,0.77,40.24,Platform,...,,,,,,,,,,
4,Counter-Strike: Global Offensive,PC,Valve,Valve Corporation,,,,,,Shooter,...,"Valve Software, Hidden Path Entertainment",M,83.0,7.3,38.0,4790.0,", Up to 10 Players",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50345,Zombieland: Double Tap - Road Trip,PC,GameMill Entertainment,High Voltage Software,,,,,,Shooter,...,,,,,,,,,,
50346,Zombillie,NS,Forever Entertainment S.A.,Forever Entertainment S.A.,,,,,,Puzzle,...,,,,,,,,,,
50347,Zone of the Enders: The 2nd Runner MARS,PC,Konami,Cygames,,,,,,Simulation,...,,,,,,,,,,
50348,Zoo Tycoon: Ultimate Animal Collection,XOne,Microsoft Studios,Frontier Developments,,,,,,Simulation,...,,,,,,,,,,


In [175]:
# checking for duplicates

merged_df.duplicated().sum()

0

In [176]:
# checking merged_df

merged_df

Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre,...,developer,esrb_rating,metascore,userscore,critic_reviews,user_reviews,num_players,Story Focus,Gameplay Focus,Series
0,Wii Sports,Wii,Nintendo,Nintendo EAD,41.36,29.02,3.77,8.51,82.65,Sports,...,Nintendo,E,76.0,8.1,51.0,483.0,1-4 Players,0,x,0
1,Mario Kart 8 Deluxe,NS,Nintendo,Nintendo EPD,5.05,4.98,2.11,0.91,13.05,Racing,...,Nintendo,E,92.0,8.6,95.0,2379.0,Up to 12 Players,,,
2,Animal Crossing: New Horizons,NS,Nintendo,Nintendo,,,,,,Simulation,...,Nintendo,E,90.0,5.6,111.0,6386.0,Up to 8 Players,,,
3,Super Mario Bros.,NES,Nintendo,Nintendo EAD,29.08,3.58,6.81,0.77,40.24,Platform,...,,,,,,,,,,
4,Counter-Strike: Global Offensive,PC,Valve,Valve Corporation,,,,,,Shooter,...,"Valve Software, Hidden Path Entertainment",M,83.0,7.3,38.0,4790.0,", Up to 10 Players",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50345,Zombieland: Double Tap - Road Trip,PC,GameMill Entertainment,High Voltage Software,,,,,,Shooter,...,,,,,,,,,,
50346,Zombillie,NS,Forever Entertainment S.A.,Forever Entertainment S.A.,,,,,,Puzzle,...,,,,,,,,,,
50347,Zone of the Enders: The 2nd Runner MARS,PC,Konami,Cygames,,,,,,Simulation,...,,,,,,,,,,
50348,Zoo Tycoon: Ultimate Animal Collection,XOne,Microsoft Studios,Frontier Developments,,,,,,Simulation,...,,,,,,,,,,


In [177]:
# making sure there were close duplications of games

merged_df.loc[merged_df['name'] == 'Grand Theft Auto V']

Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre,...,developer,esrb_rating,metascore,userscore,critic_reviews,user_reviews,num_players,Story Focus,Gameplay Focus,Series
31,Grand Theft Auto V,PS3,Rockstar Games,Rockstar North,6.37,9.85,0.99,3.12,20.32,Action,...,Rockstar North,M,97.0,8.3,50.0,4855.0,Up to 16 Players,x,x,0
34,Grand Theft Auto V,PS4,Rockstar Games,Rockstar North,6.06,9.71,0.6,3.02,19.39,Action,...,Rockstar North,M,97.0,8.3,66.0,7162.0,Up to 30 Players,x,x,0
51,Grand Theft Auto V,X360,Rockstar Games,Rockstar North,9.06,5.33,0.06,1.42,15.86,Action,...,Rockstar North,M,97.0,8.3,58.0,4062.0,Up to 16 Players,x,x,0
87,Grand Theft Auto V,PC,Rockstar Games,Rockstar North,0.48,0.76,,0.1,1.33,Action,...,Rockstar North,M,96.0,7.8,57.0,8197.0,Up to 32 Players,x,x,0
141,Grand Theft Auto V,XOne,Rockstar Games,Rockstar North,4.7,3.25,0.01,0.76,8.72,Action,...,Rockstar North,M,97.0,7.8,14.0,1621.0,Up to 30 Players,x,x,0
44764,Grand Theft Auto V,PS5,Rockstar Games,Rockstar Games,,,,,,Action-Adventure,...,Rockstar North,M,81.0,2.4,22.0,583.0,Up to 30 Players,x,x,0
44765,Grand Theft Auto V,XS,Rockstar Games,Rockstar Games,,,,,,Action-Adventure,...,Rockstar North,M,79.0,3.5,11.0,213.0,Up to 30 Players,x,x,0


In [178]:
# merging the last useful dataset into merged_df

merged_df = merged_df.merge(df6, left_on='Name', right_on='title', how='left')

In [179]:
#checking the last few rows

merged_df.tail(40)

Unnamed: 0,Name,Platform,Publisher,Developer,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Global_Sales,Genre,...,userscore,critic_reviews,user_reviews,num_players,Story Focus,Gameplay Focus,Series,title,score,opencritic_classification
50310,Ys VIII: Lacrimosa of Dana,PC,NIS America,Falcom,,,,,,Role-Playing,...,,,,,,,,,,
50311,Ys: Memories of Celceta - Kai,PS4,Xseed Games,Nihon Falcom Corporation,,,,,,Role-Playing,...,,,,,,,,,,
50312,Yu Yu Hakusho Tournament Tactics,GBA,Atari,Sensory Sweep Studios,,,,,,Strategy,...,,,,,,,,,,
50313,Yu-Gi-Oh! Duel Links,PC,Unknown,Konami,,,,,,Strategy,...,,,,,,,,,,
50314,Yu-Gi-Oh! Legacy of the Duelist,PS4,Konami,Other Ocean Interactive,,,,,,Strategy,...,,,,,,,,,,
50315,Yu-Gi-Oh! Legacy of the Duelist,XOne,Konami,Other Ocean Interactive,,,,,,Strategy,...,,,,,,,,,,
50316,Yu-Gi-Oh! Legacy of the Duelist: Link Evolution,NS,Konami,Other Ocean Interactive,,,,,,Strategy,...,7.8,14.0,24.0,Up to 4 Players,,,,Yu-Gi-Oh! Legacy of the Duelist: Link Evolution,78.0,Strong
50317,Yu-Gi-Oh! Master Duel,PC,Unknown,Konami,,,,,,Strategy,...,6.4,12.0,48.0,1 Player,,,,Yu-Gi-Oh! Master Duel,78.0,Strong
50318,Yu-Gi-Oh! Master Duel,PS4,Unknown,Konami,,,,,,Strategy,...,,,,,,,,Yu-Gi-Oh! Master Duel,78.0,Strong
50319,Yu-Gi-Oh! Master Duel,PS5,Unknown,Konami,,,,,,Strategy,...,,,,,,,,Yu-Gi-Oh! Master Duel,78.0,Strong


In [180]:
# checking for duplicates

merged_df.duplicated().sum()

0

In [181]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50350 entries, 0 to 50349
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Name                       50350 non-null  object 
 1   Platform                   50350 non-null  object 
 2   Publisher                  50350 non-null  object 
 3   Developer                  50350 non-null  object 
 4   NA_Sales                   13530 non-null  float64
 5   PAL_Sales                  13882 non-null  float64
 6   JP_Sales                   7636 non-null   float64
 7   Other_Sales                16216 non-null  float64
 8   Global_Sales               20124 non-null  float64
 9   Genre                      50350 non-null  object 
 10  name                       12612 non-null  object 
 11  release_date               12612 non-null  object 
 12  platforms                  12612 non-null  object 
 13  developer                  12604 non-null  obj

In [182]:
merged_df.drop(['name', 'title', 'platforms', 'developer'], axis=1, inplace=True)

In [183]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50350 entries, 0 to 50349
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Name                       50350 non-null  object 
 1   Platform                   50350 non-null  object 
 2   Publisher                  50350 non-null  object 
 3   Developer                  50350 non-null  object 
 4   NA_Sales                   13530 non-null  float64
 5   PAL_Sales                  13882 non-null  float64
 6   JP_Sales                   7636 non-null   float64
 7   Other_Sales                16216 non-null  float64
 8   Global_Sales               20124 non-null  float64
 9   Genre                      50350 non-null  object 
 10  release_date               12612 non-null  object 
 11  esrb_rating                11781 non-null  object 
 12  metascore                  12612 non-null  float64
 13  userscore                  12612 non-null  obj

At this point, I realize there is a lot of null values and it will be hard to have 50k rows filled.
Since 12k rows sounds like a reasonable amount of rows, I will drop all the rows that have nothing in "metascore".

In [184]:
merged_df.dropna(subset=['metascore'], inplace=True)

In [185]:
# rechecking the dataset

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12612 entries, 0 to 50339
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Name                       12612 non-null  object 
 1   Platform                   12612 non-null  object 
 2   Publisher                  12612 non-null  object 
 3   Developer                  12612 non-null  object 
 4   NA_Sales                   7214 non-null   float64
 5   PAL_Sales                  7460 non-null   float64
 6   JP_Sales                   2491 non-null   float64
 7   Other_Sales                8025 non-null   float64
 8   Global_Sales               8256 non-null   float64
 9   Genre                      12612 non-null  object 
 10  release_date               12612 non-null  object 
 11  esrb_rating                11781 non-null  object 
 12  metascore                  12612 non-null  float64
 13  userscore                  12612 non-null  object 


In [186]:
# exporting to csv when needed

# merged_df.to_csv('clean_data_1.0.csv')

<div id="heading--3"/>
    
# Next Steps

### As soon as possible:
    
    1. More cleaning of the dataset
    2. Trying to find a dataset to complement the sales number
    3. Filling the data on my own, to be able to make the predictions more accurate.

### In the next few weeks:

    4. Trying multiple types of regression models
    5. Finding the best model for the current situation
    6. Training the model for maximum efficiency
    7. Export a clean dataframe/CSV on Kaggle for other users
    8. Hosting the project on a website for easy utilization