#### Author: Allan Jeeboo
#### Preferred name: Vyncent van der Wolvenhuizen
#### Affiliation: Data Science student at Triple Ten
#### email: vanderwolvenhuizen.vyncent@proton.me
#### Date Started: 2025-02-13
#### Last Updated: 2025-02-13 12:26


# Table of Contents
## 1.0 Introduction
### 1.1 Import Data
### 1.2 Data Description
## 2.0 Data Analysis
### 2.1 Cleaning Data
### 2.2 Exploratory Data Analysis (EDA)

## 1. Introduction

This project aims to identify patterns that determine whether or not a game succeeds. We'll be using a dataset from 2016; that data will be used to create forecasts and then serve to plan a campaign.

### 1.1 Import Data
Let's import the libraries we need and then load the data.

In [168]:
import pandas as pd


df = pd.read_csv("games.csv")

df

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.00,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,0.00,0.00,0.01,0.00,,,
16711,LMA Manager 2007,X360,2006.0,Sports,0.00,0.01,0.00,0.00,,,
16712,Haitaka no Psychedelica,PSV,2016.0,Adventure,0.00,0.00,0.01,0.00,,,
16713,Spirits & Spells,GBA,2003.0,Platform,0.01,0.00,0.00,0.00,,,


### 1.2 Data description
—Name 

—Platform 

—Year_of_Release 

—Genre 

—NA_sales (North American sales in USD million) 

—EU_sales (sales in Europe in USD million) 

—JP_sales (sales in Japan in USD million) 

—Other_sales (sales in other countries in USD million) 

—Critic_Score (maximum of 100) 

—User_Score (maximum of 10) 

—Rating (ESRB)

Data for 2016 may be incomplete.

This text is taken from the Itegrated Project 1 overview page on Triple Ten.
https://tripleten.com/trainer/data-scientist/lesson/2fede7ea-9ca6-42a3-ba35-bf4142d2fcc0/

## 2.0 Data Analysis
### 2.1 Cleaning Data

In [169]:
# Change column names to lowercase
df = df.rename(columns= {"Name": "name", 
                         "Platform": "platform", 
                         "Year_of_Release": "year_of_release", 
                         "Genre": "genre", 
                         "NA_sales": "na_sales", 
                         "EU_sales": "eu_sales", 
                         "JP_sales": "jp_sales", 
                         "Other_sales": "other_sales", 
                         "Critic_Score": "critic_score", 
                         "User_Score": "user_score", 
                         "Rating": "rating"})

df

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.00,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,0.00,0.00,0.01,0.00,,,
16711,LMA Manager 2007,X360,2006.0,Sports,0.00,0.01,0.00,0.00,,,
16712,Haitaka no Psychedelica,PSV,2016.0,Adventure,0.00,0.00,0.01,0.00,,,
16713,Spirits & Spells,GBA,2003.0,Platform,0.01,0.00,0.00,0.00,,,


In [170]:
df.dtypes

name                object
platform            object
year_of_release    float64
genre               object
na_sales           float64
eu_sales           float64
jp_sales           float64
other_sales        float64
critic_score       float64
user_score          object
rating              object
dtype: object

In [171]:
df.critic_score.unique()

array([76., nan, 82., 80., 89., 58., 87., 91., 61., 97., 95., 77., 88.,
       83., 94., 93., 85., 86., 98., 96., 90., 84., 73., 74., 78., 92.,
       71., 72., 68., 62., 49., 67., 81., 66., 56., 79., 70., 59., 64.,
       75., 60., 63., 69., 50., 25., 42., 44., 55., 48., 57., 29., 47.,
       65., 54., 20., 53., 37., 38., 33., 52., 30., 32., 43., 45., 51.,
       40., 46., 39., 34., 35., 41., 36., 28., 31., 27., 26., 19., 23.,
       24., 21., 17., 22., 13.])

It would make more sense for "year_of_release" to be an int rather than a float. Also "critic_score" is a float; however, since all values are whole numbers, we'll convert this column to int as well.

In [172]:
df.isna().sum()

name                  2
platform              0
year_of_release     269
genre                 2
na_sales              0
eu_sales              0
jp_sales              0
other_sales           0
critic_score       8578
user_score         6701
rating             6766
dtype: int64

In [173]:
df.isna().sum()/df.shape[0]

name               0.000120
platform           0.000000
year_of_release    0.016093
genre              0.000120
na_sales           0.000000
eu_sales           0.000000
jp_sales           0.000000
other_sales        0.000000
critic_score       0.513192
user_score         0.400897
rating             0.404786
dtype: float64

In [174]:
# There are two missing names, which rows are they?
for index, row in df.iterrows():
    if pd.isna(row['name']):
        print(index)

659
14244


In [175]:
df.iloc[659]

name                  NaN
platform              GEN
year_of_release    1993.0
genre                 NaN
na_sales             1.78
eu_sales             0.53
jp_sales              0.0
other_sales          0.08
critic_score          NaN
user_score            NaN
rating                NaN
Name: 659, dtype: object

In [176]:
df.iloc[14244]

name                  NaN
platform              GEN
year_of_release    1993.0
genre                 NaN
na_sales              0.0
eu_sales              0.0
jp_sales             0.03
other_sales           0.0
critic_score          NaN
user_score            NaN
rating                NaN
Name: 14244, dtype: object

My initial thoughts are to drop the nan rows in "name", use fillna and mean for "year_of_release, then use fillna and mode for "genre". The final three columns are missing 51,32%, 40,09%, and 40,48% of their respective data. Not certain how to handle them just yet.

In [177]:
df.name = df.name.fillna("Unknown")

In [178]:
df.genre = df.genre.fillna(df.genre.mode()[0])
df.year_of_release = df.year_of_release.fillna(df.year_of_release.mode()[0])

df

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.00,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,0.00,0.00,0.01,0.00,,,
16711,LMA Manager 2007,X360,2006.0,Sports,0.00,0.01,0.00,0.00,,,
16712,Haitaka no Psychedelica,PSV,2016.0,Adventure,0.00,0.00,0.01,0.00,,,
16713,Spirits & Spells,GBA,2003.0,Platform,0.01,0.00,0.00,0.00,,,


In [179]:
df.isna().sum()

name                  0
platform              0
year_of_release       0
genre                 0
na_sales              0
eu_sales              0
jp_sales              0
other_sales           0
critic_score       8578
user_score         6701
rating             6766
dtype: int64

In [180]:
df.duplicated().sum()

np.int64(0)

In [181]:
print(df.platform.unique(), 
      "\n\n", 
      df.year_of_release.unique()
      )


['Wii' 'NES' 'GB' 'DS' 'X360' 'PS3' 'PS2' 'SNES' 'GBA' 'PS4' '3DS' 'N64'
 'PS' 'XB' 'PC' '2600' 'PSP' 'XOne' 'WiiU' 'GC' 'GEN' 'DC' 'PSV' 'SAT'
 'SCD' 'WS' 'NG' 'TG16' '3DO' 'GG' 'PCFX'] 

 [2006. 1985. 2008. 2009. 1996. 1989. 1984. 2005. 1999. 2007. 2010. 2013.
 2004. 1990. 1988. 2002. 2001. 2011. 1998. 2015. 2012. 2014. 1992. 1997.
 1993. 1994. 1982. 2016. 2003. 1986. 2000. 1995. 1991. 1981. 1987. 1980.
 1983.]


### 2.2 EDA