Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [ ] Determine whether your problem is regression or classification.
- [ ] Choose your evaluation metric.
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" information from the future?


In [1]:
DATA_PATH = './data/vgsales/vgsales-12-4-2019.csv'
import pandas as pd
df = pd.read_csv(DATA_PATH)

In [2]:
df.shape

(55792, 23)

In [3]:
pd.set_option('display.max_columns', 23)
pd.set_option("display.max_rows", 100)
df.head()

Unnamed: 0,Rank,Name,basename,Genre,ESRB_Rating,Platform,Publisher,Developer,VGChartz_Score,Critic_Score,User_Score,Total_Shipped,Global_Sales,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Last_Update,url,status,Vgchartzscore,img_url
0,1,Wii Sports,wii-sports,Sports,E,Wii,Nintendo,Nintendo EAD,,7.7,,82.86,,,,,,2006.0,,http://www.vgchartz.com/game/2667/wii-sports/?...,1,,/games/boxart/full_2258645AmericaFrontccc.jpg
1,2,Super Mario Bros.,super-mario-bros,Platform,,NES,Nintendo,Nintendo EAD,,10.0,,40.24,,,,,,1985.0,,http://www.vgchartz.com/game/6455/super-mario-...,1,,/games/boxart/8972270ccc.jpg
2,3,Mario Kart Wii,mario-kart-wii,Racing,E,Wii,Nintendo,Nintendo EAD,,8.2,9.1,37.14,,,,,,2008.0,11th Apr 18,http://www.vgchartz.com/game/6968/mario-kart-w...,1,8.7,/games/boxart/full_8932480AmericaFrontccc.jpg
3,4,PlayerUnknown's Battlegrounds,playerunknowns-battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,,36.6,,,,,,2017.0,13th Nov 18,http://www.vgchartz.com/game/215988/playerunkn...,1,,/games/boxart/full_8052843AmericaFrontccc.jpg
4,5,Wii Sports Resort,wii-sports-resort,Sports,E,Wii,Nintendo,Nintendo EAD,,8.0,8.8,33.09,,,,,,2009.0,,http://www.vgchartz.com/game/24656/wii-sports-...,1,8.8,/games/boxart/full_7295041AmericaFrontccc.jpg


In [4]:
df.dtypes

Rank                int64
Name               object
basename           object
Genre              object
ESRB_Rating        object
Platform           object
Publisher          object
Developer          object
VGChartz_Score    float64
Critic_Score      float64
User_Score        float64
Total_Shipped     float64
Global_Sales      float64
NA_Sales          float64
PAL_Sales         float64
JP_Sales          float64
Other_Sales       float64
Year              float64
Last_Update        object
url                object
status              int64
Vgchartzscore     float64
img_url            object
dtype: object

In [5]:
'''
Drop the following columns:
- basename: a rewritten version of the game's name that's not necessary for data wrangling
- VGChartz_Score: All values are null, and there's another vgchartzscore column with values
- url: the link to the vgchartz page that the data was collected from. Not needed for data wrangling
- img_url: boxart can affect game sales, but it's not quantifiable and out of scope for this assignment
- status: It's always 1, and there's no documentation on what status is anyways
'''

df = df.drop(['basename','VGChartz_Score','url','img_url','status'],axis=1)

In [6]:
df.shape

(55792, 18)

In [7]:
df.head()

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,Total_Shipped,Global_Sales,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Last_Update,Vgchartzscore
0,1,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,82.86,,,,,,2006.0,,
1,2,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,40.24,,,,,,1985.0,,
2,3,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,37.14,,,,,,2008.0,11th Apr 18,8.7
3,4,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,36.6,,,,,,2017.0,13th Nov 18,
4,5,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,33.09,,,,,,2009.0,,8.8


In [8]:
df['Total_Shipped'].describe()

count    1827.000000
mean        1.887258
std         4.195693
min         0.030000
25%         0.200000
50%         0.590000
75%         1.800000
max        82.860000
Name: Total_Shipped, dtype: float64

In [9]:
df['Global_Sales'].describe()

count    19415.000000
mean         0.365503
std          0.833022
min          0.000000
25%          0.030000
50%          0.120000
75%          0.360000
max         20.320000
Name: Global_Sales, dtype: float64

In [10]:
df['Total_Shipped'].describe()

count    1827.000000
mean        1.887258
std         4.195693
min         0.030000
25%         0.200000
50%         0.590000
75%         1.800000
max        82.860000
Name: Total_Shipped, dtype: float64

In [11]:
df.head(21)

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,Total_Shipped,Global_Sales,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Last_Update,Vgchartzscore
0,1,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,82.86,,,,,,2006.0,,
1,2,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,40.24,,,,,,1985.0,,
2,3,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,37.14,,,,,,2008.0,11th Apr 18,8.7
3,4,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,36.6,,,,,,2017.0,13th Nov 18,
4,5,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,33.09,,,,,,2009.0,,8.8
5,6,Pokemon Red / Green / Blue Version,Role-Playing,E,GB,Nintendo,Game Freak,9.4,,31.38,,,,,,1998.0,,
6,7,New Super Mario Bros.,Platform,E,DS,Nintendo,Nintendo EAD,9.1,8.1,30.8,,,,,,2006.0,,
7,8,Tetris,Puzzle,E,GB,Nintendo,Bullet Proof Software,,,30.26,,,,,,1989.0,,
8,9,New Super Mario Bros. Wii,Platform,E,Wii,Nintendo,Nintendo EAD,8.6,9.2,30.22,,,,,,2009.0,,9.1
9,10,Minecraft,Misc,,PC,Mojang,Mojang AB,10.0,,30.01,,,,,,2010.0,05th Aug 18,


In [22]:
'''
Total_Shipped and Global_Sales appear to share identical data, just that games in
Total_Shipped do not have listed regional sales.
Let's make a new column, All_Sales, that combines the total shipped and global sales data
'''
import numpy as np

def addStuff(x):
    if (np.isnan(x)):
        return 0
    else:
        return x

df['All_Sales'] = 1
df['All_Sales'] = df['Global_Sales'].apply(addStuff) + df['Total_Shipped'].apply(addStuff)
df.head(21) 


Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,Total_Shipped,Global_Sales,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Last_Update,Vgchartzscore,All_Sales
0,1,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,82.86,,,,,,2006.0,,,82.86
1,2,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,40.24,,,,,,1985.0,,,40.24
2,3,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,37.14,,,,,,2008.0,11th Apr 18,8.7,37.14
3,4,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,36.6,,,,,,2017.0,13th Nov 18,,36.6
4,5,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,33.09,,,,,,2009.0,,8.8,33.09
5,6,Pokemon Red / Green / Blue Version,Role-Playing,E,GB,Nintendo,Game Freak,9.4,,31.38,,,,,,1998.0,,,31.38
6,7,New Super Mario Bros.,Platform,E,DS,Nintendo,Nintendo EAD,9.1,8.1,30.8,,,,,,2006.0,,,30.8
7,8,Tetris,Puzzle,E,GB,Nintendo,Bullet Proof Software,,,30.26,,,,,,1989.0,,,30.26
8,9,New Super Mario Bros. Wii,Platform,E,Wii,Nintendo,Nintendo EAD,8.6,9.2,30.22,,,,,,2009.0,,9.1,30.22
9,10,Minecraft,Misc,,PC,Mojang,Mojang AB,10.0,,30.01,,,,,,2010.0,05th Aug 18,,30.01


In [24]:
df['All_Sales'].describe()

count    55792.000000
mean         0.188992
std          0.972131
min          0.000000
25%          0.000000
50%          0.000000
75%          0.070000
max         82.860000
Name: All_Sales, dtype: float64

In [26]:
# Drop rows that don't have any sales numbers, our target.
df = df[df['All_Sales'] > 0]

In [28]:
df.shape

(19862, 19)

In [None]:
# I'll probably go for a regression test here, though I'm not 100% sure yet.
# I'll probably need to use logorithms to transform total sales due to extreme values.
# I'll need to randomize train/validate/test. If I filtered by year, it would be impossible
# to predict game sales based on obsolete platforms.