# Enshitificator

- Before I make a mess the dataset I'm going to prove if the dataset works correctly

In [202]:
import pandas as pd
import numpy as np


df = pd.read_csv("source_clean_dataset.csv")
df.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [203]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6169 entries, 0 to 6168
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          6169 non-null   object 
 1   Platform      6169 non-null   object 
 2   Year          6169 non-null   int64  
 3   Genre         6169 non-null   object 
 4   Publisher     6095 non-null   object 
 5   NA_Sales      6169 non-null   float64
 6   EU_Sales      6169 non-null   float64
 7   JP_Sales      6169 non-null   float64
 8   Other_Sales   6169 non-null   float64
 9   Global_Sales  6169 non-null   float64
dtypes: float64(5), int64(1), object(4)
memory usage: 482.1+ KB


## Missing data

- I will erase the 10 values of the year column

In [204]:
erase_year = df.sample(n=10, random_state=42).index
df.loc[erase_year, 'Year'] = np.nan

In [205]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6169 entries, 0 to 6168
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          6169 non-null   object 
 1   Platform      6169 non-null   object 
 2   Year          6159 non-null   float64
 3   Genre         6169 non-null   object 
 4   Publisher     6095 non-null   object 
 5   NA_Sales      6169 non-null   float64
 6   EU_Sales      6169 non-null   float64
 7   JP_Sales      6169 non-null   float64
 8   Other_Sales   6169 non-null   float64
 9   Global_Sales  6169 non-null   float64
dtypes: float64(6), object(4)
memory usage: 482.1+ KB


## Duplicated rows and Outliers values

- The Tetris row (GameBoy version) is going to be copied and modified to have a oulier value and a duplicated row

In [206]:
pos = np.random.randint(0, len(df))

fila_cop = df.loc[5].copy()
fila_cop["Global_Sales"] = 9999.99
fila_cop

Name              Tetris
Platform              GB
Year              1989.0
Genre             Puzzle
Publisher       Nintendo
NA_Sales            23.2
EU_Sales            2.26
JP_Sales            4.22
Other_Sales         0.58
Global_Sales     9999.99
Name: 5, dtype: object

In [207]:
df = pd.concat([df.iloc[:pos], fila_cop.to_frame().T, df.iloc[pos:]], ignore_index=True)
df = df.infer_objects()
df.query("Name == 'Tetris'")

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
5,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
72,Tetris,NES,1988.0,Puzzle,Nintendo,2.97,0.69,1.81,0.11,5.58
5242,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,0.58,9999.99


## Format inconsistencies, incorrect data type and extra puntuation symbol

- I will change the type of Other_Sales column to String, apply a space and add the symbol of the dollar

In [208]:
try:
    df['Other_Sales'] = df['Other_Sales'].apply(lambda x: "$ {:.2f}".format(x))
except Exception as e:
    print("Is in String Format already")
    
df.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,$ 8.46,82.74
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,$ 0.77,40.24
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,$ 3.31,35.82
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,$ 2.96,33.0
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,$ 1.00,31.37


In [209]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6170 entries, 0 to 6169
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          6170 non-null   object 
 1   Platform      6170 non-null   object 
 2   Year          6160 non-null   float64
 3   Genre         6170 non-null   object 
 4   Publisher     6096 non-null   object 
 5   NA_Sales      6170 non-null   float64
 6   EU_Sales      6170 non-null   float64
 7   JP_Sales      6170 non-null   float64
 8   Other_Sales   6170 non-null   object 
 9   Global_Sales  6170 non-null   float64
dtypes: float64(5), object(5)
memory usage: 482.2+ KB


## Typographical errors and Incorrect header

- The column Genre it will change to gerne and the column Name to Column_1

In [210]:
df.rename(columns={"Genre": "gerne","Name": "Column_1"}, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6170 entries, 0 to 6169
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Column_1      6170 non-null   object 
 1   Platform      6170 non-null   object 
 2   Year          6160 non-null   float64
 3   gerne         6170 non-null   object 
 4   Publisher     6096 non-null   object 
 5   NA_Sales      6170 non-null   float64
 6   EU_Sales      6170 non-null   float64
 7   JP_Sales      6170 non-null   float64
 8   Other_Sales   6170 non-null   object 
 9   Global_Sales  6170 non-null   float64
dtypes: float64(5), object(5)
memory usage: 482.2+ KB


# Extra categories

- The NaN values of the column Publisher it will be exchanged for the useless string 'N/A'

In [211]:
df.fillna({'Publisher': 'N/A'}, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6170 entries, 0 to 6169
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Column_1      6170 non-null   object 
 1   Platform      6170 non-null   object 
 2   Year          6160 non-null   float64
 3   gerne         6170 non-null   object 
 4   Publisher     6170 non-null   object 
 5   NA_Sales      6170 non-null   float64
 6   EU_Sales      6170 non-null   float64
 7   JP_Sales      6170 non-null   float64
 8   Other_Sales   6170 non-null   object 
 9   Global_Sales  6170 non-null   float64
dtypes: float64(5), object(5)
memory usage: 482.2+ KB


In [212]:
df.query("Publisher == 'N/A'").head(5)

Unnamed: 0,Column_1,Platform,Year,gerne,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
570,Shrek / Shrek 2 2-in-1 Gameboy Advance Video,GBA,2007.0,Misc,,0.87,0.32,0.0,$ 0.02,1.21
754,Bentley's Hackpack,GBA,2005.0,Misc,,0.67,0.25,0.0,$ 0.02,0.93
762,Teenage Mutant Ninja Turtles,GBA,2003.0,Action,,0.67,0.25,0.0,$ 0.02,0.93
1040,Nicktoons Collection: Game Boy Advance Video V...,GBA,2004.0,Misc,,0.46,0.17,0.0,$ 0.01,0.64
1042,SpongeBob SquarePants: Game Boy Advance Video ...,GBA,2004.0,Misc,,0.46,0.17,0.0,$ 0.01,0.64


## No default codification of file

- I will change the encoding UTF-8 to latin1 and pandas shouldn't read it

In [213]:
df.to_csv("dirty_dataset.csv", index=False, encoding="latin1", errors="replace")

In [214]:
try:
    df2 = pd.read_csv("dirty_dataset.csv")
    df2.head()
except Exception as e:
    print("This file could not be read properly and my partner will have to fix it.")

This file could not be read properly and my partner will have to fix it.
