# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset: https://www.kaggle.com/datasets/antonkozyriev/game-recommendations-on-steam

Import the necessary libraries and create your dataframe(s).

In [1]:
import pandas as pd

df = pd.read_csv('games.csv', on_bad_lines='skip')
df.head()

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,price_discounted,discount,steam_deck
0,11190,Sherlock Holmes versus Jack the Ripper,2009-12-23,True,False,False,Mostly Positive,78,792,9.99,9.99,9.99,0.0,True
1,20700,Starscape,2008-11-03,True,False,False,Very Positive,81,80,7.99,7.99,7.99,0.0,True
2,94202,"Jamestown: Gunpowder, Treason, & Plot",2011-11-10,True,True,False,Positive,90,10,2.99,2.99,2.99,0.0,True
3,212673,Tom Clancy's Ghost Recon Future Soldier® - Khy...,2013-02-26,True,False,False,Mixed,60,10,9.99,9.99,9.99,0.0,True
4,222520,Champions of Regnum,2013-02-27,True,True,True,Mixed,67,1098,0.0,0.0,0.0,0.0,True


In [2]:
df.shape

(13469, 14)

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [3]:
import pandas as pd
df = pd.read_csv('games.csv', on_bad_lines='skip')

df.isnull()


Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,price_discounted,discount,steam_deck
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13464,False,False,False,False,False,False,False,False,False,False,False,False,False,False
13465,False,False,False,False,False,False,False,False,False,False,False,False,False,False
13466,False,False,False,False,False,False,False,False,False,False,False,False,False,False
13467,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [4]:
df.isnull().sum()

#It looks like the only columns with nulls are columns I need, so missing data is not something that will make us drop any columns.

app_id               0
title                0
date_release         0
win                  0
mac                  0
linux                0
rating               0
positive_ratio       0
user_reviews         0
price_final          0
price_original      26
price_discounted    26
discount             0
steam_deck           0
dtype: int64

In [5]:
df['price_original']=df['price_original'].fillna(value='no original price')
df['price_discounted']=df['price_discounted'].fillna(value='no price discounted')
df.isnull().sum()

#all nulls have been taken care of.

app_id              0
title               0
date_release        0
win                 0
mac                 0
linux               0
rating              0
positive_ratio      0
user_reviews        0
price_final         0
price_original      0
price_discounted    0
discount            0
steam_deck          0
dtype: int64

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [6]:
df.describe()

#I don't see any outliers that would need to be removed.

Unnamed: 0,app_id,positive_ratio,user_reviews,price_final,discount
count,13469.0,13469.0,13469.0,13469.0,13469.0
mean,975663.5,83.520974,2979.55,8.48699,8.293043
std,560834.0,12.831449,70372.65,9.346087,22.459214
min,440.0,24.0,10.0,0.0,0.0
25%,489460.0,77.0,20.0,2.49,0.0
50%,910450.0,86.0,47.0,5.99,0.0
75%,1419290.0,93.0,146.0,9.99,0.0
max,2266310.0,100.0,6870243.0,149.99,90.0


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [7]:
df=df.drop(['date_release','win','mac','linux','steam_deck'],axis=1)
df.info()

#dropped all columns that have nothing to do with my business issue.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13469 entries, 0 to 13468
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   app_id            13469 non-null  int64  
 1   title             13469 non-null  object 
 2   rating            13469 non-null  object 
 3   positive_ratio    13469 non-null  int64  
 4   user_reviews      13469 non-null  int64  
 5   price_final       13469 non-null  float64
 6   price_original    13469 non-null  object 
 7   price_discounted  13469 non-null  object 
 8   discount          13469 non-null  float64
dtypes: float64(2), int64(3), object(4)
memory usage: 947.2+ KB


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [8]:
df.head(25)

#I do not see any inconsistencies.

Unnamed: 0,app_id,title,rating,positive_ratio,user_reviews,price_final,price_original,price_discounted,discount
0,11190,Sherlock Holmes versus Jack the Ripper,Mostly Positive,78,792,9.99,9.99,9.99,0.0
1,20700,Starscape,Very Positive,81,80,7.99,7.99,7.99,0.0
2,94202,"Jamestown: Gunpowder, Treason, & Plot",Positive,90,10,2.99,2.99,2.99,0.0
3,212673,Tom Clancy's Ghost Recon Future Soldier® - Khy...,Mixed,60,10,9.99,9.99,9.99,0.0
4,222520,Champions of Regnum,Mixed,67,1098,0.0,0.0,0.0,0.0
5,222634,Train Simulator: Union Pacific GP50 Loco Add-On,Positive,100,10,19.99,19.99,19.99,0.0
6,269150,Luxuria Superbia,Very Positive,85,68,6.99,6.99,6.99,0.0
7,279990,Bridge Constructor Playground,Mostly Positive,74,95,1.99,9.99,1.99,80.0
8,289460,RC Cars,Mostly Positive,78,300,2.99,2.99,2.99,0.0
9,333880,Braveland Wizard,Very Positive,80,342,9.99,9.99,9.99,0.0


## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset? No, only some applied to my dataset.
2. Did the process of cleaning your data give you new insights into your dataset? Nothing I already knew from the EDA step
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations? No.

In [13]:
df.to_csv('games.csv')