# Data Cleaning, Munging and Wrangling

### The steps involved in processing and merging data
 
Real life data from different sources often don't play well together and require pre-processing before merging. The trade-offs in this process depend entirely on the end goal

- Sometimes a smaller set of high quality data is required

- Sometimes a larger set with minor integrity issues is preferred

In [1]:
# Dependencies

import pandas as pd
import json

In [2]:
# Convert json to dataframe

with open("./Resources/steam-reviews.json", 'r', encoding="utf-8") as reviews:
    json_raw = json.load(reviews)

In [3]:
# Display json format

# json_raw

# Handling messy data

In [4]:
# Normalize json data and convert to dataframe

review_raw_df = pd.json_normalize(json_raw)

review_df = review_raw_df.copy()

review_df.head(3)

Unnamed: 0,img_url,date,developer,publisher,popu_tags,price,name,categories,full_desc.sort,full_desc.desc,...,requirements.recommended.macOS.graphics,requirements.recommended.macOS.os,requirements.recommended.linux.processor,requirements.recommended.linux.memory,requirements.recommended.linux.graphics,requirements.recommended.linux.os,requirements.minimum.macOS.processor,requirements.minimum.macOS.memory,requirements.minimum.macOS.graphics,requirements.minimum.macOS.os
0,https://steamcdn-a.akamaihd.net/steam/apps/945...,"Nov 16, 2018",Innersloth,Innersloth,"[Multiplayer, Online, Space, Social, Deduction...",499,Among Us,"[Online PvPLAN, PvPOnline Co-opLAN, Co-opCross...",game,About This Game Play with 4-10 player online o...,...,,,,,,,,,,
1,https://steamcdn-a.akamaihd.net/steam/apps/730...,"Aug 21, 2012","Valve, Hidden Path Entertainment",Valve,"[Shooter, Multiplayer, Competitive, Action, Te...",free,Counter-Strike: Global Offensive,"[Steam Achievements Full, controller supportSt...",game,About This Game Counter-Strike: Global Offensi...,...,,,,,,,,,,
2,https://steamcdn-a.akamaihd.net/steam/apps/109...,"Aug 3, 2020",Mediatonic,Devolver Digital,"[Multiplayer, Funny, Battle, Royale, Online, F...",199,Fall Guys: Ultimate Knockout,"[MMOOnline PvPOnline, Co-opSteam Achievements ...",game,About This Game Fall Guys: Ultimate Knockout f...,...,,,,,,,,,,


In [5]:
# Date is all over the place in terms of quality
# Dates are the most common type of data that can require cleaning

review_df.date.unique()

array(['Nov 16, 2018', 'Aug 21, 2012', 'Aug 3, 2020', ...,
       'time is subjective', 'Eventually. Check back often.',
       'Q3/Q4 2021'], dtype=object)

In [6]:
# Language differences is another common place where cleaning steps might be required

# Fill NaN with 'none' before boolean masking
review_df["developer"]  = review_df["developer"].fillna("None")

# Example of inconsistent naming with multiple languages
review_df[["developer","name"]][~review_df['developer'].str.contains('[A-Za-z]')].head(5)

Unnamed: 0,developer,name
858,艺龙游戏,嗜血印 Bloody Spell
1520,上海烛龙信息科技有限公司,古剑奇谭三(Gujian3)
2067,墨鱼玩游戏,Chinese Parents
2193,搞快点工作室,探灵笔记/拾遗记-1V5(Notes of Soul)
2753,甘肃嘉元数字科技有限公司,The Wind Road 紫塞秋风


In [7]:
# Example of inconsistent naming with multiple languages - Count of instances
review_df[["developer","name"]][~review_df['developer'].str.contains('[A-Za-z]')].count()

developer    920
name         920
dtype: int64

In [8]:
# Example of extracting date series as a list
test = review_df.date.sort_values().tolist()

# First 10 "dates"
test[0:10]

['"Coming Soon"',
 '"Coming Soon"',
 '"On 100.000 wishlists!"',
 '"On 100.000 wishlists!"',
 "'coming soon'",
 '(Hopefully) Q3 2020',
 '06 2019',
 '09/20/2020',
 '1 Apr, 1994',
 '1 Apr, 2011']