# Extración de Datos

Vamos a proceder con la extracción de los datos originales que están almacenados en el directorio Data/Data-Original, los cuales serán manipulados según nuestros requerimientos específicos.

### Dataset Steam Games

Importamos las librerias

In [11]:
import pandas as pd
import gzip
import json
import ast

Descomprimimos el archivo y guardamos los datos en un dataframe

In [3]:
data_games = []
with gzip.open('../Data/Data-Original/steam_games.json.gz', 'r') as f:
    for line in f:
        data_games.append(json.loads(line))

df_games = pd.DataFrame(data_games)
df_games.tail()


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
120440,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,False,773640,"Nikita ""Ghost_RUS"""
120441,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,False,733530,Sacada
120442,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,False,610660,Laush Dmitriy Sergeevich
120443,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,False,658870,"xropi,stev3ns"
120444,,,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,,"[Early Access, Adventure, Indie, Action, Simul...",http://steamcommunity.com/app/681550/reviews/?...,"[Single-player, Stats, Steam Leaderboards, HTC...",4.99,True,681550,


Desanidamos las columnas genres, tags y specs

In [9]:
df_games = df_games.explode('genres')
df_games = df_games.explode('tags')
df_games = df_games.explode('specs')
df_games.tail()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
120444,,,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,,VR,http://steamcommunity.com/app/681550/reviews/?...,HTC Vive,4.99,True,681550,
120444,,,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,,VR,http://steamcommunity.com/app/681550/reviews/?...,Oculus Rift,4.99,True,681550,
120444,,,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,,VR,http://steamcommunity.com/app/681550/reviews/?...,Tracked Motion Controllers,4.99,True,681550,
120444,,,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,,VR,http://steamcommunity.com/app/681550/reviews/?...,Standing,4.99,True,681550,
120444,,,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,,VR,http://steamcommunity.com/app/681550/reviews/?...,Room-Scale,4.99,True,681550,


Guardamos la data extraida en un archivo csv

In [10]:
df_games.to_csv('../Data/Data-Extraida/steam_games.csv', sep=',', index=False)

### Dataset User Reviews

Descomprimimos el archivo y guardamos los datos en un dataframe. En este caso, estamos utilizando el módulo **'ast'** de Python debido a que los datos están entrecomillados con comillas simples en lugar de comillas dobles, como requiere el formato JSON estándar.

In [24]:
data_reviews = []
with gzip.open('../Data/Data-Original/user_reviews.json.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        data_reviews.append(ast.literal_eval(line))
df_reviews = pd.DataFrame(data_reviews)
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


Desanidaremos la columna reviews

In [25]:
df_reviews = df_reviews.explode('reviews')
df_reviews.reset_index(drop=True, inplace=True)
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011...."
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted April 21, 2011..."
3,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014...."
4,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted September 8, 2..."


In [26]:
df_reviews = pd.concat([df_reviews, pd.json_normalize(df_reviews['reviews'])], axis=1)

In [27]:
df_reviews.drop('reviews', axis=1, inplace=True)
df_reviews.head()

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,http://steamcommunity.com/id/js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...


Guardamos la data extraida en un archivo csv

In [28]:
df_reviews.to_csv('../Data/Data-Extraida/user_reviews.csv', sep=',', encoding='utf-8', index=False)

### Dataset Users Items

Descomprimimos el archivo y guardamos los datos en un dataframe. En este caso, estamos utilizando el módulo **'ast'** de Python debido a que los datos están entrecomillados con comillas simples en lugar de comillas dobles, como requiere el formato JSON estándar.

In [29]:
data_items = []
with gzip.open('../Data/Data-Original/users_items.json.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        data_items.append(ast.literal_eval(line))
df_items = pd.DataFrame(data_items)
df_items.head()

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."


Desanidaremos la columna items

In [30]:
df_items = df_items.explode('items')
df_items.reset_index(drop=True,inplace=True)
df_items.head()

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '10', 'item_name': 'Counter-Strike..."
1,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '20', 'item_name': 'Team Fortress ..."
2,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '30', 'item_name': 'Day of Defeat'..."
3,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '40', 'item_name': 'Deathmatch Cla..."
4,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '50', 'item_name': 'Half-Life: Opp..."


In [31]:
df_items = pd.concat([df_items, pd.json_normalize(df_items['items'])], axis=1)

In [33]:
df_items.drop('items', axis=1, inplace=True)
df_items.head()

Unnamed: 0,user_id,items_count,steam_id,user_url,item_id,item_name,playtime_forever,playtime_2weeks
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,10,Counter-Strike,6.0,0.0
1,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,20,Team Fortress Classic,0.0,0.0
2,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,30,Day of Defeat,7.0,0.0
3,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,40,Deathmatch Classic,0.0,0.0
4,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,50,Half-Life: Opposing Force,0.0,0.0


Guardamos la data extraida en un archivo csv

In [34]:
df_items.to_csv('../Data/Data-Extraida/users_items.csv', sep=',', encoding='utf-8', index=False)