# Proceso ETL Dataset User Reviews

Vamos a proceder a extraer los conjuntos de datos que se encuentran comprimidos en el archivo ***users_items.json.gz** para luego manipularlos y aplicar las transformaciones necesarias. El objetivo final es obtener un conjunto de datos limpio que pueda ser consumido por la API.


Importamos las librerias necesarias:

In [1]:
import pandas as pd
import gzip
import json
import ast

### Descomprimir datos

Descomprimimos el archivo y guardamos los datos en un dataframe. En este caso, estamos utilizando el módulo **'ast'** de Python debido a que los datos están entrecomillados con comillas simples en lugar de comillas dobles, como requiere el formato JSON estándar.

In [2]:
data_users_items = []
with gzip.open('../Data/Data-Original/users_items.json.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        data_users_items.append(ast.literal_eval(line))
df_users_items = pd.DataFrame(data_users_items)
df_users_items.head(3)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."


Desanidaremos la columna items

In [3]:
df_users_items = df_users_items.explode('items')
df_users_items.reset_index(drop=True,inplace=True)
df_users_items.head(3)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '10', 'item_name': 'Counter-Strike..."
1,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '20', 'item_name': 'Team Fortress ..."
2,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '30', 'item_name': 'Day of Defeat'..."


In [4]:
df_users_items = pd.concat([df_users_items, pd.json_normalize(df_users_items['items'])], axis=1)
df_users_items.drop('items', axis=1, inplace=True)
df_users_items.head(3)

Unnamed: 0,user_id,items_count,steam_id,user_url,item_id,item_name,playtime_forever,playtime_2weeks
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,10,Counter-Strike,6.0,0.0
1,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,20,Team Fortress Classic,0.0,0.0
2,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,30,Day of Defeat,7.0,0.0


In [6]:
df_users_items.isnull().sum()

user_id                 0
items_count             0
steam_id                0
user_url                0
item_id             16806
item_name           16806
playtime_forever    16806
playtime_2weeks     16806
dtype: int64

In [7]:
df_users_items.duplicated().sum()

59196

### Tratamiento de Duplicados y Nulos

Eliminamos las filas duplicadas

In [8]:
df_users_items.drop_duplicates(inplace=True)
df_users_items.duplicated().sum()

0

In [9]:
df_users_items.isnull().sum()

user_id                 0
items_count             0
steam_id                0
user_url                0
item_id             16714
item_name           16714
playtime_forever    16714
playtime_2weeks     16714
dtype: int64

Eliminamos las filas que tienen nulos en las columnas **'item_id','item_name','playtime_forever','playtime_2weeks** en el mismo registro debido a que esto no aportaria ninguna información útil.

In [10]:
df_users_items.dropna(subset=['item_id','item_name','playtime_forever','playtime_2weeks'], how='all', inplace=True)
df_users_items.isnull().sum()

user_id             0
items_count         0
steam_id            0
user_url            0
item_id             0
item_name           0
playtime_forever    0
playtime_2weeks     0
dtype: int64

### Transformacion de columnas

Transformamo a formato entero la columna **'item_id'**

In [15]:
df_users_items['item_id'] = df_users_items['item_id'].astype(int)
df_users_items.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5094105 entries, 0 to 5170013
Data columns (total 8 columns):
 #   Column            Dtype  
---  ------            -----  
 0   user_id           object 
 1   items_count       int64  
 2   steam_id          object 
 3   user_url          object 
 4   item_id           int32  
 5   item_name         object 
 6   playtime_forever  float64
 7   playtime_2weeks   float64
dtypes: float64(2), int32(1), int64(1), object(4)
memory usage: 330.4+ MB


Guardamos los datos en un archivo

In [16]:
df_users_items.to_csv('../Data/Data-Limpia/users_items.csv', sep=',', encoding='utf-8', index=False)