# Proceso ETL Dataset User Reviews

Vamos a proceder a extraer los conjuntos de datos que se encuentran comprimidos en el archivo ***user_reviews.json.gz** para luego manipularlos y aplicar las transformaciones necesarias. El objetivo final es obtener un conjunto de datos limpio que pueda ser consumido por la API.


Importamos las librerias necesarias:

In [39]:
import pandas as pd
import gzip
import json
import ast
from textblob import TextBlob

### Descomprimir datos

Descomprimimos el archivo y guardamos los datos en un dataframe. En este caso, estamos utilizando el módulo **'ast'** de Python debido a que los datos están entrecomillados con comillas simples en lugar de comillas dobles, como requiere el formato JSON estándar.

In [40]:
data_reviews = []
with gzip.open('../Data/Data-Original/user_reviews.json.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        data_reviews.append(ast.literal_eval(line))
df_reviews = pd.DataFrame(data_reviews)
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


Desanidaremos la columna reviews

In [41]:
df_reviews = df_reviews.explode('reviews')
df_reviews.reset_index(drop=True, inplace=True)
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011...."
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted April 21, 2011..."
3,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014...."
4,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted September 8, 2..."


In [42]:
df_reviews = pd.concat([df_reviews, pd.json_normalize(df_reviews['reviews'])], axis=1)
df_reviews.drop('reviews', axis=1, inplace=True)
df_reviews.head()

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,http://steamcommunity.com/id/js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...


In [43]:
df_reviews.duplicated().sum()

874

### Tratamiento de Duplicados y Nulos

Eliminamos las filas que estan duplicadas

In [44]:
df_reviews.drop_duplicates(inplace=True)
df_reviews.duplicated().sum()

0

Verificamos cuales son las filas que tienen valores nulos

In [45]:
nulos = df_reviews[df_reviews.isnull().any(axis=1)]
nulos

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review
137,gdxsd,http://steamcommunity.com/id/gdxsd,,,,,,,
177,76561198094224872,http://steamcommunity.com/profiles/76561198094...,,,,,,,
2559,76561198021575394,http://steamcommunity.com/profiles/76561198021...,,,,,,,
10080,cmuir37,http://steamcommunity.com/id/cmuir37,,,,,,,
13767,Jaysteeny,http://steamcommunity.com/id/Jaysteeny,,,,,,,
15493,ML8989,http://steamcommunity.com/id/ML8989,,,,,,,
19184,76561198079215291,http://steamcommunity.com/profiles/76561198079...,,,,,,,
20223,76561198079342142,http://steamcommunity.com/profiles/76561198079...,,,,,,,
25056,76561198061996985,http://steamcommunity.com/profiles/76561198061...,,,,,,,
26257,76561198108286351,http://steamcommunity.com/profiles/76561198108...,,,,,,,


Visualmente observamos en la tabla anterior que solo hay 28 filas con valores nulos que no aportarán información relevante. Por lo tanto, procedemos a eliminarlos.

In [47]:
df_reviews.dropna(inplace=True)
df_reviews.isnull().sum()

user_id        0
user_url       0
funny          0
posted         0
last_edited    0
item_id        0
helpful        0
recommend      0
review         0
dtype: int64

Verificamos cuantos valores vacios existen por columnas 

In [51]:
# Función para contar elementos vacíos en una columna
def contar_vacios(columna):
    return columna.apply(lambda x: x == '').sum()

In [53]:
vacios = df_reviews.apply(contar_vacios)
vacios

user_id            0
user_url           0
funny          50421
posted             0
last_edited    52394
item_id            0
helpful            0
recommend          0
review            30
dtype: int64

Lo anterior nos sugiere que las columnas **'funny'** y **'last_edited'** no aporta mucha información.

In [56]:
df_reviews.drop(['funny','last_edited'], inplace=True, axis=1)

### Analisis de Sentimiento

Para abordar este problema usaremos tecnicas de procesamiento de lenguaje natural para determinar si una reseña es positiva, negativa o neutral. Crearemos una nueva columna llamada **sentiment_analysis** que refleje esta evaluación. Si una reseña no está presente (es decir, es nula o está ausente), asignaremos automáticamente un valor de 1 (neutral) a esa entrada.

In [57]:
# Función para realizar análisis de sentimiento
def analyze_sentiment(text):
    if text is None or pd.isnull(text):
        return 1  # Valor neutral si la reseña está ausente
    else:
        blob = TextBlob(text)
        sentiment_score = blob.sentiment.polarity
        
        if sentiment_score < 0:
            return 0  # Sentimiento negativo
        elif sentiment_score == 0:
            return 1  # Sentimiento neutral
        else:
            return 2  # Sentimiento positivo

Aplicar análisis de sentimiento y crear nueva columna 'sentiment_analysis'


In [58]:
df_reviews['sentiment_analysis'] = df_reviews['review'].apply(analyze_sentiment)

Reemplazar la columna 'review' por 'sentiment_analysis'

In [59]:
df_reviews.drop(columns=['review'], inplace=True)
df_reviews.head()


Unnamed: 0,user_id,user_url,posted,item_id,helpful,recommend,sentiment_analysis
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted November 5, 2011.",1250,No ratings yet,True,2
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted July 15, 2011.",22200,No ratings yet,True,2
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted April 21, 2011.",43110,No ratings yet,True,2
3,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,2
4,js41637,http://steamcommunity.com/id/js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,0


### Transformacion de columnas

In [61]:
df_reviews['recommend'].unique()

array([True, False], dtype=object)

Transformamos a valores booleanos de pandas

In [64]:
df_reviews['recommend'] = df_reviews['recommend'].replace({'True': True, 'False': False})
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58431 entries, 0 to 59332
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   user_id             58431 non-null  object
 1   user_url            58431 non-null  object
 2   posted              58431 non-null  object
 3   item_id             58431 non-null  object
 4   helpful             58431 non-null  object
 5   recommend           58431 non-null  bool  
 6   sentiment_analysis  58431 non-null  int64 
dtypes: bool(1), int64(1), object(5)
memory usage: 5.2+ MB


In [65]:
df_reviews['posted'].unique()

array(['Posted November 5, 2011.', 'Posted July 15, 2011.',
       'Posted April 21, 2011.', ..., 'Posted February 18, 2013.',
       'Posted November 13, 2012.', 'Posted November 3, 2012.'],
      dtype=object)

Extraemos la fecha de la columna "posted" y la transformamos a formato de fecha 


In [68]:
df_reviews['posted'] = df_reviews['posted'].str.extract(r'Posted ([\w\s\d,]+)')

In [71]:
df_reviews['posted'] = pd.to_datetime(df_reviews['posted'], errors='coerce')
df_reviews.head()

Unnamed: 0,user_id,user_url,posted,item_id,helpful,recommend,sentiment_analysis
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2011-11-05,1250,No ratings yet,True,2
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2011-07-15,22200,No ratings yet,True,2
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2011-04-21,43110,No ratings yet,True,2
3,js41637,http://steamcommunity.com/id/js41637,2014-06-24,251610,15 of 20 people (75%) found this review helpful,True,2
4,js41637,http://steamcommunity.com/id/js41637,2013-09-08,227300,0 of 1 people (0%) found this review helpful,True,0


Convertiremos a entero la columna de **'item_id'**

In [72]:
df_reviews['item_id'] = df_reviews['item_id'].astype(int)
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58431 entries, 0 to 59332
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   user_id             58431 non-null  object        
 1   user_url            58431 non-null  object        
 2   posted              48498 non-null  datetime64[ns]
 3   item_id             58431 non-null  int32         
 4   helpful             58431 non-null  object        
 5   recommend           58431 non-null  bool          
 6   sentiment_analysis  58431 non-null  int64         
dtypes: bool(1), datetime64[ns](1), int32(1), int64(1), object(3)
memory usage: 5.0+ MB


Por ultimos eliminamos la columna user_url ya que solo son enlaces que no aportaran ninguna informacion al dataset

In [74]:
df_reviews.drop(['user_url'], inplace=True, axis=1)
df_reviews.head(3)

Unnamed: 0,user_id,posted,item_id,helpful,recommend,sentiment_analysis
0,76561197970982479,2011-11-05,1250,No ratings yet,True,2
1,76561197970982479,2011-07-15,22200,No ratings yet,True,2
2,76561197970982479,2011-04-21,43110,No ratings yet,True,2


Guardamos la data extraida en un archivo csv

In [75]:
df_reviews.to_csv('../Data/Data-Limpia/user_reviews.csv', sep=',', encoding='utf-8', index=False)