# Sistema de recomendación de videojuegos de STEAM
## EDA

En este proyecto se realizará un sistema de recomendación de videojuegos de STEAM.
En este primer proyecto se realizará el ETL y el EDA de los datasets.

### steam_games

Este Dataset contiene información sobre los videojuegos. Se puede saber mas sobre sus atributos en el `Diccionario de Datos STEAM`.

In [1]:
import pandas as pd
import gzip
import json

In [2]:
# Extraer los datos a partir del archivo JSON comprimido con gzip.

file_path = "./Datasets/steam_games.json.gz"

data_list = []  # Lista para almacenar los objetos JSON

# Se requiere este metodo porque el archivo contiene datos anidados.
with gzip.open(file_path, 'rt', encoding='utf-8') as f:
    for line in f:
        try:
            obj = json.loads(line)
            data_list.append(obj)
        except json.JSONDecodeError:
            print(f"Error al cargar línea: {line}")

In [3]:
games_df = pd.DataFrame(data_list)
games_df.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


In [4]:
print(f"El df tiene {games_df.shape[0]} filas y {games_df.shape[1]} columnas.")

El df tiene 120445 filas y 13 columnas.


In [5]:
# Obtener el promedio de nulos en cada columna.
games_df.isnull().mean()

publisher       0.800050
genres          0.760455
app_name        0.733214
title           0.750218
url             0.733198
release_date    0.750359
tags            0.734551
reviews_url     0.733214
specs           0.738760
price           0.744630
early_access    0.733198
id              0.733214
developer       0.760588
dtype: float64

Se observa que todas las columnas tienen mas de 70% de nulos y en algunas filas todos sus valores son nulos, por lo que se eliminará dichas filas.

In [6]:
games_df = games_df.dropna(how="all").reset_index(drop=True)

In [7]:
# Quedaron aprox. 25% de las filas.
print(f"El df tiene {games_df.shape[0]} filas y {games_df.shape[1]} columnas.")

El df tiene 32135 filas y 13 columnas.


In [8]:
# El promedio de nulos bajó notablemente.
games_df.isnull().mean()

publisher       0.250568
genres          0.102163
app_name        0.000062
title           0.063793
url             0.000000
release_date    0.064322
tags            0.005072
reviews_url     0.000062
specs           0.020850
price           0.042850
early_access    0.000000
id              0.000062
developer       0.102661
dtype: float64

In [9]:
games_df.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,False,767400,彼岸领域
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,False,773570,


In [10]:
games_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32135 entries, 0 to 32134
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 3.2+ MB


Se seleccionarán solo algunas columnas, que sean útiles para el EDA.

In [11]:
columnas_seleccionadas = ["id", "app_name", "title", "url", "genres", "release_date", 
                          "tags", "specs", "price"]

games_df = games_df[columnas_seleccionadas]
games_df.head(3)

Unnamed: 0,id,app_name,title,url,genres,release_date,tags,specs,price
0,761140,Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,"[Action, Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99
1,643980,Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,"[Free to Play, Indie, RPG, Strategy]",2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",Free To Play
2,670290,Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,"[Casual, Free to Play, Indie, Simulation, Sports]",2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",Free to Play


Vamos a empezar la exploración univariable.

#### id

In [12]:
# El Dataframe tiene 32135 filas
games_df.id.isna().sum()

2

Hay 2 valores nulos, vamos a explorarlos.

In [13]:
games_df[games_df.id.isna()]

Unnamed: 0,id,app_name,title,url,genres,release_date,tags,specs,price
74,,,,http://store.steampowered.com/,,,,,19.99
30961,,Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260,"[Action, Adventure]",2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...","[Single-player, Steam Achievements, Steam Trad...",19.99


La primera fila (indice 74) tiene casi todos sus valores vacios por lo que podemos eliminarla con seguridad.

La segunda fila (indice 30961) tiene el id nulo, pero podemos ver en el url que debería ser 200260 asique imputaremos este valor. Esto se confirmó en el sitio de Steam.

In [14]:
games_df.drop(74, axis=0, inplace=True)
games_df.loc[30961, "id"] = "200260"
games_df.reset_index(drop=True, inplace=True)

In [33]:
games_df.id.duplicated().sum() # Hay 2 duplicados que se dropean
games_df.drop_duplicates(subset=["id"], inplace=True)

In [35]:
# games_df tiene ahora 32132 filas.
games_df.id.nunique() # El número de ids únicos coincide con el número de filas.

32132

Si bien `id` es un número, no representa una variable cuantitativa, por lo que podemos dejarlo como tipo objeto.

#### app_name y title

Las columnas `app_name` y `title` se ven bastante similares. Vamos a explorar las diferencias.