En este notebook se prepararán los datasets que servirán de insumo para las consultas a la API

In [59]:
import pandas as pd
import html

### PlayTimeGenre

**Consigna:**

def PlayTimeGenre(genero : str): Debe devolver año con mas horas jugadas para dicho género.
Ejemplo de retorno: {"Año de lanzamiento con más horas jugadas para Género X" : 2013}

Algunas consideraciones:
* El **género del juego** está en la columna "genres" del dataset "steam_games_procesado.csv".
* El **año de lanzamiento** está en la columna "year" del dataset "steam_games_procesado.csv".
* El **tiempo de juego** está en la columna "playtime_forever" del dataset "user_items_procesado.csv".
* Podemos vincular ambos datasets a través de la columna **"id"** de "steam_games_procesado.csv" y la columna **"item_id"** de "user_items_procesado.csv".

In [60]:
games_df = pd.read_csv("./Datasets/steam_games_procesado.csv")
games_df = games_df[["id", "year", "genres"]]

games_df.head()

Unnamed: 0,id,year,genres
0,761140,2018.0,"['Action', 'Casual', 'Indie', 'Simulation', 'S..."
1,643980,2018.0,"['Free to Play', 'Indie', 'RPG', 'Strategy']"
2,670290,2017.0,"['Casual', 'Free to Play', 'Indie', 'Simulatio..."
3,767400,2017.0,"['Action', 'Adventure', 'Casual']"
4,773570,,


In [61]:
games_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32125 entries, 0 to 32124
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      32125 non-null  int64  
 1   year    29955 non-null  float64
 2   genres  28850 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 753.1+ KB


In [62]:
# Analizar los valores nulos.
games_df.isna().sum()

id           0
year      2170
genres    3275
dtype: int64

Para esta consulta no nos sirven aquellos registros que tengan valores nulos en "year" o "genres", por lo que los eliminaremos.

In [63]:
# Borramos valores nulos
games_df.dropna(subset=["year", "genres"], inplace=True, ignore_index=True)

# Casteamos "year" a entero
games_df["year"] = games_df["year"].astype("int32")

Luego, para poder agrupar por género, necesitaremos que cada género ocupe una fila. Lograremos esto con la función pd.explode().

In [64]:
games_df.shape

(28728, 3)

In [65]:
# Usar eval() para convertir las listas de strings entre comillas en listas de 
# cadenas de texto válidas antes de usar explode().
games_df['genres'] = games_df['genres'].apply(eval)

games_df = games_df.explode("genres")

games_df.reset_index(inplace=True, drop=True) # Reseteamos el indice

games_df.head()

Unnamed: 0,id,year,genres
0,761140,2018,Action
1,761140,2018,Casual
2,761140,2018,Indie
3,761140,2018,Simulation
4,761140,2018,Strategy


In [68]:
games_df.genres.unique()

array(['Action', 'Casual', 'Indie', 'Simulation', 'Strategy',
       'Free to Play', 'RPG', 'Sports', 'Adventure', 'Racing',
       'Massively Multiplayer', 'Early Access', 'Animation & Modeling',
       'Video Production', 'Web Publishing', 'Education',
       'Software Training', 'Utilities', 'Design & Illustration',
       'Audio Production', 'Photo Editing', 'Accounting'], dtype=object)

In [67]:
# Reemplazar los valores en la columna 'genres' después de descodificar HTML
games_df['genres'] = games_df['genres'].apply(lambda x: html.unescape(x))
games_df['genres'] = games_df['genres'].str.replace("'Animation & Modeling'", 'Animation & Modeling')
games_df['genres'] = games_df['genres'].str.replace("'Design & Illustration'", 'Design & Illustration')

In [69]:
games_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71207 entries, 0 to 71206
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      71207 non-null  int64 
 1   year    71207 non-null  int32 
 2   genres  71207 non-null  object
dtypes: int32(1), int64(1), object(1)
memory usage: 1.4+ MB


In [70]:
games_df.shape

(71207, 3)

Ahora cargaremos el segundo dataframe.

In [72]:
items_df = pd.read_csv("./Datasets/user_items_procesado.csv")
items_df = items_df[["item_id", "playtime_forever"]]

items_df.head(3)

Unnamed: 0,item_id,playtime_forever
0,10,6
1,20,0
2,30,7


In [73]:
items_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5094092 entries, 0 to 5094091
Data columns (total 2 columns):
 #   Column            Dtype
---  ------            -----
 0   item_id           int64
 1   playtime_forever  int64
dtypes: int64(2)
memory usage: 77.7 MB


In [74]:
items_df.isna().sum() # No hay valores nulos

item_id             0
playtime_forever    0
dtype: int64

Vamos a sumar las horas jugadas para cada juego.

In [75]:
total_hours_by_game = items_df.groupby("item_id").playtime_forever.sum()
total_hours_by_game

item_id
10        17107858
20          960524
30          756375
40          154424
50          726545
            ...   
528580           0
528660         182
529670           0
529820           0
530720           0
Name: playtime_forever, Length: 10978, dtype: int64

Convertimos la serie anterior a un dataframe.

In [76]:
items_df = total_hours_by_game.to_frame().reset_index()
items_df.head(3)

Unnamed: 0,item_id,playtime_forever
0,10,17107858
1,20,960524
2,30,756375


Procedemos a hacer un inner join de ambos dataframes.

In [81]:
PlayTimeGenre_df = games_df.merge(items_df, left_on='id', right_on='item_id', 
                                  how='inner')

In [82]:
PlayTimeGenre_df.head()

Unnamed: 0,id,year,genres,item_id,playtime_forever
0,282010,1997,Action,282010,9319
1,282010,1997,Indie,282010,9319
2,282010,1997,Racing,282010,9319
3,70,1998,Action,70,2650946
4,1640,2006,Strategy,1640,27162


Podemos eliminar las columnas "id" e "item_id".

In [86]:
PlayTimeGenre_df.drop(["id", "item_id"], axis=1, inplace=True)

In [114]:
PlayTimeGenre_df = PlayTimeGenre_df.groupby(["genres", "year"]).playtime_forever.sum()
PlayTimeGenre_df

genres          year
Action          1983      3473
                1984       384
                1988     16001
                1989       607
                1990     18335
                         ...  
Web Publishing  2013    333678
                2014     33641
                2015    348673
                2016       136
                2017      9382
Name: playtime_forever, Length: 343, dtype: int64

In [115]:
PlayTimeGenre_df = temp.unstack(0)
PlayTimeGenre_df

genres,Action,Adventure,Animation & Modeling,Audio Production,Casual,Design & Illustration,Early Access,Education,Free to Play,Indie,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1983,3473.0,3473.0,,,3473.0,,,,,,...,,,,,,,,,,
1984,384.0,384.0,,,384.0,,,,,,...,,,,,,,,,,
1987,,7712.0,,,,,,,,,...,,,,235.0,,,,,,
1988,16001.0,29963.0,,,,,,,,13962.0,...,,13962.0,,1.0,,,13962.0,,,
1989,607.0,9311.0,,,,,,,,,...,,,,5084.0,,,,,,
1990,18335.0,5492.0,,,,,,,,,...,,,,1926.0,,,60.0,,,
1991,2502.0,1722.0,,,1355.0,,,,,,...,,0.0,,144.0,,,0.0,,,
1992,1925.0,23795.0,,,2492.0,,,,,,...,,2164.0,,11.0,,,,,,
1993,24623.0,46772.0,,,22039.0,,,,,,...,,,,65.0,,,182863.0,,,
1994,120177.0,15890.0,,,45975.0,,,,,,...,,2092.0,289.0,9351.0,,,1169.0,,,


In [116]:
PlayTimeGenre_df.to_csv("./Datasets_API/PlayTimeGenre.csv")