# MLOps Steam


Buenas y bienvenidos a este Notebook donde haremos el proceso de ETL a 3 datasets brindados por la plataforma de juegos Steam donde nosotros podremos practicar y brindar una solucion al problema que estan teniendo. Una vez que tratemos los datos nuestro objetivo sera hacer un analisis exploratorio de los datos y a raiz de esto sacar un modelo funcional de inteligencia artificial, que podra ser consumida desde una api por Render.


Comenzemos con la lectura de los datos y la limpieza de los mismos.

In [174]:
#instalamos todas las librerias necesarias
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import ast

In [175]:
import json

data = []
with open('data/output_steam_games.json', 'r') as f:
    for line in f:
        try:
            obj = json.loads(line)
            data.append(obj)
        except json.JSONDecodeError as e:
            print("Error en línea:", line)

# Convierte la lista de objetos en un DataFrame
steam = pd.DataFrame(data)

# Imprime el DataFrame
print(steam.shape)
steam.head()


(120445, 13)


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


In [176]:
steam = steam.dropna(thresh=3)
print(steam.shape)
steam.head(3)

(32135, 13)


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
88310,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro
88311,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
88312,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com


In [177]:
rows = []
with open('data/australian_users_items.json', 'r', encoding='UTF-8') as f:
    for line in f.readlines():
        rows.append(ast.literal_eval(line))

MemoryError: 

In [None]:
user_items = pd.DataFrame(rows)

In [None]:
user_items = user_items.dropna(thresh=3)
print(user_items.shape)
user_items.head(3)

(88310, 5)


Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."


In [None]:
rows = []
with open('data/australian_user_reviews.json', 'r', encoding='UTF-8') as f:
    for line in f.readlines():
        rows.append(ast.literal_eval(line))

In [None]:
user_reviews = pd.DataFrame(rows)

In [None]:
user_reviews = user_reviews.dropna(thresh=3)
print(user_reviews.shape)
user_reviews.head(3)


(25799, 3)


Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."


In [None]:
data_desanidada = []

for index, row in user_items.iterrows():
    user_id = row['user_id']
    items_count = row['items_count']
    steam_id = row['steam_id']
    user_url = row['user_url']
    items = row['items']
    
    for i in items:   
        new_row = {
        'user_id': user_id,
        'items_count': items_count,
        'steam_id' : steam_id,
        'user_url' : user_url,
        'item_id': i.get('item_id', ''),
        'item_name': i.get('item_name', ''),
        'playtime_forever': i.get('playtime_forever', ''),
        'playtime_2weeks': i.get('playtime_2weeks', '')
        }
        
        data_desanidada.append(new_row)

user_items_completo = pd.DataFrame(data_desanidada)

In [191]:
data_desanidada = []

for index, row in user_reviews.iterrows():
    user_id = row['user_id']
    user_url = row['user_url']
    reviews = row['reviews']
    
    for i in reviews:   
        new_row = {
        'user_id': user_id,
        'user_url': user_url,
        'reviews' : reviews,
        'funny': i.get('funny', ''),
        'posted': i.get('posted', ''),
        'last_edited': i.get('last_edited', ''),
        'item_id': i.get('item_id', ''),
        'helpful': i.get('helpful', ''),
        'recommend': i.get('recommend', bool),
        'review': i.get('review', '')
        }
        
        data_desanidada.append(new_row)

user_reviews_completo = pd.DataFrame(data_desanidada)

### En este momento ya poseemos los 3 dataframes necesarios para comenzar a trabajar a responder las preguntas solicitadas, vamos a ello una por una


In [None]:
endpoint1 = steam[['price', 'id', 'title']]
endpoint1['price'] = endpoint1['price'].apply(pd.to_numeric, errors='coerce')
endpoint1.price = endpoint1.price.fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  endpoint1['price'] = endpoint1['price'].apply(pd.to_numeric, errors='coerce')


In [183]:
evcentric = user_items_completo[user_items_completo['user_id'] == 'evcentric']
valores_evcentric = evcentric['item_id'].values
valores = []
for i in valores_evcentric:
    valores.append(i)
endpoint1[endpoint1['id'].isin(valores)].price.sum()

1584.9


### ENDPOINT1


In [203]:
def userdata(user_id):
    user_mask = user_items_completo[user_items_completo['user_id'] == user_id]
    valores_user = user_mask['item_id'].values
    valores = []
    for i in valores_user:
        valores.append(i)
    reviews_mask = user_reviews_completo[user_reviews_completo['user_id'] == user_id]
    porcentaje = reviews_mask['recommend'].sum() / reviews_mask.shape[0]
    precio = endpoint1[endpoint1['id'].isin(valores)].price.sum()
    total_items = len(valores)

    return 'porcentaje: ' + str(porcentaje) + ', ' + 'precio: ' + str(precio) + ', ' + 'total_items: '  + str(total_items)

In [204]:
userdata('evcentric')

'porcentaje: 1.0, precio: 1584.9, total_items: 137'

### ENDPOINT 2

In [208]:
def convertir_fecha(fecha_texto):
    # Utilizar expresiones regulares para extraer el mes, día y año
    match = re.search(r'(\w+) (\d+), (\d+)', fecha_texto)
    if match:
        mes = match.group(1)
        dia = match.group(2)
        año = match.group(3)
        
        # Mapear nombres de meses a números de meses
        meses = {'January': '01', 'February': '02', 'March': '03', 'April': '04', 'May': '05', 'June': '06',
                 'July': '07', 'August': '08', 'September': '09', 'October': '10', 'November': '11', 'December': '12'}
        
        # Crear una cadena en formato 'YYYY-MM-DD' y convertirla a objeto de fecha
        fecha_str = f'{año}-{meses[mes]}-{dia}'
        return pd.to_datetime(fecha_str)
    else:
        return None

# Aplicar la función de conversión a la columna 'fecha_texto' y crear una nueva columna 'fecha'
user_reviews_completo['posted'] = user_reviews_completo['posted'].apply(convertir_fecha)

In [238]:
def countreviews(fecha_inicio, fecha_final):
    filtro_fechas = (user_reviews_completo['posted'] >= fecha_inicio) & (user_reviews_completo['posted'] <= fecha_final)
    df_fechas_filtrado = user_reviews_completo[filtro_fechas]
    cantidad_usuarios = df_fechas_filtrado['user_id'].nunique()
    cantidad_trues = df_fechas_filtrado[df_fechas_filtrado['recommend'] == True].shape[0]
    porcentaje = (cantidad_trues / cantidad_usuarios) * 100
    return cantidad_usuarios, porcentaje

In [237]:
countreviews('2011-11-05', '2013-09-08')

(3249, 2167)

### ENDPOINT 3

In [271]:
steam['genres'].dropna(inplace=True)

In [275]:
generos_unicos = []
for index, row in steam.iterrows():
    generos_unicos.extend(row['genres'])
generos_unicos = set(generos_unicos)


In [278]:
ids_por_genero = {genero: [] for genero in generos_unicos}
for index, row in steam.iterrows():
    for genero in row['genres']:
        ids_por_genero[genero].append(row['id'])

In [285]:
resultados = []

# Iterar sobre cada género en el diccionario
for genero, ids_juegos in ids_por_genero.items():
    # Filtrar el DataFrame de tiempo jugado para incluir solo los IDs de juegos del género actual
    df_genero = user_items_completo[user_items_completo['item_id'].isin(ids_juegos)]
    
    # Calcular el tiempo total jugado para el género actual
    tiempo_total = df_genero['playtime_forever'].sum()
    
    # Agregar el resultado a la lista
    resultados.append({'genres': genero, 'TiempoTotal': tiempo_total})

# Crear un DataFrame a partir de la lista de resultados
df_resultados = pd.DataFrame(resultados)

# Ordenar el DataFrame por tiempo total jugado en orden descendente
endpoint3 = df_resultados.sort_values(by='TiempoTotal', ascending=False)

endpoint3['Puesto'] = endpoint3['TiempoTotal'].rank(ascending=False, method='min').astype(int)

In [286]:
endpoint3.head()

Unnamed: 0,genres,TiempoTotal,Puesto
21,Action,3113562606,1
13,Indie,1494622404,2
17,RPG,1041022718,3
18,Adventure,909995120,4
5,Simulation,867646306,5


In [299]:
def genre(genero):
    resultado = endpoint3[endpoint3['genres'] == genero]
    valor = resultado['Puesto'].values
    return valor[0]

In [301]:
genre('Simulation')

5

### ENDPOINT 4

In [306]:
genero_deseado = ids_por_genero.get('Action', [])

In [321]:
df_horas = user_items_completo[user_items_completo['item_id'].isin(genero_deseado)]

horas_por_jugador = df_horas.groupby('user_id')['playtime_forever'].sum()
top_5 = horas_por_jugador.nlargest(5).reset_index()
top_5 = top_5.merge(user_reviews, how='left', on='user_id')
top_5

Unnamed: 0,user_id,playtime_forever,user_url,reviews
0,Sp3ctre,1699307,http://steamcommunity.com/id/Sp3ctre,[{'funny': '142 people found this review funny...
1,shinomegami,1580428,http://steamcommunity.com/id/shinomegami,"[{'funny': '2,598 people found this review fun..."
2,REBAS_AS_F-T,1456212,http://steamcommunity.com/id/REBAS_AS_F-T,"[{'funny': '', 'posted': 'Posted March 20, 201..."
3,Terminally-Chill,1065742,http://steamcommunity.com/id/Terminally-Chill,[{'funny': '23 people found this review funny'...
4,DownSyndromeKid,1061193,,


In [326]:
def userforgenre(genre):
    genero_deseado = ids_por_genero.get(genre, [])
    df_horas = user_items_completo[user_items_completo['item_id'].isin(genero_deseado)]
    horas_por_jugador = df_horas.groupby('user_id')['playtime_forever'].sum()
    top_5 = horas_por_jugador.nlargest(5).reset_index()
    top_5 = top_5.merge(user_reviews, how='left', on='user_id')
    return top_5


In [327]:
userforgenre('Simulation')

Unnamed: 0,user_id,playtime_forever,user_url,reviews
0,jimmynoe,1062130,,
1,clawbot44,798416,http://steamcommunity.com/id/clawbot44,"[{'funny': '1 person found this review funny',..."
2,clawbot44,798416,http://steamcommunity.com/id/clawbot44,"[{'funny': '1 person found this review funny',..."
3,Evilutional,684723,,
4,REBAS_AS_F-T,676540,http://steamcommunity.com/id/REBAS_AS_F-T,"[{'funny': '', 'posted': 'Posted March 20, 201..."
5,tsunamitad,661309,http://steamcommunity.com/id/tsunamitad,[{'funny': '461 people found this review funny...


### ENDPOINT 5

In [363]:
steam.tail()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
120440,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,False,773640,"Nikita ""Ghost_RUS"""
120441,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,False,733530,Sacada
120442,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,False,610660,Laush Dmitriy Sergeevich
120443,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,False,658870,"xropi,stev3ns"
120444,,N/D,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,NaT,"[Early Access, Adventure, Indie, Action, Simul...",http://steamcommunity.com/app/681550/reviews/?...,"[Single-player, Stats, Steam Leaderboards, HTC...",4.99,True,681550,


In [355]:
steam['release_date'] = pd.to_datetime(steam['release_date'], format='%Y-%m-%d', errors='coerce')

In [356]:
steam['price'] = steam['price'].apply(pd.to_numeric, errors='coerce')
steam.price = steam.price.fillna(0)

In [359]:

def developer(desarrollador):
     # Filtrar el DataFrame para obtener solo los juegos del desarrollador especificado
    juegos_del_desarrollador = steam[steam['developer'] == desarrollador]
    
    # Inicializar un diccionario para almacenar los porcentajes por año
    porcentajes_por_anio = {}
    
    juegos_del_desarrollador['release_date'] = juegos_del_desarrollador['release_date'].dt.year

    # Obtener la lista de años únicos
    años_unicos = juegos_del_desarrollador['release_date'].unique()
    
    # Calcular el porcentaje de juegos gratis para cada año
    for año in años_unicos:
        juegos_del_año = juegos_del_desarrollador[juegos_del_desarrollador['release_date'] == año]
        juegos_gratis_del_año = juegos_del_año[juegos_del_año['price'] == 0]
        
        porcentaje_juegos_gratis = (len(juegos_gratis_del_año) / len(juegos_del_año)) * 100
        
        porcentajes_por_anio[año] = porcentaje_juegos_gratis
    
    return porcentajes_por_anio




### ENDPOINT 6

In [367]:
user_reviews_completo.head(3)

Unnamed: 0,user_id,user_url,reviews,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2...",,2011-11-05,,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2...",,2011-07-15,,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2...",,2011-04-21,,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...


In [371]:
from textblob import TextBlob

def analisis_sentimiento(review):
    analysis = TextBlob(review)
    
    # Determinar la polaridad del sentimiento (-1 = negativo, 0 = neutro, 1 = positivo)
    if analysis.sentiment.polarity > 0:
        return 'positiva'
    elif analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negativa'

def contar_sentimientos_por_anio(año_deseado):
    # Filtrar el DataFrame para obtener solo las reviews del año especificado
    reviews_del_anio = user_reviews_completo[user_reviews_completo['posted'].dt.year == año_deseado]
    
    # Crear columnas de sentimiento aplicando la función de análisis de sentimiento
    reviews_del_anio['sentimiento'] = reviews_del_anio['review'].apply(analisis_sentimiento)
    
    # Contar la cantidad de reviews positivas, neutras y negativas
    count_sentimientos = reviews_del_anio['sentimiento'].value_counts().to_dict()
    
    return count_sentimientos




ModuleNotFoundError: No module named 'textblob'