<b><p style="font-size: XX-large"><font color = "cyan">Andrew Cai</font></p></b> </div>
<b><p style="font-size: XX-large"><font color = "cyan">Steam Game Rating Prediction</font></p></b> </div>

-------------------------

# Data Set Background and Project Purpose

The dataset in this project contains over 50,000 cleaned and preprocessed data on video games (reviews) from a Steam Store - a leading online platform for purchasing and downloading video games, DLC, and other gaming-related content.

The model building in this project consists of two main files from the Kaggle dataset:

- games.csv - a table of games (or add-ons) information on ratings, pricing in US dollars $, release date, etc. 
- games_metadata.json - A piece of extra non-tabular details on games, such as descriptions and tags, is in a metadata file

This project serves as a means to learn scikit-learn and TensorFlow to build deep learning models for binary classifcation. Specifically, this project will take game attributes such as mac support, windows support, discounted, pricing, and game tags to determine whether or not the game will receive a positive rating with customers.


[Game Recommendations on Steam](https://www.kaggle.com/datasets/antonkozyriev/game-recommendations-on-steam)

Game Recommendations on Steam (2023). Kaggle. 10.34740/kaggle/ds/2871694.

-------------------------

# Packages and Initial Setups

In [1]:
import os
import warnings
import logging

# Suppress TensorFlow logs
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# Suppress warnings
warnings.filterwarnings('ignore', category=UserWarning, module='tensorflow')
warnings.filterwarnings('ignore', category=UserWarning, module='keras')
warnings.filterwarnings('ignore', category=UserWarning, module='keras_tuner')

# Suppress TensorFlow and Keras warnings
logging.getLogger('tensorflow').setLevel(logging.ERROR)

# Optionally, suppress specific warnings
warnings.filterwarnings('ignore', message='.*tf.reset_default_graph.*')

import pandas as pd
import json
import numpy as np
import random

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam, SGD
import keras_tuner as kt


-------------------------

# Exploratory Data Analysis

## Game Data

In [2]:
# Load in CSV
dfgames = pd.read_csv("games.csv")
dfgames.head()

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,discount,steam_deck
0,13500,Prince of Persia: Warrior Within™,2008-11-21,True,False,False,Very Positive,84,2199,9.99,9.99,0.0,True
1,22364,BRINK: Agents of Change,2011-08-03,True,False,False,Positive,85,21,2.99,2.99,0.0,True
2,113020,Monaco: What's Yours Is Mine,2013-04-24,True,True,True,Very Positive,92,3722,14.99,14.99,0.0,True
3,226560,Escape Dead Island,2014-11-18,True,False,False,Mixed,61,873,14.99,14.99,0.0,True
4,249050,Dungeon of the ENDLESS™,2014-10-27,True,True,False,Very Positive,88,8784,11.99,11.99,0.0,True


In [3]:
# Check for nulls / data types 
dfgames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50872 entries, 0 to 50871
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app_id          50872 non-null  int64  
 1   title           50872 non-null  object 
 2   date_release    50872 non-null  object 
 3   win             50872 non-null  bool   
 4   mac             50872 non-null  bool   
 5   linux           50872 non-null  bool   
 6   rating          50872 non-null  object 
 7   positive_ratio  50872 non-null  int64  
 8   user_reviews    50872 non-null  int64  
 9   price_final     50872 non-null  float64
 10  price_original  50872 non-null  float64
 11  discount        50872 non-null  float64
 12  steam_deck      50872 non-null  bool   
dtypes: bool(4), float64(3), int64(3), object(3)
memory usage: 3.7+ MB


## Extract JSON Game Tag Data

In [4]:
# File Name
jFile = "games_metadata.json"
# Extract JSON Data into Python List
gameMetaData = [] # Initiate Empty List
with open (jFile, "r", encoding='utf-8') as json_file:
    for line in json_file:
        gameMetaData.append(json.loads(line))

gameMetaData[5]

{'app_id': 250180,
 'description': '“METAL SLUG 3”, the masterpiece in SNK’s emblematic 2D run & gun action shooting game series, still continues to fascinate millions of fans worldwide to this day for its intricate dot-pixel graphics, and simple and intuitive game controls!',
 'tags': ['Arcade',
  'Classic',
  'Action',
  'Co-op',
  'Side Scroller',
  'Retro',
  'Local Co-Op',
  'Shooter',
  '2D',
  'Online Co-Op',
  'Great Soundtrack',
  "Shoot 'Em Up",
  'Platformer',
  'Multiplayer',
  'Pixel Graphics',
  'Old School',
  'Difficult',
  'Singleplayer',
  'Nostalgia',
  'Comedy']}

In [5]:
# Create Dictionary using app_id as the Key
gameTag = {} # Initialize empty dict
for n in range(len(gameMetaData)):
    gameTag[gameMetaData[n]['app_id']] = gameMetaData[n]['tags']

In [6]:
# Convert dict into pandas dataframe
dfTag = pd.DataFrame({'app_id':gameTag.keys(), 'tags':gameTag.values()})
dfTag.sample(5)

Unnamed: 0,app_id,tags
13246,1641500,"[Casual, Visual Novel, Anime, Text-Based, Sci-..."
30244,1103780,"[Racing, Free to Play, Simulation, Sports, Aut..."
46594,1012100,[Simulation]
10378,1921740,"[Runner, Side Scroller, 2D Platformer, Family ..."
11986,73031,"[Action, RPG]"


In [7]:
# Check game tag dataframe
dfTag.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50872 entries, 0 to 50871
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   app_id  50872 non-null  int64 
 1   tags    50872 non-null  object
dtypes: int64(1), object(1)
memory usage: 795.0+ KB


## Merge Game Tag Data with Game Data

In [8]:
# Merge Game Tags Data
dfgames2 = dfgames.copy() # Create a copy of original dataframe
dfgames2 = dfgames2.merge(dfTag, how='inner', on='app_id') # Merge tag data
dfgames2.sample(5)

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,discount,steam_deck,tags
1550,939960,Far Cry® New Dawn,2019-02-15,True,False,False,Mostly Positive,76,22055,39.99,39.99,0.0,True,"[Open World, FPS, Action, Co-op, Post-apocalyp..."
42705,1030830,Mafia II: Definitive Edition,2020-05-19,True,False,False,Mixed,69,11516,4.49,29.99,85.0,True,"[Action, Adventure, Crime, Open World, Story R..."
21616,1032750,Zombie Shooter: Ares Virus,2022-02-09,True,False,False,Mixed,42,118,4.99,4.99,0.0,True,"[Zombies, Shooter, Action, Adventure, Battle R..."
46924,1142800,DiRT Rally 2.0 Deluxe 2.0 (Season3+4),2019-08-26,True,False,False,Mostly Positive,78,52,11.99,11.99,0.0,True,"[Simulation, Sports, Racing]"
21964,1283220,Devolverland Expo,2020-07-11,True,False,False,Very Positive,89,5685,0.0,0.0,0.0,True,"[Free to Play, Action, Adventure, FPS, Indie, ..."


In [9]:
# Check merged game tag dataframe
dfgames2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50872 entries, 0 to 50871
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app_id          50872 non-null  int64  
 1   title           50872 non-null  object 
 2   date_release    50872 non-null  object 
 3   win             50872 non-null  bool   
 4   mac             50872 non-null  bool   
 5   linux           50872 non-null  bool   
 6   rating          50872 non-null  object 
 7   positive_ratio  50872 non-null  int64  
 8   user_reviews    50872 non-null  int64  
 9   price_final     50872 non-null  float64
 10  price_original  50872 non-null  float64
 11  discount        50872 non-null  float64
 12  steam_deck      50872 non-null  bool   
 13  tags            50872 non-null  object 
dtypes: bool(4), float64(3), int64(3), object(4)
memory usage: 4.1+ MB


In [10]:
# Explore output variable
dfgames2['rating'].value_counts()

rating
Positive                   13502
Very Positive              13139
Mixed                      12157
Mostly Positive             8738
Mostly Negative             1849
Overwhelmingly Positive     1110
Negative                     303
Very Negative                 60
Overwhelmingly Negative       14
Name: count, dtype: int64

In [11]:
# Exploring price average (exclude free games i.e. price = 0)
dfpriceCheck = dfgames2[dfgames2['price_original'] != 0]
dfpriceCheck['price_original'].mean()

10.885909715070376

## Feature Engineering

In [12]:
# Create column: classify discount price to have been discounted to 1
dfgames2['discounted'] = dfgames2['discount'].apply(lambda x: 1 if x > 0 else 0)
# Create column: classify all "positive" reviews as 1
dfgames2['positive_rating'] = dfgames2['rating'].apply(lambda x: 1 if "Positive" in x else 0)
# Based on the average original price, classify games over $10 as 1
dfgames2['over_10'] = dfgames2['price_original'].apply(lambda x: 1 if x > 10 else 0)

In [13]:
# Drop columns that are unnecessary for model for binary classification
dfgames2.drop(['app_id','title', 'positive_ratio', 'price_final', 'date_release', 'user_reviews', 'rating', 'price_original', 'discount'], axis=1, inplace=True)
dfgames2.sample(5)

Unnamed: 0,win,mac,linux,steam_deck,tags,discounted,positive_rating,over_10
12636,True,True,False,True,"[Story Rich, 2D, Puzzle, Action-Adventure, Han...",0,1,1
37900,True,False,False,True,"[Experimental, Difficult, Education, Puzzle, L...",0,0,1
12802,True,True,True,True,[],0,1,0
44605,True,False,False,True,"[Free to Play, Indie, Action, Casual, Simulati...",0,1,0
39373,True,True,True,True,"[Simulation, Sports, Racing, Strategy]",0,1,0


## Expand Tags as Boolean Columns

In [14]:
# Expand tags lists as presence columns
mlb = MultiLabelBinarizer()
tags_presence = pd.DataFrame(mlb.fit_transform(dfgames2['tags']), columns=mlb.classes_, index=dfgames2.index).astype(bool)
dfgames2 = dfgames2.join(tags_presence) # add the tag columns
dfgames2.drop(['tags'], axis=1, inplace=True) # remove the tags lists column
dfgames2.sample(5)

Unnamed: 0,win,mac,linux,steam_deck,discounted,positive_rating,over_10,1980s,1990's,2.5D,...,Well-Written,Werewolves,Western,Wholesome,Word Game,World War I,World War II,Wrestling,Zombies,eSports
37315,True,False,False,True,0,0,0,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10239,True,True,True,True,0,1,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9849,True,False,False,True,0,1,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False
49797,True,False,False,True,1,1,0,False,False,False,...,False,False,False,False,False,False,False,False,False,False
33093,True,False,False,True,0,0,0,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In [15]:
# Convert all data types to binary
dfgames2 = dfgames2.astype({col: int for col in dfgames2.columns})

In [16]:
X = dfgames2.drop(['positive_rating'], axis=1) # Select all features except target
y = dfgames2['positive_rating'] # Select target feature

-------------------------

# Model Architecture and Design

In [17]:
# Set up testing and training sets with 20% test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Build and Compile Model

### Parameter Testing and Tuning

In [18]:
def build_model(hp):
    """Testing/tuning several parameters for Sequential Deep Learning Model for 3 layers

    Args:
        hp (keras tuner): tuner for model

    Returns:
        JSON file: trained model parameters that had best accuracy score
    """
    # model selection, sequential ideal for bianry classifcaiton
    model = Sequential()
    # finding parameters for first layer
    model.add(Dense(units=hp.Int('units1', min_value=32, max_value=128, step=32),
                    activation=hp.Choice('activation1', values=['relu', 'leaky_relu', 'elu']), input_dim=len(X_train.columns)))
    model.add(Dropout(rate=hp.Float('dropout_rate', min_value=0.2, max_value=0.5, step=0.1)))
    # finding parameters for second layer
    model.add(Dense(units=hp.Int('units2', min_value=32, max_value=128, step=32), 
                    activation=hp.Choice('activation2', values=['relu', 'leaky_relu', 'elu'])))
    model.add(Dropout(rate=hp.Float('dropout_rate2', min_value=0.2, max_value=0.5, step=0.1)))
    # finding parameters for third layer
    model.add(Dense(units=hp.Int('units3', min_value=32, max_value=128, step=32), 
                    activation=hp.Choice('activation3', values=['relu', 'leaky_relu', 'elu'])))
    model.add(Dense(1, activation='sigmoid'))
    # Find best optimizer between SGD and ADAM
    optimizer = hp.Choice('optimizer', values=['adam', 'sgd'])
    # Test different learning rates to assist with fitting
    learning_rate = hp.Float('learning_rate', min_value=1e-5, max_value=1e-1, sampling='log')
    if optimizer == 'adam':
        optimizer_instance = Adam(learning_rate=learning_rate)
    else:
        optimizer_instance = SGD(learning_rate=learning_rate)
    # Compile the model
    model.compile(optimizer=optimizer_instance,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    return model

# Initialize the tuner
tuner = kt.Hyperband(build_model,
                     objective='val_accuracy',
                     max_epochs=100,
                     project_name='sd_tuner',
                     hyperband_iterations=2)

# Perform hyperparameter search
tuner.search(X_train, y_train, epochs=100, validation_split=0.2, verbose=0)

# Get the best model
best_model = tuner.get_best_models(num_models=1)[0]

# Get the best hyperparameters
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]

# Print the best model summary
print("Best Model Summary:")
best_model.summary()

# Print best hyperparameters
print("Best Hyperparameters:")
for key, value in best_hyperparameters.values.items():
    print(f"{key}: {value}")

Reloading Tuner from .\sd_tuner\tuner0.json

Best Model Summary:


Best Hyperparameters:
units1: 96
activation1: leaky_relu
dropout_rate: 0.2
units2: 128
activation2: leaky_relu
dropout_rate2: 0.4
units3: 128
activation3: elu
optimizer: adam
learning_rate: 0.00026655389110270963
tuner/epochs: 34
tuner/initial_epoch: 12
tuner/bracket: 4
tuner/round: 3
tuner/trial_id: 0140


### Model Design

In [19]:
# # Uncomment this entire cell to adjust new model parameters
# early_stopping = EarlyStopping(monitor='val_accuracy', patience=10, restore_best_weights=True)
# lr_reduction= ReduceLROnPlateau(monitor='val_loss', patience=5, verbose=1, factor=0.5)
# # Create Adam optimizer with a custom learning rate
# optimizer = Adam(learning_rate=0.00026655389110270963)  # Adjust learning rate as needed started with 0.0001

# model = Sequential()
# model.add(Dense(units=96, activation='leaky_relu', input_dim=len(X_train.columns), kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.1))) # L2 regularization with lambda=0.01 l2(0.01) / , kernel_regularizer=regularizers.l1_l2(l1=0.001, l2=0.01)
# model.add(BatchNormalization())
# model.add(Dropout(0.2)) # Dropout with 20% probability
# model.add(Dense(units=128, activation='leaky_relu', kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) 
# model.add(BatchNormalization())
# model.add(Dropout(0.4))  # Dropout with 40% probability
# model.add(Dense(units=128, activation='elu', kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01)))
# model.add(Dense(units=1, activation='sigmoid'))


# model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

## Fit, Predict, and Evaluate

In [20]:
# # Uncomment cell to train new model
# model.fit(X_train, y_train, epochs=300, batch_size=32, callbacks=[early_stopping, lr_reduction]) 

In [21]:
# # Uncomment cell to save new model
# model.save('SD_Seq_model.keras')

In [22]:
# # Uncomment cell to delete model from memory
# del model

In [23]:
# Reload saved model
model = load_model('SD_Seq_model.keras')

In [24]:
# Testing model with testing data
y_hat = model.predict(X_test, verbose=0)
# Convert probabilties to binary classification
y_hat = [0 if val <0.5 else 1 for val in y_hat] 
# Checking accuracy between test and prediction
accuracy_score(y_test, y_hat)

0.7184275184275184

In [25]:
def SteamRating(cols, x_input):
    """Based on Feature Classification, returns 

    Args:
        cols (pandas columns): list of predictor features
        x_input (array): array of binary values of feature presence

    Returns:
        str: Postive or Not Positive Rating
        array: Array of features associated with rating
    """
    # Prediction based on x_input
    y_pred = model.predict(x_input, verbose=0)
    # Convert probability to 0 or 1 classification
    y_pred = 0 if y_pred <0.5 else 1
    # Find indices of where 1 is present from x_input
    attr_1 = np.where(x_input != 0)[1]
    # Have the features be in an array for index searching
    a_col = np.array(cols)

    if y_pred == 1:
        return ('Positive Rating', a_col[[attr_1]])
    else:
        return ('Negative Rating', a_col[[attr_1]])


In [26]:
def rating_trials(n, positive, negative, cols):
    """Runs model against randomly generated data feature presence of n satisfying conditions,
       this will be compiled in an array for positive/negative ratings

    Args:
        n (int): number of satisfied conditions
        positive (bool): set to True if want positive rating array (exclusive)
        negative (bool): set to True if want negative rating array (exclusive)
        cols (pandas columns): list of predictor features
    """
    if positive:
        # Generate count array
        featureCount = np.zeros((1, 447))
        # Initialize stopper
        count = 0
        while count != n: 
            # Random predictor feature presence
            x = np.random.randint(2, size=(1,447))
            # Condition satisfied summations and ticker
            if SteamRating(cols, x)[0] == 'Positive Rating':
              featureCount = featureCount +  x
              count += 1
        return(featureCount.astype(int))
    elif negative:
        # Generate count array
        featureCount = np.zeros((1, 447))
        # Initialize stopper
        count = 0
        while count != n: 
            # Random predictor feature presence
            x = np.random.randint(2, size=(1,447))
            # Condition satisfied summations and ticker
            if SteamRating(cols, x)[0] == 'Negative Rating':
                featureCount = featureCount +  x
                count += 1
        return(featureCount.astype(int))


### Predictions Between Postive and Negative Rating Attributes

In [27]:
# Run Trials
random.seed(0)
# Positive Rating Array
testP = rating_trials(100, True, False, X_test.columns)
# Negative Rating Array
testN = rating_trials(100, False, True, X_test.columns)

In [28]:
# Generate Summary Feature Counts
positive_df = pd.DataFrame(testP, columns=X_test.columns)
positive_df['Rating'] = 'Postive'
negative_df = pd.DataFrame(testN, columns=X_test.columns)
negative_df['Rating'] = 'Negative'

In [29]:
# Append the new row to the original DataFrame
dfFeatureCount = pd.concat([positive_df, negative_df], axis=0)
dfFeatureCount.reset_index(drop=True)

Unnamed: 0,win,mac,linux,steam_deck,discounted,over_10,1980s,1990's,2.5D,2D,...,Werewolves,Western,Wholesome,Word Game,World War I,World War II,Wrestling,Zombies,eSports,Rating
0,56,52,47,54,50,58,50,52,48,49,...,47,48,50,49,55,48,60,47,49,Postive
1,97,32,47,49,39,54,51,47,50,42,...,55,49,47,54,40,50,53,46,40,Negative


In [30]:
# Top 10 Features for Positive Rating
dfTopPositive = dfFeatureCount.drop(['Rating'], axis=1).iloc[0].nlargest(10).to_frame().T
dfTopPositive

Unnamed: 0,Lore-Rich,Medical Sim,City Builder,Clicker,Cycling,Hardware,Comic Book,FMV,Jet,Wrestling
0,62,62,61,61,61,61,60,60,60,60


In [31]:
# Top 10 Features for Negative Rating
dfTopNegative = dfFeatureCount.drop(['Rating'], axis=1).iloc[1].nlargest(10).to_frame().T
dfTopNegative

Unnamed: 0,win,Massively Multiplayer,Simulation,Clicker,Survival,Dark Fantasy,Early Access,MMORPG,Realistic,Turn-Based Combat
0,97,88,85,78,72,69,68,67,67,67


-------------------------

# Conclusion

After building a Sequential Model for binary classification with an accuracy score of ~72% and running trials until 100 positive and 100 negative ratings were produced, it seems that the top 10 feature combinations for a

**Positive Rating:** 
- Lore-Rich
- Medical Sim
- City Builder
- Clicker
- Cycling
- Hardware
- Comic Book
- FMV
- Jet
- Wrestling

**Negative Rating:** 
- win
- Massively Multiplayer	
- Simulation	
- Clicker	
- Survival	
- Dark Fantasy	
- Early Access	
- MMORPG	
- Realistic	
- Turn-Based Combat

To improve the accuracy score, some future work would be to clean up the raw data more and introduce more feature engineered variables. Collapsing redundant columns such as MMO versus Multiplayer could also simplyify the dataset/improve accuracy. Furthermore, in the tuning, add in regularizers to find parameters that reduce overfitting even further. 

However, it seems that overall many steam reviews are generally rated positively since when running the trials, the negative review output took signifcantly longer to produce. Assuming that features do not have dependencies on one another, the negative combination above are best to avoid. Further modelign and statistical work should be done to actually test for feature independence.