Programmer: Chris Heise (crheise@icloud.com)
School: New Mexico Highlands University
Course: BSSD 3850 Data Modeling
Instructor: Jonathan Lee
Date: 6 October 2022

Program: Midterm Project
Purpose: Use tensorflow to attempt to determine trends in video game popularity.

# IMDB Video Games

## 1. Define the Problem
The purpose of this dataset, as described on [kaggle.com](https://www.kaggle.com/datasets/muhammadadiltalay/imdb-video-games), is to determine trends in popularity of video games based on genre and/or plot.

> Is there a relationship between popularity (votes), year, and genre when it comes to a game's rating?

#### The data was collected from the following nine genres of video games on IMDB
    1. Action
    2. Adventure
    3. Comedy
    4. Crime
    5. Family
    6. Fantasy
    7. Mystery
    8. Sci-Fi
    9. Thriller

- The data contains over 20K titles.

In [1]:
# Data Analysis
import numpy as np
import pandas as pd

# Machine Learning
import sklearn
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.utils import to_categorical

from ipywidgets import widgets

## 2. Acquire & Inspect the Data

In [2]:
video_game_data = pd.read_csv('./imdb-videogames.csv')

# Here, I copy the data into a separate variable to maintain its integrity
df = video_game_data

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,name,url,year,certificate,rating,votes,plot,Action,Adventure,Comedy,Crime,Family,Fantasy,Mystery,Sci-Fi,Thriller
0,0,Spider-Man,https://www.imdb.com/title/tt5807780/?ref_=adv...,2018.0,T,9.2,20759,"When a new villain threatens New York City, Pe...",True,True,False,False,False,True,False,False,False
1,1,Red Dead Redemption II,https://www.imdb.com/title/tt6161168/?ref_=adv...,2018.0,M,9.7,35703,Amidst the decline of the Wild West at the tur...,True,True,False,True,False,False,False,False,False
2,2,Grand Theft Auto V,https://www.imdb.com/title/tt2103188/?ref_=adv...,2013.0,M,9.5,59986,Three very different criminals team up for a s...,True,False,False,True,False,False,False,False,False
3,3,God of War,https://www.imdb.com/title/tt5838588/?ref_=adv...,2018.0,M,9.6,26118,"After wiping out the gods of Mount Olympus, Kr...",True,True,False,False,False,False,False,False,False
4,4,Uncharted 4: A Thief's End,https://www.imdb.com/title/tt3334704/?ref_=adv...,2016.0,T,9.5,28722,Thrown back into the dangerous underworld he'd...,True,True,False,False,False,False,False,False,False


As you can see, the data includes columns that won't be helpful to us. It's also important to check for null/incomplete data.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20803 entries, 0 to 20802
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   20803 non-null  int64  
 1   name         20803 non-null  object 
 2   url          20803 non-null  object 
 3   year         20536 non-null  float64
 4   certificate  7903 non-null   object 
 5   rating       11600 non-null  float64
 6   votes        11600 non-null  object 
 7   plot         20803 non-null  object 
 8   Action       20803 non-null  bool   
 9   Adventure    20803 non-null  bool   
 10  Comedy       20803 non-null  bool   
 11  Crime        20803 non-null  bool   
 12  Family       20803 non-null  bool   
 13  Fantasy      20803 non-null  bool   
 14  Mystery      20803 non-null  bool   
 15  Sci-Fi       20803 non-null  bool   
 16  Thriller     20803 non-null  bool   
dtypes: bool(9), float64(2), int64(1), object(5)
memory usage: 1.4+ MB


Certificate (the age rating of the game) is missing too many values to be filled in. The year column is also missing some, but not too much, data. 

## 3. Prepare the Data

I'll start by dropping any games that are missing ratings, votes or year.

In [5]:
# NN for 'Non-Null'
nn_df = df.dropna(subset=['year', 'rating', 'votes'])

Next, I remove any columns that don't contain information relative to what I'm attempting to determine.

In [6]:
clean_df = nn_df.drop(columns = ['Unnamed: 0', 'name', 'url', 'certificate', 'plot'])

Above, we saw that the votes column was an 'object' not a numeric value like we need. Let's confirm that.

In [7]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11599 entries, 0 to 20791
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   year       11599 non-null  float64
 1   rating     11599 non-null  float64
 2   votes      11599 non-null  object 
 3   Action     11599 non-null  bool   
 4   Adventure  11599 non-null  bool   
 5   Comedy     11599 non-null  bool   
 6   Crime      11599 non-null  bool   
 7   Family     11599 non-null  bool   
 8   Fantasy    11599 non-null  bool   
 9   Mystery    11599 non-null  bool   
 10  Sci-Fi     11599 non-null  bool   
 11  Thriller   11599 non-null  bool   
dtypes: bool(9), float64(2), object(1)
memory usage: 464.4+ KB


Below, I convert the data to numeric values. I drop the decimal places from rating to reduce the size of the output. Instead of having ~100 possible ratings (0.0 - 9.9), there will now be 10 (0 - 9). 

In [8]:
clean_df['votes'] = clean_df['votes'].str.replace(',', '').astype(np.int64)
clean_df['rating'] = clean_df['rating'].astype(np.int64)
clean_df['year'] = clean_df['year'].astype(np.int64)
clean_df = clean_df.replace([True, False], [0, 1])

In [9]:
clean_df.head()

Unnamed: 0,year,rating,votes,Action,Adventure,Comedy,Crime,Family,Fantasy,Mystery,Sci-Fi,Thriller
0,2018,9,20759,0,0,1,1,1,0,1,1,1
1,2018,9,35703,0,0,1,0,1,1,1,1,1
2,2013,9,59986,0,1,1,0,1,1,1,1,1
3,2018,9,26118,0,0,1,1,1,1,1,1,1
4,2016,9,28722,0,0,1,1,1,1,1,1,1


Finally, I can split the dataset.

In [10]:
labels = clean_df['rating']
features = clean_df.drop(columns = ['rating'])

In [11]:
ratio = int(len(labels)*0.7)

# Training data (70%)
train_feat = np.array(features[:ratio])
train_lbls = np.array(labels[:ratio])

# Testing data (30%)
test_feat = np.array(features[ratio:])
test_lbls = np.array(labels[ratio:])

In [12]:
# One-hot encode the labels
train_lbls = to_categorical(train_lbls)
test_lbls = to_categorical(test_lbls)

In [13]:
print(test_feat.shape)
print(test_lbls.shape)
print(train_feat.shape)
print(test_lbls.shape)

(3480, 11)
(3480, 10)
(8119, 11)
(3480, 10)


In [14]:
test_lbls[:5]

array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]], dtype=float32)

## 4. Define & Fit the Model

In [15]:
# I've attempted activations of tanh and relu with various configurations,
    # but the best train acc achieved has been ~40%.

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(64, activation='tanh'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10)
])

loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

In [16]:
e = 5

model.fit(train_feat, train_lbls, epochs=e)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x281f80350>

In [17]:
model.evaluate(test_feat, test_lbls, verbose=2)

109/109 - 0s - loss: 1.5349 - accuracy: 0.3796 - 88ms/epoch - 811us/step


[1.534890055656433, 0.3795976936817169]

In [18]:
predictions = model(test_feat[:1])
print(predictions)
predictions = predictions.numpy()[0]
print(predictions)
max_pred = np.amax(predictions)
print(test_lbls[0], np.where(predictions == max_pred)[0])

tf.Tensor(
[[-5.4576254 -2.0109565 -1.7652036 -0.5239869  0.421166   1.3896167
   2.5748694  3.1871047  2.3275402 -1.2031752]], shape=(1, 10), dtype=float32)
[-5.4576254 -2.0109565 -1.7652036 -0.5239869  0.421166   1.3896167
  2.5748694  3.1871047  2.3275402 -1.2031752]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] [7]


**Observation:** it appears there either isn't enough data to make make accurate ratings prediction, or there isn't any correlation between a game's rating and its genres and year released. I personally believe it's the latter of the two. 

## 5. Predict on a Custom Video Game

In [19]:
# Helper function
def convert_genres(genres):
    oh = []
    
    if 'Action' in genres:
        oh.append(1)
    else:
        oh.append(0)        
    if 'Adventure' in genres:
        oh.append(1)
    else:
        oh.append(0)        
    if 'Comedy' in genres:
        oh.append(1)
    else:
        oh.append(0)        
    if 'Crime' in genres:
        oh.append(1)
    else:
        oh.append(0)        
    if 'Family' in genres:
        oh.append(1)
    else:
        oh.append(0)        
    if 'Fantasy' in genres:
        oh.append(1)
    else:
        oh.append(0)    
    if 'Mystery' in genres:
        oh.append(1)
    else:
        oh.append(0)    
    if 'Sci-Fi' in genres:
        oh.append(1)
    else:
        oh.append(0)        
    if 'Thriller' in genres:
        oh.append(1)
    else:
        oh.append(0)
    
    return oh

In [20]:
def predict_rating(button):
    # Get data from user
    year = year_select.value
    votes = votes_select.value
    genres = genre_select.value    
    oh_genres = convert_genres(genres)

    # Make list of the data
    usr_choices = []
    usr_choices.append(year)
    usr_choices.append(votes)
    for genre in oh_genres:
        usr_choices.append(genre)
    length = [1, ]
    
    # Convert data and make prediction
    length = np.array(length)
    usr_game = np.array(usr_choices)
    
    usr_game = np.array([length, usr_game], dtype=object)
    usr_game = usr_game = usr_game.astype(np.int64)
    usr_game = tf.convert_to_tensor(usr_game)
    
    prediction = model(usr_game)
    max_pred = np.amax(predictions)
    
    print(test_lbls[0], np.where(predictions == max_pred)[0])

In [21]:
year_select = widgets.IntSlider(
    value=2022,
    min=1972,
    max=2072,
    step=1,
    description='Year:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True
)

votes_select = widgets.IntSlider(
    value=25000,
    min=5,
    max=65000,
    step=1000,
    description='Votes:',
    disabled=False,
    orientation='horizontal',
    readout=True
)

genre_select = widgets.SelectMultiple(
    options=['Action', 'Adventure', 'Comedy', 'Crime', 'Family', 'Fantasy', 'Mystery', 'Sci-Fi', 'Thriller'],
    value=['Action'],
    description='Genres',
    disabled=False
)

predict_button = widgets.Button(
    description='Predict Game Rating',
    disabled=False,
    tooltip='Click to Predict Game Rating',
)
predict_button.on_click(predict_rating)

display(year_select, votes_select, genre_select, predict_button)


IntSlider(value=2022, continuous_update=False, description='Year:', max=2072, min=1972)

IntSlider(value=25000, description='Votes:', max=65000, min=5, step=1000)

SelectMultiple(description='Genres', index=(0,), options=('Action', 'Adventure', 'Comedy', 'Crime', 'Family', …

Button(description='Predict Game Rating', style=ButtonStyle(), tooltip='Click to Predict Game Rating')