This notebook contains the ML model that predicts log of the sales. It could just as well predict the sales themselves, but given the range of values for sales, it makes more sense to predict the log. An error of \\$50,000 on a 1.5 million dollars game would be less substantial than the same error on a \\$100,000 game.

The notebook assumes that the data set is already cleaned and ready to be used. See the data engineering notebook for the data preparation.

There are three kinds of columns in the data set: numerical (e.g. sales), categorical (e.g. platform, genre), and free-form text (e.g. summary). The numerical data does not require any further preparation. The categorical data will be one-hot encoded. Some categorical columns contain singular values and will be transformed using OneHotEncoder. Other categorical columns contain lists that will use DictVectorizer. Note that the majority of the categorical columns contain numerical values, but these simply represent different classes of whatever data the columns hold. The free-form text will be transformed using TF-IDF vectorizer.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

from tdi_capstone_common_functions import *
# the two functions imported are: standardize_string and pseudo_list_parser

The data, loaded into a pandas data frame, has a lot of columns. While the data engineering notebook dropped columns that were not useful for predictions, specifically a lot of meta-data (e.g., the url for the game on IGDB), there are still columns that may be useful, but may as well not be. They were retained up to this point, but are now dropped. They may be used in future versions of the model. Some, however, can and shouled be dropped regardless, such asname, release_year, closest_match, match_score, id, first_release_date, and release_dates.

In [2]:
# load the data
drop_columns = [
    'name',
    'release_year',
    'closest_match',
    'match_score',
    'id',
    'first_release_date',
    'release_dates',
    'alternative_names',
    'external_games',
    'similar_games',
    'language_supports',
    'status',
    'bundles',
    'collections',
    'parent_game',
    'collection'
    ]

df = (pd.read_csv('data_complete.csv', index_col='index')
      .drop(drop_columns, axis=1))

Since different kinds of columns will be transformed using different transformers, it is easy to create lists of columns according to their kind.

In [3]:
# columns with numerical data
numeric_columns = ['sales_na', 'sales_eu', 'sales_jp', 'sales_other', 'sales_global']

# columns with free-form text
text_columns = ['summary', 'storyline']

# columns that contain lists (imported as strings that look like lists)
list_columns = ['age_ratings', 'game_modes', 'genres', 'themes', 'involved_companies', 'keywords',
               'multiplayer_modes', 'franchises', 'game_engines', 'player_perspectives', 'game_localizations']

# parse the columns that contain pseudo-lists into lists and populate those with a non-list (NaN) with an empty list
df[list_columns] = df[list_columns].applymap(lambda x: pseudo_list_parser(x)).applymap(lambda x: x if isinstance(x, list) else [])

# since most columns won't use OneHotEncoder, it is easier to exclude columns from df.columns than to explicitly list them out
non_ohe_columns = numeric_columns + text_columns + list_columns
# creates a list with the columns that do not appear in any of the above lists
ohe_columns = [column_name for column_name in df.columns if column_name not in non_ohe_columns]

## Building the model

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import DictVectorizer

from sklearn.decomposition import TruncatedSVD # used for dimensionality reduction

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

As mentioned above, there are three kinds of columns: numerical, categorical, and free-form text. Numerical data does not require any transformation at this point. The categorical columns can be divided into two subtypes, columns with singular values that will be transformed using OneHotEncoder and columns with lists of values that will be transformed with DictVectorizer. Lastly, free-form text will use TfidfVectorizer. Each of these kinds of transformations will be stored as a list of transformers with their relevant columns, so they can all eventually be put together into a single list of transformers passed onto the ColumnTransformer of the model.

First to be treated are the OneHotEncoder categorical columns.

In [15]:
ohe_transformers = [('categorical', OneHotEncoder(handle_unknown='ignore'), ohe_columns)]

Then the categorical columns with lists of values. Each one of these lists is first encoded as a dictionary (using a custom class) and then vectorized.

In [16]:
class DictEncoder(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
#         X_trans = pd.DataFrame()
        
#         for column in X.columns:
#             column_trans = [{key: 1 for key in item} for item in X[column]]
#             X_trans[column] = column_trans
        
#         return X_trans
        return [{key: 1 for key in row} for row in X] #, name=X.name)

Because the transformer first needs to encode and then vectorize the list, a pipeline is used.

In [17]:
list_column_vectorizer = Pipeline([
    ('dict_encoder', DictEncoder()),
    ('dict_vectorizer', DictVectorizer())
])

Then a list is created, where every element is the tuple format of a transformer (name, transformer, column(s)) for every column. This way, it is possible to access specific transformers by its name via the named_steps method.

In [18]:
list_column_transformers = [(f'list_vect_{column}', list_column_vectorizer, f'{column}') for column in list_columns]

Lastly, the free-form text columns are treated, where the strings are first standardized and then vectorized using TF-IDF. Very rare words are excluded, represented by the min_df value. Additionally, a relatively low max_df is set to exclude words that appear in a lot of game descriptions, since we want to use the most impactful words.

In [19]:
def text_column_preprocessing(series):
    """
    Standardizes a series containing free-form strings.
    
    Parameter
    ---------
    series : pandas.Series()
        A series of strings to be standardized, retaining only alphanumeric characters,
        changing East Asian characters into Latin ones, and removing symbols and parentheses.
    
    Returns
    -------
    pandas.Series
        A series of standardized strings.
    """
    
    return series.map(standardize_string)

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from spacy.lang.en.stop_words import STOP_WORDS

STOP_WORDS = STOP_WORDS.difference({'he','his','her','hers'}).union({'ll', 've'})

text_column_vectorizer = Pipeline([
    ('standardize_text', FunctionTransformer(text_column_preprocessing)),
    ('tfidf', TfidfVectorizer(min_df=20, max_df=0.5, stop_words=list(STOP_WORDS)))
])

Much like in the case with the list columns, a list of tuples is created, each referring to an individual free-fore text column.

In [22]:
text_column_transformers = [(f'text_{column}', text_column_vectorizer, f'{column}') for column in text_columns]

In [None]:
# Under construction: allows for a quick and easy creation of a regressor.
# This will include the GS component and the DR component.

# def generate_regressor(model='ridge', param_grid=None):
#     """
#     Creates a regressor to be used by the model to allow for quick
#     customization of the model.
    
#     Parameters
#     ----------
#     model : string or sklearn model, default 'ridge'
#         The name of the model as a string or the model object itself.
#         Currently supported models are LinearRegression, Ridge,
#         RandomForestRegressor, KNeighborsRegressor. Defaults to Ridge
#         if anything else is given, as Ridge is currently the most
#         accurate regressor.
#     param_grid : dict, default None
#         The parameter grid for grid search cross-validation based on
#         the regressor. If no param_grid is passed, defaults to the one
#         hard-coded in the function.
    
#     Returns
#     -------
#     sklearn regressor
#         The regressor object requested or the default if that type of
#         regressor is not supported by the function.
#     """
    
#     if isinstance(model, str):
#         model = model.lower()
#         elif model == 'linearregression':
#             model = LinearRegression()
#         elif model == 'randomforestregressor':
#             model = RandomForestRegressor()
#         elif model == 'kneighborsregressor':
#             model = KNeighborsRegressor()
#         else:
#             model = Ridge() # the default regressor
    
#     if isinstance(model, LinearRegression()):
        

It is now possible to build the model. All the features will undergo their respective transformations (remember that the numerical columns do not need any at this point).

In [23]:
features = ColumnTransformer(
    transformers = list_column_transformers + ohe_transformers + text_column_transformers,
    remainder='drop')

The data set will end up having quite a lot of features, primarily due to the free-form text vectorizer, especially given the number of observations. To balance that, some kind of dimensionality reduction can help to improve the model. Post-transformation, the data is stored in a sparse matrix, and PCA doesn't work with sparse data. Instead, we use TruncatedSVD.

In [24]:
svd = TruncatedSVD()

Several regressors were tried with appropriate parameter grids. These include LinearRegression, Ridge, RandomForestRegressor, and KNeighborsRegressor. Out of all of them, Ridge seemed to performed the best.

In [None]:
# regressor = Ridge()
regressor = RandomForestRegressor()
# regressor = KNeighborsRegressor()

param_grid = {
# relaxed dimensionality reduction:
    'dim_reduction__n_components': [100, 250, 500],
# aggresive dimensionality reduction (for, e.g., KNN):
#     'dim_reduction__n_components': [10, 20, 30],

# Ridge hyperparameters
#    'regressor__alpha': [0.1, 1.0, 10.0]
# KNN hyperparameters
#     'regressor__n_neighbors': [3, 5, 8, 10, 15]
# RandomForest hyperparameters:
    'regressor__n_estimators': [10, 50, 100, 200, 300],
    'regressor__max_depth': [None, 10, 20, 50],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4]
}

estimator = Pipeline([
    ('dim_reduction', svd),
    ('regressor', regressor)
])

gs = GridSearchCV(
    estimator,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1
)

pipe = Pipeline([
    ('features', features),
    ('main_regressor', gs)
])

## Training the model

As mentioned above, the numerical columns are only the sales, which are also the (basis of the) labels, so these can be dropped from the feature matrix passed onto the model. The labels are going to be the log of the global sales. In the future, it would be possible to have the model predict localized sales (e.g. EU, US, etc).

In [25]:
X = df.drop(numeric_columns, axis=1)
y = np.log(df['sales_global'])

The data is then split into a training set and a test set.

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05)

The model is then fit with the training set.

In [None]:
pipe.fit(X_train, y_train)

In [None]:
#pipe.named_steps.main_regressor.best_params_

Predictions can then be made:

In [None]:
y_pred = pipe.predict(X_test)

It's then possible to observe some metrics on the performance of the model.

In [None]:
from sklearn import metrics

print(f"Mean absolute error: {metrics.mean_absolute_error(y_test, y_pred)}")
print(f"Mean squared error: {metrics.mean_squared_error(y_test, y_pred)}")
print(f"R^2: {metrics.r2_score(y_test, y_pred)}")

In [None]:
pipe.score(X_test, y_test)