## Outline
This notebook contains the code for the ML model for the project.
It assumes that the data it is given is already cleaned and ready for processing. The process of cleaning up the data is done in the data engineering notebook.

It assumes that the dataset contains columns with three data types: numerical (e.g. sales), categorical (e.g., platform, genre), and descriptive (a text description of the game).

The numerical data would not require any further manipulation.

The categorical data will be one-hot encoded.

The text description will need to be transformed using NLP methods. One way of doing this is some kind of count vectorizer, that will teach the model which words appear in better-selling games. Another way of thinking about it is to think of the text description as a "review" (like in yelp), and the corresponding sales figure as the "rating." Can the model use the words in the text description to predict (or to use as part of other features to predict) the sales of a game?

## Ideas about the model
What does the model do?

What does it need to do with the numerical, categorical, and descriptive data?

This is a regression problem!

Starting from the end:
1. A regressor: what kind of regressor would work here? Given the data, I think I can try a linear model first. I can withhold a piece of the data and use it as a test set, to see how the linear model performs. If it doesn't do well (whatever that means), I can try other options. I need to remember that this is a regression problem and therefore should focus on regression estimators.
2. The regressor will be fed the feature matrix.
3. The feature matrix will include numerical data, categorical data, and descriptive (free-text) data.
4. Does the numerical data need to be altered? The numerical data that I currently am thinking of is just the sales data, which is already scaled, so I don't think I need to do anything with the numerical data.
5. Categorical data will need to be one-hot encoded. This is the majority of my features, from genre and platform, to franchise and potentially age-rating. So a OHE can be applied to all of these. This will increase the number of features greatly. Combined with the NLP data, this may result in a larger number of features than observations.
6. The text data will need to be processed using NLP methods... HOW SO?

## Workflow of the model

(0. Transforming the numerical data by scaling it - this step is not necessary because the only numerical data currently is the sales values, which are already scaled.)
1. Transforming the categorical data - using OHE
2. Transforming the text data - using NLP methods... TBD
3. Training a regressor, starting out with LinearRegression
4. Using CV

Then:

5. Test the model by splitting the data into a training and testing set and see how well it performs in predicting the sales of the withheld testing set.
6. ...

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

In [2]:
# Standardization of strings

import re
from unidecode import unidecode

def standardize_string(string):

    if not isinstance(string, str):
        return ''
        
    # converts everything to unicode, addressing diacritics as well as chinese characters
    string = unidecode(string)
    
    # removes any non-alphanumeric character or non-space as well as parenthesis (and their inclused content)
    regex = r'\([^)]*\)|[^a-zA-Z0-9\s]'
    string = re.sub(regex, '', string)
    
    # standardizes spacing that there is only one space between each word
    string = re.sub(r'\s+', ' ', string)
    
    # changes to lowercase and strips whitespaces
    return string.lower().strip()

In [3]:
def pseudo_list_parser(item, dtype=int, ignore_ws=True):

    if isinstance(item, str):
        if ignore_ws:
            item = item.replace(' ', '')
        return [dtype(x) for x in item.replace('[','').replace(']', '').split(',')]
    
    return item

In [4]:
# df['summary'].sample(5)

In [5]:
# df.iloc[1701]['summary']

In [6]:
# standardize_string('test$ing a random\nblah')#df.iloc[1700]['summary'])

In [7]:
# df.iloc[1700]['summary']

In [8]:
# from gensim.models import Word2Vec
# from gensim.utils import simple_preprocess
# import re

# def process_review(review):
#     """
#     Splits review into sentences, then sentences into tokens. Returns 
#     nested list.
#     """
#     words = [simple_preprocess(sentence, deacc=True) 
#              for sentence in re.split('\.|\?|\!', review)
#              if sentence]
#     return words

# # Flatten list to contain all sentences from all reviews

# def flatten_text(text):
    
#     sentences = [sentence for review in text 
#                  for sentence in process_review(text)]
    
#     return sentences

In [9]:
# process_review('test$ing a random\nblah')

In [10]:
# process_review(df.iloc[1700]['summary'])

## Loading the data and parsing it

In [11]:
# load the data
drop_columns = [
    'name',
    'release_year',
    'closest_match',
    'match_score',
    'id',
    'first_release_date',
    'external_games',
    'release_dates',
    'similar_games',
    'language_supports',
    'status',
    'alternative_names',
    'bundles',
    'collections',
    'parent_game',
    'collection'
    ]

df = (pd.read_csv('data_complete.csv', index_col='index')
      .drop(drop_columns, axis=1))

In [12]:
# columns with numerical data
numeric_columns = ['sales_na', 'sales_eu', 'sales_jp', 'sales_other', 'sales_global']

# columns with text descriptions
text_columns = ['summary', 'storyline']

# columns that contain lists
list_columns = ['age_ratings', 'game_modes', 'genres', 'themes', 'involved_companies', 'keywords',
               'multiplayer_modes', 'franchises', 'game_engines', 'player_perspectives', 'game_localizations']
# parse the columns that contain pseudo-lists into lists and populate those with non-list (=NaN) with empty lists
df[list_columns] = df[list_columns].applymap(lambda x: pseudo_list_parser(x)).applymap(lambda x: x if isinstance(x, list) else [])

# since most columns are categorical, it is simpler to exclude columns from df.columns than to explicitly list them out
non_ohe_columns = numeric_columns + text_columns + list_columns

ohe_columns = [column_name for column_name in df.columns if column_name not in non_ohe_columns]

In [13]:
df.shape

(4202, 25)

In [14]:
# sum_d = {}
# store_d = {}

# for item in df['summary']:
    
#     sum_words = standardize_string(item).split()
#     for word in sum_words:
#         sum_d[word] = sum_d.get(word, 0) + 1

# for item in df['storyline']:
    
#     store_words = standardize_string(item).split()
    
#     for word in store_words:
#         store_d[word] = store_d.get(word, 0) + 1


In [15]:
# count = 0

# for k, v in store_d.items():
#     if v > 100:
#         count += 1
        
# print(count)

## The model

In [16]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer

In [18]:
from sklearn.decomposition import TruncatedSVD

In [19]:
class DictEncoder(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
#         X_trans = pd.DataFrame()
        
#         for column in X.columns:
#             column_trans = [{key: 1 for key in item} for item in X[column]]
#             X_trans[column] = column_trans
        
#         return X_trans
        return [{key: 1 for key in row} for row in X] #, name=X.name)

In [20]:
list_column_vectorizer = Pipeline([
    ('dict_encoder', DictEncoder()),
    ('dict_vectorizer', DictVectorizer())
#    (f'dv_{col}', DictVectorizer(), col) for col in 
])

In [21]:
list_column_transformers = [(f'list_vect_{column}', list_column_vectorizer, f'{column}') for column in list_columns]
#list_column_transformers

In [22]:
ohe_transformers = [('categorical', OneHotEncoder(handle_unknown='ignore'), ohe_columns)]

In [23]:
def text_column_preprocessing(series):
    
    return series.map(standardize_string)

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from spacy.lang.en.stop_words import STOP_WORDS

STOP_WORDS = STOP_WORDS.difference({'he','his','her','hers'}).union({'ll', 've'})

text_column_vectorizer = Pipeline([
    ('text_parse_and_split', FunctionTransformer(text_column_preprocessing)),
    ('tfidf', TfidfVectorizer(min_df=20, max_df=0.5, stop_words=list(STOP_WORDS)))
])

text_column_transformers = [(f'text_{column}', text_column_vectorizer, f'{column}') for column in text_columns]

In [26]:
features = ColumnTransformer(
    transformers = list_column_transformers + ohe_transformers + text_column_transformers,
    remainder='drop')

svd = TruncatedSVD() # I think this is how to use it
# pca doesn't work with sparse matrix

# regressor = Ridge()
regressor = RandomForestRegressor()
# regressor = KNeighborsRegressor()

param_grid = {
# relaxed dimensionality reduction:
    'dim_reduction__n_components': [10, 20, 30, 50, 100, 250, 500],
# aggresive dimensionality reduction (for, e.g., KNN):
#     'dim_reduction__n_components': [10, 20, 30],

# Ridge hyperparameters
#    'regressor__alpha': [0.01, 0.1, 1, 10, 100]
# KNN hyperparameters
#     'regressor__n_neighbors': [3, 5, 8, 10, 15]
# RandomForest hyperparameters:
    'regressor__n_estimators': [10, 50, 100], #, 200, 300],
    'regressor__max_depth': [None, 10, 20, 50],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4]
}

estimator = Pipeline([
    ('dim_reduction', svd),
    ('regressor', regressor)
])

gs = GridSearchCV(
    estimator,
    param_grid=param_grid,
    cv=5,
    n_jobs=2
)

pipe = Pipeline([
    ('features', features),
    ('main_regressor', gs)
])

In [27]:
X = df.drop(numeric_columns, axis=1)
y = np.log(df['sales_global'])

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.02)

In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe.named_steps.main_regressor.best_params_

In [None]:
preds = pipe.predict(X_test)

In [None]:
from sklearn import metrics

print(f"Mean absolute error: {metrics.mean_absolute_error(y_test, preds)}")
print(f"Mean squared error: {metrics.mean_squared_error(y_test, preds)}")
print(f"R^2: {metrics.r2_score(y_test, preds)}")

In [55]:
# Ridge with 500 components (or less)

from sklearn import metrics

print(f"Mean absolute error: {metrics.mean_absolute_error(y_test, preds)}")
print(f"Mean squared error: {metrics.mean_squared_error(y_test, preds)}")
print(f"R^2: {metrics.r2_score(y_test, preds)}")

Mean absolute error: 0.5733643988756195
Mean squared error: 0.5429549244012856
R^2: 0.6581879515557436


In [56]:
pipe.score(X_test, y_test)

0.6581879515557436

In [33]:
# from sklearn.preprocessing import MultiLabelBinarizer

# mlb = MultiLabelBinarizer()

#trans_test = mlb.fit_transform(test['genres'])
#trans_columns = [f'genres_{column}' for column in mlb.classes_]
#trans_df = pd.DataFrame(trans_test, columns=trans_columns)
#result_df = pd.concat([test, trans_df], axis=1)
#result_df

In [34]:
# def SeriesEncoder(X):
    
#     return pd.Series([{key: 1 for key in row} for row in X], name=X.name)

In [35]:
# from sklearn.preprocessing import FunctionTransformer

# def DictEncoder(X):
    
#     if isinstance(X, pd.Series):
#         print (f'DictEncoder treating X as a Series (name={X.name})')
#         return SeriesEncoder(X)
    
#     else:
    
#         X_trans = pd.DataFrame()
#         print (f'DictEncoder treating X as a DataFrame (columns={X.columns})')
        
#         for column in X.columns:
# #             yield SeriesEncoder(X[column])

#             encoded_column = SeriesEncoder(X[column])

#             X_trans = pd.concat([X_trans, encoded_column], axis=1)

#         return X_trans
    
#     return [{key: 1 for key in item} for item in X]

In [36]:
# def ohe_list_column(series):
    
#     trans_series = mlb.fit_transform(series)
    
#     trans_columns_names = [f'{series.name}_{column}' for column in mlb.classes_]
    
#     trans_df = pd.DataFrame(trans_series, columns=trans_columns_names)
    
#     return trans_df