# Borrador de GridSearch (Clasificación)

## Pre-gridsearch: eligiendo que modelos usar

**Candidatos**
- Linear SVC (baseline)
- SVC (no lineal)
- KNeighbours
- RandomForestClassifier
- DecisionTreeClassifier
- MLP (red-neuronal de sklearn)

In [1]:
from preprocessing import train_and_evaluate_clf, custom_features
from sklearn.model_selection import train_test_split
import pandas as pd

import time
import math

def timeSince(since):
    now = time.time_ns()
    s = now - since
    return s*10**(-9)

In [2]:
import re

def custom_features(dataframe_in):
    df = dataframe_in.copy(deep=True)

    df['month'] = pd.to_datetime(df['release_date']).dt.month
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.to_julian_date())

    df['revenue'] = pd.Series([0 for _ in range(len(dataframe_in))])

    df.loc[df.publisher.str.match('.*microsoft.*', flags=re.IGNORECASE).values, 'revenue'] = 10.260
    df.loc[df.publisher.str.match('.*netease.*', flags=re.IGNORECASE).values, 'revenue'] = 6.668
    df.loc[df.publisher.str.match('.*activision.*', flags=re.IGNORECASE).values, 'revenue'] = 6.388
    df.loc[df.publisher.str.match('.*electronic.*', flags=re.IGNORECASE).values, 'revenue'] = 5.537
    df.loc[df.publisher.str.match('.*bandai.*', flags=re.IGNORECASE).values, 'revenue'] = 3.018
    df.loc[df.publisher.str.match('.*square.*', flags=re.IGNORECASE).values, 'revenue'] = 2.386
    df.loc[df.publisher.str.match('.*nexon.*', flags=re.IGNORECASE).values, 'revenue'] = 2.286
    df.loc[df.publisher.str.match('.*ubisoft.*', flags=re.IGNORECASE).values, 'revenue'] = 1.446
    df.loc[df.publisher.str.match('.*konami.*', flags=re.IGNORECASE).values, 'revenue'] = 1.303
    df.loc[df.publisher.str.match('.*SEGA.*').values, 'revenue'] = 1.153
    df.loc[df.publisher.str.match('.*capcom.*', flags=re.IGNORECASE).values, 'revenue'] = 0.7673
    df.loc[df.publisher.str.match('.*warner.*', flags=re.IGNORECASE).values, 'revenue'] = 0.7324

    return df

In [3]:
df_train = pd.read_pickle('train.pickle')
df_train = custom_features(df_train)
X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['rating'], test_size=0.3, random_state=0, stratify=df_train['rating'])

In [4]:
from sklearn.svm import LinearSVC

baseline = LinearSVC(random_state=0,max_iter=10000)
train_and_evaluate_clf(baseline,X_train,y_train,X_eval,y_eval)

Resultados clasificación LinearSVC
                 precision    recall  f1-score   support

          Mixed       0.29      0.30      0.30       497
Mostly Positive       0.26      0.23      0.24       512
       Negative       0.42      0.33      0.37       387
       Positive       0.33      0.44      0.38       610
  Very Positive       0.41      0.30      0.35       359

       accuracy                           0.33      2365
      macro avg       0.34      0.32      0.33      2365
   weighted avg       0.33      0.33      0.33      2365



In [5]:
from sklearn.svm import SVC
# from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier

clasificadores = [
    SVC(random_state=0),
    KNeighborsClassifier(),
    DecisionTreeClassifier(random_state=0),
    RandomForestClassifier(random_state=0),
    BaggingClassifier(random_state=0),
    GradientBoostingClassifier(random_state=0),
    MLPClassifier(early_stopping =True,max_iter = 100, random_state=0)
]

In [6]:
for clf in clasificadores:
    start = time.time_ns()
    train_and_evaluate_clf(clf,X_train,y_train,X_eval,y_eval)
    print("Time elapsed for {} method: {} seconds\n".format(type(clf).__name__,timeSince(start)))

Resultados clasificación SVC
                 precision    recall  f1-score   support

          Mixed       0.33      0.28      0.30       497
Mostly Positive       0.25      0.15      0.19       512
       Negative       0.37      0.33      0.35       387
       Positive       0.33      0.65      0.43       610
  Very Positive       0.54      0.14      0.22       359

       accuracy                           0.33      2365
      macro avg       0.36      0.31      0.30      2365
   weighted avg       0.35      0.33      0.31      2365

Time elapsed for SVC method: 5.597117639 seconds

Resultados clasificación KNeighborsClassifier
                 precision    recall  f1-score   support

          Mixed       0.27      0.35      0.30       497
Mostly Positive       0.24      0.30      0.27       512
       Negative       0.36      0.23      0.28       387
       Positive       0.33      0.37      0.34       610
  Very Positive       0.35      0.14      0.20       359

       accuracy

In [11]:
clf1 = SVC(random_state=0,probability=True)
clf2 = KNeighborsClassifier()
clf3 = RandomForestClassifier(random_state=0)
clf4 = MLPClassifier(early_stopping =True,max_iter = 100, random_state=0)

eclf_soft = VotingClassifier(
    estimators=[('svc', clf1), ('kn', clf2), ('rf', clf3), ('mlp', clf4)],
    voting = 'soft'
)

start = time.time_ns()
train_and_evaluate_clf(eclf_soft,X_train,y_train,X_eval,y_eval)
print("Time elapsed for voting (soft) method: {} seconds\n".format(timeSince(start)))

Resultados clasificación VotingClassifier
                 precision    recall  f1-score   support

          Mixed       0.31      0.27      0.29       497
Mostly Positive       0.27      0.27      0.27       512
       Negative       0.44      0.33      0.38       387
       Positive       0.35      0.54      0.42       610
  Very Positive       0.46      0.26      0.33       359

       accuracy                           0.35      2365
      macro avg       0.37      0.33      0.34      2365
   weighted avg       0.36      0.35      0.34      2365

Time elapsed for voting (soft) method: 37.920576699 seconds



In [13]:
clf1 = SVC(random_state=0,probability=True)
clf2 = KNeighborsClassifier()
clf3 = RandomForestClassifier(random_state=0)
clf4 = MLPClassifier(early_stopping =True,max_iter = 100, random_state=0)

eclf_hard = VotingClassifier(
    estimators=[('svc', clf1), ('kn', clf2), ('rf', clf3), ('mlp', clf4)],
    voting = 'hard'
)

start = time.time_ns()
train_and_evaluate_clf(eclf_hard,X_train,y_train,X_eval,y_eval)
print("Time elapsed for voting (hard) method: {} seconds\n".format(timeSince(start)))

Resultados clasificación VotingClassifier
                 precision    recall  f1-score   support

          Mixed       0.32      0.37      0.34       497
Mostly Positive       0.27      0.25      0.26       512
       Negative       0.46      0.29      0.36       387
       Positive       0.35      0.54      0.42       610
  Very Positive       0.57      0.18      0.27       359

       accuracy                           0.35      2365
      macro avg       0.39      0.33      0.33      2365
   weighted avg       0.38      0.35      0.34      2365

Time elapsed for voting (hard) method: 35.801460836000004 seconds



# Borrador de GridSearch (regresión)

**Candidatos**:
- Lasso
- ElasticNet
- Ridge
- SVR Lineal
- SVR polinomial
- SVR RBF
- Bagging
- DecisionTree
- RandomForest
- GradientBoosting
- ExtraTreesRegressor
- AdaBoostRegressor
- etc

In [14]:
from sklearn.svm import SVR
from sklearn.linear_model import ElasticNet, Ridge, RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor, HistGradientBoostingRegressor, VotingRegressor, StackingRegressor

In [15]:
regresores = [
    ElasticNet(random_state=0),
    Ridge(random_state=0),
    SVR(kernel='linear'),
    SVR(kernel='poly'),
    SVR(kernel='rbf'),
    KNeighborsRegressor(),
    DecisionTreeRegressor(random_state=0),
    BaggingRegressor(random_state=0),
    GradientBoostingRegressor(random_state=0),
    RandomForestRegressor(random_state=0),
    # ExtraTreesRegressor(random_state=0),
    # AdaBoostRegressor(random_state=0),
    # HistGradientBoostingRegressor(random_state=0),
    # VotingRegressor(estimators=[])
    # StackingRegressor(estimators=[])
]

In [16]:
df_train = pd.read_pickle('train.pickle')
df_train = custom_features(df_train)
X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['estimated_sells'], test_size=0.3, random_state=0)

In [17]:
from preprocessing import train_and_evaluate_reg

for clf in regresores:
    start = time.time_ns()
    train_and_evaluate_reg(clf,X_train,y_train,X_eval,y_eval)
    print("Time elapsed for {} method: {} seconds\n".format(type(clf).__name__,timeSince(start)))

Resultados regresión ElasticNet


  f = msb / msw


Error cuadrático medio = 1671385053276.5864
Score R2 = 0.06770034234323863
Time elapsed for ElasticNet method: 2.256597338 seconds

Resultados regresión Ridge


  f = msb / msw


Error cuadrático medio = 1671827028592.083
Score R2 = 0.0674538082280014
Time elapsed for Ridge method: 1.8286654830000002 seconds

Resultados regresión SVR


  f = msb / msw


Error cuadrático medio = 1826175581091.4065
Score R2 = -0.018641913624264594
Time elapsed for SVR method: 5.895251775 seconds

Resultados regresión SVR


  f = msb / msw


Error cuadrático medio = 1828390721632.3743
Score R2 = -0.019877520442667995
Time elapsed for SVR method: 5.841998715000001 seconds

Resultados regresión SVR


  f = msb / msw


Error cuadrático medio = 1828416301994.4126
Score R2 = -0.019891789185064734
Time elapsed for SVR method: 5.921423229 seconds

Resultados regresión KNeighborsRegressor


  f = msb / msw


Error cuadrático medio = 1601210069958.7705
Score R2 = 0.10684399317044402
Time elapsed for KNeighborsRegressor method: 2.1025358830000003 seconds

Resultados regresión DecisionTreeRegressor


  f = msb / msw


Error cuadrático medio = 4184152507003.635
Score R2 = -1.3339229594137318
Time elapsed for DecisionTreeRegressor method: 2.645488878 seconds

Resultados regresión BaggingRegressor


  f = msb / msw


Error cuadrático medio = 730686091203.4044
Score R2 = 0.5924228283913218
Time elapsed for BaggingRegressor method: 7.905426198000001 seconds

Resultados regresión GradientBoostingRegressor


  f = msb / msw


Error cuadrático medio = 808942717167.9564
Score R2 = 0.5487712320981176
Time elapsed for GradientBoostingRegressor method: 2.50803198 seconds

Resultados regresión RandomForestRegressor


  f = msb / msw


Error cuadrático medio = 829064518289.4973
Score R2 = 0.5375472784913328
Time elapsed for RandomForestRegressor method: 62.640253447000006 seconds



In [18]:
reg1 = GradientBoostingRegressor(random_state=0)
reg2 = RandomForestRegressor(random_state=0)
reg3 = BaggingRegressor(random_state=0)
ereg = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('bg', reg3)])

start = time.time_ns()
train_and_evaluate_reg(ereg,X_train,y_train,X_eval,y_eval)
print("Time elapsed for voting-regressor method: {} seconds\n".format(timeSince(start)))

Resultados regresión VotingRegressor


  f = msb / msw


Error cuadrático medio = 735845920098.8611
Score R2 = 0.5895446725149298
Time elapsed for voting-regressor method: 68.956703336 seconds

