# Borrador de GridSearch (Clasificación)

## Pre-gridsearch: eligiendo que modelos usar

**Candidatos**
- Linear SVC (baseline)
- SVC (no lineal)
- KNeighbours
- RandomForestClassifier
- DecisionTreeClassifier
- MLP (red-neuronal de sklearn)

In [1]:
from preprocessing import train_and_evaluate_clf, custom_features
from sklearn.model_selection import train_test_split
import pandas as pd

import time
import math

def timeSince(since):
    now = time.time_ns()
    s = now - since
    return s*10**(-9)

In [2]:
import re

def custom_features(dataframe_in):
    df = dataframe_in.copy(deep=True)

    df['month'] = pd.to_datetime(df['release_date']).dt.month
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.to_julian_date())

    df['revenue'] = pd.Series([0 for _ in range(len(dataframe_in))])

    df.loc[df.publisher.str.match('.*microsoft.*', flags=re.IGNORECASE).values, 'revenue'] = 10.260
    df.loc[df.publisher.str.match('.*netease.*', flags=re.IGNORECASE).values, 'revenue'] = 6.668
    df.loc[df.publisher.str.match('.*activision.*', flags=re.IGNORECASE).values, 'revenue'] = 6.388
    df.loc[df.publisher.str.match('.*electronic.*', flags=re.IGNORECASE).values, 'revenue'] = 5.537
    df.loc[df.publisher.str.match('.*bandai.*', flags=re.IGNORECASE).values, 'revenue'] = 3.018
    df.loc[df.publisher.str.match('.*square.*', flags=re.IGNORECASE).values, 'revenue'] = 2.386
    df.loc[df.publisher.str.match('.*nexon.*', flags=re.IGNORECASE).values, 'revenue'] = 2.286
    df.loc[df.publisher.str.match('.*ubisoft.*', flags=re.IGNORECASE).values, 'revenue'] = 1.446
    df.loc[df.publisher.str.match('.*konami.*', flags=re.IGNORECASE).values, 'revenue'] = 1.303
    df.loc[df.publisher.str.match('.*SEGA.*').values, 'revenue'] = 1.153
    df.loc[df.publisher.str.match('.*capcom.*', flags=re.IGNORECASE).values, 'revenue'] = 0.7673
    df.loc[df.publisher.str.match('.*warner.*', flags=re.IGNORECASE).values, 'revenue'] = 0.7324

    return df

In [8]:
# df_train = pd.read_pickle('train.pickle')
# df_train = custom_features(df_train)
# X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['rating'], test_size=0.3, random_state=1, stratify=df_train['rating'])

# Borrador de GridSearch (regresión)

**Candidatos**:
- Lasso
- ElasticNet
- Ridge
- SVR Lineal
- SVR polinomial
- SVR RBF
- Bagging
- DecisionTree
- RandomForest
- GradientBoosting
- ExtraTreesRegressor
- AdaBoostRegressor
- etc

In [4]:
from sklearn.svm import SVR
from sklearn.linear_model import ElasticNet, Ridge, RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor, HistGradientBoostingRegressor, VotingRegressor, StackingRegressor

In [9]:
regresores = [
    BaggingRegressor(random_state=0),
    BaggingRegressor(max_samples=0.5, n_estimators=15, random_state=0)
    # GradientBoostingRegressor(random_state=0),
    # RandomForestRegressor(random_state=0)
]

In [12]:
df_train = pd.read_pickle('train.pickle')
df_train = custom_features(df_train)
X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['estimated_sells'], test_size=0.3, random_state=1)

In [13]:
import numpy as np
np.seterr(divide='ignore', invalid='ignore');

from preprocessing import train_and_evaluate_reg

for clf in regresores:
    start = time.time_ns()
    train_and_evaluate_reg(clf,X_train,y_train,X_eval,y_eval)
    print("Time elapsed for {} method: {} seconds\n".format(type(clf).__name__,timeSince(start)))

Resultados regresión BaggingRegressor
Error cuadrático medio = 409512591825.20996
Score R2 = 0.6921114967477814
Time elapsed for BaggingRegressor method: 10.053536840000001 seconds

Resultados regresión BaggingRegressor
Error cuadrático medio = 649806977070.0111
Score R2 = 0.5114482397690754
Time elapsed for BaggingRegressor method: 8.388083729 seconds

