# Borrador baseline

Idea:
- tener un pipeline básico para cada una de las tareas
- dijar pre-procesamiento
- compatibilidad con output de modelo de lenguaje
- elegir mejor manera de incluir modelo de lenguaje

**Columnas con categorías**

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

class CategoriesTokenizer:
    def __init__(self):
        pass

    def __call__(self, doc):
        return doc.split(';')

Esta versión de vectorizador es para columnas con pocas categorías posibles (<1k):
- platforms (3 valores posibles)
- categories (29 valores posibles)
- genres (26 valores posibles)
- tags (306 valores posibles)

In [2]:
boc_some_values = CountVectorizer(
    tokenizer = CategoriesTokenizer(),
    max_df = 1.0,
    min_df = 0.05  # hiperparametro a optimizar
    # valores para GridSearch : [5%, 10%, 15%] ???
    )

Esta otra versión es para developers y publishers (5617 y 3961 valores posibles respectivamente)

In [3]:
boc_many_values = CountVectorizer(
    tokenizer = CategoriesTokenizer(),
    max_df = 1.0,
    min_df = 1  # hiperparametro a optimizar
    # valores para GridSearch : [5, 10, 15] ???
    )

**Juntando todo**

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, PowerTransformer


preprocesisng = ColumnTransformer(
    transformers=[
        ('BoC-plat',boc_some_values,'platforms'),
        ('BoC-cat',boc_some_values,'categories'),
        ('BoC-genres',boc_some_values,'genres'),
        ('BoC-tags',boc_some_values,'tags'),

        ('BoC-dev',boc_many_values,'developer'),
        ('BoC-pub',boc_many_values,'publisher'),

        # ('OneHotEncoder',OneHotEncoder(handle_unknown='ignore'),['...']),
        # ('StandardScaler',StandardScaler(), ['...']),
        ('MinMaxScaler',MinMaxScaler(),['required_age','price']),
        ('BoxCox',PowerTransformer(method='yeo-johnson'),['achievements','average_playtime']),
        # ('unchanged',None,['english'])  # chequear como no hacer nada
])

In [5]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

svm_lineal = Pipeline([
    ('Pre-procesamiento',preprocesisng),
    ('Clasificador',LinearSVC(random_state=0,max_iter=10000))
])

In [6]:
from sklearn.model_selection import train_test_split
import pandas as pd

df_train = pd.read_pickle('train.pickle')
X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['rating'], test_size=0.3, random_state=0, stratify=df_train['rating'])

In [7]:
from sklearn.metrics import classification_report

print("Resultados clasificación SVM lineal")
svm_lineal.fit(X_train, y_train)
y_svm = svm_lineal.predict(X_eval)
print(classification_report(y_eval,y_svm))

Resultados clasificación SVM lineal
                 precision    recall  f1-score   support

          Mixed       0.30      0.30      0.30       497
Mostly Positive       0.26      0.21      0.23       512
       Negative       0.40      0.40      0.40       387
       Positive       0.32      0.40      0.36       610
  Very Positive       0.40      0.36      0.38       359

       accuracy                           0.33      2365
      macro avg       0.34      0.33      0.33      2365
   weighted avg       0.33      0.33      0.33      2365

