# Borrador baseline

Idea:
- tener un pipeline básico para cada una de las tareas
- dijar pre-procesamiento
- compatibilidad con output de modelo de lenguaje
- elegir mejor manera de incluir modelo de lenguaje

**Columnas con categorías**

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

class CategoriesTokenizer:
    def __init__(self):
        pass

    def __call__(self, doc):
        return doc.split(';')

Esta versión de vectorizador es para columnas con pocas categorías posibles (<1k):
- platforms (3 valores posibles)
- categories (29 valores posibles)
- genres (26 valores posibles)
- tags (306 valores posibles)

In [2]:
boc_some_values = CountVectorizer(
    tokenizer = CategoriesTokenizer(),
    max_df = 1.0,
    min_df = 0.05  # hiperparametro a optimizar
    # valores para GridSearch : [5%, 10%, 15%] ???
    )

Esta otra versión es para developers y publishers (5617 y 3961 valores posibles respectivamente)

In [3]:
boc_many_values = CountVectorizer(
    tokenizer = CategoriesTokenizer(),
    max_df = 1.0,
    min_df = 1  # hiperparametro a optimizar
    # valores para GridSearch : [5, 10, 15] ???
    )

**Juntando todo**

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, PowerTransformer


preprocesisng = ColumnTransformer(
    transformers=[
        ('BoC-plat',boc_some_values,'platforms'),
        ('BoC-cat',boc_some_values,'categories'),
        ('BoC-genres',boc_some_values,'genres'),
        ('BoC-tags',boc_some_values,'tags'),

        ('BoC-dev',boc_many_values,'developer'),
        ('BoC-pub',boc_many_values,'publisher'),

        # ('OneHotEncoder',OneHotEncoder(handle_unknown='ignore'),['...']),
        # ('StandardScaler',StandardScaler(), ['...']),
        ('MinMaxScaler',MinMaxScaler(),['required_age','price']),
        ('BoxCox',PowerTransformer(method='yeo-johnson'),['achievements','average_playtime']),
        # ('unchanged',None,['english'])  # chequear como no hacer nada
])

In [5]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

svm_lineal = Pipeline([
    ('Pre-procesamiento',preprocesisng),
    ('Clasificador',LinearSVC(random_state=0,max_iter=10000))
])

In [6]:
from sklearn.model_selection import train_test_split
import pandas as pd

df_train = pd.read_pickle('train.pickle')
X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['rating'], test_size=0.3, random_state=0, stratify=df_train['rating'])

In [7]:
from sklearn.metrics import classification_report

print("Resultados clasificación SVM lineal")
svm_lineal.fit(X_train, y_train)
y_svm = svm_lineal.predict(X_eval)
print(classification_report(y_eval,y_svm))

Resultados clasificación SVM lineal
                 precision    recall  f1-score   support

          Mixed       0.30      0.30      0.30       497
Mostly Positive       0.26      0.21      0.23       512
       Negative       0.40      0.40      0.40       387
       Positive       0.32      0.40      0.36       610
  Very Positive       0.40      0.36      0.38       359

       accuracy                           0.33      2365
      macro avg       0.34      0.33      0.33      2365
   weighted avg       0.33      0.33      0.33      2365



## Agregando embeddings

In [8]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np

MODEL = "distilbert-videogame-descriptions-rating"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

def sentence_clf_output(text):
    """retorna el SequenceClassifierOutput"""
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input, return_dict=True, output_hidden_states=True)
    return output

### Versión logits

In [9]:
def logits_embedding(clf_output):
    # retorna el vector de scores de clasificacion (antes de la capa softmax)
    return clf_output['logits'][0].detach().numpy().reshape(1,5)

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

class LogitsEmbedding(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        embed = lambda row: logits_embedding(sentence_clf_output(row))
        X_new = X.apply(embed)
        X_new = np.concatenate(X_new.values)
        return X_new

In [11]:
preprocesisng_logits = ColumnTransformer(
    transformers=[
        ('BoC-plat',boc_some_values,'platforms'),
        ('BoC-cat',boc_some_values,'categories'),
        ('BoC-genres',boc_some_values,'genres'),
        ('BoC-tags',boc_some_values,'tags'),

        ('BoC-dev',boc_many_values,'developer'),
        ('BoC-pub',boc_many_values,'publisher'),

        # ('OneHotEncoder',OneHotEncoder(handle_unknown='ignore'),['...']),
        # ('StandardScaler',StandardScaler(), ['...']),
        ('MinMaxScaler',MinMaxScaler(),['required_age','price']),
        ('BoxCox',PowerTransformer(method='yeo-johnson'),['achievements','average_playtime']),
        # ('unchanged',None,['english'])  # chequear como no hacer nada

        ('LogitsText',LogitsEmbedding(),'short_description')
])

svm_lineal_logits = Pipeline([
    ('Pre-procesamiento',preprocesisng_logits),
    ('Clasificador',LinearSVC(random_state=0,max_iter=10000))
])

In [12]:
print("Resultados clasificación SVM lineal con logit embeddings")
svm_lineal_logits.fit(X_train, y_train)
y_svm = svm_lineal_logits.predict(X_eval)
print(classification_report(y_eval,y_svm))

Resultados clasificación SVM lineal con logit embeddings
                 precision    recall  f1-score   support

          Mixed       0.29      0.28      0.28       497
Mostly Positive       0.26      0.22      0.24       512
       Negative       0.39      0.40      0.40       387
       Positive       0.33      0.37      0.35       610
  Very Positive       0.35      0.35      0.35       359

       accuracy                           0.32      2365
      macro avg       0.32      0.32      0.32      2365
   weighted avg       0.32      0.32      0.32      2365



### Versión token [CLF]

In [13]:
def first_tok_embedding(cfl_output):
    # retorna un numpy array correspondiente al token contextualizado
    return cfl_output['hidden_states'][-1][0][0].detach().numpy().reshape(1,768)

class CLFTokenEmbedding(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        embed = lambda row: first_tok_embedding(sentence_clf_output(row))
        X_new = X.apply(embed)
        X_new = np.concatenate(X_new.values)
        return X_new

In [14]:
preprocesisng_CLFToken = ColumnTransformer(
    transformers=[
        ('BoC-plat',boc_some_values,'platforms'),
        ('BoC-cat',boc_some_values,'categories'),
        ('BoC-genres',boc_some_values,'genres'),
        ('BoC-tags',boc_some_values,'tags'),

        ('BoC-dev',boc_many_values,'developer'),
        ('BoC-pub',boc_many_values,'publisher'),

        # ('OneHotEncoder',OneHotEncoder(handle_unknown='ignore'),['...']),
        # ('StandardScaler',StandardScaler(), ['...']),
        ('MinMaxScaler',MinMaxScaler(),['required_age','price']),
        ('BoxCox',PowerTransformer(method='yeo-johnson'),['achievements','average_playtime']),
        # ('unchanged',None,['english'])  # chequear como no hacer nada

        ('LogitsText',CLFTokenEmbedding(),'short_description')
])

svm_lineal_CLFToken = Pipeline([
    ('Pre-procesamiento',preprocesisng_CLFToken),
    ('Clasificador',LinearSVC(random_state=0,max_iter=10000))
])

In [15]:
print("Resultados clasificación SVM lineal con logit embeddings")
svm_lineal_CLFToken.fit(X_train, y_train)
y_svm = svm_lineal_CLFToken.predict(X_eval)
print(classification_report(y_eval,y_svm))

Resultados clasificación SVM lineal con logit embeddings
                 precision    recall  f1-score   support

          Mixed       0.29      0.28      0.28       497
Mostly Positive       0.25      0.26      0.26       512
       Negative       0.41      0.40      0.40       387
       Positive       0.31      0.33      0.32       610
  Very Positive       0.32      0.30      0.31       359

       accuracy                           0.31      2365
      macro avg       0.32      0.31      0.32      2365
   weighted avg       0.31      0.31      0.31      2365



### Versión promedio de embeddings

In [16]:
def mean_embedding(cfl_output):
    # retorna un numpy array correspondiente a la suma de los vectores contextualizados
    return cfl_output['hidden_states'][-1][0].detach().numpy().mean(axis=0).reshape(1,768)

class MeanEmbedding(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        embed = lambda row: mean_embedding(sentence_clf_output(row))
        X_new = X.apply(embed)
        X_new = np.concatenate(X_new.values)
        return X_new

In [17]:
preprocesisng_mean = ColumnTransformer(
    transformers=[
        ('BoC-plat',boc_some_values,'platforms'),
        ('BoC-cat',boc_some_values,'categories'),
        ('BoC-genres',boc_some_values,'genres'),
        ('BoC-tags',boc_some_values,'tags'),

        ('BoC-dev',boc_many_values,'developer'),
        ('BoC-pub',boc_many_values,'publisher'),

        # ('OneHotEncoder',OneHotEncoder(handle_unknown='ignore'),['...']),
        # ('StandardScaler',StandardScaler(), ['...']),
        ('MinMaxScaler',MinMaxScaler(),['required_age','price']),
        ('BoxCox',PowerTransformer(method='yeo-johnson'),['achievements','average_playtime']),
        # ('unchanged',None,['english'])  # chequear como no hacer nada

        ('LogitsText',MeanEmbedding(),'short_description')
])

svm_lineal_mean = Pipeline([
    ('Pre-procesamiento',preprocesisng_mean),
    ('Clasificador',LinearSVC(random_state=0,max_iter=10000))
])

In [None]:
print("Resultados clasificación SVM lineal con logit embeddings")
svm_lineal_mean.fit(X_train, y_train)
y_svm = svm_lineal_mean.predict(X_eval)

In [19]:
print(classification_report(y_eval,y_svm))

                 precision    recall  f1-score   support

          Mixed       0.28      0.27      0.27       497
Mostly Positive       0.24      0.24      0.24       512
       Negative       0.37      0.33      0.34       387
       Positive       0.30      0.33      0.32       610
  Very Positive       0.34      0.34      0.34       359

       accuracy                           0.30      2365
      macro avg       0.31      0.30      0.30      2365
   weighted avg       0.30      0.30      0.30      2365



### Resultados del modelo de lenguaje sin otras features

In [20]:
from scipy.special import softmax

def eval_text(text):
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    return np.argmax(scores), scores

In [35]:
y_lm = []
label_names = ['Negative', 'Mixed', 'Mostly Positive', 'Positive', 'Very Positive']
# label_names = ['Very Positive','Positive' , 'Mostly Positive', 'Mixed','Negative' ]

for texto in X_eval['short_description']:
    label, _ = eval_text(texto)
    y_lm.append(label_names[label])

In [36]:
report = classification_report(y_eval, y_lm)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Bag-of-words clásicos

In [37]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk import word_tokenize 

stop_words = stopwords.words('english')

# Definimos un tokenizador con Stemming
class StemmerTokenizer:
    def __init__(self):
        self.ps = PorterStemmer()
    def __call__(self, doc):
        doc_tok = word_tokenize(doc)
        doc_tok = [t for t in doc_tok if t not in stop_words]
        return [self.ps.stem(t) for t in doc_tok]

bow = CountVectorizer(
    tokenizer= StemmerTokenizer(),
    ngram_range=(1,2),
    min_df = 0.05, max_df = 0.85
    )

In [40]:
preprocesisng_bow = ColumnTransformer(
    transformers=[
        ('BoC-plat',boc_some_values,'platforms'),
        ('BoC-cat',boc_some_values,'categories'),
        ('BoC-genres',boc_some_values,'genres'),
        ('BoC-tags',boc_some_values,'tags'),

        ('BoC-dev',boc_many_values,'developer'),
        ('BoC-pub',boc_many_values,'publisher'),

        # ('OneHotEncoder',OneHotEncoder(handle_unknown='ignore'),['...']),
        # ('StandardScaler',StandardScaler(), ['...']),
        ('MinMaxScaler',MinMaxScaler(),['required_age','price']),
        ('BoxCox',PowerTransformer(method='yeo-johnson'),['achievements','average_playtime']),
        # ('unchanged',None,['english'])  # chequear como no hacer nada

        ('BoWText',bow,'short_description')
])

svm_lineal_bow = Pipeline([
    ('Pre-procesamiento',preprocesisng_bow),
    ('Clasificador',LinearSVC(random_state=0,max_iter=10000))
])

In [41]:
print("Resultados clasificación SVM lineal con bag-of-words")
svm_lineal_bow.fit(X_train, y_train)
y_svm = svm_lineal_bow.predict(X_eval)
print(classification_report(y_eval,y_svm))

Resultados clasificación SVM lineal con logit embeddings
                 precision    recall  f1-score   support

          Mixed       0.29      0.27      0.28       497
Mostly Positive       0.25      0.23      0.24       512
       Negative       0.40      0.39      0.40       387
       Positive       0.33      0.38      0.35       610
  Very Positive       0.39      0.36      0.38       359

       accuracy                           0.32      2365
      macro avg       0.33      0.33      0.33      2365
   weighted avg       0.32      0.32      0.32      2365



## Baseline regresión

In [8]:
X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['estimated_sells'], test_size=0.3, random_state=0)

In [9]:
from sklearn.svm import SVR

svr_lineal = Pipeline([
    ('Pre-procesamiento',preprocesisng),
    ('Regresor',SVR())
])

In [10]:
svr_lineal.fit(X_train, y_train)
y_svm = svr_lineal.predict(X_eval)

In [11]:
from sklearn.metrics import r2_score, mean_squared_error

print("Resultados regresión SVM lineal")
print("Error cuadrático medio = {}".format(mean_squared_error(y_eval,y_svm)))
print("Score R2 = {}".format(r2_score(y_eval,y_svm)))

Resultados regresión SVM lineal
Error cuadrático medio = 1828414958387.0896
Score R2 = -0.019891039720035808
