# Pre-modelo de clasificación basado en texto

Muy seguramente no bastará con la descripción del videojuego para poder predecir su valor monetario. Entrenaremos un modelo pretendiendo que si y luego extraeremos parte de este para agregar features en un clasificador más robusto.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Cargando el dataset

In [2]:
df_train = pd.read_pickle('train.pickle')

X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['rating'], test_size=0.3, random_state=0, stratify=df_train['rating'])

In [3]:
X_train = X_train.drop(columns=['name','release_date','english','developer','publisher','platforms','required_age','categories','genres','tags','achievements','average_playtime','price','estimated_sells'])
X_eval = X_eval.drop(columns=['name','release_date','english','developer','publisher','platforms','required_age','categories','genres','tags','achievements','average_playtime','price','estimated_sells'])

In [4]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5516 entries, 3997 to 2515
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   short_description  5516 non-null   object  
 1   rating             5516 non-null   category
dtypes: category(1), object(1)
memory usage: 91.6+ KB


In [5]:
X_train['label'] = X_train['rating'].astype(str)
X_eval['label'] = X_eval['rating'].astype(str)

In [6]:
X_train = X_train.drop(columns=["rating"])
X_eval = X_eval.drop(columns=["rating"])

In [7]:
from datasets import Dataset

data_train = Dataset.from_pandas(X_train)
data_eval = Dataset.from_pandas(X_eval)

In [8]:
data_train = data_train.remove_columns('__index_level_0__')
data_eval = data_eval.remove_columns('__index_level_0__')

In [10]:
data_train = data_train.rename_column("short_description", "text")
data_eval = data_eval.rename_column("short_description", "text")

In [11]:
print(data_train[0])
data_train

{'text': 'Play as the hunter girl Azel and battle monsters with equipment made of their corpses while searching for the Black Demon Beast that branded her.', 'label': 'Very Positive'}


Dataset({
    features: ['text', 'label'],
    num_rows: 5516
})

In [12]:
from datasets import ClassLabel

data_train = data_train.cast_column("label",ClassLabel(num_classes=5, names=['Negative', 'Mixed', 'Mostly Positive', 'Positive', 'Very Positive']))
data_eval = data_eval.cast_column("label",ClassLabel(num_classes=5, names=['Negative', 'Mixed', 'Mostly Positive', 'Positive', 'Very Positive']))

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

In [13]:
data_train.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['Negative', 'Mixed', 'Mostly Positive', 'Positive', 'Very Positive'], id=None)}

Queremos hacer finetuning a BETO para la tarea de predecir los ratings, lo cual corresponde a una tarea de clasificación. Seguiremos partes de [este tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification).

## Cargando tokenizer y modelos pre-entrenados

In [14]:
from transformers import AutoTokenizer

model_id = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [17]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

In [18]:
data_train = data_train.map(preprocess_function, batched=True)
data_eval = data_eval.map(preprocess_function, batched=True)

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

In [19]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [20]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=5)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

In [21]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir = "distilbert-videogame-descriptions-rating",
    learning_rate = 2e-5,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 5,
    weight_decay = 0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

trainer = Trainer(
    model = model,
    args=training_args,
    train_dataset = data_train,
    eval_dataset = data_eval,
    tokenizer = tokenizer,
    data_collator = data_collator,
)

## Fine-tunning

In [22]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5516
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1725


  0%|          | 0/1725 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-movie-descriptions-rating/checkpoint-345
Configuration saved in distilbert-movie-descriptions-rating/checkpoint-345/config.json


{'eval_loss': 1.551745057106018, 'eval_runtime': 12.622, 'eval_samples_per_second': 187.372, 'eval_steps_per_second': 11.726, 'epoch': 1.0}


Model weights saved in distilbert-movie-descriptions-rating/checkpoint-345/pytorch_model.bin
tokenizer config file saved in distilbert-movie-descriptions-rating/checkpoint-345/tokenizer_config.json
Special tokens file saved in distilbert-movie-descriptions-rating/checkpoint-345/special_tokens_map.json


{'loss': 1.5566, 'learning_rate': 1.420289855072464e-05, 'epoch': 1.45}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-movie-descriptions-rating/checkpoint-690
Configuration saved in distilbert-movie-descriptions-rating/checkpoint-690/config.json


{'eval_loss': 1.5596345663070679, 'eval_runtime': 12.7998, 'eval_samples_per_second': 184.769, 'eval_steps_per_second': 11.563, 'epoch': 2.0}


Model weights saved in distilbert-movie-descriptions-rating/checkpoint-690/pytorch_model.bin
tokenizer config file saved in distilbert-movie-descriptions-rating/checkpoint-690/tokenizer_config.json
Special tokens file saved in distilbert-movie-descriptions-rating/checkpoint-690/special_tokens_map.json


{'loss': 1.4186, 'learning_rate': 8.405797101449275e-06, 'epoch': 2.9}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-movie-descriptions-rating/checkpoint-1035
Configuration saved in distilbert-movie-descriptions-rating/checkpoint-1035/config.json


{'eval_loss': 1.587784767150879, 'eval_runtime': 12.9609, 'eval_samples_per_second': 182.471, 'eval_steps_per_second': 11.419, 'epoch': 3.0}


Model weights saved in distilbert-movie-descriptions-rating/checkpoint-1035/pytorch_model.bin
tokenizer config file saved in distilbert-movie-descriptions-rating/checkpoint-1035/tokenizer_config.json
Special tokens file saved in distilbert-movie-descriptions-rating/checkpoint-1035/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-movie-descriptions-rating/checkpoint-1380
Configuration saved in distilbert-movie-descriptions-rating/checkpoint-1380/config.json


{'eval_loss': 1.6469284296035767, 'eval_runtime': 13.0053, 'eval_samples_per_second': 181.849, 'eval_steps_per_second': 11.38, 'epoch': 4.0}


Model weights saved in distilbert-movie-descriptions-rating/checkpoint-1380/pytorch_model.bin
tokenizer config file saved in distilbert-movie-descriptions-rating/checkpoint-1380/tokenizer_config.json
Special tokens file saved in distilbert-movie-descriptions-rating/checkpoint-1380/special_tokens_map.json


{'loss': 1.2034, 'learning_rate': 2.6086956521739132e-06, 'epoch': 4.35}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-movie-descriptions-rating/checkpoint-1725
Configuration saved in distilbert-movie-descriptions-rating/checkpoint-1725/config.json


{'eval_loss': 1.686038851737976, 'eval_runtime': 12.9764, 'eval_samples_per_second': 182.254, 'eval_steps_per_second': 11.405, 'epoch': 5.0}


Model weights saved in distilbert-movie-descriptions-rating/checkpoint-1725/pytorch_model.bin
tokenizer config file saved in distilbert-movie-descriptions-rating/checkpoint-1725/tokenizer_config.json
Special tokens file saved in distilbert-movie-descriptions-rating/checkpoint-1725/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from distilbert-movie-descriptions-rating/checkpoint-345 (score: 1.551745057106018).


{'train_runtime': 645.5208, 'train_samples_per_second': 42.725, 'train_steps_per_second': 2.672, 'train_loss': 1.3549741529381794, 'epoch': 5.0}


TrainOutput(global_step=1725, training_loss=1.3549741529381794, metrics={'train_runtime': 645.5208, 'train_samples_per_second': 42.725, 'train_steps_per_second': 2.672, 'train_loss': 1.3549741529381794, 'epoch': 5.0})

In [24]:
trainer.save_model("distilbert-videogame-descriptions-rating")

Saving model checkpoint to distilbert-videogame-descriptions-rating
Configuration saved in distilbert-videogame-descriptions-rating/config.json
Model weights saved in distilbert-videogame-descriptions-rating/pytorch_model.bin
tokenizer config file saved in distilbert-videogame-descriptions-rating/tokenizer_config.json
Special tokens file saved in distilbert-videogame-descriptions-rating/special_tokens_map.json
