---

## Anexo 2

### Pre-modelo de regresión basado en texto

Muy seguramente no bastará con la descripción del videojuego para poder predecir su valor monetario. Entrenaremos un modelo pretendiendo que si y luego extraeremos parte de este para agregar features en un clasificador más robusto.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

### Cargando el dataset

In [2]:
df_train = pd.read_pickle('train.pickle')

X_train, X_eval, y_train, y_eval = train_test_split(df_train, df_train['estimated_sells'], test_size=0.3, random_state=0)

In [3]:
X_train = X_train.drop(columns=['name','release_date','english','developer','publisher','platforms','required_age','categories','genres','tags','achievements','average_playtime','price','rating'])
X_eval = X_eval.drop(columns=['name','release_date','english','developer','publisher','platforms','required_age','categories','genres','tags','achievements','average_playtime','price','rating'])

In [4]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5516 entries, 3710 to 2732
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   short_description  5516 non-null   object
 1   estimated_sells    5516 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 129.3+ KB


In [5]:
X_train['label'] = X_train['estimated_sells'].astype(float)
X_eval['label'] = X_eval['estimated_sells'].astype(float)

In [6]:
X_train = X_train.drop(columns=["estimated_sells"])
X_eval = X_eval.drop(columns=["estimated_sells"])

In [7]:
from datasets import Dataset

data_train = Dataset.from_pandas(X_train)
data_eval = Dataset.from_pandas(X_eval)

In [8]:
data_train = data_train.remove_columns('__index_level_0__')
data_eval = data_eval.remove_columns('__index_level_0__')

In [9]:
data_train = data_train.rename_column("short_description", "text")
data_eval = data_eval.rename_column("short_description", "text")

In [10]:
print(data_train[0])
data_train

{'text': 'Begin a journey in the beautiful world! Build your own town anywhere and explore the myterious lands anytime. Nothing is impossible in the open world: Be a legendary warrior and 5-star chef at the same time!', 'label': 285348.0}


Dataset({
    features: ['text', 'label'],
    num_rows: 5516
})

In [8]:
from datasets import ClassLabel

data_train = data_train.cast_column("rating",ClassLabel(num_classes=5, names=['Negative', 'Mixed', 'Mostly Positive', 'Positive', 'Very Positive']))
data_eval = data_eval.cast_column("rating",ClassLabel(num_classes=5, names=['Negative', 'Mixed', 'Mostly Positive', 'Positive', 'Very Positive']))

Casting the dataset:   0%|          | 0/6 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

In [11]:
data_train.features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='float64', id=None)}

Queremos hacer finetuning a BETO para la tarea de predecir los ratings, lo cual corresponde a una tarea de clasificación. Seguiremos partes de [este tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification).

### Cargando tokenizer y modelos pre-entrenados

In [12]:
from transformers import AutoTokenizer

model_id = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [13]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

In [14]:
data_train = data_train.map(preprocess_function, batched=True)
data_eval = data_eval.map(preprocess_function, batched=True)

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

In [15]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier

In [17]:
from sklearn.metrics import mean_squared_error, r2_score

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = mean_squared_error(labels, predictions, squared=False)
    r2 = r2_score(labels, predictions)
    return {"rmse": rmse, 'r2':r2}

In [18]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir = "distilbert-videogames-descriptions-sells",
    learning_rate = 2e-5,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 5,
    weight_decay = 0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

trainer = Trainer(
    model = model,
    args=training_args,
    train_dataset = data_train,
    eval_dataset = data_eval,
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Fine-tunning

In [19]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5516
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1725
  Number of trainable parameters = 66954241


  0%|          | 0/1725 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-videogames-descriptions-sells/checkpoint-345
Configuration saved in distilbert-videogames-descriptions-sells/checkpoint-345/config.json


{'eval_loss': 1837031358464.0, 'eval_rmse': 1355371.375, 'eval_r2': -0.02469729931328235, 'eval_runtime': 23.7644, 'eval_samples_per_second': 99.519, 'eval_steps_per_second': 6.228, 'epoch': 1.0}


Model weights saved in distilbert-videogames-descriptions-sells/checkpoint-345/pytorch_model.bin
tokenizer config file saved in distilbert-videogames-descriptions-sells/checkpoint-345/tokenizer_config.json
Special tokens file saved in distilbert-videogames-descriptions-sells/checkpoint-345/special_tokens_map.json


{'loss': 2149345650540.544, 'learning_rate': 1.420289855072464e-05, 'epoch': 1.45}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-videogames-descriptions-sells/checkpoint-690
Configuration saved in distilbert-videogames-descriptions-sells/checkpoint-690/config.json


{'eval_loss': 1837022445568.0, 'eval_rmse': 1355368.0, 'eval_r2': -0.024692377468323645, 'eval_runtime': 24.3062, 'eval_samples_per_second': 97.3, 'eval_steps_per_second': 6.089, 'epoch': 2.0}


Model weights saved in distilbert-videogames-descriptions-sells/checkpoint-690/pytorch_model.bin
tokenizer config file saved in distilbert-videogames-descriptions-sells/checkpoint-690/tokenizer_config.json
Special tokens file saved in distilbert-videogames-descriptions-sells/checkpoint-690/special_tokens_map.json


{'loss': 3084424321171.456, 'learning_rate': 8.405797101449275e-06, 'epoch': 2.9}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-videogames-descriptions-sells/checkpoint-1035
Configuration saved in distilbert-videogames-descriptions-sells/checkpoint-1035/config.json


{'eval_loss': 1837014712320.0, 'eval_rmse': 1355365.125, 'eval_r2': -0.02468801676772281, 'eval_runtime': 24.3805, 'eval_samples_per_second': 97.004, 'eval_steps_per_second': 6.07, 'epoch': 3.0}


Model weights saved in distilbert-videogames-descriptions-sells/checkpoint-1035/pytorch_model.bin
tokenizer config file saved in distilbert-videogames-descriptions-sells/checkpoint-1035/tokenizer_config.json
Special tokens file saved in distilbert-videogames-descriptions-sells/checkpoint-1035/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-videogames-descriptions-sells/checkpoint-1380
Configuration saved in distilbert-videogames-descriptions-sells/checkpoint-1380/config.json


{'eval_loss': 1837009207296.0, 'eval_rmse': 1355363.125, 'eval_r2': -0.024684929634953612, 'eval_runtime': 24.5065, 'eval_samples_per_second': 96.505, 'eval_steps_per_second': 6.039, 'epoch': 4.0}


Model weights saved in distilbert-videogames-descriptions-sells/checkpoint-1380/pytorch_model.bin
tokenizer config file saved in distilbert-videogames-descriptions-sells/checkpoint-1380/tokenizer_config.json
Special tokens file saved in distilbert-videogames-descriptions-sells/checkpoint-1380/special_tokens_map.json


{'loss': 2099406589394.944, 'learning_rate': 2.6086956521739132e-06, 'epoch': 4.35}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2365
  Batch size = 16


  0%|          | 0/148 [00:00<?, ?it/s]

Saving model checkpoint to distilbert-videogames-descriptions-sells/checkpoint-1725
Configuration saved in distilbert-videogames-descriptions-sells/checkpoint-1725/config.json


{'eval_loss': 1837007372288.0, 'eval_rmse': 1355362.375, 'eval_r2': -0.02468383518142403, 'eval_runtime': 24.5431, 'eval_samples_per_second': 96.361, 'eval_steps_per_second': 6.03, 'epoch': 5.0}


Model weights saved in distilbert-videogames-descriptions-sells/checkpoint-1725/pytorch_model.bin
tokenizer config file saved in distilbert-videogames-descriptions-sells/checkpoint-1725/tokenizer_config.json
Special tokens file saved in distilbert-videogames-descriptions-sells/checkpoint-1725/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from distilbert-videogames-descriptions-sells/checkpoint-1725 (score: 1837007372288.0).


{'train_runtime': 598.1181, 'train_samples_per_second': 46.111, 'train_steps_per_second': 2.884, 'train_loss': 2548196899534.284, 'epoch': 5.0}


TrainOutput(global_step=1725, training_loss=2548196899534.284, metrics={'train_runtime': 598.1181, 'train_samples_per_second': 46.111, 'train_steps_per_second': 2.884, 'train_loss': 2548196899534.284, 'epoch': 5.0})

In [20]:
trainer.save_model("distilbert-videogames-descriptions-sells")

Saving model checkpoint to distilbert-videogames-descriptions-sells
Configuration saved in distilbert-videogames-descriptions-sells/config.json
Model weights saved in distilbert-videogames-descriptions-sells/pytorch_model.bin
tokenizer config file saved in distilbert-videogames-descriptions-sells/tokenizer_config.json
Special tokens file saved in distilbert-videogames-descriptions-sells/special_tokens_map.json
