# Project Part 3

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/entrylevelcs/CS39AA-Project/blob/main/project_part3.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/entrylevelcs/CS39AA-Project/blob/main/project_part3.ipynb)


## 1. Introduction/Background

For this part of the project we are using the bert based pretrained model and training it using our two data sets to see how accurate we can get the predictions to be.

## 2. Using pretrained models to improve accuracy

In [1]:
# import all of the python modules/packages you'll need here
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification,  TrainingArguments, Trainer
from datasets import Dataset, load_metric
import os
import wandb
import random
# ...

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


Set up our wandb information to track our project.

In [2]:
"""
wandb.login()
os.environ["WANDB_DISABLED"] = "false"
os.environ["WANDB_LOG_MODEL"] = "true"
os.environ["WANDB_PROJECT"] = "real_data"
os.environ["WANDB_NOTEBOOK_NAME"] = "project_part3.ipynb"
"""
#Uncomment the above for tracking
#remove line below for tracking
os.environ["WANDB_DISABLED"] = "true"

Get the data set from real steam reviews. The specific set of data that I am using for this notebook comes from https://www.kaggle.com/datasets/andrewmvd/steam-reviews/ but is just a sample of 25000 from the entire set.

In [3]:
human_data = 'https://raw.githubusercontent.com/entrylevelcs/CS39AA-Project/main/human_dataset.csv'
df = pd.read_csv(human_data)
df = df[df["review_text"].notnull()]

Get the data set that was generated by chatgpt. This review data started as only being about CS:GO but has been expanded to be more general and talk about other games.

In [4]:
ai_data = 'https://raw.githubusercontent.com/entrylevelcs/CS39AA-Project/main/gpt3.5_generated_data.csv'
df2 = pd.read_csv(ai_data)

We need to reformat the data into forms that are usable to the model. One thing we need to do is reclassify our labels since the model marks "-1" as a wrong label. In order to do this we replace the "recommended" labels, 1, with 1 and the "not recommended" labels, -1, with 0.

In [5]:
df = df.rename(columns={"review_text": "text", "review_score": "label"})
df2 = df2.rename(columns={"Review": "text", " Sentiment": "label"})
class_tok2idx = dict({1: 1, -1: 0})
df['label'] = df['label'].apply(lambda x: class_tok2idx[x])
df2['label'] = df2['label'].apply(lambda x: class_tok2idx[x])

We get some numbers related to the two data sets sizes so that the data going into both training models are the same.

In [6]:
ai_data_size = len(df2)
sample_size = len(df)
proportion = 1 - ((sample_size-ai_data_size)/sample_size)

Initialize our tokenizer and models.

In [7]:
real_data_raw = Dataset.from_pandas(df[['label', 'text']])
simulated_data_raw = Dataset.from_pandas(df2)
MODEL_NAME = 'bert-base-cased'
MAX_LENGTH = 55
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.add_special_tokens({'pad_token': '<pad>'})
realModel = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2, max_length=MAX_LENGTH, output_attentions=False, output_hidden_states=False)
simulatedModel = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2, max_length=MAX_LENGTH, output_attentions=False, output_hidden_states=False)
realModel.resize_token_embeddings(len(tokenizer))
simulatedModel.resize_token_embeddings(len(tokenizer))

  if _pandas_api.is_sparse(col):
  torch.utils._pytree._register_pytree_node(
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 28997. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: http

Embedding(28997, 768)

Tokenize our data sets so that they can be used in the models.

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True, max_length=MAX_LENGTH)

real_data = real_data_raw.map(tokenize_function, batched=True)
simulated_data = simulated_data_raw.map(tokenize_function, batched=True)

Map:   0%|          | 0/24971 [00:00<?, ? examples/s]

Map:   0%|          | 0/2496 [00:00<?, ? examples/s]

Data set that is sampled from the full set so that we can use our models to predict it.

In [9]:
pdata = 'https://raw.githubusercontent.com/entrylevelcs/CS39AA-Project/main/extra.csv'
df3 = pd.read_csv(pdata)
df3 = df3[df3["review_text"].notnull()]
df3 = df3.rename(columns={"review_text": "text", "review_score": "label"})
df3['label'] = df3['label'].apply(lambda x: class_tok2idx[x])
predict_data_raw = Dataset.from_pandas(df3[['label', 'text']])
predict_data = predict_data_raw.map(tokenize_function, batched=True)


  if _pandas_api.is_sparse(col):


Map:   0%|          | 0/12500 [00:00<?, ? examples/s]

Split the real data into the training data and the evaluation data set. Both the simulated and real training data sets will use the real_data_eval data set as the evaluation data sets. With this split both our simulated training data set and our real training data set are the same size.

In [10]:
train_prop = proportion
real_data_train = real_data.select(range(int(len(real_data)*train_prop)))
real_data_eval = real_data.select(range(int(len(real_data)*train_prop), len(real_data)))

Set up our real model that is trained on the real data. Also initialize our different hyperparameters that we need to cycle through in order to find the optimal combination. I was unable to get the wandb sweep to work so instead I just reran the notebook eight times in order to "optimize" the hyper parameters. Best values found for the real model were batchSize = 32, learningRate = 1e-4 and weight_decay = 0.

In [11]:
"""
batchSize = [8, 32, 64, 128, 256]
learningRate = [1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
weightDecay = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
batchSize = batchSize[random.randint(0,4)]
learningRate = learningRate[random.randint(0,4)]
weightDecay = weightDecay[random.randint(0,6)]
"""
# uncomment above to get random hyperparameter values
batchSize = 32
learningRate = 1e-4
weightDecay = 0.0

def compute_metrics(eval_pred):
    metrics = dict()

    accuracy_metric = load_metric('accuracy')
    precision_metric = load_metric('precision')
    recall_metric = load_metric('recall')
    f1_metric = load_metric('f1')

    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    
    metrics.update(accuracy_metric.compute(predictions=preds, references=labels))
    metrics.update(precision_metric.compute(predictions=preds, references=labels, average='weighted'))
    metrics.update(recall_metric.compute(predictions=preds, references=labels, average='weighted'))
    metrics.update(f1_metric.compute(predictions=preds, references=labels, average='weighted'))
    
    return metrics

real_training_args = TrainingArguments(num_train_epochs=3,
                                  do_train=True,
                                  report_to=None, #wandb #for tracking
                                  output_dir="real",
                                  evaluation_strategy="epoch",
                                  eval_steps=78,
                                  learning_rate=learningRate,
                                  weight_decay=weightDecay,
                                  per_device_train_batch_size=batchSize,
                                  per_device_eval_batch_size=32)

real_trainer = Trainer(model = realModel, 
                  args = real_training_args,
                  train_dataset = real_data_train, 
                  eval_dataset = real_data_eval,
                  compute_metrics = compute_metrics,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Print out the different hyperparameters so I can log them.

In [12]:
print(batchSize)
print(learningRate)
print(weightDecay)

32
0.0001
0.0


Set up our simulated model that is trained on simulated data. Best values found are included in the code block below.

In [13]:
batchSize = 32
learningRate = 1e-5
weightDecay = 0.1

sim_training_args = TrainingArguments(num_train_epochs=3,
                                  do_train=True,
                                  report_to=None,
                                  output_dir="simulated",
                                  evaluation_strategy="epoch",
                                  eval_steps=78,
                                  learning_rate=learningRate,
                                  weight_decay=weightDecay,
                                  per_device_train_batch_size=batchSize,
                                  per_device_eval_batch_size=32)

sim_trainer = Trainer(model = simulatedModel, 
                  args = sim_training_args,
                  train_dataset = simulated_data, 
                  eval_dataset = real_data_eval,
                  compute_metrics = compute_metrics,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [14]:
if torch.cuda.is_available():
    device = "cuda:0"
    print("Using GPU")
else: 
    device = "cpu"


Using GPU


Train our real data model.

In [15]:
realModel.to(device)
torch.set_grad_enabled(True)
real_trainer.train()
real_trainer.evaluate()

  0%|          | 0/234 [00:00<?, ?it/s]

  0%|          | 0/703 [00:00<?, ?it/s]

  accuracy_metric = load_metric('accuracy')


{'eval_loss': 0.36119842529296875, 'eval_accuracy': 0.8494838939313045, 'eval_precision': 0.8335323612613328, 'eval_recall': 0.8494838939313045, 'eval_f1': 0.8221807845141444, 'eval_runtime': 52.1292, 'eval_samples_per_second': 431.16, 'eval_steps_per_second': 13.486, 'epoch': 1.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.368147075176239, 'eval_accuracy': 0.8558907278875245, 'eval_precision': 0.8419890690399171, 'eval_recall': 0.8558907278875245, 'eval_f1': 0.8438892166379125, 'eval_runtime': 50.9698, 'eval_samples_per_second': 440.967, 'eval_steps_per_second': 13.792, 'epoch': 2.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.48628753423690796, 'eval_accuracy': 0.8546004627157857, 'eval_precision': 0.8417203637280933, 'eval_recall': 0.8546004627157857, 'eval_f1': 0.8446118211622624, 'eval_runtime': 51.4348, 'eval_samples_per_second': 436.981, 'eval_steps_per_second': 13.668, 'epoch': 3.0}
{'train_runtime': 208.185, 'train_samples_per_second': 35.954, 'train_steps_per_second': 1.124, 'train_loss': 0.28023388854458803, 'epoch': 3.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.48628753423690796,
 'eval_accuracy': 0.8546004627157857,
 'eval_precision': 0.8417203637280933,
 'eval_recall': 0.8546004627157857,
 'eval_f1': 0.8446118211622624,
 'eval_runtime': 51.1386,
 'eval_samples_per_second': 439.511,
 'eval_steps_per_second': 13.747,
 'epoch': 3.0}

Train our simulated data model.

In [16]:
simulatedModel.to(device)
sim_trainer.train()
sim_trainer.evaluate()

  0%|          | 0/234 [00:00<?, ?it/s]

  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.507163941860199, 'eval_accuracy': 0.7797650827549386, 'eval_precision': 0.8049571808803667, 'eval_recall': 0.7797650827549386, 'eval_f1': 0.7901366219028239, 'eval_runtime': 51.1279, 'eval_samples_per_second': 439.604, 'eval_steps_per_second': 13.75, 'epoch': 1.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.7969406843185425, 'eval_accuracy': 0.7334935041822388, 'eval_precision': 0.8066929532056002, 'eval_recall': 0.7334935041822388, 'eval_f1': 0.757704095772947, 'eval_runtime': 50.998, 'eval_samples_per_second': 440.723, 'eval_steps_per_second': 13.785, 'epoch': 2.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.9118539094924927, 'eval_accuracy': 0.7158302189001602, 'eval_precision': 0.8046701901215098, 'eval_recall': 0.7158302189001602, 'eval_f1': 0.7438469931408687, 'eval_runtime': 52.3374, 'eval_samples_per_second': 429.444, 'eval_steps_per_second': 13.432, 'epoch': 3.0}
{'train_runtime': 207.1823, 'train_samples_per_second': 36.142, 'train_steps_per_second': 1.129, 'train_loss': 0.15104535094693175, 'epoch': 3.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.9118539094924927,
 'eval_accuracy': 0.7158302189001602,
 'eval_precision': 0.8046701901215098,
 'eval_recall': 0.7158302189001602,
 'eval_f1': 0.7438469931408687,
 'eval_runtime': 51.4571,
 'eval_samples_per_second': 436.791,
 'eval_steps_per_second': 13.662,
 'epoch': 3.0}

Use the real model with the optimal parameters to predict the extra data.

In [17]:
real_trainer.predict(predict_data)

  0%|          | 0/391 [00:00<?, ?it/s]

PredictionOutput(predictions=array([[-0.65022147,  0.78339416],
       [-2.6378212 ,  3.0397048 ],
       [-1.5313379 ,  1.2616408 ],
       ...,
       [-3.0483782 ,  3.4318466 ],
       [-0.65022135,  0.78339404],
       [-2.6929402 ,  3.0046084 ]], dtype=float32), label_ids=array([0, 1, 0, ..., 1, 0, 1], dtype=int64), metrics={'test_loss': 0.4689282178878784, 'test_accuracy': 0.85944, 'test_precision': 0.8469438366844293, 'test_recall': 0.85944, 'test_f1': 0.8496350014566572, 'test_runtime': 30.63, 'test_samples_per_second': 408.096, 'test_steps_per_second': 12.765})

Use our simulated model with the optimal parameters to predict the extra data.

In [18]:
sim_trainer.predict(predict_data)

  0%|          | 0/391 [00:00<?, ?it/s]

PredictionOutput(predictions=array([[-1.9441288,  0.7717161],
       [-3.2725382,  2.6734142],
       [ 1.6970272, -1.7514855],
       ...,
       [-1.7533967,  1.0316035],
       [-1.9441302,  0.7717168],
       [ 2.0968409, -1.9795   ]], dtype=float32), label_ids=array([0, 1, 0, ..., 1, 0, 1], dtype=int64), metrics={'test_loss': 0.9185799956321716, 'test_accuracy': 0.70984, 'test_precision': 0.8035024448038779, 'test_recall': 0.70984, 'test_f1': 0.7395756388256645, 'test_runtime': 31.5503, 'test_samples_per_second': 396.193, 'test_steps_per_second': 12.393})

For the full list of random hyperparameter trials check the hyper_variants.txt in this github repository.