# Project Part 3

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/entrylevelcs/CS39AA-Project/blob/main/project_part3.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/entrylevelcs/CS39AA-Project/blob/main/project_part3.ipynb)


## 1. Introduction/Background

For this part of the project we are using the bert based pretrained model and training it using our two data sets to see how accurate we can get the predictions to be.

## 2. Using pretrained models to improve accuracy

In [1]:
# import all of the python modules/packages you'll need here
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification,  TrainingArguments, Trainer
from datasets import Dataset, load_metric
import os
os.environ["WANDB_DISABLED"] = "true"
# ...

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


Get the data set from real steam reviews. The specific set of data that I am using for this notebook comes from https://www.kaggle.com/datasets/andrewmvd/steam-reviews/ but is just a sample of 25000 from the entire set.

In [2]:
human_data = 'https://raw.githubusercontent.com/entrylevelcs/CS39AA-Project/main/human_dataset.csv'
df = pd.read_csv(human_data)
df = df[df["review_text"].notnull()]

Get the data set that was generated by chatgpt. This review data started as only being about CS:GO but has been expanded to be more general and talk about other games.

In [3]:
ai_data = 'https://raw.githubusercontent.com/entrylevelcs/CS39AA-Project/main/gpt3.5_generated_data.csv'
df2 = pd.read_csv(ai_data)

We need to reformat the data into forms that are usable to the model. One thing we need to do is reclassify our labels since the model marks "-1" as a wrong label. In order to do this we replace the "recommended" labels, 1, with 1 and the "not recommended" labels, -1, with 0.

In [4]:
df = df.rename(columns={"review_text": "text", "review_score": "label"})
df2 = df2.rename(columns={"Review": "text", " Sentiment": "label"})
class_tok2idx = dict({1: 1, -1: 0})
#classes = df.label.unique().tolist()
#class_tok2idx = dict((v, k) for k, v in enumerate(classes))
df['label'] = df['label'].apply(lambda x: class_tok2idx[x])
#classes = df2.label.unique().tolist()
#class_tok2idx = dict((v, k) for k, v in enumerate(classes))
df2['label'] = df2['label'].apply(lambda x: class_tok2idx[x])

We get some numbers related to the two data sets sizes so that the data going into both training models are the same.

In [5]:
ai_data_size = len(df2)
sample_size = len(df)
proportion = 1 - ((sample_size-ai_data_size)/sample_size)

Initialize our tokenizer and models.

In [6]:
real_data_raw = Dataset.from_pandas(df[['label', 'text']])
simulated_data_raw = Dataset.from_pandas(df2)
MODEL_NAME = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.add_special_tokens({'pad_token': '<pad>'})
realModel = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2, max_length=55, output_attentions=False, output_hidden_states=False)
simulatedModel = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2, max_length=55, output_attentions=False, output_hidden_states=False)
realModel.resize_token_embeddings(len(tokenizer))
simulatedModel.resize_token_embeddings(len(tokenizer))

  if _pandas_api.is_sparse(col):
  torch.utils._pytree._register_pytree_node(
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 28997. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: http

Embedding(28997, 768)

Tokenize our data sets so that they can be used in the models.

In [7]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True, max_length=55)

real_data = real_data_raw.map(tokenize_function, batched=True)
simulated_data = simulated_data_raw.map(tokenize_function, batched=True)

Map:   0%|          | 0/24971 [00:00<?, ? examples/s]

Map:   0%|          | 0/2496 [00:00<?, ? examples/s]

Data set that is sampled from the full set so that we can use our models to predict it.

In [8]:
pdata = 'https://raw.githubusercontent.com/entrylevelcs/CS39AA-Project/main/extra.csv'
df3 = pd.read_csv(pdata)
df3 = df3[df3["review_text"].notnull()]
df3 = df3.rename(columns={"review_text": "text", "review_score": "label"})
df3['label'] = df3['label'].apply(lambda x: class_tok2idx[x])
predict_data_raw = Dataset.from_pandas(df3[['label', 'text']])
predict_data = predict_data_raw.map(tokenize_function, batched=True)


  if _pandas_api.is_sparse(col):


Map:   0%|          | 0/12500 [00:00<?, ? examples/s]

Split the real data into the training data and the evaluation data set. Both the simulated and real training data sets will use the real_data_eval data set as the evaluation data sets. With this split both our simulated training data set and our real training data set are the same size.

In [9]:
train_prop = proportion
real_data_train = real_data.select(range(int(len(real_data)*train_prop)))
real_data_eval = real_data.select(range(int(len(real_data)*train_prop), len(real_data)))

Set up our real model that is trained on the real data.

In [10]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

real_training_args = TrainingArguments(num_train_epochs=5,
                                  do_train=True,
                                  report_to=None,
                                  output_dir="real",
                                  evaluation_strategy="steps",
                                  eval_steps=78,
                                  learning_rate=1e-5,
                                  per_device_train_batch_size=32,
                                  per_device_eval_batch_size=32)

real_trainer = Trainer(model = realModel, 
                  args = real_training_args,
                  train_dataset = real_data_train, 
                  eval_dataset = real_data_eval,
                  compute_metrics = compute_metrics,
)

  metric = load_metric("accuracy")
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Set up our simulated model that is trained on simulated data.

In [11]:
sim_training_args = TrainingArguments(num_train_epochs=5,
                                  do_train=True,
                                  report_to=None,
                                  output_dir="simulated",
                                  evaluation_strategy="steps",
                                  eval_steps=78,
                                  learning_rate=1e-5,
                                  per_device_train_batch_size=32,
                                  per_device_eval_batch_size=32)

sim_trainer = Trainer(model = simulatedModel, 
                  args = sim_training_args,
                  train_dataset = simulated_data, 
                  eval_dataset = real_data_eval,
                  compute_metrics = compute_metrics,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [12]:
if torch.cuda.is_available():
    device = "cuda:0"
    print("Using GPU")
else: 
    device = "cpu"


Using GPU


Train our real data model.

In [13]:
realModel.to(device)
torch.set_grad_enabled(True)
real_trainer.train()
real_trainer.evaluate()

  0%|          | 0/390 [00:00<?, ?it/s]

  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.45092934370040894, 'eval_accuracy': 0.8182060864922585, 'eval_runtime': 46.7067, 'eval_samples_per_second': 481.216, 'eval_steps_per_second': 15.051, 'epoch': 1.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.3722735345363617, 'eval_accuracy': 0.8182060864922585, 'eval_runtime': 47.316, 'eval_samples_per_second': 475.019, 'eval_steps_per_second': 14.858, 'epoch': 2.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.3703669309616089, 'eval_accuracy': 0.8450347036839295, 'eval_runtime': 47.415, 'eval_samples_per_second': 474.027, 'eval_steps_per_second': 14.827, 'epoch': 3.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.3710778057575226, 'eval_accuracy': 0.8437889304146645, 'eval_runtime': 47.8575, 'eval_samples_per_second': 469.644, 'eval_steps_per_second': 14.689, 'epoch': 4.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.37922489643096924, 'eval_accuracy': 0.8480156611496708, 'eval_runtime': 47.3053, 'eval_samples_per_second': 475.127, 'eval_steps_per_second': 14.861, 'epoch': 5.0}
{'train_runtime': 322.3981, 'train_samples_per_second': 38.694, 'train_steps_per_second': 1.21, 'train_loss': 0.3390509972205529, 'epoch': 5.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.37922489643096924,
 'eval_accuracy': 0.8480156611496708,
 'eval_runtime': 46.5277,
 'eval_samples_per_second': 483.067,
 'eval_steps_per_second': 15.109,
 'epoch': 5.0}

Use the real model to predict the extra data.

In [14]:
real_trainer.predict(predict_data)

  0%|          | 0/391 [00:00<?, ?it/s]

PredictionOutput(predictions=array([[-0.3687703,  1.0426805],
       [-2.1956954,  2.237417 ],
       [-1.1692294,  1.6718305],
       ...,
       [-2.2064612,  2.3959916],
       [-0.36877  ,  1.0426807],
       [-2.2191145,  2.2227476]], dtype=float32), label_ids=array([0, 1, 0, ..., 1, 0, 1], dtype=int64), metrics={'test_loss': 0.3737613260746002, 'test_accuracy': 0.852, 'test_runtime': 26.4744, 'test_samples_per_second': 472.154, 'test_steps_per_second': 14.769})

Train our simulated data model.

In [15]:
simulatedModel.to(device)
sim_trainer.train()
sim_trainer.evaluate()

  0%|          | 0/390 [00:00<?, ?it/s]

  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.5354476571083069, 'eval_accuracy': 0.769398469478555, 'eval_runtime': 46.7639, 'eval_samples_per_second': 480.627, 'eval_steps_per_second': 15.033, 'epoch': 1.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 0.8680912256240845, 'eval_accuracy': 0.7299786438868126, 'eval_runtime': 47.6691, 'eval_samples_per_second': 471.5, 'eval_steps_per_second': 14.747, 'epoch': 2.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 1.1738061904907227, 'eval_accuracy': 0.6869994660971703, 'eval_runtime': 47.2405, 'eval_samples_per_second': 475.778, 'eval_steps_per_second': 14.881, 'epoch': 3.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 1.2126156091690063, 'eval_accuracy': 0.7019042534258765, 'eval_runtime': 47.047, 'eval_samples_per_second': 477.735, 'eval_steps_per_second': 14.942, 'epoch': 4.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 1.212938904762268, 'eval_accuracy': 0.7064869193806728, 'eval_runtime': 46.7173, 'eval_samples_per_second': 481.107, 'eval_steps_per_second': 15.048, 'epoch': 5.0}
{'train_runtime': 320.8086, 'train_samples_per_second': 38.902, 'train_steps_per_second': 1.216, 'train_loss': 0.1017233628493089, 'epoch': 5.0}


  0%|          | 0/703 [00:00<?, ?it/s]

{'eval_loss': 1.212938904762268,
 'eval_accuracy': 0.7064869193806728,
 'eval_runtime': 46.4928,
 'eval_samples_per_second': 483.43,
 'eval_steps_per_second': 15.121,
 'epoch': 5.0}

Use our simulated model to predict the extra data.

In [16]:
sim_trainer.predict(predict_data)

  0%|          | 0/391 [00:00<?, ?it/s]

PredictionOutput(predictions=array([[-2.4561958 ,  1.6606176 ],
       [-3.7050686 ,  3.1216106 ],
       [ 2.5778518 , -2.1714263 ],
       ...,
       [-0.90807104,  0.77992636],
       [-2.4561944 ,  1.6606168 ],
       [ 3.035918  , -2.6402717 ]], dtype=float32), label_ids=array([0, 1, 0, ..., 1, 0, 1], dtype=int64), metrics={'test_loss': 1.198638677597046, 'test_accuracy': 0.70792, 'test_runtime': 27.0811, 'test_samples_per_second': 461.577, 'test_steps_per_second': 14.438})