<a href="https://colab.research.google.com/github/alexistassone/NLPProject/blob/main/NLP_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Comet and Dependencies

---

In [None]:
!pip install comet_ml torch datasets transformers scikit-learn accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Initialize Comet

---



In [None]:
import comet_ml

comet_ml.init(project_name="imdb-distilbart")

Set Model Type

---



In [None]:
PRE_TRAINED_MODEL_NAME = "distilbert-base-uncased"
SEED = 20

Load Data

---



In [None]:
from transformers import AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

raw_datasets = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

Setup Tokenizer

---



In [None]:
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

In [None]:
def tokenize_function(examples):
  return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)



In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Create Sample Datasets

---



In [None]:
train_dataset = tokenized_datasets["train"].shuffle(seed=SEED).select(range(200))
eval_dataset = tokenized_datasets["test"].shuffle(seed=SEED).select(range(200))



Setup Trnasformer Model

---



In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    PRE_TRAINED_MODEL_NAME, num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.

Setup Evaluation Function

---



In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def get_example(index):
  return eval_dataset[index]["text"]

def compute_metrics(pred):
  experiment = comet_ml.get_global_experiement()

  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, 
                                                             average="macro")
  acc = accuracy_score(labels, preds)

  if experiment:
    epoch = int(experiment.curr_epoch) if experiment.curr_epoch is not None else 0
    experiment.set_epoch(epoch)
    experiment.log_confusion_matrix(
        y_true=labels, y_predicted=preds, 
        file_name=f"confustion-matrix-epoch-{epoch}.json",
        labels=["negative","positive"], index_to_example_function=get_example)

  for i in range(20):
    experiment.log_text(get_example(i), metadata={"label": labels[i].item()})

  return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

Run Training

---



In [None]:
!env COMET_MODE=ONLINE
!env COMET_LOG_ASSETS=TRUE
!pip install accelerate

training_args = TrainingArguments(seed=SEED, output_dir="./results", 
                                  overwrite_output_dir=True, num_train_epochs=1,
                                  do_train=True, do_eval=True,
                                  evaluation_strategy="steps", eval_steps=25,
                                  save_strategy="steps", save_total_limit=10,
                                  save_steps=25, per_device_train_batch_size=8,
                                  report_to=["comet_ml"])

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset,
                  eval_dataset=eval_dataset, compute_metrics=compute_metrics,
                  data_collator=data_collator)
trainer.train()

SHELL=/bin/bash
NV_LIBCUBLAS_VERSION=11.11.3.6-1
NVIDIA_VISIBLE_DEVICES=all
COLAB_JUPYTER_TRANSPORT=ipc
NV_NVML_DEV_VERSION=11.8.86-1
NV_CUDNN_PACKAGE_NAME=libcudnn8
CGROUP_MEMORY_EVENTS=/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events
NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1
VM_GCE_METADATA_HOST=169.254.169.253
S2N_DONT_MLOCK=1
HOSTNAME=92c8ba225332
TBE_RUNTIME_ADDR=172.28.0.1:8011
COMET_MODE=ONLINE
GCE_METADATA_TIMEOUT=3
NVIDIA_REQUIRE_CUDA=cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,drive