<a href="https://colab.research.google.com/github/dkay7223/Sentiment_analyzer_model/blob/main/training_papluca_for_sentiment_analyzing_Plus_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Course Project
**Group Members**
• Muhammad Usman - i202602
• Daniyal Khan - i201847
• Tayyba Saveera - i202314

Model: xlm-roberta-base-language-detection

#Model description
This model is an XLM-RoBERTa transformer model with a classification head on top (i.e. a linear layer on top of the pooled output).


# Sentiment Analysis for Project

Training the model:*italicised text* [papluca/xlm-roberta-base-language-detection](https://huggingface.co/papluca/xlm-roberta-base-language-detection), which is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification) dataset.

In [49]:
!pip install -q datasets transformers[sentencepiece] langid watermark

In [50]:
!pip install accelerate -U
!pip install transformers[torch]
!pip show transformers
!pip show accelerate


Name: transformers
Version: 4.35.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
Name: accelerate
Version: 0.25.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: sylvain@huggingface.co
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: 


In [51]:
%load_ext watermark
%watermark -p torch,datasets,sklearn,transformers,langid

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
torch       : 2.1.0+cu121
datasets    : 2.15.0
sklearn     : 1.2.2
transformers: 4.35.2
langid      : 1.1.6



# Loading the model

In [52]:
import time
from pathlib import Path

import langid
import torch
from datasets import load_dataset
from sklearn.metrics import f1_score, accuracy_score, classification_report
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    pipeline,
    Trainer,
    TrainingArguments
)

This is the folder in which we'll store model checkpoints

In [53]:
gdrive_dir = Path('./checkpoints')

#Loading the Dataset

In [54]:
dataset = load_dataset("papluca/language-identification")

#Training the dataset

In [55]:
ds_train = dataset['train']
ds_valid = dataset['validation']
ds_test = dataset['test']

print(f"Train / valid / test samples: {len(ds_train)} / {len(ds_valid)} / {len(ds_test)}")

Train / valid / test samples: 70000 / 10000 / 10000


## Tokenization

We'll set up a tokenizer that truncates sequences longer than 128 tokens, but ignore padding for now. This is because we'll use dynamic padding, i.e. we'll pad to the length of the *longest sequence in each batch*.

In [56]:
model_ckpt = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [57]:
def tokenize_text(sequence):
    """Tokenize input sequence."""
    return tokenizer(sequence["text"], truncation=True, max_length=128)

Tokenize all sub-datasets:


In [58]:
tok_train = ds_train.map(tokenize_text, batched=True)
tok_valid = ds_valid.map(tokenize_text, batched=True)
tok_test = ds_test.map(tokenize_text, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Prepare forward and backward mappings between labels strings and integers:

In [59]:

xnli_languages = ['ur']

all_langs = sorted(list(set( xnli_languages )))

In [60]:
id2label = {idx: all_langs[idx] for idx in range(len(all_langs))}
label2id = {v: k for k, v in id2label.items()}
label2id["pt"] = len(label2id)

In [61]:
def encode_labels(example):
    """Map string labels to integers."""
    example["labels"] = label2id[example["labels"]]
    return example

Encode targets:

In [62]:
tok_train = tok_train.filter(lambda example: example["labels"] in label2id)
tok_valid = tok_valid.filter(lambda example: example["labels"] in label2id)
tok_test = tok_test.filter(lambda example: example["labels"] in label2id)

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's have a look at the corpus statistics to see e.g. whether our truncation length is fine.

In [63]:
from statistics import mean, stdev

_len = [len(sample) for sample in tok_train['input_ids']]
avg_len, std_len = mean(_len), stdev(_len)
min_len, max_len = min(_len), max(_len)

print('-'*10 + ' Corpus statistics ' + '-'*10)
print(f'\nAvg. length: {avg_len:.1f} (std. {std_len:.1f})')
print('Min. length:', min_len)
print('Max. length:', max_len)

---------- Corpus statistics ----------

Avg. length: 22.5 (std. 13.8)
Min. length: 3
Max. length: 128


Max. length is 128, but on average samples are much shorter (ca. 22.5 tokens). Padding batches

In [64]:
# Use dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Model Training

In [65]:
model = AutoModelForSequenceClassification.from_pretrained(
  model_ckpt, num_labels=len(all_langs), id2label=id2label, label2id=label2id
)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We define here the metrics that we're going to monitor during training:

In [66]:
def compute_metrics(pred):
    """Custom metric to be used during training."""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)  # Accuracy
    f1 = f1_score(labels, preds, average="weighted")  # F1-score
    return {
        "accuracy": acc,
        "f1": f1
        }

To train our model, we'll use the HF `Trainer`. The 1st step is to create an instance of the `TrainingArguments` class, which will contain all the hyperparameters the `Trainer` will use for training and evaluation.

In [67]:
epochs = 2
lr = 2e-5
train_bs = 64
eval_bs = train_bs * 2

# Log training loss at each epoch
logging_steps = len(tok_train) // train_bs
# Out dir
output_dir = gdrive_dir / "xlm-roberta-base-finetuned-language-detection"



In [71]:
from transformers import TrainingArguments


In [73]:
print(TrainingArguments)

<class 'transformers.training_args.TrainingArguments'>


In [None]:
training_args = TrainingArguments(
  output_dir=output_dir,
  num_train_epochs=epochs,
  learning_rate=lr,
  per_device_train_batch_size=train_bs,
  per_device_eval_batch_size=eval_bs,
  evaluation_strategy="epoch",
  logging_steps=logging_steps,
  fp16=True,  # Remove if GPU doesn't support it
)

In [None]:
trainer = Trainer(
  model,
  training_args,
  compute_metrics=compute_metrics,
  train_dataset=tok_train,
  eval_dataset=tok_valid,
  data_collator=data_collator,
  tokenizer=tokenizer,
)

Let's train the model!

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 70000
  Num Epochs = 2
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 2188
  Number of trainable parameters = 278059028
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [2188/2188 19:33, Epoch 2/2]
Epoch	Training Loss	Validation Loss	Accuracy	F1
1	0.248100	0.014772	0.996800	0.996815
2	0.009400	0.008678	0.998000	0.998000
Saving model checkpoint to drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-500
Configuration saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-500/config.json
Model weights saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-500/pytorch_model.bin
tokenizer config file saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-500/tokenizer_config.json
Special tokens file saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-500/special_tokens_map.json
Saving model checkpoint to drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1000
Configuration saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1000/config.json
Model weights saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1000/tokenizer_config.json
Special tokens file saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10000
  Batch size = 128
Saving model checkpoint to drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1500
Configuration saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1500/config.json
Model weights saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1500/tokenizer_config.json
Special tokens file saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-2000
Configuration saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-2000/config.json
Model weights saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-2000/tokenizer_config.json
Special tokens file saved in drive/MyDrive/Colab Notebooks/HuggingFace_course/HF_course_community_event/xlm-roberta-base-finetuned-language-detection/checkpoint-2000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10000
  Batch size = 128


Training completed. Do not forget to share your model on huggingface.co/models =)


TrainOutput(global_step=2188, training_loss=0.12860834635892063, metrics={'train_runtime': 1176.8649, 'train_samples_per_second': 118.96, 'train_steps_per_second': 1.859, 'total_flos': 8561635871314944.0, 'train_loss': 0.12860834635892063, 'epoch': 2.0})

# Bnechmarking

In [76]:
ds_test = ds_test.to_pandas()
ds_test.head(3)

Unnamed: 0,labels,text
0,nl,Een man zingt en speelt gitaar.
1,nl,De technologisch geplaatste Nasdaq Composite I...
2,es,Es muy resistente la parte trasera rígida y lo...


In [77]:
# Constrain the language set
langid.set_languages(all_langs)

In [None]:
print(classification_report(ds_test.labels.values.tolist(), langid_preds, digits=3))

precision    recall  f1-score   support

          ur      0.998     0.964     0.981       500
         
    accuracy                          0.985     10000
   macro avg      0.986     0.985     0.985     10000
weighted avg      0.986     0.985     0.985     10000

In [79]:
device = 0 if torch.cuda.is_available() else -1

model_ckpt = "papluca/xlm-roberta-base-language-detection"
pipe = pipeline("text-classification", model=model_ckpt, device=device)

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
start_time = time.perf_counter()
model_preds = [s['label'] for s in pipe(ds_test.text.values.tolist(), truncation=True, max_length=128)]
print(f"{time.perf_counter() - start_time:.2f} seconds")

#Classification Report

In [None]:
print(classification_report(ds_test.labels.values.tolist(), model_preds, digits=3))

              precision    recall  f1-score   support

        
          ur      0.998     0.964     0.981       500
          

    accuracy                          0.996     10000
   macro avg      0.996     0.996     0.996     10000
weighted avg      0.996     0.996     0.996     10000


# Testing and comparision

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification



tokenizer = AutoTokenizer.from_pretrained("dkay7223/SentimentAnalyzerForNLPProject")
model = AutoModelForSequenceClassification.from_pretrained("dkay7223/SentimentAnalyzerForNLPProject", id2label = {0: 'negative', 1: 'positive'})



In [None]:
from transformers import TextClassificationPipeline
text = "وہ ایک برا شخص ہے"
pipe = TextClassificationPipeline(model = model, tokenizer = tokenizer, top_k = 2, device = 0)
pipe(text)