<a href="https://colab.research.google.com/github/alturkim/nlp-notebooks/blob/main/Sentiment_Analysis_(with_HF_Trainer_API).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div dir=rtl>

# تحليل المشاعر باستخدام التعلم العميق


---


## مقدمة:
تحليل المشاعر هي المهمة التي تقوم بوصف المشاعر الموجودة في نص أو صورة.

## مثال:
أحد الأمثلة الشهيرة هي تصنيف تقييمات العملاء المكتوبة للبضائع إلى إيجابية أو سلبية.

## محتوى الملف:
هذا الملف يحتوي على كود يقوم ببناء محلل للمشاعر باستخدام نموذج لغوي مسبق التدريب.
الكود يستخدم المكتبية الشهيرة HuggingFace.

## شرح بالفيديو:
يوجد شرح فيديو لهذا الملف على الرابط التالي:

</div>



In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, AdamW, get_scheduler
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, arrow_dataset
import evaluate

import numpy as np

## نظرة عامة عن البيانات

In [None]:
raw_dataset = load_dataset("ar_res_reviews", split="train")
print(raw_dataset)
raw_dataset = raw_dataset.rename_column("polarity", "label")

def get_stat(dataset : arrow_dataset.Dataset) -> None:
    labels = dataset["label"]
    pos_count = sum([1 for i in labels if i==1])
    neg_count = sum([1 for i in labels if i==0])

    pos_pct = pos_count/(pos_count + neg_count)
    neg_pct = neg_count/(pos_count + neg_count)
    print(f"There are {pos_count} positive reviews, and {neg_count} negative reviews.")
    print(f"Percentage of positive reviews: {pos_pct*100:.2f}%")
    print(f"Percentage of negative reviews: {neg_pct*100:.2f}%")

print("stat about dataset: ")
get_stat(raw_dataset)

تقسيم البيانات إلى ثلاث مجموعات </br>
train, eval, test


In [None]:
def train_eval_test_split(dataset):
    split_datasets = dict()
    train_eval_test = dataset.train_test_split(test_size=0.4, stratify_by_column="label", seed=10)
    split_datasets["train"] = train_eval_test["train"]
    eval_test = train_eval_test["test"].train_test_split(test_size=0.5, stratify_by_column="label", seed=10)
    split_datasets["eval"] = eval_test["train"]
    split_datasets["test"] = eval_test["test"]
    return split_datasets
split_datasets = train_eval_test_split(raw_dataset)



print("training data stats")
get_stat(split_datasets["train"])
print("test data stats")
get_stat(split_datasets["test"])


There are 4162 positive reviews, and 1692 negative reviews.
Percentage of positive reviews: 71.10%
Percentage of negative reviews: 28.90%
There are 1784 positive reviews, and 726 negative reviews.
Percentage of positive reviews: 71.08%
Percentage of negative reviews: 28.92%


In [None]:
sample = split_datasets["train"].shuffle().select([3])
print(sample["text"])
print(sample["label"])

['مقهى شندويشات ممتاز واسعار رخيصة']
[1]


In [None]:
model_checkpoint = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

tokenized_datasets = dict()

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_datasets["train"] = split_datasets["train"].map(tokenize_function, batched=True)
tokenized_datasets["test"] = split_datasets["test"].map(tokenize_function, batched=True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

In [None]:
training_args = TrainingArguments(output_dir="output",  num_train_epochs=2, evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'pre_classifie

In [None]:
def compute_metrics(eval_preds):
    # eval_preds is an EvalPrediction object which is a named tuple
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    metric_names = ["Precision", "Recall", "F1", "Accuracy"]
    results = dict()
    for m in metric_names:
        metric = evaluate.load(m)
        results = {**results, **metric.compute(predictions=predictions, references=labels)}
    return results

"""
Note that when you pass the tokenizer as we did here, 
the default data_collator used by the Trainer will be a DataCollatorWithPadding as defined previously, 
so you can skip the line data_collator=data_collator in this call.
"""

from datasets import Dataset
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
print("results before finetuning ... ")
predictions = trainer.predict(tokenized_datasets["test"])
# predictions is a namedTuple and predictions.predictions are the logits
results = compute_metrics((predictions.predictions, tokenized_datasets["test"]["labels"]))
print(results)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, restaurant_id, user_id. If text, restaurant_id, user_id are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2510
  Batch size = 8


{'precision': {'precision': 0.6686390532544378}, 'recall': {'recall': 0.0633408071748879}, 'f1': {'f1': 0.11571940604198669}, 'accuracy': {'accuracy': 0.31195219123505974}}


In [None]:
trainer.train() # report the training loss every 500 steps, can be modified in TrainingArgument object


In [None]:
print("results after finetuning ... ")
predictions = trainer.predict(tokenized_datasets["test"])
# predictions is a namedTuple and predictions.predictions are the logits
results = compute_metrics((predictions.predictions, tokenized_datasets["test"]["labels"]))
print(results)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, restaurant_id, user_id. If text, restaurant_id, user_id are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2510
  Batch size = 8


results after finetuning ... 


{'precision': {'precision': 0.8729222520107238}, 'recall': {'recall': 0.9125560538116592}, 'f1': {'f1': 0.8922992600712524}, 'accuracy': {'accuracy': 0.8434262948207172}}


In [None]:
# test your model here
sentences = [""]
# tokenized = 

In [None]:
# demo with gradio

# logging with tensorboard