# Sentiment Analysis

In this project, I tried to predict the sentiment of 20 different sentences written in English and Persian using two pretrained models from HuggingFace.

## Imports

In [None]:
# Install/update the transformers library
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.48.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.48.0-py3-none-any.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.47.1
    Uninstalling transformers-4.47.1:
      Successfully uninstalled transformers-4.47.1
Successfully installed transformers-4.48.0


## Model Selection - English

- Using the `cardiffnlp/twitter-roberta-base-sentiment-latest` model from HuggingFace which is trained on 124M tweets  from January 2018 to December 2021.




In [None]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

# Selecting the model for words in English
model_name_en = 'cardiffnlp/twitter-roberta-base-sentiment-latest'
model_en = AutoModelForSequenceClassification.from_pretrained(model_name_en)
tokenizer_en = AutoTokenizer.from_pretrained(model_name_en)
sentiment_pipeline_en = pipeline('sentiment-analysis', model=model_en, tokenizer=tokenizer_en)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


- Generated 10 random English phrases with different meanings and stored the predictions of the model in a pandas dataframe.

In [None]:
import pandas as pd

phrases_en = [
    'Great job!',
    'I have a headache.',
    'What a great day.',
    'How are you?',
    'I like to read books',
    'Programming can be hard, but it\'s fun.',
    'I got full marks on the exam.',
    'That chef does not know how to cook.',
    'Python is very straight forward.',
    'Learning is great.'
]

# Make predictions
results_en = sentiment_pipeline_en(phrases_en)

# Save to a dataframe
df_en = pd.DataFrame({
    'Phrase': phrases_en,
    'Prediction': [result['label'] for result in results_en],
    'Probablity': [result['score'] for result in results_en]
})

In [None]:
df_en.head(10)

Unnamed: 0,Phrase,Prediction,Probablity
0,Great job!,positive,0.967827
1,I have a headache.,negative,0.721547
2,What a great day.,positive,0.982199
3,How are you?,neutral,0.851066
4,I like to read books,positive,0.830213
5,"Programming can be hard, but it's fun.",positive,0.792353
6,I got full marks on the exam.,positive,0.869636
7,That chef does not know how to cook.,negative,0.868727
8,Python is very straight forward.,positive,0.783547
9,Learning is great.,positive,0.976359


## Model Selection - Persian

- Using the `rezaFarsh/ternary_persian_sentiment_analysis` model, which is the fine-tuned version of `sentence-transformers/LaBSE` for sentiment analysis in Persian language.

In [None]:
# Selecting the model for words in Persian
model_name_fa = 'rezaFarsh/ternary_persian_sentiment_analysis'
model_fa = AutoModelForSequenceClassification.from_pretrained(model_name_fa)
tokenizer_fa = AutoTokenizer.from_pretrained(model_name_fa)
sentiment_pipeline_fa = pipeline('sentiment-analysis', model=model_fa, tokenizer=tokenizer_fa)

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cpu


1. Generated 10 random Persian phrases with different meanings and stored the predictions of the model in a pandas dataframe.

2. Converted the default labels to more meaningful labels.

In [None]:
phrases_fa = [
    'اون کتاب عالی بود',
    'روزنامه امروز خیلی حوصله سربر بود',
    'غذای امروز افتضاح بود',
    'حالت چطوره؟',
    'کوهنوردی بسیار هیجان انگیز است',
    'من عاشق شطرنج هستم',
    'من در امتحان نمره کامل گرفتم',
    'اون آشپز آشپزی بلد نیست',
    'امتحانات به هفته آینده موکول شد',
    'دیروز مدارس تعطیل شد'
]

results_fa = sentiment_pipeline_fa(phrases_fa)

df_fa = pd.DataFrame({
    'Phrase': phrases_fa,
    'Prediction': [result['label'] for result in results_fa],
    'Probablity': [result['score'] for result in results_fa]
})

# LABEL_2 = POSITIVE, LABEL_1 = NEUTRAL, LABEL_0 = NEGATIVE
df_fa['Prediction'] = df_fa['Prediction'].map({
    'LABEL_2': 'مثبت',
    'LABEL_1': 'خنثی',
    'LABEL_0': 'منفی'
    })

In [None]:
df_fa.head(10)

Unnamed: 0,Phrase,Prediction,Probablity
0,اون کتاب عالی بود,مثبت,0.952173
1,روزنامه امروز خیلی حوصله سربر بود,منفی,0.649566
2,غذای امروز افتضاح بود,منفی,0.984413
3,حالت چطوره؟,خنثی,0.993157
4,کوهنوردی بسیار هیجان انگیز است,مثبت,0.934296
5,من عاشق شطرنج هستم,خنثی,0.829223
6,من در امتحان نمره کامل گرفتم,مثبت,0.641187
7,اون آشپز آشپزی بلد نیست,منفی,0.703528
8,امتحانات به هفته آینده موکول شد,خنثی,0.716493
9,دیروز مدارس تعطیل شد,خنثی,0.887283
