# Getting Started with Sentiment Analysis using Python

Original Article: https://huggingface.co/blog/sentiment-analysis-python

## Install huggingface libraries

In [1]:
!pip install -q transformers emoji xformers datasets accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m62.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m353.7/353.7 kB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m104.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m11.7 MB/s[

## How to Use Pre-trained Sentiment Analysis Models with Python

On the [Hugging Face Hub](https://huggingface.co/models), we are building the largest collection of models and datasets publicly available in order to democratize machine learning 🚀. In the Hub, you can find more than 27,000 models shared by the AI community with state-of-the-art performances on tasks such as sentiment analysis, object detection, text generation, speech recognition and more. The Hub is free to use and most models have a widget that allows to test them directly on your browser!

There are more than [215 sentiment analysis models](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment) publicly available on the Hub and integrating them with Python just takes 5 lines of code:

In [2]:
from transformers import pipeline

In [3]:
sentiment_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [8]:
data = ["I love you", "I hate you",'I have normal feeling for you']

In [9]:
sentiment_pipeline(data)

[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079},
 {'label': 'POSITIVE', 'score': 0.9992109537124634}]

You can use a specific sentiment analysis model that is better suited to your language or use case by providing the name of the model. For example, if you want a sentiment analysis model for tweets, you can specify the [model id](https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis):

In [6]:
specific_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis") #yauta pretrained model use garya

Downloading (…)lve/main/config.json:   0%|          | 0.00/949 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/338 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/843k [00:00<?, ?B/s]

Downloading (…)solve/main/bpe.codes:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

In [7]:
specific_model(data)

[{'label': 'POS', 'score': 0.9916695356369019},
 {'label': 'NEG', 'score': 0.9806600213050842}]

## Building Your Own Sentiment Analysis Model

## Activate GPU and Install Dependencies

Activate GPU for faster training by clicking on `Runtime` > `Change runtime type` and then selecting `GPU` as the Hardware accelerator.
Then check if GPU is available

In [10]:
import torch
torch.cuda.is_available()

True

## Preprocess data

### Load data

In [24]:
from datasets import load_dataset #hugging face ko ho

imdb = load_dataset("imdb")
print(imdb['train']) 



  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


In [25]:
!pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [29]:
import pandas as pd
df=pd.DataFrame(imdb['train'])
df.head(20)

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
5,I would put this at the top of my list of film...,0
6,Whoever wrote the screenplay for this movie ob...,0
7,"When I first saw a glimpse of this movie, I qu...",0
8,"Who are these ""They""- the actors? the filmmake...",0
9,This is said to be a personal film for Peter B...,0


### Create a smaller training dataset for faster training times

In [12]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))]) #shuffle to introduce randomness

small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])
print(small_train_dataset[0])
print(small_test_dataset[0])

{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...', 'label': 1}
{'text': "<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, 

### Set DistilBERT tokenizer

In [13]:
from transformers import AutoTokenizer #break text into token

In [14]:
MODEL_NAME = "distilbert-base-uncased" # base: thikthikI, uncased/; case sensitive nahune
#bert: sano model, thulo model lai sano ma. break garera kam garni

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) #convert to number 
help(tokenizer)

### Prepare the text inputs for the model

In [19]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True) #

In [20]:
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

### Use data_collector to convert our samples to PyTorch tensors and concatenate them with the correct amount of *padding*

In [30]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Training the model

### Define DistilBERT as our base model:

In [31]:
from transformers import AutoModelForSequenceClassification

In [33]:
num_labels = 2

In [34]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

### Define the evaluation metrics 

In [35]:
import numpy as np
from datasets import load_metric

load_accuracy = load_metric("accuracy")
load_f1 = load_metric("f1")

def compute_metrics(eval_pred):    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

  load_accuracy = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Define a new Trainer with all the objects we constructed so far

In [36]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="runs",
    learning_rate=2e-5,
    per_device_train_batch_size=32, # multiple device ma train garna milcha, gpu ma
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_steps=45
)

In [37]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

### Train the model

In [38]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
45,0.5235
90,0.3326
135,0.2267
180,0.223


TrainOutput(global_step=188, training_loss=0.3220239738200573, metrics={'train_runtime': 286.1667, 'train_samples_per_second': 20.967, 'train_steps_per_second': 0.657, 'total_flos': 793355529763200.0, 'train_loss': 0.3220239738200573, 'epoch': 2.0})

### Compute the evaluation metrics

In [39]:
trainer.evaluate()

{'eval_loss': 0.3095719814300537,
 'eval_accuracy': 0.8566666666666667,
 'eval_f1': 0.8599348534201954,
 'eval_runtime': 5.2479,
 'eval_samples_per_second': 57.166,
 'eval_steps_per_second': 1.906,
 'epoch': 2.0}

## Analyzing new data with the model

Run inferences with your new model using Pipeline

In [48]:

# Load the saved model
YOUR_LOCAL_MODEL = '/content/runs/checkpoint-188'

In [49]:
sentiment_model = pipeline(task="sentiment-analysis", model=YOUR_LOCAL_MODEL)

In [50]:
sentiment_model(["I love this movie", "This movie sucks!"])

[{'label': 'LABEL_1', 'score': 0.9396113753318787},
 {'label': 'LABEL_0', 'score': 0.8979520797729492}]

## Future Tasks

1. Try to remove the warnings in the notebook, if any
2. Load your own dataset
3. And if you are **really** interested, try writing your own training loop.