<a href="https://colab.research.google.com/github/ali-sowicz/sentiment-analyzer/blob/main/Alicja_Golisowicz_Copy_of_VL_summer_camp_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://public.3.basecamp.com/p/gKcVQuYMuHC9hNVHkY76Tguc/upload/representation)

# Introduction

Welcome to Voicelab AI Summer Camp Exam! You are selected to the first round of check-up to join the NLP Team at Voicelab.

## NLP TASK
This is a small task prepared for you to check you basic NLP skills. 
Part of code is ready to use, but some need to be added. 
The task is to prepare data and train the selected T5 model for the task of Polish sentiment analysis. Add your evaluation metrics if you think that's necessary. Additionally, you can provide EDA (Exploratory Data Analysis) if you think that will help you train better model.

**Input:** Tweet e.g.  *(string)*: "*Stan polskiego sportu, zwłaszcza gier zespołowych, jest smutnym odzwierciedleniem stanu polskiego państwa. :(*"
**Output:** One of the classes e.g. *Positive*
**Required classes:** *Positive, Neutral, Negative*

It should work as best as possible with Polish.
After you finish the task, please send us your Colab notebook with your comments. 
Additionally, send the *test_submission.csv* file with sentiment predictions. **We will evaluate your results, code, and how you approached the problem so be sure you attach all your comments.** The comments can be made in Polish.



![](https://d3caycb064h6u1.cloudfront.net/wp-content/uploads/2021/06/sentimentanalysishotelgeneric-2048x803-1.jpg)

## Sentiment Analysis with Text-To-Text Transfer Transformer 

> *Stan polskiego sportu, zwłaszcza gier zespołowych, jest smutnym odzwierciedleniem stanu polskiego państwa. :(*

~ Negative

> **Sentiment analysis** (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.


## T5: Text-To-Text Transfer Transformer

> T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending *a different prefix to the input corresponding to each task*, e.g., for translation: "translate English to German: ", for summarization: "summarize: ", and others.

T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.

![](https://miro.medium.com/max/1400/1*oPH8tAGqu3aUp6qjMtqcHg.png)

More: https://huggingface.co/docs/transformers/model_doc/t5

## PyTorch Lightning
This notebook uses Pytorch Lighnting for training. It is a popular version of deep learning framework called PyTorch. Readme more about it here: https://www.pytorchlightning.ai/#about-lightning

In [None]:
!pip install torch
!pip install transformers
!pip install pytorch_lightning
!pip install SentencePiece
!pip install stop-words

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 29.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 70.8 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 55.5 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, tran

Now we seed everything so the results are reproducible.
**Don't change the seed.**

In [None]:
import random
import numpy as np
import torch
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything(2022)

# Load data
Load provided twitter dataset from file. Dataset have examples with tweets, and are annotated with sentiment.

### **Data**

Training data: https://public.3.basecamp.com/p/y3TJTtKBS4hqajrtrjbb9S8z

Test data: https://public.3.basecamp.com/p/K6634dqwPEPvU43HreEwmkWR

Download Twitter Sentiment dataset and upload it below.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving test.csv to test.csv
Saving train.csv to train.csv


## Let's see example data rows

In [None]:
!head train.csv


index,text,label,lang
7364,"RT @mmagierowski: Drodzy, po naciskach Angeli i innych unijnych liderów postanowiłem stworzyć publiczną stronę na FB :-) Zapraszam! https:/…",__neutralny__,pl
786,@m_wawrzynowski I ja też tak uważam. Co innego baza z profilami zgryzu piłkarzy w hiszpańskiej La Liga. @escapade02 @zzyzynski,__neutralny__,pl
5627,#Egipt aresztowany to młodszy syn Mursiego Abdullah http://t.co/UURjMm2Jbi,__negatywny__,pl
7322,@aronsem po skonsumowaniu coś tam wykręci #CzasNaZmiany,__neutralny__,pl
8465,@CuleSynia395 a nooo .... nawet mi nie mów bo mnie telepie ;P mam na to radę kochanie nie ogladamy ;D mamy wyjebane na rosberga xD,__negatywny__,pl
8346,"""Rafałek dopiero zaczynał do nocniczka siusiać, kiedy Zenon Jaskuła wygrywał etapy. Bez majtek biegał"".

Czy ktoś to nagrywa?",__neutralny__,pl
37,Tanie loty do Paryża z Wrocławia oraz Warszawy od 88 PLN: W październiku http://t.co/ubdRNg0VZK #polecane :),__pozytywny__,pl


## Build custom PyTorch dataset
Load data from *train.csv*.

In [None]:
from numpy.ma.core import cos
from torch.utils.data import Dataset
from tqdm import tqdm
import csv
import string
import pandas as pd
import re
from stop_words import get_stop_words  
class TwitterDataset(Dataset):
    def __init__(self, data_path, append_prefix=""):
        self.append_prefix = append_prefix
        #self.samples = list()
        self.data_path = data_path

        # load csv file here
        self.samples = self.readEdit() #this method loads csv file and returns list with cleaned data 
 
    def deleteNaN(self,df):         #Removes rows with NaN contents
        for i, row in df.iterrows():
          isNaN = row.isnull()
          if isNaN.any():
            df = df.drop([i])
        return df

    def remove_stopwords(self,text): #Removes polish stop words - common words of little value
        stop = get_stop_words('polish')
        text = " ".join([word for word in text.split() if word not in (stop)])
        return text

    def remove_emojis(self,text):    #Removes emojis
        emoji_pattern = re.compile("["
          u"\U0001F600-\U0001F64F"  # emoticons
          u"\U0001F300-\U0001F5FF"  # symbols & pictographs
          u"\U0001F680-\U0001F6FF"  # transport & map symbols
          u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                                  "]+", flags=re.UNICODE)
        return emoji_pattern.sub(r'', str(text))

    def cleanTxt(self,text): 
        text = text.lower()          
        text = re.sub(r'@[A-Za-z0-9]+', '', text) #Removes @mentions
        text = re.sub(r'#', '', text) #Removes # symbol        
        text = re.sub(r'(^rt[\s]+)|rt', '', text) #Removes RT
        text = re.sub(r"(\w+:\/\/\S+)|http.+?", "", text) #Removes links
        text = self.remove_emojis(text)   #Removes emojis
        text = text.translate(str.maketrans('', '', string.punctuation)) #Removes punctuation
        text = self.remove_stopwords(text) #Removes polish stop words
        return text

    def readEdit(self):   #cleans text and returns a list with text & label
        df = pd.read_csv(self.data_path)    #Load train.csv into DataFramework
        df = df[df.lang != 'en']            #Removes english tweets - rows with 'en'
        df = df.drop(columns=['index'])     #Removes index column  
        df = df.drop(columns=['lang'])      #Removes lang column
        df = self.deleteNaN(df)             #Removes rows with no text
        df['text']=df['text'].apply(self.cleanTxt) #
        data = df.values.tolist()
        print(data)
        return data 

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample = {}
        sample["input"] = self.append_prefix + self.samples[idx][0] #zmienione z 1
        sample["target"] = self.samples[idx][1]
        return sample


AttributeError: ignored

In [None]:
import os
from typing import Tuple, List
from torch.utils.data import DataLoader, random_split
import torch
import csv

class DataloaderCreator:
    """
    DataloaderCreator creates a dataset and split it into train and val subsets.
    """

    def __init__(self, data_path, ratio, batch_size, workers):
        self.data_path = data_path
        self.ratio = ratio
        self.batch_size = batch_size
        self.workers = workers

    def _get_split_length(
        self, dataset: torch.utils.data.ConcatDataset
    ) -> Tuple[int, int]:
        train_val_ratio = self.ratio
        train_len = round(len(dataset) * train_val_ratio)
        val_len = len(dataset) - train_len
        print(len(dataset))
        return train_len, val_len

    def get_dataloaders(self):
        
        train = TwitterDataset(self.data_path)
        train_len, val_len = self._get_split_length(train)
        
        train, val = random_split(
            train, [train_len, val_len], generator=torch.Generator().manual_seed(0)
        )
        print(train_len)
        print(val_len)
        dataloader_train = DataLoader(
            train,
            shuffle=True,
            batch_size=self.batch_size,
            num_workers=self.workers,
            drop_last=False,
        )

        dataloader_val = DataLoader(
            val,
            shuffle=False,
            batch_size=self.batch_size,
            num_workers=self.workers,
            drop_last=False,
        )
        return dataloader_train, dataloader_val


## Loading data
Now, we will preprocess and load training set

Data parameters

In [None]:
data_path = "train.csv"
ratio = 0.9
batch_size = 8
workers = 2

Create Dataloders

In [None]:
loader = DataloaderCreator(
        data_path,
        ratio,
        batch_size,
        workers
    )
dataloader_train, dataloader_val = loader.get_dataloaders()

1717
[['drodzy naciskach angeli i innych unijnych liderów postanowiłem stworzyć publiczną stronę fb zapraszam …', '__neutralny__'], ['wawrzynowski i też uważam innego baza z profilami zgryzu piłkarzy w hiszpańskiej la liga', '__neutralny__'], ['egipt aresztowany młodszy syn mursiego abdullah', '__negatywny__'], ['skonsumowaniu coś wykręci czasnazmiany', '__neutralny__'], ['a nooo nawet mów telepie p radę kochanie ogladamy d mamy wyjebane rosberga xd', '__negatywny__'], ['rafałek dopiero zaczynał nocniczka siusiać zenon jaskuła wygrywał etapy majtek biegał ktoś nagrywa', '__neutralny__'], ['tanie loty paryża z wrocławia oraz warszawy 88 pln w październiku polecane', '__pozytywny__'], ['no dajmy 410 ale za cenę schodzimy 110 chyba materiał lepszy focie', '__negatywny__'], ['tyde robi wykurwiste remixy', '__pozytywny__'], ['mikita strzelił kolejnego gola dolcanu dolcan 24 widzew', '__pozytywny__'], ['pieniazek żeby było kłamię te bagaże dostaliśmy blokadzie separatystów mh17 donbas ukrain

Validate if your dataloader works as expected (print single step)


In [None]:
for batch in dataloader_train:
  print(batch['input'])
  print(batch['target'])
  break

['pieniazek żeby było kłamię te bagaże dostaliśmy blokadzie separatystów mh17 donbas ukraina', 'rosyjskie służby poprosiły kijów o możliwość pomocy w akcji ratunkowej ria nowosti', 'rothna pewno w moskwie czarna skrzynka zostanie zmanipulowana i zamieniona było w przypadku katast…', 'super spotkanie hejlobuzy z gosia dzieki ze zaczepiałas', 'chyba koniec rewolucji w ataku gio wylocie podobno sampa zainteresowana', 'rosdziennikarka piszeże boeing prawdziwya cała tragedia spektakljak 1109 ślad cia', 'no w piździerniku 3 tygodnie w usa najs', 'dyktator pięknie ilustruje pan swoimi tłitami sekciarską naturę fanatyków korwinistów']
['__negatywny__', '__negatywny__', '__negatywny__', '__neutralny__', '__neutralny__', '__negatywny__', '__pozytywny__', '__negatywny__']


## Define the model
We define T5ForConditionalGeneration from Transformers.

In [None]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
import pytorch_lightning as pl
from torch.optim.lr_scheduler import MultiplicativeLR


class SentimentT5(pl.LightningModule):
    """SentimentT5 Model for sentiment analysis"""

    def __init__(
        self,
        lr,
        multiply_lr_step,
        warmup_steps,
        model_path,
        model_save_dir,
        model_load_dir,
    ):
        super().__init__()

        self.lr = lr
        self.model_save_dir = model_save_dir
        self.model = T5ForConditionalGeneration.from_pretrained(model_path)
        self.tokenizer = T5Tokenizer.from_pretrained(model_path)
        self.warmup_steps = warmup_steps
        self.multiply_lr_step = multiply_lr_step


    def forward(self, input_sequences, output_sequences, **kwargs):
        input_sequences = [sequence for sequence in input_sequences]
        input_tokens = self.tokenizer(
            input_sequences,
            padding=True,
            truncation=False,
            return_tensors="pt",
        )
        input_ids = input_tokens.input_ids
        attention_mask = input_tokens.attention_mask

        target_encoding = self.tokenizer(
            output_sequences,
            padding=True,
            truncation=True,
        )
        
        labels = target_encoding.input_ids
        labels = labels = [
            [
                (label if label != self.tokenizer.pad_token_id else -100)
                for label in labels_example
            ]
            for labels_example in labels
        ]
        labels = torch.tensor(labels)

        loss = self.model(
            input_ids=input_ids.to(self.device),
            attention_mask=attention_mask.to(self.device),
            labels=labels.to(self.device),
        ).loss
        return loss

    def training_step(self, batch, batch_idx):
        input_sequences, output_sequences = batch["input"], batch["target"]
        loss = self(input_sequences, output_sequences)
        self.log("loss", loss, batch_size=1)
        return {"loss": loss}

    def training_epoch_end(self, outputs):
        if self.trainer.global_step > 0:
            print("Saving model...")
            torch.save(self.model.state_dict(), self.model_save_dir)

    def validation_step(self, batch, batch_idx):
        input_sequences, output_sequences = batch["input"], batch["target"]
        loss = self(input_sequences, output_sequences)
        self.log("validation_loss", loss, batch_size=1)

    def validation_epoch_end(self, out):
        if self.trainer.global_step > 0:
            print("Saving model...")
            torch.save(self.model.state_dict(), self.model_save_dir)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)

        def lambd(epoch):
            return self.multiply_lr_step

        scheduler = MultiplicativeLR(optimizer, lr_lambda=lambd)
        return [optimizer], [scheduler]

    def optimizer_step(
        self,
        epoch,
        batch_idx,
        optimizer,
        optimizer_idx,
        optimizer_closure,
        on_tpu=False,
        using_native_amp=False,
        using_lbfgs=False,
    ):
        if self.trainer.global_step < self.warmup_steps:
            lr_scale = min(1.0, float(self.trainer.global_step + 1) / self.warmup_steps)
            for pg in optimizer.param_groups:
                pg["lr"] = lr_scale * self.lr

        optimizer.step(closure=optimizer_closure)


## Training
Train SentimentT5 model using PyTorch Lightning Trainer. Find optimal parametrs that will result in best results.

Model parameters

In [None]:
lr = 0.01 #5e-5 2
multiply_lr_step = 0.1
warmup_steps = 0
model_path = 't5-base' # path to model from hugging face e.g. "t5-base"
model_save_dir = '/content/T5'
model_load_dir = None # can be used to load model from checkpoint

Trainer parameters

In [None]:
max_epochs = 3
gpus = [0]
progress_bar_refresh_rate = 50
accumulate_grad_batches = 5
log_every_n_steps = 5

In [None]:
model = SentimentT5(lr=lr,
    multiply_lr_step=multiply_lr_step,
    warmup_steps=warmup_steps,
    model_path=model_path,
    model_save_dir=model_save_dir,
    model_load_dir=model_load_dir)

trainer = pl.Trainer(
    max_epochs=max_epochs,
    gpus=gpus,
    progress_bar_refresh_rate=progress_bar_refresh_rate,
    accumulate_grad_batches=accumulate_grad_batches,
)

trainer = pl.Trainer(max_epochs=max_epochs,log_every_n_steps=log_every_n_steps)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [None]:
trainer.fit(model,dataloader_train, dataloader_val)

Missing logger folder: /content/lightning_logs

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

                not been set for this class (_ResultMetric). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.
                


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Saving model...
Saving model...


Validation: 0it [00:00, ?it/s]

Saving model...
Saving model...


# Inference
Write your own inference code.
Documentation is avaiable [here](https://huggingface.co/docs/transformers/model_doc/t5#inference)

In [None]:
import pandas as pd  
from transformers import T5Tokenizer, T5ForConditionalGeneration


de = pd.read_csv('test.csv')
test = de.values.tolist() 
pred = list()            #list for predictions

# run model and generate predictions
tokenizer = T5Tokenizer.from_pretrained("/contents/T5")
model = T5ForConditionalGeneration.from_pretrained("/contents/T5")

for text in test:
  input_ids = tokenizer(text, return_tensors="pt").input_ids
  outputs = model.generate(input_ids)
  pred.append(tokenizer.decode(outputs[0], skip_special_tokens=True))

# save predictions in two-column csv format: [tweet, predicted_class]
predictions = {'text': test, 'label': pred}
df = pd.DataFrame(predictions)

df.to_csv('test_submission.csv')

NameError: ignored

# Save test submission file

In [None]:
files.download('test_submission.csv')