<a href="https://colab.research.google.com/github/clayton-summitt/w266-final/blob/main/XLM_T_Run_a_classifier_on_a_text_file.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installs and imports

In [1]:
!pip install --upgrade pip
!pip install sentencepiece
!pip install transformers

Collecting pip
  Downloading pip-21.3.1-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.0 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-21.3.1
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 4.2 MB/s            
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96
Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
     |████████████████████████████████| 3.1 MB 4.3 MB/s            
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
     |████████████████████████████████| 596 kB 74.0 MB/s    

In [3]:
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import AutoModelForSequenceClassification
from torch.utils.data import DataLoader
import numpy as np
from scipy.special import softmax
from google.colab import files
from google.colab import drive
drive.mount('/content/drive' ,force_remount=True)
import glob
import os
os.chdir("drive/MyDrive/vaccine/data/")

Mounted at /content/drive


In [11]:
os.listdir('fine_tune_sentimnet/results/best_model/')

['config.json', 'pytorch_model.bin', 'training_args.bin']

## Data

In [4]:
def preprocess(corpus):
  outcorpus = []
  for text in corpus:
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    new_text = " ".join(new_text)
    outcorpus.append(new_text)
  return outcorpus

In [None]:
!wget https://raw.githubusercontent.com/cardiffnlp/xlm-t/main/data/sentiment/all/test_text.txt

In [5]:
dataset_path = 'train.txt'
dataset = open(dataset_path).read().split('\n')

In [6]:
# this is a dataset in 8 different languages
for example in [0,870,1740,2610,3480,4350,5220,6090]:
  print(dataset[example])

"ADPH investigating 44 possible flu related deaths"
"Per lei è più importante il costo del vaccino, e no…"
"investigators are closing in on a Global influenza pollen"
"Dourado evita falar de Flu e diz que não conversou com Corinthians via"
"Both condoms and Sanitary wear are a necessity to women. To think of it, HIV is an incurable *illness…"
"Actually, 9/11 did happen and Elvis really is dead: how the rise of conspiracy theories leads to vaccine skepticis…"
"So bad news, I got influenza"
"Aún con influenza me gusta que llueva"


## Model

In [44]:
CUDA = True # set to true if using GPU (Runtime -> Change runtime Type -> GPU)
BATCH_SIZE = 32
MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
config = AutoConfig.from_pretrained(MODEL) # used for id to label name
model = AutoModelForSequenceClassification.from_pretrained('fine_tune_sentimnet/results/best_model/')
if CUDA:
  model = model.to('cuda')
_ = model.eval()

## Forward

In [18]:
def forward(text, cuda=True):
  text = preprocess(text)
  encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
  if cuda:
    encoded_input.to('cuda')
    output = model(**encoded_input)
    scores = output[0].detach().cpu().numpy()
  else:
    output = model(**encoded_input)
    scores = output[0].detach().numpy()
  
  scores = softmax(scores, axis=-1)
  return scores

In [45]:
dl = DataLoader(dataset, batch_size=BATCH_SIZE)
all_preds = []
all_scores = []
for idx,batch in enumerate(dl):
  print('Batch ',idx+1,' of ',len(dl))
  text = preprocess(batch)
  scores = forward(text, cuda=CUDA)
  all_scores.extend(scores)
  preds = np.argmax(scores, axis=-1)
  all_preds.extend(preds)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Batch  52272  of  57271
Batch  52273  of  57271
Batch  52274  of  57271
Batch  52275  of  57271
Batch  52276  of  57271
Batch  52277  of  57271
Batch  52278  of  57271
Batch  52279  of  57271
Batch  52280  of  57271
Batch  52281  of  57271
Batch  52282  of  57271
Batch  52283  of  57271
Batch  52284  of  57271
Batch  52285  of  57271
Batch  52286  of  57271
Batch  52287  of  57271
Batch  52288  of  57271
Batch  52289  of  57271
Batch  52290  of  57271
Batch  52291  of  57271
Batch  52292  of  57271
Batch  52293  of  57271
Batch  52294  of  57271
Batch  52295  of  57271
Batch  52296  of  57271
Batch  52297  of  57271
Batch  52298  of  57271
Batch  52299  of  57271
Batch  52300  of  57271
Batch  52301  of  57271
Batch  52302  of  57271
Batch  52303  of  57271
Batch  52304  of  57271
Batch  52305  of  57271
Batch  52306  of  57271
Batch  52307  of  57271
Batch  52308  of  57271
Batch  52309  of  57271
Batch  52310  of  57271

In [34]:
# this is a dataset in 8 different languages
for example in [0,870,1740,2610,3480,4350,5220,6090,10000,18000,29000,99000]:
  pred = all_preds[example]
  print(dataset[example], '--->', config.id2label[pred])

"ADPH investigating 44 possible flu related deaths" ---> Neutral
"Per lei è più importante il costo del vaccino, e no…" ---> Neutral
"investigators are closing in on a Global influenza pollen" ---> Negative
"Dourado evita falar de Flu e diz que não conversou com Corinthians via" ---> Negative
"Both condoms and Sanitary wear are a necessity to women. To think of it, HIV is an incurable *illness…" ---> Negative
"Actually, 9/11 did happen and Elvis really is dead: how the rise of conspiracy theories leads to vaccine skepticis…" ---> Negative
"So bad news, I got influenza" ---> Negative
"Aún con influenza me gusta que llueva" ---> Positive
"COP7FCTC The 4 BigPharma to WHO are GlaxoSmithKline Novartis Sanofi Pasteur and Merck are the leading vaccine manufacturer" ---> Neutral
"Zambia News - HIV Activist Kasune Challenges MPs to Disclose Status" ---> Neutral
"Improving estimates of district HIV prevalence and burden in South Africa using small area estimation techniques…" ---> Neutral
"Trump

In [46]:
#comaparitive scores after finetuning
for example in [0,870,1740,2610,3480,4350,5220,6090,10000,18000,29000,99000]:
  pred = all_preds[example]
  print(dataset[example], '--->', config.id2label[pred])

"ADPH investigating 44 possible flu related deaths" ---> Negative
"Per lei è più importante il costo del vaccino, e no…" ---> Neutral
"investigators are closing in on a Global influenza pollen" ---> Neutral
"Dourado evita falar de Flu e diz que não conversou com Corinthians via" ---> Neutral
"Both condoms and Sanitary wear are a necessity to women. To think of it, HIV is an incurable *illness…" ---> Neutral
"Actually, 9/11 did happen and Elvis really is dead: how the rise of conspiracy theories leads to vaccine skepticis…" ---> Neutral
"So bad news, I got influenza" ---> Negative
"Aún con influenza me gusta que llueva" ---> Positive
"COP7FCTC The 4 BigPharma to WHO are GlaxoSmithKline Novartis Sanofi Pasteur and Merck are the leading vaccine manufacturer" ---> Neutral
"Zambia News - HIV Activist Kasune Challenges MPs to Disclose Status" ---> Neutral
"Improving estimates of district HIV prevalence and burden in South Africa using small area estimation techniques…" ---> Neutral
"Trump Wi

In [32]:
len(all_preds),scores.shape

(1832670, (30, 3))

In [39]:
from numpy import save

In [43]:
save('baseline_sentiment_scores.npy',np.array(all_scores))

In [47]:
save('best_model_sentiment_scores.npy',np.array(all_scores))