# Usage of TweetEval and Twitter-specific RoBERTa models

In this notebook we show how to perform tasks such as masked language modeling, computing tweet similarity or tweet classificationo using our Twitter-specific RoBERTa models.

- Paper: [_TweetEval_ benchmark (Findings of EMNLP 2020)](https://arxiv.org/pdf/2010.12421.pdf)
- Authors: Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke and Leonardo Neves.
- [Github](https://github.com/cardiffnlp/tweeteval)


## Preliminaries

We define a function to normalize a tweet to the format we used for TweetEval. Note that preprocessing is minimal (replacing user names by `@user` and links by `http`).

In [1]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

We only need to install one dependnecy: the `transformers` library.

In [2]:
!pip install transformers



## Computing Tweet Similarity

In [None]:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
import numpy as np
from scipy.spatial.distance import cosine
from collections import defaultdict

MODEL = "cardiffnlp/twitter-roberta-base"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModel.from_pretrained(MODEL)

def get_embedding(text):
  text = preprocess(text)
  encoded_input = tokenizer(text, return_tensors='pt')
  features = model(**encoded_input)
  features = features[0].detach().cpu().numpy() 
  features_mean = np.mean(features[0], axis=0) 
  return features_mean

query = "The book was awesome"

tweets = ["I just ordered fried chicken 🐣", 
          "The movie was great", 
          "What time is the next game?", 
          "Just finished reading 'Embeddings in NLP'"]

d = defaultdict(int)
for tweet in tweets:
  sim = 1-cosine(get_embedding(query),get_embedding(tweet))
  d[tweet] = sim

print('Most similar to: ',query)
print('----------------------------------------')
for idx,x in enumerate(sorted(d.items(), key=lambda x:x[1], reverse=True)):
  print(idx+1,x[0])

## Feature Extraction

In [None]:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
import numpy as np

MODEL = "cardiffnlp/twitter-roberta-base"
text = "Good night 😊"
text = preprocess(text)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# Pytorch
encoded_input = tokenizer(text, return_tensors='pt')
model = AutoModel.from_pretrained(MODEL)
features = model(**encoded_input)
features = features[0].detach().cpu().numpy() 
features_mean = np.mean(features[0], axis=0) 
#features_max = np.max(features[0], axis=0)

# # Tensorflow
# encoded_input = tokenizer(text, return_tensors='tf')
# model = TFAutoModel.from_pretrained(MODEL)
# features = model(encoded_input)
# features = features[0].numpy()
# features_mean = np.mean(features[0], axis=0) 
# #features_max = np.max(features[0], axis=0)

features_mean.shape

(768,)

## Masked language modeling

Use Twitter-RoBERTA-base to predict words in context using the `fill-mask` pipeline in `transformers`.

In [None]:
from transformers import pipeline, AutoTokenizer
import numpy as np

MODEL = "cardiffnlp/twitter-roberta-base"
fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def print_candidates():
    for i in range(5):
        token = tokenizer.decode(candidates[i]['token'])
        score = np.round(candidates[i]['score'], 4)
        print(f"{i+1}) {token} {score}")

texts = [
 "I am so <mask> 😊",
 "I am so <mask> 😢" 
]
for text in texts:
    t = preprocess(text)
    print(f"{'-'*30}\n{t}")
    candidates = fill_mask(t)
    print_candidates()

------------------------------
I am so <mask> 😊
1)  happy 0.402
2)  excited 0.1441
3)  proud 0.143
4)  grateful 0.0669
5)  blessed 0.0334
------------------------------
I am so <mask> 😢
1)  sad 0.2641
2)  sorry 0.1605
3)  tired 0.138
4)  sick 0.0278
5)  hungry 0.0232


## Use TweetEval Classifiers

We currently provide the following fine-tuned models for different tweet classification tasks:

- emoji prediction (`emoji`)
- emotion detection (`emotion`)
- hate speech detection (`hate`)
- irony detection (`irony`)
- offensive language identification (`offensive`)
- sentiment analysis (`sentiment`)
- _(coming soon)_ stance detection (`stance`) with 5 targets (`abortion`, `atheism`, `climate`, `feminist`, `hillary`), for example: `stance-abortion`


In [4]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

task='emotion'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [5]:
# download label mapping
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

In [8]:
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)

# text = "Good night 😊"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)

Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [174]:
text = "@fede Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
ranking = np.argsort(scores)
print(scores)
print(ranking)
print(labels)

[0.01050839 0.9325733  0.02990752 0.02701083]
[0 3 2 1]
['anger', 'joy', 'optimism', 'sadness']


##Codice nuovo

In [205]:
import pandas as pd
df = pd.DataFrame({'col1':["@fede Good night 😊", "I hate night 😖"]})
df

Unnamed: 0,col1
0,@fede Good night 😊
1,I hate night 😖


In [211]:
emo_df = pd.DataFrame(0, index=df.index, columns=labels)
df['max_em'] = 0
#dft = pd.DataFrame({'col1': ["@fede Good night 😊"]})
for i in np.arange(df.shape[0]):
    text = df.col1[i]
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    ranking = np.argsort(scores)
    emo_df.iloc[i] = scores
    max = ranking[3]
    df['max_em'].iloc[i] = labels[max]
    # for i in range(scores.shape[0]):
    #     emo_df.at[i] += scores[ranking[i]]
emo_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['max_em'].iloc[i] = labels[max]


Unnamed: 0,anger,joy,optimism,sadness
0,0.011716,0.927626,0.029789,0.030869
1,0.187211,0.01581,0.00893,0.788049


In [194]:
df3 = pd.concat([df, emo_df], axis=1)
df3

Unnamed: 0,col1,max_em,anger,joy,optimism,sadness
0,@fede Good night 😊,joy,0.011716,0.927626,0.029789,0.030869
1,I hate night 😖,sadness,0.187211,0.01581,0.00893,0.788049


In [204]:
df

Unnamed: 0,col1,max_em
0,@fede Good night 😊,sadness
1,I hate night 😖,sadness


In [206]:
#anger, joy, optimism, sadness
def my_fun(text):
    text = preprocess(text)
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    emo_df = scores
    max = ranking[3]
    df['max_em'] = labels[max]
    return emo_df

In [210]:
emo_df = pd.DataFrame(0, index=df.index, columns=labels)
dd= pd.DataFrame()
pd.concat(df.col1.apply(lambda x: my_fun(x)))
dd

AttributeError: 'DataFrame' object has no attribute 'concat'

In [196]:
df

Unnamed: 0,col1,max_em
0,@fede Good night 😊,joy
1,I hate night 😖,sadness


In [33]:
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

1) sadness 0.788
2) anger 0.1872
3) joy 0.0158
4) optimism 0.0089
