## Sentiment Analysis

In the following code we perfom Twitter-RoBERTa-base for Sentiment Analysis. 
This is a RoBERTa base model trained on about 58 million tweets and finetuned for sentiment analysis with the TweetEval benchmark proposed in “TWEETEVAL: Unified Benchmark and Comparative Evaluation for Tweet Classification” (Barbieri et Al. 2020).

In [None]:
pip install transformers scipy

In [1]:
# Run this only once in the entire notebook, if run twice it will provide an error. Interrup run time and rerun everything if needed
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [2]:
# Preprocessing functions
def preprocess(text):
  new_text = []
  for t in text.split(" "):
    t = 'http' if t.startswith('http') else t
    new_text.append(t)
  return " ".join(new_text)

import re

def preprocess1(text):
   cleaned_text = re.sub(r'http[s]?://\S+', '', text)
   cleaned_text_no_specials = re.sub(r'[^a-zA-Z0-9\s]', '', cleaned_text)
   return cleaned_text_no_specials

In [3]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

#Truncating excessive length of posts function
def truncate_doc(doc, max_len=300):
  tokens = doc.split()
  truncated_tokens = tokens[:max_len]
  truncated_doc = " ".join(truncated_tokens)
  return truncated_doc

#defining the sentiment analysis function
def sentiment_analysis(texts):
  labels=[]
  mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
  with urllib.request.urlopen(mapping_link) as f:
      html = f.read().decode('utf-8').split("\n")
      csvreader = csv.reader(html, delimiter='\t')
  labels = [row[1] for row in csvreader if len(row) > 1]

  model = AutoModelForSequenceClassification.from_pretrained(MODEL)
  model.save_pretrained(MODEL)
  res = []
  for text in texts:
    try:
      encoded_input = tokenizer(text, return_tensors='pt')
      output = model(**encoded_input)
      scores = output[0][0].detach().numpy()
      scores = softmax(scores)
      d = {"text": text}
      ranking = np.argsort(scores)
      ranking = ranking[::-1]
      for i in range(scores.shape[0]):
        l = labels[ranking[i]]
        s = scores[ranking[i]]
        d[f"{l}"] = f'{np.round(float(s),4)}'
      res.append(d)

    except:
      try:
        text = truncate_doc(text)
        encoded_input = tokenizer(text, return_tensors='pt')
        output = model(**encoded_input)
        scores = output[0][0].detach().numpy()
        scores = softmax(scores)
        d = {"text": text}
        ranking = np.argsort(scores)
        ranking = ranking[::-1]
        for i in range(scores.shape[0]):
          l = labels[ranking[i]]
          s = scores[ranking[i]]
          d[f"{l}"] = f'{np.round(float(s),4)}'
        res.append(d)
      except:
        d = {"text": text, "labels":"Error length"}
        res.append(d)
      print("done")
  return res


## Code

In [4]:
#load data
import pandas as pd
df1 = pd.read_csv("df_downsampled_10000.csv")

In [5]:
df1

Unnamed: 0,title,tag,artist,year,lyrics,Label
0,Killagram,rap,Esham,1996,It's the Mr. Unholy sinster Man I murdered y...,0
1,Riding and Surviving,rap,Prime Minister,2000,"I'm kickin this for my Lord and Savior, Jesus...",0
2,Win the G,rap,O.C.,1997,"O.C. ft. Bumpy Knuckles - “Win the G” Yo,...",0
3,Old to the New,rap,Nice & Smooth,1994,Artist: Nice & Smooth Album: Jewel of the Nil...,0
4,Well Never Stop,rap,Rakim,1999,We love to flirt To chase the skirts Get to k...,0
...,...,...,...,...,...,...
11075,Satisfy You,hip hop,Puff Daddy featuring R. Kelly,1999,All I want is somebody who's gonna love me for...,1
11076,Big Pimpin',hip hop,Jay-Z featuring UGK,2000,"Uh, uh, uh, uhIt's big pimpin', baby It's big ...",1
11077,Forgot About Dre,hip hop,Dr. Dre featuring Eminem,2000,"Y'all know me, still the same OGBut I been low...",1
11078,The Next Episode,hip hop,Dr. Dre featuring Snoop Dogg,2000,La-da-da-da-dahIt's the motherfuckin' D-O-doub...,1


In [6]:
df1['lyrics'] = df1.lyrics.apply(preprocess1)

In [7]:
texts = [t for t in df1['lyrics']]

In [8]:
import pandas as pd
result = sentiment_analysis(texts)

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done
done


In [11]:
sentiment = pd.DataFrame(result)
sentiment.to_csv("sentiment_scores.csv")

The model evaluates the song lyrics according to three labels: negative, neutral and positive. The
model then assigns for each document and for each label a score from 0 to 1. These scores are the
model’s confidence level: the higher the score, the more likely the model believes the text is
expressing a particular label. Being standardized, they usually sum up to one and can be generally
interpreted as probabilities.

In [12]:
sentiment

Unnamed: 0,text,negative,neutral,positive
0,Its the Mr Unholy sinster Man I murdered your ...,0.8366,0.1525,0.0109
1,Im kickin this for my Lord and Savior Jesus Ch...,0.108,0.7115,0.1805
2,OC ft Bumpy Knuckles Win the G Yo OC are you r...,0.409,0.5169,0.074
3,Artist Nice Smooth Album Jewel of the Nile Son...,0.0734,0.573,0.3536
4,We love to flirt To chase the skirts Get to k...,0.3223,0.5714,0.1063
...,...,...,...,...
11075,All I want is somebody whos gonna love me for ...,0.1174,0.5791,0.3035
11076,Uh uh uh uhIts big pimpin baby Its big pimpin ...,0.4372,0.458,0.1048
11077,Yall know me still the same OGBut I been lowke...,0.6672,0.2988,0.0341
11078,LadadadadahIts the motherfuckin DOdoubleG Lada...,0.5132,0.4103,0.0765


In [13]:
concatenated_df = pd.concat([df1, sentiment], axis=1)
concatenated_df

Unnamed: 0,title,tag,artist,year,lyrics,Label,text,negative,neutral,positive
0,Killagram,rap,Esham,1996,Its the Mr Unholy sinster Man I murdered you...,0,Its the Mr Unholy sinster Man I murdered your ...,0.8366,0.1525,0.0109
1,Riding and Surviving,rap,Prime Minister,2000,Im kickin this for my Lord and Savior Jesus C...,0,Im kickin this for my Lord and Savior Jesus Ch...,0.108,0.7115,0.1805
2,Win the G,rap,O.C.,1997,OC ft Bumpy Knuckles Win the G Yo OC are...,0,OC ft Bumpy Knuckles Win the G Yo OC are you r...,0.409,0.5169,0.074
3,Old to the New,rap,Nice & Smooth,1994,Artist Nice Smooth Album Jewel of the Nile S...,0,Artist Nice Smooth Album Jewel of the Nile Son...,0.0734,0.573,0.3536
4,Well Never Stop,rap,Rakim,1999,We love to flirt To chase the skirts Get to k...,0,We love to flirt To chase the skirts Get to k...,0.3223,0.5714,0.1063
...,...,...,...,...,...,...,...,...,...,...
11075,Satisfy You,hip hop,Puff Daddy featuring R. Kelly,1999,All I want is somebody whos gonna love me for ...,1,All I want is somebody whos gonna love me for ...,0.1174,0.5791,0.3035
11076,Big Pimpin',hip hop,Jay-Z featuring UGK,2000,Uh uh uh uhIts big pimpin baby Its big pimpin ...,1,Uh uh uh uhIts big pimpin baby Its big pimpin ...,0.4372,0.458,0.1048
11077,Forgot About Dre,hip hop,Dr. Dre featuring Eminem,2000,Yall know me still the same OGBut I been lowke...,1,Yall know me still the same OGBut I been lowke...,0.6672,0.2988,0.0341
11078,The Next Episode,hip hop,Dr. Dre featuring Snoop Dogg,2000,LadadadadahIts the motherfuckin DOdoubleG Lada...,1,LadadadadahIts the motherfuckin DOdoubleG Lada...,0.5132,0.4103,0.0765


In [14]:
concatenated_df.to_csv("df_downsampled_sent_10000.csv")