# Sentiment classification of Nepali Tweets using pre-trained models

This notebook demonstrates how we can use state-of-the-art pre-trained models to classify the sentiments of English as well as Non-English (in this case, Nepali) Tweets from a specified user

Here, I have fetched the last 200 tweets of [K.P. Sharma Oli](https://twitter.com/kpsharmaoli) and classified their sentiments by using 3 different models for a side-to-side comparison.

**Sentiment classification models used in this notebook:**
- [RoBERTa-base](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment)
- [Flair TextClassifier](https://github.com/flairNLP/flair)
- [TextBlob Polarity](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis)

## Specify the Twitter Username and Tweet Limit

In [1]:
user = 'kpsharmaoli'   # Username of the target account
limit=200   # No. of tweets to be fetched (starts from the latest)

## Import Dependencies

In [7]:
# for Pre-processing
import pandas as pd
from langdetect import detect
from googletrans import Translator
import re

# for Twitter API
import tweepy, configparser

# for Sentiment Analysis
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
from textblob import TextBlob

## Fetch data from Twitter

In [8]:
# API stuff
config = configparser.ConfigParser()
config.read('config.ini') # config.ini stores the API Keys

api_key = config['twitter']['api_key']
api_key_secret = config['twitter']['api_key_secret']

access_token = config['twitter']['access_token']
access_token_secret = config['twitter']['access_token_secret']

auth = tweepy.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [14]:
# fetching tweet data

tweets = api.user_timeline(screen_name=user,
    count = limit,
    tweet_mode='extended')

columns = ['Timestamp','Tweet']
data = []

for tweet in tweets:
    data.append([tweet.created_at, tweet.full_text])
kpTweets = pd.DataFrame(data, columns=columns)
print(f'Successfully fetched the last {limit} tweets of @{user}')

Successfully fetched the last 200 tweets of @kpsharmaoli


## Preprocessing & Translation

In [15]:
translator = Translator()
def nptoen(text):
    output = translator.translate(text, src='ne', dest='en').text  # change src to process texts of languages other than Nepali
    return output

In [16]:
processed_tweets = []
translated_tweets = []
langs = []
for tweet in kpTweets['Tweet']:
    
    tweet2 = re.sub(r'^RT[\s]+', '', tweet)    # this will remove the old style retweet text "RT"
    tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)    # this will remove hyperlinks
    tweet2 = re.sub(r'#', '', tweet2)    # only removing the hash # sign from the word
    
    processed_tweets.append(tweet2)

#     print(tweet2)
    if len(tweet2)>5:
        lang = detect(tweet2)
    else:
        lang = 'N/A'
    
    if lang == 'ne':
        translated = nptoen(tweet2)
    elif lang == 'en':
        translated = tweet2
    else:
        translated = 'NaN'
    translated_tweets.append(translated)
    
    
    if len(tweet2)<2:
        langs.append('NaN')
        continue

    langs.append(lang)

kpTweets['Language'] = pd.Series(langs)
kpTweets['Processed'] = pd.Series(processed_tweets)
kpTweets['Translated'] = pd.Series(translated_tweets)
kpTweets.dropna(inplace=True)

In [17]:
kpTweets.head()

Unnamed: 0,Timestamp,Tweet,Language,Processed,Translated
0,2022-09-19 02:32:53+00:00,https://t.co/qLwag1MSe6,,,
1,2022-09-16 18:48:56+00:00,महिला साफ च्याम्पियनसिप फुटबलको सेमिफाइनलमा सा...,ne,महिला साफ च्याम्पियनसिप फुटबलको सेमिफाइनलमा सा...,Hearty congratulations to the Nepali women's f...
2,2022-09-13 14:45:24+00:00,आउनुहोस् संविधान दिवसको उत्सवलाई यस पटक सँगसँग...,ne,आउनुहोस् संविधान दिवसको उत्सवलाई यस पटक सँगसँग...,Let's celebrate Constitution Day together this...
3,2022-09-04 08:05:16+00:00,नेकपा (एमाले)का उपाध्यक्ष कमरेड रामबहादुर थापा...,ne,नेकपा (एमाले)का उपाध्यक्ष कमरेड रामबहादुर थापा...,I am saddened by the news of the death of Nand...
4,2022-08-21 02:33:06+00:00,"चिन्तनशील राजनीतिकर्मी, मित्र प्रदीप गिरिको नि...",ne,"चिन्तनशील राजनीतिकर्मी, मित्र प्रदीप गिरिको नि...",I am saddened by the death of thoughtful polit...


In [None]:
# kpTweets.to_csv('kpTweetsPreprocessed.csv')

## RoBERTa

This is [twitter-roBERTa-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment), a state-of-the-art NLP model trained on ~58M tweets and finetuned for sentiment analysis with the TweetEval benchmark.

It uses Meta AI's [RoBERTa](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/), a robustly optimized method for pretraining natural language processing (NLP) systems that improves on [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)), the self-supervised method released by Google in 2018.

In [18]:
roberta = 'cardiffnlp/twitter-roberta-base-sentiment'

model = AutoModelForSequenceClassification.from_pretrained(roberta)

tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ['Negative', 'Neutral', 'Positive']

In [19]:
def sentiment(tweet_proc):
    encoded_tweet = tokenizer(tweet_proc, return_tensors = 'pt')

    # print(encoded_tweet)

    # output = model(encoded_tweet['input_ids'], encoded_tweet['attention_mask'])
    output = model(**encoded_tweet)

    scores = output[0][0].detach().numpy()
    scores = softmax(scores)

    labelnscore = {}
    for i in range(len(scores)):

        l = labels[i]
        s = scores[i]
        labelnscore[l] = s

    hah = pd.Series(labelnscore)

    return [hah.idxmax(),hah.max()]

In [20]:
sentiments = []
confidence = []

for tweet in kpTweets.Translated:
    senti = sentiment(tweet)
    sentiments.append(senti[0])
    confidence.append(senti[1])

kpTweets['RoBERTa-Sentiment'] = sentiments
kpTweets['RoBERTa-Confidence'] = confidence

In [21]:
kpTweets.head()

Unnamed: 0,Timestamp,Tweet,Language,Processed,Translated,RoBERTa-Sentiment,RoBERTa-Confidence
0,2022-09-19 02:32:53+00:00,https://t.co/qLwag1MSe6,,,,Neutral,0.543908
1,2022-09-16 18:48:56+00:00,महिला साफ च्याम्पियनसिप फुटबलको सेमिफाइनलमा सा...,ne,महिला साफ च्याम्पियनसिप फुटबलको सेमिफाइनलमा सा...,Hearty congratulations to the Nepali women's f...,Positive,0.978982
2,2022-09-13 14:45:24+00:00,आउनुहोस् संविधान दिवसको उत्सवलाई यस पटक सँगसँग...,ne,आउनुहोस् संविधान दिवसको उत्सवलाई यस पटक सँगसँग...,Let's celebrate Constitution Day together this...,Positive,0.942668
3,2022-09-04 08:05:16+00:00,नेकपा (एमाले)का उपाध्यक्ष कमरेड रामबहादुर थापा...,ne,नेकपा (एमाले)का उपाध्यक्ष कमरेड रामबहादुर थापा...,I am saddened by the news of the death of Nand...,Neutral,0.601057
4,2022-08-21 02:33:06+00:00,"चिन्तनशील राजनीतिकर्मी, मित्र प्रदीप गिरिको नि...",ne,"चिन्तनशील राजनीतिकर्मी, मित्र प्रदीप गिरिको नि...",I am saddened by the death of thoughtful polit...,Neutral,0.465284


## TextBlob

[TextBlob](https://pypi.org/project/textblob/) is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In our case, we'll only be using its [sentiment analysis](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis) feature by calculating the polarity and subjectivity.

The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [22]:
sentiment = []
polarity = []
subjectivity = []

for tweet in kpTweets.Translated:
    polar = TextBlob(tweet).polarity
    subjec = TextBlob(tweet).subjectivity

    if polar > 0:
        senti = 'Positive'
    elif polar < 0:
        senti = 'Negative'
    else:
        senti = 'Neutral'

    sentiment.append(senti)
    polarity.append(polar)
    subjectivity.append(subjec)

kpTweets['TB-Sentiment'] = sentiment
kpTweets['TB-Polarity'] = polarity
kpTweets['TB-Subjectivity'] = subjectivity

In [23]:
kpTweets.head()

Unnamed: 0,Timestamp,Tweet,Language,Processed,Translated,RoBERTa-Sentiment,RoBERTa-Confidence,TB-Sentiment,TB-Polarity,TB-Subjectivity
0,2022-09-19 02:32:53+00:00,https://t.co/qLwag1MSe6,,,,Neutral,0.543908,Neutral,0.0,0.0
1,2022-09-16 18:48:56+00:00,महिला साफ च्याम्पियनसिप फुटबलको सेमिफाइनलमा सा...,ne,महिला साफ च्याम्पियनसिप फुटबलको सेमिफाइनलमा सा...,Hearty congratulations to the Nepali women's f...,Positive,0.978982,Positive,0.472222,0.488889
2,2022-09-13 14:45:24+00:00,आउनुहोस् संविधान दिवसको उत्सवलाई यस पटक सँगसँग...,ne,आउनुहोस् संविधान दिवसको उत्सवलाई यस पटक सँगसँग...,Let's celebrate Constitution Day together this...,Positive,0.942668,Neutral,0.0,0.0
3,2022-09-04 08:05:16+00:00,नेकपा (एमाले)का उपाध्यक्ष कमरेड रामबहादुर थापा...,ne,नेकपा (एमाले)का उपाध्यक्ष कमरेड रामबहादुर थापा...,I am saddened by the news of the death of Nand...,Neutral,0.601057,Positive,0.1,0.85
4,2022-08-21 02:33:06+00:00,"चिन्तनशील राजनीतिकर्मी, मित्र प्रदीप गिरिको नि...",ne,"चिन्तनशील राजनीतिकर्मी, मित्र प्रदीप गिरिको नि...",I am saddened by the death of thoughtful polit...,Neutral,0.465284,Positive,0.4,0.5


## Flair

[Flair](https://github.com/flairNLP/flair) is a state-of-the-art NLP framework developed by Humboldt University of Berlin: 

In [None]:
from flair.models import TextClassifier
from flair.data import Sentence
# from segtok.segmenter import split_single

classifier = TextClassifier.load('en-sentiment')

In [None]:
def getSentiment(tweet):
    text = Sentence(tweet)
    # stacked_embeddings.embed(text)
    classifier.predict(text)
    return [text.labels[0].value, text.labels[0].score]

In [None]:
sentiments = []
confidence = []

for tweet in kpTweets.Translated:
    if type(tweet)==str:
        sentiment = getSentiment(tweet)
    else:
        sentiment = ['NaN', 'NaN']
    
    sentiments.append(sentiment[0])
    confidence.append(sentiment[1])

kpTweets['Flair-Sentiment'] = pd.Series(sentiments)
kpTweets['Flair-Confidence'] = pd.Series(confidence)

In [None]:
kpTweets.to_csv('kpTweets-output.csv')