# Sentiment Analysis
Comments and suggestions: `fernando.batista@iscte-iul.pt`

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fmmb/ADD_In-2023/blob/main/sa-strategies.ipynb)

Table of contents
* [Setup](#intro)
* [Approach 1: Using existing NLP tools](#approach1)
* [Approach 2: Using a sentiment lexicon](#approach2)
* [Approach 3: Using pre-trained transformer-based models](#approach3)
* [Approach 4: Machine Learning - Training your own classifier from scratch](#approach4)
* [Approach 5: ChatGPT-like models](#approach5)

If you are using google colab, please check the [instructions on how to use your files in google colab](README.md#files-in-google-colab)

## Setup

In [None]:
from textblob import TextBlob
import pandas as pd
import nltk
if 'google.colab' in str(get_ipython()):
    nltk.download('punkt')

<p id="approach1"></p>

## Approach 1: Using existing NLP tools
There are many existing tools for sentiment analysis, such as `TextBlob` and `sentiment Vader`, amongst others. This section presents examples on how to use these two tools

### TextBlob
`TextBlob` will be used to perform our initial sentiment analysis tests.
Please consider the following list of texts, stored in the variable `texts` 

In [None]:
texts = [ "The movie was good.",
          "I hate the movie",
          "The movie was not good.",
          "I really think this product sucks.",
          "Really great product.",
          "I don't like this product"]

We can now print each one of the individual texts, together with it's sentiment score. A value above 0 means that the sentiment is positive, while a value below 0, means it is negative.

In [None]:
for t in texts:
    print(t, "==>", TextBlob(t).sentiment.polarity)

The previous code assumes that the text is already split into sentences, which may not be the case of texts comming from sources, such as *web pages* or *blogs*. However, your can give the whole text to `textblob`, and it is able to automatically split the texts into sentences, as follows.

In [None]:
mytext = """The movie was good. The movie was not good. I really think this product sucks.
Really great product. I don't like this product."""
text=TextBlob(mytext)

In [None]:
for s in text.sentences:
    print(s, "==> ", s.sentiment.polarity)

### Evaluation

Let's automatically classify a set of texts and evaluate the corresponding performance. 

In order to be able to do that, one has to provide the true labels together with the data. However, please note that the classification system won't be able to see the true labels. 

In [None]:
texts = ["I love chocolate",
        "I hate to eat", 
        "I don't love anyone",
        "I like cakes",
        "I don't fail"]
tags=["pos", "neg", "neg", "pos", "pos"]

Let's do a small test with one of these examples...

In [None]:
print("TEXT:", texts[0])
print("SENTIMENT:", TextBlob(texts[0]).sentiment.polarity)

In [None]:
correct=0
incorrect=0
for i in range(len(texts)):
    polarity = TextBlob(texts[i]).sentiment.polarity
    print(f"SCORE: {polarity:4}, TAG: {tags[i]}, TEXT: {texts[i]}")
    if polarity >=0 and tags[i] == "pos":
        correct +=1
    elif polarity <0 and tags[i] == "neg":
        correct += 1
    else:
        incorrect += 1
accuracy=(correct)/(correct+incorrect)
print(f"correct: {correct}, incorrect: {incorrect}, accuracy: {accuracy}")

### Vader Sentiment

In [None]:
!pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sa = SentimentIntensityAnalyzer()

texts = [
    "This food is amazing and tasty !",
    "Exoplanets are planets outside the solar system",
    "It is sad to see such a bad behavior"
]

for text in texts:
    score = sa.polarity_scores(text)["compound"]
    print('TEXT:', text)
    print('  SENTIMENT:', score)

After computing the polarity score, we can also calculate the proportion of each sentiment in a sentence using the keys: "pos", "neu" and "neg" 

In [None]:
for text in texts:
    scores = sa.polarity_scores(text)
    print('TEXT:', text)
    print('  SCORES:', scores)

### Flair (optional)

<p id="approach2"></p>

## Approach 2: Applying a sentiment lexicon

Let's start by reading the sentiment lexicon

In [None]:
lexname="https://raw.githubusercontent.com/fmmb/ADD_In-2023/main/sample-lexicon.csv"
data = pd.read_csv(lexname, encoding="utf-8", index_col=["English"])

In [None]:
data.sample(5)

We will be using the dictionary `lex` instead of the dataframe, since the performance is much better that way

In [None]:
lex = data['polarity'].to_dict()

Let's test some well-known words

In [None]:
for w in ['good', 'ok', 'dislike', 'bad']:
    print(f"WORD: {w}, POLARITY: {lex.get(w)}")

Now, to calculate the semntiment of a sentence, we only have to sum the polarity of each individual word, and check the final result.

In [None]:
text = 'This is a lovely and beautiful place , but I hate the corner'
polarity = 0
for w in text.split():
    polarity += lex.get(w, 0)
    print(w, lex.get(w, 0) )
print("Final score:" , polarity)

Even better with a function ...

In [None]:
def sentiment(text):
    score = 0
    for w in text.split():
        score += lex.get(w, 0)
    if score >= 0:
        return "POS"
    else:
        return "NEG"

In [None]:
text = 'This is a lovely and beautiful place , but I hate the corner'
sentiment(text)

We could now process all our texts again with our classifier

In [None]:
for text in texts:
    print(text, "=>", sentiment(text))

<p id="approach3"></p>

## Approach 3: Using pre-trained transformer-based models

### The easy way ...

In [None]:
!pip install -q transformers

In [None]:
from transformers import pipeline

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis")
#sentiment_pipeline = pipeline("sentiment-analysis", model="finiteautomata/bertweet-base-sentiment-analysis")

In [None]:
data = ["I love you", "I hate you", "the unemployment is increasing"]

In [None]:
sentiment_pipeline(data)

### Fine-tunning (optional)

Some practical readings...
* [Getting Started with Sentiment Analysis using Python](https://huggingface.co/blog/sentiment-analysis-python)
* [Sentiment Analysis in 10 Minutes with BERT and TensorFlow](https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671)

<p id="approach3"></p>

## Approach 4: Training your own classifier from scratch (Optional)

Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library. It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [None]:
import nltk
nltk.download(['movie_reviews','punkt','stopwords','averaged_perceptron_tagger'])

from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string
from nltk.probability import FreqDist

import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
import random

In [None]:
from nltk.corpus import movie_reviews as mr
print("The data contains %d reviews"% len(mr.fileids()))

### Shuffling the documents

We start by shuffling the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]. Then we will proceed using scikit-learn.

In [None]:
# Shuffle
docnames=mr.fileids()
random.shuffle(docnames)

# create two separate lists: documents and tags
documents=[]
tags = []
for doc in docnames:
    documents.append(mr.raw(doc))
    tags.append( doc.split('/')[0])

Let's check the first few documents ...

In [None]:
for i in range(5):
    print("DOC:", documents[i][:400])
    print("TAG:", tags[i])

The first 80% of the documents will be used for training, and the final 20% will be used for testing...

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents
train_documents, test_documents = documents[:numtrain], documents[numtrain:]
train_tags, test_tags = tags[:numtrain], tags[numtrain:]

Now that we have separated training and testing sets, we will convert the texts into their vectorial representation. Scikit-learn provides two interesting methods for this: `CountVectorizer` and `TfidfVectorizer`. Please check the documentation if you want to check different parameters.
- [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)
print("TRAIN SIZE:", train_X.shape)
print("TEST SIZE:", test_X.shape)

We can see that the features are actually the words from the texts, where some strange "words" can also be found.

In [None]:
print(vectorizer.get_feature_names()[600:700])

In [None]:
classifier = MultinomialNB()

In [None]:
classifier.fit(train_X, train_tags)

In [None]:
pred = classifier.predict(test_X)

In [None]:
score = sklearn.metrics.accuracy_score(test_tags, pred)
print("accuracy:   %0.3f" % score)

### Using the classifier for processing new texts

1. Please note that you have to perform the exact same processing steps to the new sentences, previously used during training.
2. Then, you have only to apply the classifier to the new sentences

In [None]:
frases = ["I love movies very much", 
          "I hate my stupid life",
          "I am disapointed with the argument"]
frases_X = vectorizer.transform(frases)
classifier.predict(frases_X)

<p id="approach4"></p>

## Approach 5: ChatGPT-like models

Please use [ChatGPT](https://chat.openai.com/) or a similar model, such as [BARD](https://bard.google.com/), to calculate the sentiment of the following sentences
* This food is amazing and tasty!
* Exoplanets are planets outside the solar system
* It is sad to see such a bad behavior

Try changing the prompt as follows:
> Calculate the sentiment of the following sentences. Produce the output in the following format: TEXT, SENTIMENT (Positive, Neutral, Negative)
            ...