# Sentiment Analysis
Comments and suggestions: `fernando.batista@iscte-iul.pt`

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fmmb/ADD_in/blob/main/SA.ipynb)

Table of contents
* [Intro](#intro)
* [Approach 1: Using existing NLP tools](#approach1)
* [Approach 2: Using a sentiment lexicon](#approach2)
* [Approach 3: Machine Learning - Training your own classifier](#approach3)
* [Approach 4: Using pre-trained transformer-based models](#approach4)
* [Approach 5: ChatGPT-like models](#approach5)

If you are using google colab, please check the [instructions on how to use your files in google colab](README.md#files-in-google-colab)

## Introduction

In [None]:
from textblob import TextBlob
import pandas as pd
import nltk
if 'google.colab' in str(get_ipython()):
    nltk.download('punkt')

<p id="approach1"></p>

## Approach 1: Using existing NLP tools
There are many existing tools for sentiment analysis, such as `TextBlob` and `sentiment Vader`, amongst others.

### TextBlob
`TextBlob` is one of these NLP tools and will be used to perform our initial tests.

In [None]:
texts = [ "The movie was good.",
          "I hate the movie",
          "The movie was not good.",
          "I really think this product sucks.",
          "Really great product.",
          "I don't like this product"]

In [None]:
for t in texts:
    print(t, "==>", TextBlob(t).sentiment.polarity)

The previous code assumes that the text is already split into sentences, which may not be the case of texts comming from sources, such as *web pages* or *blogs*. An alternate solution would be to give the whole text to `textblob` as follows.

In [None]:
mytext = """The movie was good. The movie was not good. I really think this product sucks.
Really great product. I don't like this product."""
text=TextBlob(mytext)

In [None]:
for s in text.sentences:
    print(s, "==> ", s.sentiment.polarity)

### Evaluation: classifying a set of texts and evaluate the performance

In [None]:
texts = ["I love chocolate",
        "I hate to eat", 
        "I don't love anyone",
        "I like cakes",
        "I don't fail"]
tags=["pos", "neg", "neg", "pos", "pos"]

In [None]:
TextBlob(texts[0]).sentiment.polarity

In [None]:
correct=0
incorrect=0
for i in range(len(texts)):
    polarity = TextBlob(texts[i]).sentiment.polarity
    print(f"  => {polarity:4}, {tags[i]}, {texts[i]}")
    if polarity >=0 and tags[i] == "pos":
        correct +=1
    elif polarity <0 and tags[i] == "neg":
        correct += 1
    else:
        incorrect += 1
acc=(correct)/(correct+incorrect)
print(f"correct: {correct}, incorrect: {incorrect}, accuracy: {acc}")

### Vader Sentiment

In [None]:
!pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sa = SentimentIntensityAnalyzer()

sentences = [
    "This food is amazing and tasty !",
    "Exoplanets are planets outside the solar system",
    "This is sad to see such bad behavior"
]

for sentence in sentences:
    score = sa.polarity_scores(sentence)["compound"]
    print(f'{sentence}. sentiment score: {score}')

We can also calculate the percentage of each sentiment present in that sentence using "pos", "neu" and "neg" keys after computing the polarity score

In [None]:
for sentence in sentences:
    polarity = sa.polarity_scores(sentence)
    print(f"{sentence}\n  => {polarity}") 

### Flair (optional)

In [6]:
!pip install flair

In [5]:
from flair.data import Sentence
from flair.nn import Classifier

sentence = Sentence('I love Lisbon .')

# load and apply the Sentiment tagger
tagger = Classifier.load('sentiment')
tagger.predict(sentence)

# print the sentence with all annotations
print(sentence)

RuntimeError: Failed to import transformers.trainer_utils because of the following error (look up to see its traceback):
'EntryPoint' object has no attribute 'module'

<p id="approach2"></p>

## Approach 2: Applying a sentiment lexicon

Leitura do léxico de sentimento

In [None]:
#data = pd.read_csv("../data/en/NCR-lexicon.csv", encoding="utf-8")
data = pd.read_csv("sample-lexicon.csv", encoding="utf-8", index_col=["English"])
data.sample(5)

In [None]:
lex = data['polarity'].to_dict()

In [None]:
text = 'I hate to say goodbye but I love chocolate'
polarity = 0
for w in text.split():
    polarity += lex.get(w, 0)
    print(w, lex.get(w, 0) )
print("Sum:" , polarity)

Even better with a function ...

In [None]:
def sentiment(text):
    polarity = 0
    for w in text.split():
        polarity += lex.get(w, 0)
    if polarity >= 0:
        return "POS"
    else:
        return "NEG"

In [None]:
text = 'I hate to say goodbye but I love chocolate'
sentiment(text)

In [None]:
for text in texts:
    print(sentiment(texto))

<p id="approach3"></p>

## Approach 3: Training your own classifier

Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [None]:
import nltk
nltk.download(['movie_reviews','punkt','stopwords','averaged_perceptron_tagger'])

from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string
from nltk.probability import FreqDist

In [None]:
from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

### Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [None]:
import random
docnames=mr.fileids()
random.shuffle(docnames)

### Let's do it using some useful scikit-learn functions 


In [None]:
import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

#### Assuming that documents are shuffled
Make sure `docnames` contain a shuffled list of documents 

In [None]:
documents=[]
tags = []
for doc in docnames:
    documents.append(mr.raw(doc))
    tags.append( doc.split('/')[0])

In [None]:
for i in range(5):
    print("{} -> {}...".format(tags[i], documents[i][:50]))

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents
train_documents, test_documents = documents[:numtrain], documents[numtrain:]
train_tags, test_tags = tags[:numtrain], tags[numtrain:]

Agora que temos os conjuntos de treino e de teste separados, há que converter o texto dos documentos na sua representação vetorial. O scikit-learn tem dois métodos interessantes: `CountVectorizer` e `TfidfVectorizer`. Ambos aceitam um conjunto interessante de parâmetros, que não exploramos aqui, mas que vale a pena consultar.
- [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)
print(train_X.shape, test_X.shape)

Podemos verificar que as features são na verdade as "palavras" dos textos, onde também se incluem números e outras coisas extranhas.

In [None]:
print(vectorizer.get_feature_names()[600:700])

In [None]:
classifier = MultinomialNB()

In [None]:
classifier.fit(train_X, train_tags)

In [None]:
pred = classifier.predict(test_X)

In [None]:
score = sklearn.metrics.accuracy_score(test_tags, pred)
print("accuracy:   %0.3f" % score)

### Using the classifier for processing new texts

1. Please note that you have to perform the exact same processing steps to the new sentences, previously used during training.
2. Then, you have only to apply the classifier to the new sentences

In [None]:
frases = ["I love movies very much", 
          "I hate my stupid life",
          "I am disapointed with the argument"]
frases_X = vectorizer.transform(frases)
classifier.predict(frases_X)

<p id="approach4"></p>

## Approach 4: Using pre-trained transformer-based models

### The easy way ...

In [None]:
!pip install -q transformers

In [None]:
from transformers import pipeline

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis")
#sentiment_pipeline = pipeline("sentiment-analysis", model="finiteautomata/bertweet-base-sentiment-analysis")

In [None]:
data = ["I love you", "I hate you", "the unemployment is increasing"]

In [None]:
sentiment_pipeline(data)

### Fine-tunning (optional)

Some practical readings...
* [Getting Started with Sentiment Analysis using Python](https://huggingface.co/blog/sentiment-analysis-python)
* [Sentiment Analysis in 10 Minutes with BERT and TensorFlow](https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671)

<p id="approach4"></p>

## Approach 5: ChatGPT-like models

TODO