# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-3 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

## Sentence 1: "I love apples"

**Explanation:**  
In this sentence, VADER identifies the word "love" as strongly positive. The remaining words ("I" and "apples") are not associated with any sentiment in the lexicon. So, the overall sentiment is driven almost completely by "love," resulting in a high positive score.



## Sentence 2: "I don't love apples"

**Explanation:**  
The presence of the negation "don't" inverts the sentiment of the word "love." VADER’s negation rule switches the polarity of a positive term to negative, leading to an overall negative sentiment for the sentence. This shows how even a single negator can significantly change the sentiment outcome.



## Sentence 3: "I love apples :-)"

**Explanation:**  
The positive term "love" is again present, and the additional emoticon ":-)" provides an extra boost. VADER is designed to recognize common emoticons and assign them sentiment values. The combination results in an even higher overall positive sentiment than the sentence without the emoticon.



## Sentence 4: "These houses are ruins"

**Explanation:**  
The key term in this sentence is "ruins," which in VADER’s lexicon carries a negative connotation. As a result, the sentence is interpreted as negative. However, the context is unknown, “ruins” could describe old structures as well but VADER does not perform context disambiguation, so it relies solely on the lexicon value.



## Sentence 5: "These houses are certainly not considered ruins"

**Explanation:**  
In this sentence, the negative word "ruins" is modified by the negation "not," which causes VADER to invert its sentiment. Additionally, the modifier "certainly" slightly improves the sentiment intensity. The inversion of a negative term leads VADER to interpret the sentence as leaning towards a positive sentiment, even though one might expect it to be more neutral.



## Sentence 6: "He lies in the chair in the garden"

**Explanation:**  
The word "lies" is ambiguous. Although it can mean "reclines" in a neutral sense, VADER’s lexicon associates "lies" with dishonesty, a negative trait. Without the ability to disambiguate between the meanings, VADER assigns a negative sentiment to the sentence, even though the intended meaning is simply descriptive (and, thus, one would be expect the sentiment to be neutral).



## Sentence 7: "This house is like any house"

**Explanation:**  
In this case, the word "like" is interpreted by VADER as a positive signal. However, the sentence is meant to express a neutral comparison, that the house is unremarkable. Because VADER registers "like" with a positive bias, it results in a slight positive sentiment, which does not fully capture the neutral intent of the sentence.



## **Final Outcome:**  
VADER’s approach relies on a fixed lexicon and a set of heuristic rules. While it effectively captures clear positive and negative signals (handles negation and emoticons), its inability to disambiguate word meanings or interpret subtle contextual cues can lead to sentiment scores that do not always match the intended sentiment of the sentence.


### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream. If you have trouble accessing Twitter, try to find an existing dataset (on websites like kaggle or huggingface).

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [1]:
import json
########################### imports below have been added manually for solving the questions in this NB
import nltk
from sklearn.datasets import load_files
from nltk.tokenize import word_tokenize

In [2]:
my_tweets = json.load(open('my_tweets.json'))

In [3]:
for id_, tweet_info in list(my_tweets.items())[::-1]:
    print(id_, tweet_info)
    break

50 {'sentiment_label': 'neutral', 'text_of_tweet': 'Scientists discovered a new type of organism in the depths of the ocean.', 'tweet_url': 'manually created'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

##### **NB!** I don't exactly understand whether by "explain which scores are most relevant and why" they refer to the scores that VADER produces (neg, pos, neu, compound), or the scores from the classification report (f1, P, R, accuracy, etc.). Therefore, I explain both.

In [4]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [5]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import spacy

nlp = spacy.load("en_core_web_sm")
vader_model = SentimentIntensityAnalyzer()

Reuse `run_vader` from ***Lab3.2*** *notebook*.

In [6]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', textual_unit) # change to textual_unit so the whole tweet is displayed
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [7]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    

from sklearn.metrics import classification_report, precision_score, recall_score, f1_score

# use scikit-learn's classification report
print(classification_report(gold, all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=gold, y_pred=all_vader_output, average='micro')
recall_micro = recall_score(y_true=gold, y_pred=all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=gold, y_pred=all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11}{recall_micro:>10}{f1_score_micro:>10}{len(tweets):>10}")


              precision    recall  f1-score   support

    negative       1.00      0.75      0.86        16
     neutral       0.75      0.75      0.75        16
    positive       0.82      1.00      0.90        18

    accuracy                           0.84        50
   macro avg       0.86      0.83      0.84        50
weighted avg       0.85      0.84      0.84        50

   micro avg       0.84      0.84      0.84        50


#### **(A) Quantitative Evaluation**
VADER analyzes the polarity of words and produces 4 different sentiment scores for each text input:
* Positive (`pos`): The proportion of words that express a positive sentiment
* Negative (`neg`): The proportion of words that express a negative sentiment
* Neutral (`neu`): The proportion of words that are neutral / lack clear sentiment
* Compound (`compound`): An aggregated sentiment score; ranges from -1 (extremely negative) to +1 (extremely positive)

After producing those 4 scores, the final sentiment label of the tweet is assigned based on the `compound` score, as detailed in the `vader_output_to_label` function. Therefore, the `compound` score is the most relevant score for the final label classification, which is done according to the following scheme:
* **`compound` > 0** → *positive* label
* **`compound` < 0** → *negative* label
* **`compound` = 0** → *neutral* label

In order to evaluate the performence of the VADER classifier, we used the scikit-learn's classification report. We present and explain the insights it provides:
1. **Precision** (how many tweets did VADER classify correctly out of all classifications for a certain category):
* Negative (`1.00`) - VADER achieves perfect precision on negative examples (correctly predicts them 100% of the time)
* Neutral (`0.75`) - some tweets misclassified as neutral were actually positive or negative
* Positive (`0.82`) - overall high precision on positive tweets, however, *18%* of those classified as positive are being misclassified
2. **Recall** (how many tweets did VADER classify correctly out of all the tweets in a category (as determined by the gold label)):
* Negative (`0.75`) - some negative tweets were misclassified as neutral or positive
* Neutral (`0.75`) - VADER manages to identify *75%* of actual neutral tweets
* Positive (`1.00`) - all actual positive tweets were detected correctly
3. **F1-score** (harmonic mean of precision and recall):
* Negative (`0.86`) - strong performance in detecting negative tweets
* Neutral (`0.75`) - decent performance in classifying neutral tweets. Across the three categories, VADER has the lowest score for neutral tweets
* Positive (`0.90`) - highly effective at identifying positive tweets
4. **Accuracy** (overall percentage of correct classifications):
* `84%` - VADER correctly classified *84%* of all tweets
5. **Macro average** & **Weighted average**                  ## Add micro avg and add and discuss the numbers perhaps?
* Macro average - the average of precision, recall, and F1-score across all classes, treating each class equally (no weights)
* Weighted average - the average of precision, recall, and F1-score, weighted by the number of samples in each class
* Both macro and weighted averages showed nearly identical scores

For evaluating the overall performance, F1-scores and accuracy are most relevant as they give a good general idea of how well the classifier works. Accuracy alone can be a misleading metric, as it does not give insight on the types of errors the model makes and is sensitive to class imbalance - when one category is much more frequent that the rest (this is not the case here), therefore, combining it with F1-score, we can get a good overview of the performance.

**Error Analysis:**

In [8]:
misclassified_counts = {"positive": 0, "negative": 0, "neutral": 0}
misclassified_tweets = {"positive": [], "negative": [], "neutral": []}

for i in range(len(tweets)):
    if gold[i] != all_vader_output[i]:  #classification is incorrect
        true_label = gold[i]
        predicted_label = all_vader_output[i]

        misclassified_counts[true_label] += 1

        misclassified_tweets[true_label].append((tweets[i], predicted_label))

# Summary of misclassified words
print("\nMisclassification Summary")
print("-" * 50)
for sentiment, count in misclassified_counts.items():
    print(f"{sentiment.capitalize()} misclassified: {count}")
print("*" * 50)
for sentiment, errors in misclassified_tweets.items():
    print(f"\nMisclassified {sentiment.capitalize()} Tweets")
    print("~" * 50)

    for tweet, predicted in errors: # print the tweets
        print(f"\nTweet: {tweet}")
        print(f"Expected: {sentiment}; Predicted: {predicted}")
        
        print("\nVADER with verbose:") # see what vader assigns
        run_vader(tweet, lemmatize=True, verbose=1)
        print("*" * 50)


Misclassification Summary
--------------------------------------------------
Positive misclassified: 0
Negative misclassified: 4
Neutral misclassified: 4
**************************************************

Misclassified Positive Tweets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Misclassified Negative Tweets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Tweet: The traffic today was unbearable. Took me whole two hours to get home.
Expected: negative; Predicted: neutral

VADER with verbose:

INPUT SENTENCE The traffic today was unbearable. Took me whole two hours to get home.
INPUT TO VADER ['the', 'traffic', 'today', 'be', 'unbearable', '.', 'take', 'I', 'whole', 'two', 'hour', 'to', 'get', 'home', '.']
VADER OUTPUT {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
**************************************************

Tweet: The food at the restaurant was cold and tasteless. Never going back.
Expected: negative; Predicted: neutral

VADER with verbose:

INPUT SENTENCE Th

#### **(B) Error Analysis**
In total, 0 positive tweets were misclassified, 4 negative tweets were misclassified as neutral, and 4 neutral were misclassified as positive.

1. Misclassified Positive Tweets
* All true-positive tweets are correctly identified.

2. Misclassified Negative Tweets
* All 4 true-negative tweets were misclassified as neutral. In all of them, we see that the `neu` score is 1.0, meaning that the classifier is absolutely sure the tweets convey a neutral tone. Therefore, its lexicon fails to associate the negative words in the sentence with a negative sentiment. For example, the tweet `"The traffic today was unbearable. Took me whole two hours to get home."` clearly carries a negative sentiment, because of the word `unbearable`, which seems to be missing from VARDER's lexicon, possibly accounting for the fact that it classified the tweet as `neutral`. Similarly, the words `cold` and `tasteless` from the second tweet `"The food at the restaurant was cold and tasteless. Never going back."` are also absent from VADER's lexicon, which leads to a misclassification as neutral despite the negative sentiment. The words `slow` and `cold` (again) indicating a negative sentiment in the third tweet `"The service at the cafe was slow and the coffee was cold."` are missing as well, contributing to its neutral classification. Lastly, the terms `overpriced` and `mediocre` from the fourth tweet `"The restaurant was overpriced and the food was mediocre."` are not found in VADER's lexicon, which likely resulted in the neutral classification of the tweet, even though it expresses a negative tone.

3. Misclassified Neutral Tweets
* As shown in the performance analysis, the classifier has the lowest performance score when it comes to neutral tweets. All 4 of the misclassified neutral tweets were classified as positive, suggesting that VADER tends to assign a sentiment even when the intended tone is neutral. A common issue observed is that VADER overweights slightly positive words, such as `innovation`, `exploration`, `useful`, and `like`. These words do not necessarily indicate strong positive emotion, but VADER assigns them a positive sentiment score. For example, in For example, in `"Reading about the latest tech innovations. There have been a lot of new developments."`, the word `innovations` is factual rather than an expression of enthusiasm, yet it leads to a positive classification. The same is true for the word `exploration` in `"Listening to a podcast about space exploration. So much new information..."`, which also causes a misclassification. Moreover, VADER lacks context awareness, causing misinterpretations of the tone. In `"Would you like to watch the sunset at the park with me tonight?"`, the phrase `would you like` is not an expression of excitement but rather a neutral request, yet VADER assigns it a positive connotation. Similarly, in `"Listening to a podcast about productivity. It provides many useful tips."`, the word `useful` describes a fact rather than carrying a positive sentiment, yet it considers the tweet to be positive.
* In addition to the points made above, another possible reason for the observed misclassifications is that the threshold for classifying a tweet as positive is too lenient. Currently, any `compound` score greater than `0.0` results in a positive classification, even if the score is barely a positive number. This means that tweets with weakly positive words, even if the intended tone is neutral, are pushed towards the positive category. Raising the threshold slightly could help reduce these errors and better differentiate between truly neutral and positive tweets.

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [9]:
# path to folder
from pathlib import Path

cur_dir = Path().resolve()
path_to_folder = Path.joinpath(cur_dir, r"airlinetweets")

print(path_to_folder)

/Users/joanapetkova/Documents/VU AI/Text Mining/TextMiningGroup8/Lab3/airlinetweets


In [10]:
# load the data
import os

data = []
categories = ["positive", "negative", "neutral"]
for c in categories:
    category_path = os.path.join(path_to_folder, c)

    for txt_file in os.listdir(category_path):
        txt_file_path = os.path.join(category_path, txt_file)
        
        with open(txt_file_path, "r", encoding="utf-8") as file:
            tweet = file.read().strip()
            data.append((tweet, c))

# checking random example
print(data[5])

('@JetBlue great flight on a brand new jet. Great seating. Beautiful plane. Big fan of this airline.', 'positive')


### Experiment 1: Basic VADER

In [11]:
airline_tweets = []
airline_all_vader_output = []
airline_gold = []

# settings (to change for different experiments)
to_lemmatize = False 
pos = set()

# perform sentiment analysis with VADER on each tweet
for tweet, label in data:
    vader_output = run_vader(tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    airline_tweets.append(tweet)
    airline_all_vader_output.append(vader_label)
    airline_gold.append(label)
    
# use scikit-learn's classification report
print(classification_report(airline_gold, airline_all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
recall_micro = recall_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11.2f}{recall_micro:>10.2f}{f1_score_micro:>10.2f}{len(airline_tweets):>10}")

              precision    recall  f1-score   support

    negative       0.80      0.51      0.63      1750
     neutral       0.60      0.51      0.55      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.63      4755
   macro avg       0.65      0.64      0.62      4755
weighted avg       0.66      0.63      0.62      4755

   micro avg       0.63      0.63      0.63      4755


### Experiment 2: VADER with Lemmatized Text

In [12]:
airline_tweets = []
airline_all_vader_output = []
airline_gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

# perform sentiment analysis with VADER on each tweet
for tweet, label in data:
    vader_output = run_vader(tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    airline_tweets.append(tweet)
    airline_all_vader_output.append(vader_label)
    airline_gold.append(label)
    
# use scikit-learn's classification report
print(classification_report(airline_gold, airline_all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
recall_micro = recall_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11.2f}{recall_micro:>10.2f}{f1_score_micro:>10.2f}{len(airline_tweets):>10}")

              precision    recall  f1-score   support

    negative       0.79      0.52      0.63      1750
     neutral       0.60      0.49      0.54      1515
    positive       0.56      0.88      0.68      1490

    accuracy                           0.62      4755
   macro avg       0.65      0.63      0.62      4755
weighted avg       0.65      0.62      0.62      4755

   micro avg       0.62      0.62      0.62      4755


### Experiment 3: VADER with Only Adjectives

In [13]:
airline_tweets = []
airline_all_vader_output = []
airline_gold = []

# settings (to change for different experiments)
to_lemmatize = False 
pos = {"ADJ", "JJ", "JJR", "JJS"}  # include universal as well as Penn Treebank tags for adjective

# perform sentiment analysis with VADER on each tweet
for tweet, label in data:
    vader_output = run_vader(tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    airline_tweets.append(tweet)
    airline_all_vader_output.append(vader_label)
    airline_gold.append(label)
    
# use scikit-learn's classification report
print(classification_report(airline_gold, airline_all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
recall_micro = recall_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11.2f}{recall_micro:>10.2f}{f1_score_micro:>10.2f}{len(airline_tweets):>10}")

              precision    recall  f1-score   support

    negative       0.86      0.20      0.33      1750
     neutral       0.40      0.89      0.55      1515
    positive       0.67      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.64      0.51      0.47      4755
weighted avg       0.65      0.50      0.46      4755

   micro avg       0.50      0.50      0.50      4755


### Experiment 4: VADER with Lemmatized Text of Only Adjectives

In [14]:
airline_tweets = []
airline_all_vader_output = []
airline_gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = {"ADJ", "JJ", "JJR", "JJS"}  # include universal as well as Penn Treebank tags for adjective

# perform sentiment analysis with VADER on each tweet
for tweet, label in data:
    vader_output = run_vader(tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    airline_tweets.append(tweet)
    airline_all_vader_output.append(vader_label)
    airline_gold.append(label)
    
# use scikit-learn's classification report
print(classification_report(airline_gold, airline_all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
recall_micro = recall_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11.2f}{recall_micro:>10.2f}{f1_score_micro:>10.2f}{len(airline_tweets):>10}")

              precision    recall  f1-score   support

    negative       0.86      0.20      0.33      1750
     neutral       0.40      0.89      0.55      1515
    positive       0.67      0.44      0.53      1490

    accuracy                           0.50      4755
   macro avg       0.64      0.51      0.47      4755
weighted avg       0.65      0.50      0.46      4755

   micro avg       0.50      0.50      0.50      4755


### Experiment 5: VADER with Only Nouns

In [15]:
airline_tweets = []
airline_all_vader_output = []
airline_gold = []

# settings (to change for different experiments)
to_lemmatize = False 
pos = {"NOUN", "NN", "NNS"}  # include universal as well as Penn Treebank tags for nouns (proper nouns not included)

# perform sentiment analysis with VADER on each tweet
for tweet, label in data:
    vader_output = run_vader(tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    airline_tweets.append(tweet)
    airline_all_vader_output.append(vader_label)
    airline_gold.append(label)
    
# use scikit-learn's classification report
print(classification_report(airline_gold, airline_all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
recall_micro = recall_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11.2f}{recall_micro:>10.2f}{f1_score_micro:>10.2f}{len(airline_tweets):>10}")

              precision    recall  f1-score   support

    negative       0.73      0.14      0.23      1750
     neutral       0.36      0.82      0.50      1515
    positive       0.53      0.35      0.42      1490

    accuracy                           0.42      4755
   macro avg       0.54      0.44      0.39      4755
weighted avg       0.55      0.42      0.38      4755

   micro avg       0.42      0.42      0.42      4755


### Experiment 6: VADER with Lemmatized Text of Only Nouns

In [16]:
airline_tweets = []
airline_all_vader_output = []
airline_gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = {"NOUN", "NN", "NNS"}  # include universal as well as Penn Treebank tags for nouns (proper nouns not included)

# perform sentiment analysis with VADER on each tweet
for tweet, label in data:
    vader_output = run_vader(tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    airline_tweets.append(tweet)
    airline_all_vader_output.append(vader_label)
    airline_gold.append(label)
    
# use scikit-learn's classification report
print(classification_report(airline_gold, airline_all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
recall_micro = recall_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11.2f}{recall_micro:>10.2f}{f1_score_micro:>10.2f}{len(airline_tweets):>10}")

              precision    recall  f1-score   support

    negative       0.71      0.15      0.25      1750
     neutral       0.36      0.81      0.50      1515
    positive       0.52      0.34      0.41      1490

    accuracy                           0.42      4755
   macro avg       0.53      0.44      0.39      4755
weighted avg       0.54      0.42      0.38      4755

   micro avg       0.42      0.42      0.42      4755


### Experiment 7: VADER with Only Verbs

In [17]:
airline_tweets = []
airline_all_vader_output = []
airline_gold = []

# settings (to change for different experiments)
to_lemmatize = False 
pos = {"VERB", "MD", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"}  # include universal as well as Penn Treebank tags for verbs

# perform sentiment analysis with VADER on each tweet
for tweet, label in data:
    vader_output = run_vader(tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    airline_tweets.append(tweet)
    airline_all_vader_output.append(vader_label)
    airline_gold.append(label)
    
# use scikit-learn's classification report
print(classification_report(airline_gold, airline_all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
recall_micro = recall_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11.2f}{recall_micro:>10.2f}{f1_score_micro:>10.2f}{len(airline_tweets):>10}")

              precision    recall  f1-score   support

    negative       0.79      0.29      0.42      1750
     neutral       0.38      0.81      0.52      1515
    positive       0.57      0.35      0.43      1490

    accuracy                           0.47      4755
   macro avg       0.58      0.48      0.46      4755
weighted avg       0.59      0.47      0.46      4755

   micro avg       0.47      0.47      0.47      4755


### Experiment 8: VADER with Lemmatized Text of Only Verbs

In [18]:
airline_tweets = []
airline_all_vader_output = []
airline_gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = {"VERB", "MD", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"}  # include universal as well as Penn Treebank tags for verbs

# perform sentiment analysis with VADER on each tweet
for tweet, label in data:
    vader_output = run_vader(tweet, lemmatize=to_lemmatize, parts_of_speech_to_consider=pos) # use run_vader function
    vader_label = vader_output_to_label(vader_output) # use the predefined function above to get the labels based on scores
    
    airline_tweets.append(tweet)
    airline_all_vader_output.append(vader_label)
    airline_gold.append(label)
    
# use scikit-learn's classification report
print(classification_report(airline_gold, airline_all_vader_output))

# calculate the micro precision, recall and f1 score using the corresponding functions since
# micro averages are not part of classification report
precision_micro = precision_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
recall_micro = recall_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')
f1_score_micro = f1_score(y_true=airline_gold, y_pred=airline_all_vader_output, average='micro')

print(f"{'micro avg':>12}{precision_micro:>11.2f}{recall_micro:>10.2f}{f1_score_micro:>10.2f}{len(airline_tweets):>10}")

              precision    recall  f1-score   support

    negative       0.75      0.29      0.42      1750
     neutral       0.38      0.78      0.51      1515
    positive       0.57      0.36      0.44      1490

    accuracy                           0.47      4755
   macro avg       0.57      0.48      0.46      4755
weighted avg       0.58      0.47      0.46      4755

   micro avg       0.47      0.47      0.47      4755


### **Comparison**

#### Lemmatization:
TO BE ADDED

#### Importance of POS for Sentiment Analysis:
TO BE ADDED

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [19]:
# imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

In [20]:
# data split
df = pd.DataFrame(data, columns=["tweet", "sentiment"])
X_train, X_test, y_train, y_test = train_test_split(df["tweet"], df["sentiment"], test_size=0.2, random_state=42)

In [21]:
# training with default settings (TF-IDF representation, min_df=2)
vectorizer = TfidfVectorizer(min_df=2)
X_train_vectorizer = vectorizer.fit_transform(X_train)
X_test_vectorizer = vectorizer.transform(X_test)

# using Multinomial NB because it is good for txt classification
model = MultinomialNB()
model.fit(X_train_vectorizer, y_train)
y_pred = model.predict(X_test_vectorizer)

print("Basic settings (TF-IDF representation, min_df=2)")
print(classification_report(y_test, y_pred))

Basic settings (TF-IDF representation, min_df=2)
              precision    recall  f1-score   support

    negative       0.79      0.93      0.85       328
     neutral       0.81      0.72      0.76       296
    positive       0.88      0.81      0.85       327

    accuracy                           0.82       951
   macro avg       0.83      0.82      0.82       951
weighted avg       0.83      0.82      0.82       951



In [22]:
# DIFFERENT SETTINGS
# with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count')

# TF-IDF ('airline_tfidf')
# here not sure if we should do something else with tfidf or just the basic one ^^^^ is enough

In [23]:
# Bag of words representation ('airline_count')
# for BoW use CountVectorizer
vectorizer_count = CountVectorizer(min_df=2)  
X_train_counter = vectorizer_count.fit_transform(X_train)
X_test_counter = vectorizer_count.transform(X_test)

model_count = MultinomialNB()
model_count.fit(X_train_counter, y_train)
y_pred_count = model_count.predict(X_test_counter)

print("Bag of words representation ('airline_count')")
print(classification_report(y_test, y_pred_count))

Bag of words representation ('airline_count')
              precision    recall  f1-score   support

    negative       0.84      0.91      0.87       328
     neutral       0.81      0.78      0.79       296
    positive       0.87      0.83      0.85       327

    accuracy                           0.84       951
   macro avg       0.84      0.84      0.84       951
weighted avg       0.84      0.84      0.84       951



In [24]:
# DIFFERENT SETTINGS 2
# with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 

min_df_values = [2, 5, 10]

### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [25]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names_out() # had to change this line of code to _out(), because in newer version of scikit-learn it was removed
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
#important_features_per_class(airline_vec, clf)

airline_tweets = load_files(
    "airlinetweets",
    categories=["neutral", "positive", "negative"],
    encoding="utf-8",
    shuffle=True,
    random_state=0
                            )
tweets = airline_tweets.data
labels = airline_tweets.target
label_names = airline_tweets.target_names
labels = [label_names[label] for label in labels]

x_train, x_test, y_train, y_test = train_test_split(tweets, labels, train_size=0.8, random_state=0)

# x_train = [" ".join(word_tokenize(text)) for text in x_train]  # we can use this if we want to remove punctuation and maybe improve our model
# x_test = [" ".join(word_tokenize(text)) for text in x_test]

vectorizer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize) # remove tokenizer argument if we want to remove punctuation, and uncomment the lines above
train_counts = vectorizer.fit_transform(x_train)
test_counts = vectorizer.transform(x_test)


classifier = MultinomialNB()
classifier.fit(train_counts, y_train)


important_features_per_class(vectorizer, classifier)



Important words in negative documents
negative 1490.0 @
negative 1383.0 united
negative 1233.0 .
negative 734.0 to
negative 568.0 i
negative 538.0 the
negative 451.0 a
negative 414.0 ``
negative 394.0 flight
negative 393.0 ?
negative 374.0 you
negative 369.0 !
negative 361.0 and
negative 325.0 my
negative 322.0 on
negative 315.0 for
negative 314.0 #
negative 294.0 is
negative 257.0 in
negative 216.0 n't
negative 210.0 of
negative 196.0 it
negative 194.0 that
negative 188.0 your
negative 171.0 have
negative 167.0 not
negative 161.0 me
negative 157.0 with
negative 153.0 was
negative 153.0 ''
negative 152.0 no
negative 142.0 at
negative 129.0 this
negative 122.0 's
negative 114.0 service
negative 114.0 do
negative 111.0 :
negative 110.0 from
negative 104.0 now
negative 99.0 get
negative 98.0 be
negative 97.0 virginamerica
negative 92.0 delayed
negative 90.0 we
negative 89.0 been
negative 89.0 are
negative 88.0 plane
negative 88.0 just
negative 88.0 cancelled
negative 86.0 an
negative 85.0

### Expected Features

#### Negative Class
**Expected Words:** `"delay"`, `"cancelled"`, `"bad"`, `"late"`,`"rude"`
**Why?** People usually express frustration on Twitter when their flight is delayed, canceled, or when they experience poor service.

#### Neutral Class
**Expected Words:** `"flight"`, `"gate"`, `"airport"`  
**Why?** Neutral tweets are often just factual, mentioning general travel related words without expressing strong emotions, this also makes it harder to predict neutral words.

#### Positive Class
**Expected Words:**  `"helpful"`, `"friendly"`, `"best"`, `"great"`, `"amazing"`,    
**Why?** Happy customers tend to express gratitude and enthusiasm when they receive good service.

---

### Unexpected Features 
Some unexpected features appeared in the output, such as punctuation and symbols like `"@"`, `"."`, `"!"`, `"?"`, and `"#"`, which are structural elements of tweets but do not carry sentiment. Additionally, common stopwords such as `"the"`, `"a"`, `"is"`, `"with"`, and `"that"` were surprisingly ranked high despite being function words that frequently appear in all tweets without having meaningful sentiment. Furthermore, numbers and generic words like `"2"`, `"hours"`, and `"http"` were unexpected. While `"hour"` and `"hours"` may sometimes indicate delays, numerical values and URLs generally do not convey sentiment on their own.


---

### Words That Should Be Kept Or Removed
To improve the classification model, we should remove words that don’t add sentiment meaning and keep those that do.

#### Words to Remove that have No Sentiment Meaning
- **Punctuation**: `"."`, `"!"`, `"?"`, `"..."`, `"@"`, `"#"`  
- **Stopwords**: `"the"`, `"to"`, `"a"`, `"is"`, `"in"`, `"with"`, `"that"`, `"be"`  
- **Mentions & URLs**: `"@"`, `"http"`  
- **Symbols**: `"&"`, `"-"`, `":"`, `"''"`

#### Words to Keep that have Strong Sentiment Meaning
- **Negative Sentiment:** `"worst"`, `"cancelled"`, `"delayed"`, `"rude"`, `"bad"`, `"no"`  
- **Neutral Words:** `"ticket"`, `"service"`, `"airport"`, `"flight"`, `"gate"`  
- **Positive Sentiment:** `"helpful"`, `"best"`, `"great"`, `"friendly"`, `"good"`, `"awesome"`, `"amazing"`  


---

### Model Improvement 
To make the classification more accurate, we can apply the following techniques:

1. **Remove Stopwords**: Use `stop_words="english"` in `CountVectorizer` to filter out common words that don’t add meaning.  
2. **Exclude Punctuation** 
4. **Apply Stemming or Lemmatization**: Convert words to their root form (e.g., `"delayed"` → `"delay"`, `"cancelled"` → `"cancel"`).  

By applying these refinements, we can improve our Naive Bayes model's ability to classify tweets more accurately and meaningfully.  


### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook