# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-3 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

### Question1 Answer
First thing to note is that the scores of "neg", "neu", and "pos" can range from 0 to 1. The scores are actually ratios for proportions of the text that fall in each category. </br>
The scores of "neg", "neu", and "pos" do no take VADER's rule based enhancement into account. The rule based enhancement are applied when calculating the compound score. </br>
The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). </br>
Standardized thresholds for the compound score are:
- Compound score > 0.05: Positive sentiment 
- Compound score < -0.05: Negative sentiment 
- Compound score between -0.05 and 0.05: Neutral sentiment 

VADER also considers the intensity of the sentiment: Amplifiers enhance the sentiment of the words they modify; Punctuation and emoticons are also sentiment modifiers. </br>
When negation is applied, VADER flips the sentiment of the words the negation is applied to.
Wordforms (e.g., "love", "loves", and "loved") are treated as seperate instances in the lexicon.  </br>
[https://github.com/cjhutto/vaderSentiment/tree/master] </br>

- Sentence 1
    - This output is plausible. </br> 
    The verb "love" results in a high positive score as it has a high positive sentiment in VADER's lexicon. Furthermore, the words "I" and "apples" are neutral words. </br>
    The compound score is also reasonable as the proportions of the text mainly fall in the positive (and neutral) sentiment. Thus, having a high compound score that indicates a positive sentiment is correct.
- Sentence 2
    - This output is plausible. </br>
    The use of "don't" before the word "love" results in a high negative score. This is because "love" is a strong positive word and the "don't" negates it. VADER thus flips the sentiment to a negative score. Furthermore, the words "I" and "apples" are neutral words. </br>
    The compound score is also plausible. The compound score indicates a negative sentiment, which is on par with the word's used.
- Sentence 3
    - This output is plausible. </br>
    The verb "love" results in a high positive score as it has a high positive sentiment in VADER's lexicon. Additionally, the smiley face at the end is also marked as a positive sentiment in the lexicon of VADER. Thus, the positive score increases. Furthermore, the words "I" and "apples" are neutral words. </br>
    The compound score is also plausible. The compound scores show that the sentence has a positive sentiment. Additionally, the compound score is higher than that of sentence 1, which makes sense as the smiley face reinforces a highers positive sentiment.
- Sentence 4
    - This output is plausible. </br>
    The word "ruins" is interpreted as a negative word, while the other words are neutral. We would expect the negative sentiment to be higher than the neutral sentiment, but perhaps because the sentence is a statement without any emotional value words, results in it having a higher neutral score </br>
    Compound score seems plausible too, because the compound score indicates a negative sentiment.
- Sentence 5
    - This output is plausible. </br>
    The word "ruins" is interpreted as a negative word and since it is negated by the "not", the sentiment is marked positive. Furthermore, the positive senitment is furhter reinforced by the word "certainly" which appears before the word "not". The other words are neutral (like in sentence 6). </br>
    Compound score suggests a positive sentiment. Taking the positive scores into account of the sentence, this outcome seems plausible.
- Sentence 6
    - This output is NOT plausible. </br>
    The "neg" categorization happened due to the verb "lies", which has a mean sentiment rating of -1.8 in VADER's `lexicon.txt`. However, the "lies" stems from the verb "to lay" and not "to lie". VADER should have been able to disambiguate the word based on the context. </br>
    Furthermore, the sentence does not include any words that have an emotional value. Thus, it should only have a neutral score </br>
    Additionally, the compound score indicates a negative sentiment. However, this is also wrong since wrong scores were given. (A "neg" score was given. And since all the scores of "neg, "pos", and "neu" add up to 1, a wrongly given "neg" score influences the scores of the other categories.)
- Sentence 7
    - This output is NOT plausible. </br>
    In this sentence, mostly neutral words are used. The positive inclined score is due to the use of the word "like" (mean sentiment rating: 1.5), which can be seen as "to like" but also used for comparison. VADER should be able to tell the difference given the context here, meaning the "like" for comparison should have been found out.</br>
    The compound score suggests that there is a positive sentiment. However, it should be a neutral sentiment. Thus, this is not correct.

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream. If you have trouble accessing Twitter, try to find an existing dataset (on websites like kaggle or huggingface).

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [3]:
import json

In [29]:
my_tweets = json.load(open('my_tweets.json'))

In [30]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'negative', 'text_of_tweet': 'Michelle Trachtenberg has sadly passed away at the age of 39.', 'tweet_url': 'https://x.com/DiscussingFilm/status/1894801472610611220'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [31]:
import spacy
from nltk.sentiment.vader import SentimentIntensityAnalyzer

vader_model = SentimentIntensityAnalyzer()
nlp = spacy.load('en_core_web_sm')

def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None):
    
    doc = nlp(textual_unit)       
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))

    return scores




In [32]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [33]:
from sklearn.metrics import classification_report

tweets = []
all_vader_output = []
gold = []

# Settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet, lemmatize=to_lemmatize)
    vader_label = vader_output_to_label(vader_output)
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
# use scikit-learn's classification report
report = classification_report(gold, all_vader_output, digits = 3)
print(report)

              precision    recall  f1-score   support

    negative      0.727     0.842     0.780        19
     neutral      0.846     0.846     0.846        13
    positive      0.800     0.667     0.727        18

    accuracy                          0.780        50
   macro avg      0.791     0.785     0.785        50
weighted avg      0.784     0.780     0.778        50



**Answer 3a.:** Precision, recall, and F1-score are used to measure how well VADER classified the tweets.
<br> 

Precision tells us how many of the tweets that VADER labeled as a certain sentiment were actually correct. For example, VADER had a precision of 72.7% for negative tweets, meaning that when it said a tweer was negative, it was correct about 73% of the time. For neutral tweets, precision is higher, making it the most reliable category. Positive sentiment had a precision of 80%, which is similar to neutral but still leaves room for error. 
<br>

Recall, on the other hand, measures how many actual tweers of each sentiment were correctly identified. VADER correctly caught 84.2% of negative tweets, which is a strong result, and 84.6% of neurtal tweets, meaning it was also good at neutral detection. However, recall for positive tweets was lower at 66.7%, meaning VADER missed quite a few of them. 
<br>

The F1-score balances precision and recall, and in this casse, neutral tweets also had the heighest score at 84.6%, negative tweets followed at 78%, and positive tweets was the weakest at 72.7%. 

<br>
To answer the question which score matters the most, we need to take into account what we want to acheive. If the goal is to catch as many tweets of a certain sentiment as possible, recall is the most important because it tells us how many we missed. This is particularly relevant for negative sentiment, where missing complaints or criticism could be a problem (for example, for companies looking to understand the sentiment about their products). If we want to make sure that when VADER labels somehing, its actually correct, then precision is more important. That would matter more for positive sentiment, where we want to be sure that tweets labeled as positive a truly positive. However, since F1-score balances them both, it could be generally said that its the best way to compare overall performance, and in this case, it confirms that VADER handled neutral and negative tweets well but struggled with positive ones. 

In [34]:
# --Task 3.b--

import pandas as pd
pd.set_option("display.max_colwidth", None)

misclassified_tweets = []

# Going through each tweet and comparing VADER prediction to human label
for idx, (true_label, predicted_label, text) in enumerate(zip(gold, all_vader_output, tweets)):
    if true_label != predicted_label:
        misclassified_tweets.append({
            "Tweet ID": idx + 1, 
            "Tweet Text": text,
            "Gold Label": true_label,
            "VADER Prediction": predicted_label
        })

df_misclassified = pd.DataFrame(misclassified_tweets)
df_positive_errors = df_misclassified[df_misclassified["Gold Label"] == "positive"].head(10)
df_negative_errors = df_misclassified[df_misclassified["Gold Label"] == "negative"].head(10)
df_neutral_errors = df_misclassified[df_misclassified["Gold Label"] == "neutral"].head(10)

In [35]:
print("MISCLASSIFIED POSITIVE TWEETS: \n")
print(df_positive_errors.to_string(index=False))

MISCLASSIFIED POSITIVE TWEETS: 

 Tweet ID Tweet Text                                                                                                                                                                                                              Gold Label VADER Prediction
11                                                                                                                           This is Brooks. He found a giant stick on his walk and stubbornly carried it all the way home. 14/10 positive   negative        
18                                                                                                                                                                                                   Spectacular photograph, Don! positive    neutral        
21        Good news, Elon Musk has tens of billions in loans secured against Tesla. If Tesla goes bankrupt he will go completely bankrupt as all his current possessions will be needed to pay his debts. Tim

**Answer 3.b (1):**
- Tweet ID 11: The word "stubbornly" is in the lexicon with a negative score of -1.4. Since there aren’t any strong positive words in the sentence that appear in the lexicon, this negative value skews the overall sentiment. Additionally, "14/10" is a highly positive rating in human interpretation, but VADER does not assign sentiment to numerical ratings, so it was ignored.
- Tweet ID 18: The word "spectacular" is not in the lexicon, so VADER doesn’t register it as positive. Since "photograph" and "Don" also aren’t in the lexicon, VADER fails to detect any positive sentiment. Additionally, punctuation amplifies sentiment, but since there is only one exclamation mark, it wasn’t enough to push the sentiment into the positive range.
- Tweet ID 21: The word "good" is in the lexicon with a positive score of 1.9, which means it does contribute to the sentiment calculation. However, "news" is not in the lexicon, so "good news" as a phrase is not explicitly recognized as positive sentiment. At the same time, the words "bankrupt" (-2.6) and "debt" (-1.5) are in the lexicon with strong negative scores, and these values heavily influence the total sentiment score. Since VADER follows a summation rule, the multiple high-magnitude negative words overpower the single positive word "good" (1.9). As a result, despite starting with a seemingly positive phrase, the overall sentiment score becomes negative due to the dominance of strongly negative financial terms.
- Tweet ID 23: The phrase "bad day" is commonly understood as negative, but "bad" is the only word from that phrase in the lexicon, and it has a very strong score of -2.5. Since VADER assigns more weight to high-scoring negative words, the sentiment gets pulled toward the negative side. The phrase "Good boi" is very positive to humans, but "boi" (which a social media slang), is not in the lexicon, meaning VADER does not recognize this as positive sentiment. Because the first strong word in the tweet was negative, VADER treated the tweet as overall negative.
- Tweet ID 45: The words "illegal" (-2.6) and "war" (-2.9) are strongly negative in the lexicon, which means the first half of the sentence gets assigned a high negative sentiment score. However, VADER does apply a rule for contrastive conjunctions like "but", meaning that the second part of the sentence ("Canada will be there, standing with Ukraine") is supposed to carry more weight in the final sentiment calculation. Despite this rule, the positive words in the second half of the sentence are not strong enough to fully neutralize the very high-magnitude negative words in the first half. Since words like "Canada" and "standing with Ukraine" are not in the lexicon, they do not contribute a clear positive sentiment score. The lack of high-scoring positive words likely resulted in the earlier negative words dominating the final classification despite the contrastive conjunction rule.
- Tweet ID 46: The VADER lexicon does not contain "BLUE" or "WOWWWWW", meaning these words were ignored in the sentiment analysis. However, "WOW" (without elongation) is in the lexicon with a strong positive score of 2.8, so if the tweet had used "WOW" instead of "WOWWWWW", it likely would have contributed a strong positive sentiment. The key reason VADER misclassified this tweet as negative is the presence of "ghost", which is in the lexicon with a negative score of -1.3. Since no other recognized positive words were present, the negative weight from "ghost" pulled the overall sentiment score into the negative range.

In [36]:
print("MISCLASSIFIED NEGATIVE TWEETS: \n")
print(df_negative_errors.to_string(index=False))

MISCLASSIFIED NEGATIVE TWEETS: 

 Tweet ID Tweet Text                                                                                                                                                            Gold Label VADER Prediction
 3                                                                                   President Trump and Elon Musk are running this government like a discount furniture store. negative   positive        
 7                                                                                                                                         JUST IN: Bitcoin falls under $84,000 negative    neutral        
40        Elon Musk rolls up to the cabinet meeting he's attending dressed like a bum, giggling about his 'tech support' t-shirt like a 13 year-old boy. What an embarrassment. negative   positive        


**Answer 3.b (2):**
- Tweet ID 3: The phrase "like a discount furniture store" was likely misinterpreted because of the word "like", which has a positive sentiment score of 1.5 in the lexicon. While in context, this phrase is meant as criticism, VADER sees "like" as expressing a positive comparison, which shifts the sentiment score upwards. Since VADER applies valence shifting rules but not deep semantic understanding, it doesn't recognize the sarcastic intent behind the comparison, leading to an incorrect positive classification.
- Tweet ID 7: The key issue here is that "falls" is not in the VADER lexicon. For humans, when discussing financial markets, "falls" generally carries a negative implication. Without a clear negative word in the lexicon, VADER doesn’t recognize this as a negative financial event, leading to a neutral classification. However, the label we chose for this tweet could be debateable: it could be neutral if you account for the fact that some people either refer to a drop in a stock as a good thing, as it presents an opportunity to buy it at a lower price, or as a bad thing since you are technically losing money. In that case the VADER label might be correct. However we chose a negative sentiment here since we believe it to be the most common interpatation, as usually when a stock price falls it isnt a positive event.  
- Tweet ID 40: The word "giggling" has a positive score of 1.5, and "support" has an even stronger positive score of 1.7, which likely increased the overall sentiment score. Meanwhile, "embarrassment" is correctly listed in the lexicon as negative (-1.9), but it wasn’t enough to outweigh the positive words. While a human can easily tell that calling someone a "bum" (which isnt included in the lexicon) and comparing them to a "13-year-old boy" is mocking, VADER sees "giggling" and "support" as positive and doesn't recognize the insulting intent behind the phrase.

In [37]:
print("MISCLASSIFIED NEUTRAL TWEETS: \n")
print(df_neutral_errors.to_string(index=False))

MISCLASSIFIED NEUTRAL TWEETS: 

 Tweet ID Tweet Text                                                                                                                                                          Gold Label VADER Prediction
29                                        CGI primates from “Better Man” and “Kingdom of the Planet of the Apes” face off in this year’s Oscars race for best visual effects. neutral    positive        
33        Artists including Kate Bush and Cat Stevens made an album of white noise in empty studios, protesting a U.K. proposal to give AI firms access to copyrighted music. neutral    negative        


**Answer 3.b (3):**
- Tweet ID 29: The tweet is simply reporting facts about the Oscars, so humans interpret it as neutral, but VADER assigns positive sentiment due to two words in its lexicon: "Better" (1.9) contributes positivity, "Best" (3.2) has an even stronger positive weight. Since VADER is a summation-based model, these two words alone likely pushed the compound score into the positive range, even though the tweet itself does not express an opinion. Additionally, VADER does not understand that "best" is referring to an awards category rather than an evaluative sentiment. This is a common issue with domain-specific words (like award categories) being misinterpreted as subjective sentiment.
- Tweet ID 33: Two words contributed to the misclassification as negative: "Protest" (-1.0) has a negative sentiment in the lexicon, even though in this case, it’s being used in a neutral factual sense, and "Empty" (-0.8) that also contributes negativity, even though it refers to a physical space rather than a negative emotional tone.

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

**Answers to Question 4** <br>
First, we unzip the `airlinetweets.zip` file.

In [4]:
import os
import zipfile

# Unzip the 'airlinetweets.zip' file if not already unzipped
extract_folder = 'airlinetweets'

if not os.path.exists(extract_folder):
    zip_file_path = 'airlinetweets.zip'

    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall()

The 'airlinetweets' folder in the `airlinetweets.zip` file contains three folders: 'negative', 'neutral', and 'positive'. 

Each of these three folders (representing gold labels) contains hundreds of .txt files, each file containing the raw text of an individual tweet.

E.g., "@VirginAmerica my group got their Cancelled Flightlation fees waived but I can't because my ticket is booked for 2/18? Your reps were no help either 😡" is the content of a .txt file in the 'negative' folder, as its sentiment is negative.

Now, we define a function to run for each experiment above.

In [11]:
import pandas as pd

# Define function to run for each experiment
def vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
):
    # Define the paths to the sentiment folders
    sentiment_folders = ['negative', 'neutral', 'positive']  # gold labels
    results = []

    # Iterate through each sentiment folder and process the tweets
    for sentiment in sentiment_folders:
        folder_path = os.path.join(extract_folder, sentiment)
        
        for filename in os.listdir(folder_path):
            if filename.endswith('.txt'):
                file_path = os.path.join(folder_path, filename)
                
                # Read the tweet from the file
                with open(file_path, 'r', encoding='utf-8') as file:
                    tweet = file.read().strip()
                    
                    # Run VADER on the tweet
                    scores = run_vader(tweet, lemmatize, parts_of_speech_to_consider)
                    vader_label = vader_output_to_label(scores)
                    
                    # Store the results
                    results.append({
                        'tweet': tweet,
                        'sentiment': sentiment,  # GOLD: from folder name
                        'scores': scores,  # from VADER
                        'label': vader_label
                    })

    # Create a DataFrame from the results
    results_df = pd.DataFrame(results)

    # Prepare the gold labels and VADER predictions for evaluation
    gold_labels = results_df['sentiment'].tolist()
    vader_predictions = results_df['label'].tolist()

    # Generate the classification report
    report = classification_report(gold_labels, vader_predictions, digits=3)

    print(report)

Finally, we perform the experiments.
* Run VADER (as it is) on the set of airline tweets 

In [12]:
# Settings (to change for different experiments)
to_lemmatize = False
pos = None

# Run experiment
vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
)

              precision    recall  f1-score   support

    negative      0.797     0.515     0.625      1750
     neutral      0.605     0.506     0.551      1515
    positive      0.559     0.884     0.685      1490

    accuracy                          0.628      4755
   macro avg      0.654     0.635     0.621      4755
weighted avg      0.661     0.628     0.620      4755



* Run VADER on the set of airline tweets after having lemmatized the text

In [13]:
# Settings (to change for different experiments)
to_lemmatize = True
pos = None

# Run experiment
vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
)

              precision    recall  f1-score   support

    negative      0.786     0.521     0.627      1750
     neutral      0.597     0.488     0.537      1515
    positive      0.557     0.881     0.683      1490

    accuracy                          0.623      4755
   macro avg      0.647     0.630     0.615      4755
weighted avg      0.654     0.623     0.616      4755



* Run VADER on the set of airline tweets with only adjectives

In [14]:
# Settings (to change for different experiments)
to_lemmatize = False
pos = {'ADJ'}

# Run experiment
vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
)

              precision    recall  f1-score   support

    negative      0.858     0.204     0.330      1750
     neutral      0.402     0.892     0.554      1515
    positive      0.669     0.438     0.529      1490

    accuracy                          0.496      4755
   macro avg      0.643     0.511     0.471      4755
weighted avg      0.653     0.496     0.464      4755



* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text

In [15]:
# Settings (to change for different experiments)
to_lemmatize = True
pos = {'ADJ'}

# Run experiment
vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
)

              precision    recall  f1-score   support

    negative      0.856     0.205     0.330      1750
     neutral      0.402     0.892     0.554      1515
    positive      0.668     0.438     0.529      1490

    accuracy                          0.497      4755
   macro avg      0.642     0.511     0.471      4755
weighted avg      0.653     0.497     0.464      4755



* Run VADER on the set of airline tweets with only nouns

In [16]:
# Settings (to change for different experiments)
to_lemmatize = False
pos = {'NOUN'}

# Run experiment
vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
)

              precision    recall  f1-score   support

    negative      0.727     0.140     0.235      1750
     neutral      0.361     0.820     0.502      1515
    positive      0.535     0.350     0.423      1490

    accuracy                          0.423      4755
   macro avg      0.541     0.437     0.387      4755
weighted avg      0.550     0.423     0.379      4755



* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text

In [17]:
# Settings (to change for different experiments)
to_lemmatize = True
pos = {'NOUN'}

# Run experiment
vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
)

              precision    recall  f1-score   support

    negative      0.714     0.154     0.253      1750
     neutral      0.361     0.811     0.499      1515
    positive      0.523     0.341     0.413      1490

    accuracy                          0.422      4755
   macro avg      0.532     0.435     0.388      4755
weighted avg      0.541     0.422     0.382      4755



* Run VADER on the set of airline tweets with only verbs

In [18]:
# Settings (to change for different experiments)
to_lemmatize = False
pos = {'VERB'}

# Run experiment
vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
)

              precision    recall  f1-score   support

    negative      0.786     0.286     0.419      1750
     neutral      0.383     0.811     0.520      1515
    positive      0.569     0.350     0.433      1490

    accuracy                          0.473      4755
   macro avg      0.580     0.482     0.458      4755
weighted avg      0.590     0.473     0.456      4755



* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

In [19]:
# Settings (to change for different experiments)
to_lemmatize = True
pos = {'VERB'}

# Run experiment
vader_experiment(
        lemmatize=to_lemmatize, 
        parts_of_speech_to_consider=pos
)

              precision    recall  f1-score   support

    negative      0.751     0.291     0.420      1750
     neutral      0.378     0.782     0.509      1515
    positive      0.571     0.360     0.441      1490

    accuracy                          0.469      4755
   macro avg      0.567     0.478     0.457      4755
weighted avg      0.576     0.469     0.455      4755



#TODO: Compare results and write answers to part 4b

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [None]:
# Your code here


### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [None]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
#important_features_per_class(airline_vec, clf)

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook