# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-3 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}
```
- The words "I" and "apples" are neutral, while "love" is positive according to the VADER lexicon. At first glance it might seem odd that the positive score is higher than the neutral score, but this is reasonale since the word love has a very strong positive sentiment. The compound score of 0.6369 shows that VADER correctly identifies the sentence as positive overall.

```
INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}
```
- Given that the compound score of -0,5216 is close to -1, VADER considers the sentence to have a negative sentiment overall. While one might expect a higher positive score due to the use of the word "love", the use of "don't" negates the positive sentiment of the word love. This explains the positive score of 0. Overall, the results are reasonable.

```
INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}
```
- The sentence "I love apples :-)" has a compound score of 0.7579, which indicates an overall positive sentiment. The smiley face emoticon is recognised by VADER as having a positive effect, so the positive score (0.867) is slightly higher than in the sentence "I love apples" (which had a positive score of 0.808). Overall, the results are reasonable, as the smiley face further increases the sentiment.

```
INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}
```

- The negative score of 0.492 comes from the word "ruins," which carries a negative sentiment. All of the other words are neutral and thus explain the neutral score of 0.508. Since there are no positive words in the sentence, the positive score is 0.0. The compound score of -0.4404 reflects the sentiment as slightly negative, but not strongly negative, due to the balance between the neutral and negative scores.

```
INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}
```

- VADER considers this sentence to have a slightly positive sentiment given that the compound score is about 0.5. The word "certainly" indicates a strong sentiment and the word "ruins", like in the sentence above, has a negative sentiment, but the use of the word "not" directly after it makes the sentiment positive. The positive and neutral socres are not that reasonable, because the use of the word certainly shows a stronger sentiment, so the positive value should have been a bit higher, making the neutral score lower.

```
INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}
```

- The negative score of 0.286 likely comes from the word "lies," which can have both a neutral meaning (i.e., lying down) and a negative one (i.e., telling lies). VADER could have interpreted it as negative, which is incorrect in this context. The rest of the sentence is neutral which explains the high neutral score (0.714).

```
INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```
- The word "like" contributes to the positive score of 0.333 since it is associated with positive sentiment in the VADER lexicon. However, in this context the word "like" is used for comparison, so the positive score is incorrect. The neutral score of 0.667 is expected, as the sentence contains words like "This", "house" and "is" and does not have any lexicons with a strong sentiment. 

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream. If you have trouble accessing Twitter, try to find an existing dataset (on websites like kaggle or huggingface).

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

source = https://www.kaggle.com/datasets/ahmedshahriarsakib/tweet-sample

In [28]:
import json

In [29]:
my_tweets = json.load(open('my_tweets.json'))

In [30]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'positive', 'text_of_tweet': 'All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.', 'tweet_url': 'https://twitter.com/BarackObama/status/946775615893655552'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
        
        The classification report shows the model's performance across three sentiment classes: negative, neutral, and positive. Precision indicates how accurate the model's predictions were for each class, while recall measures how well it identified all instances of each sentiment. The F1-score balances both precision and recall, with a higher score reflecting a better overall performance. The model performs well for negative sentiment, with high precision (0.79) and recall (0.94), but struggles with neutral sentiment, showing a lower recall (0.44) and F1-score (0.56). The weighted average F1-score of 0.72 suggests that while the model performs reasonably well overall, there is room for improvement, especially with neutral tweets.

* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

        POSITIVE TWEETS MISCLASSIFIED:
        1. Tweet: Nothing like a fresh cup of coffee in the morning to start the day right! | Predicted as: negative
        2. Tweet: What an incredible concert! One for the books! | Predicted as: neutral
        3. Tweet: Just booked my first solo trip! Can't wait to explore new places. | Predicted as: neutral

        NEGATIVE TWEETS MISCLASSIFIED:
        1. Tweet: Can't believe how much the rent prices have gone up in this area. Unbelievable. | Predicted as: positive

        NEUTRAL TWEETS MISCLASSIFIED:
        1. Tweet: Just finished watching the new episode. It was okay, not great but not terrible either. | Predicted as: positive
        2. Tweet: The new phone update has some interesting changes. | Predicted as: positive
        3. Tweet: The movie I watched last night was just okay, not as good as the reviews said. | Predicted as: negative
        4. Tweet: The news today was a mix of interesting stories. | Predicted as: positive
        5. Tweet: Just read an interesting article. Not sure how I feel about it yet. | Predicted as: positive
        6. Tweet: The book I’m reading is slow to start, but I’m hoping it picks up soon. | Predicted as: positive
        7. Tweet: I’ve been thinking about starting a new fitness routine. | Predicted as: positive
        8. Tweet: Not sure how to feel about the news today. | Predicted as: negative
        9. Tweet: Weather is looking mild for the weekend. No surprises expected. | Predicted as: negative

        The error analysis reveals that the VADER sentiment analysis model often misclassifies tweets beacause it relies on certain keywords that may not always capture the full context of the tweet. Positive tweets can be misclassified as negative or neutral because VADER sometimes misses subtle positive expressions or misinterprets neutral phrases as negative. In negative tweets, VADER might mistake words with dual meanings, like "unbelievable," for positive sentiment. Neutral tweets are frequently misclassified as positive because the model focuses too much on positive words like "interesting." These misclassifications show how difficult it can be to detect sentiment, when context and nuance are involved.

In [31]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [32]:
import nltk
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader_model = SentimentIntensityAnalyzer()
from sklearn.metrics import classification_report

tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = vader_model.polarity_scores(the_tweet) # run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])

    # Print tweet and its classification
    print(f"Tweet: {the_tweet}")
    print(f"Predicted Sentiment: {vader_label}")
    print(f"Actual Sentiment: {tweet_info['sentiment_label']}")
    print('-' * 50)
    
# use scikit-learn's classification report
print(classification_report(gold, all_vader_output))

Tweet: All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.
Predicted Sentiment: positive
Actual Sentiment: positive
--------------------------------------------------
Tweet: This traffic is absolutely ridiculous. Been stuck for an hour and haven’t moved an inch.
Predicted Sentiment: negative
Actual Sentiment: negative
--------------------------------------------------
Tweet: Just finished watching the new episode. It was okay, not great but not terrible either.
Predicted Sentiment: positive
Actual Sentiment: neutral
--------------------------------------------------
Tweet: Can’t believe I got the job! Excited for this new journey ahead! #blessed
Predicted Sentiment: positive
Actual Sentiment: positive
--------------------------------------------------
Tweet: Another rainy day... so much for making weekend plans. Ugh.
Predicted Sentiment: negative
Actual Sentiment:

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [33]:
import nltk
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import spacy
nlp = spacy.load('en_core_web_sm')

def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [34]:
#Run VADER (as it is) on the set of airline tweets 

import os
rootdir = '/Users/gebruiker/Documents/Naamloos/lab3/airlinetweets'
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        x = os.path.join(subdir, file)
        if x.endswith('.txt'):
            with open(x, 'r', encoding='utf-8') as f:
                content = f.read()
                print(content)
                scores = vader_model.polarity_scores(content)
                print(scores)
                print() 
    



In [35]:
#Run VADER on the set of airline tweets after having lemmatized the text

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        x = os.path.join(subdir, file)
        if x.endswith('.txt'):
            with open(x, 'r', encoding='utf-8') as f:
                content = f.read()
                scores = vader_model.polarity_scores(content)
                lemma = run_vader(content, lemmatize=True, verbose=1)
                # print(scores)
                # print()  # Separator between files
                print(lemma)


In [36]:
#Run VADER on the set of airline tweets with only adjectives

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        x = os.path.join(subdir, file)
        if x.endswith('.txt'):
            with open(x, 'r', encoding='utf-8') as f:
                content = f.read()
                print(content)

                adjectives = run_vader(content, 
                            lemmatize=False, 
                            parts_of_speech_to_consider={'ADJ'},
                            verbose=1)
               
                print(adjectives)
                print()

In [37]:
#Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        x = os.path.join(subdir, file)
        if x.endswith('.txt'):
            with open(x, 'r', encoding='utf-8') as f:
                content = f.read()
                print(content)

                adjectives = run_vader(content,lemmatize=True, parts_of_speech_to_consider={'ADJ'},verbose=1)
               
                print(adjectives)
                print()


In [38]:
#Run VADER on the set of airline tweets with only nouns


for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        x = os.path.join(subdir, file)
        if x.endswith('.txt'):
            with open(x, 'r', encoding='utf-8') as f:
                content = f.read()
                print(content)

                nouns = run_vader(content, 
                            lemmatize=False, 
                            parts_of_speech_to_consider={'NOUN'},
                            verbose=1)
               
                print(nouns)
                print()

In [39]:
#Run VADER on the set of airline tweets with only nouns and after having lemmatized the text

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        x = os.path.join(subdir, file)
        if x.endswith('.txt'):
            with open(x, 'r', encoding='utf-8') as f:
                content = f.read()
                print(content)

                nouns = run_vader(content, 
                            lemmatize=True, 
                            parts_of_speech_to_consider={'NOUN'},
                            verbose=1)
               
                print(nouns)
                print()

In [40]:
#Run VADER on the set of airline tweets with only verbs

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        x = os.path.join(subdir, file)
        if x.endswith('.txt'):
            with open(x, 'r', encoding='utf-8') as f:
                content = f.read()
                print(content)

                verbs = run_vader(content, 
                            lemmatize=False, 
                            parts_of_speech_to_consider={'VERB'},
                            verbose=1)
               
                print(verbs)
                print()

In [41]:
#Run VADER on the set of airline tweets with only nouns and after having lemmatized the text

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        x = os.path.join(subdir, file)
        if x.endswith('.txt'):
            with open(x, 'r', encoding='utf-8') as f:
                content = f.read()
                print(content)

                verbs = run_vader(content, 
                            lemmatize=True, 
                            parts_of_speech_to_consider={'VERB'},
                            verbose=1)
               
                print(verbs)
                print()

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

Which category performs best, is this the case for any setting?

- 

Does the frequency threshold affect the scores? Why or why not according to you?

- 

In [None]:
from sklearn.datasets import load_files
import nltk
import pathlib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
path = str(airline_tweets_folder)
airline_tweets_train = load_files(path)

# TF-IDF, min_df=2
#Feature extraction
airline_vec = CountVectorizer(min_df=2, 
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)
tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

#Training with tf-idf model
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for testing
    ) 
clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.82      0.91      0.86       364
           1       0.84      0.71      0.77       282
           2       0.84      0.86      0.85       305

    accuracy                           0.83       951
   macro avg       0.83      0.82      0.83       951
weighted avg       0.83      0.83      0.83       951



In [None]:
# TF-IDF, min_df=5
#Feature extraction
airline_vec = CountVectorizer(min_df=5,
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)
tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

#Training with tf-idf model
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for testing
    ) 
clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.81      0.88      0.84       364
           1       0.79      0.73      0.76       300
           2       0.83      0.80      0.81       287

    accuracy                           0.81       951
   macro avg       0.81      0.80      0.81       951
weighted avg       0.81      0.81      0.81       951



In [None]:
# TF-IDF, min_df=10
#Feature extraction
airline_vec = CountVectorizer(min_df=10, 
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)
tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

#Training with tf-idf model
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for testing
    ) 
clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.82      0.89      0.85       362
           1       0.79      0.72      0.75       297
           2       0.83      0.81      0.82       292

    accuracy                           0.81       951
   macro avg       0.81      0.81      0.81       951
weighted avg       0.81      0.81      0.81       951



In [None]:
# Bag of words representation, min_df=2
airline_vec = CountVectorizer(min_df=2, 
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

#Training with airline_counts model
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the airline_counts model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for testing
    ) 
clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.84      0.88      0.86       348
           1       0.84      0.74      0.79       304
           2       0.83      0.89      0.86       299

    accuracy                           0.84       951
   macro avg       0.84      0.84      0.84       951
weighted avg       0.84      0.84      0.84       951



In [None]:
# Bag of words representation, min_df=5
airline_vec = CountVectorizer(min_df=5, 
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

#Training with airline_counts model
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the airline_counts model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for testing
    ) 
clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.81      0.89      0.85       353
           1       0.85      0.71      0.77       305
           2       0.82      0.86      0.84       293

    accuracy                           0.82       951
   macro avg       0.82      0.82      0.82       951
weighted avg       0.82      0.82      0.82       951



In [None]:
# Bag of words representation, min_df=10
airline_vec = CountVectorizer(min_df=10, 
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

#Training with airline_counts model
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the airline_counts model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for testing
    ) 
clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)

print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.83      0.92      0.88       346
           1       0.81      0.75      0.78       326
           2       0.85      0.81      0.83       279

    accuracy                           0.83       951
   macro avg       0.83      0.83      0.83       951
weighted avg       0.83      0.83      0.83       951



### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [61]:
airline_vec = CountVectorizer(min_df=2, 
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

#Training with airline_counts model
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_counts, # the airline_counts model
    airline_tweets_train.target, # the category values for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for testing
    ) 
clf = MultinomialNB().fit(docs_train, y_train)
y_pred = clf.predict(docs_test)




In [65]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names_out()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
important_features_per_class(airline_vec, clf)

Important words in negative documents
0 1492.0 @
0 1371.0 united
0 1219.0 .
0 420.0 ``
0 410.0 flight
0 387.0 ?
0 365.0 !
0 316.0 #
0 208.0 n't
0 163.0 ''
0 116.0 's
0 110.0 :
0 108.0 service
0 95.0 virginamerica
0 93.0 delayed
0 92.0 get
0 92.0 cancelled
0 88.0 time
0 88.0 bag
0 83.0 plane
0 82.0 customer
0 79.0 -
0 71.0 gate
0 70.0 ;
0 70.0 ...
0 68.0 http
0 68.0 'm
0 66.0 hours
0 65.0 still
0 64.0 late
0 63.0 hour
0 63.0 airline
0 63.0 &
0 59.0 2
0 57.0 would
0 53.0 help
0 53.0 amp
0 52.0 ca
0 51.0 one
0 49.0 like
0 49.0 flights
0 47.0 us
0 47.0 never
0 46.0 waiting
0 46.0 delay
0 45.0 flightled
0 44.0 3
0 43.0 $
0 42.0 due
0 41.0 worst
0 39.0 fly
0 38.0 wait
0 38.0 u
0 38.0 (
0 38.0 've
0 37.0 people
0 36.0 luggage
0 36.0 day
0 36.0 )
0 35.0 really
0 35.0 back
0 34.0 trying
0 34.0 another
0 33.0 lost
0 32.0 hold
0 32.0 got
0 32.0 bags
0 31.0 ticket
0 31.0 thanks
0 31.0 seats
0 31.0 last
0 31.0 ever
0 30.0 seat
0 30.0 check
0 29.0 today
0 29.0 problems
0 29.0 guys
0 29.0 going
0 29.

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook