# Lab3 - Assignment Sentiment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the LAB-2 assignment of the Text Mining course. It is about sentiment analysis.

The aims of the assignment are:
* Learn how to run a rule-based sentiment analysis module (VADER)
* Learn how to run a machine learning sentiment analysis module (Scikit-Learn/ Naive Bayes)
* Learn how to run scikit-learn metrics for the quantitative evaluation
* Learn how to perform and interpret a quantitative evaluation of the outcomes of the tools (in terms of Precision, Recall, and F<sub>1</sub>)
* Learn how to evaluate the results qualitatively (by examining the data) 
* Get insight into differences between the two applied methods
* Get insight into the effects of using linguistic preprocessing
* Be able to describe differences between the two methods in terms of their results
* Get insight into issues when applying these methods across different  domains

In this assignment, you are going to create your own gold standard set from 50 tweets. You will the VADER and scikit-learn classifiers to these tweets and evaluate the results by using evaluation metrics and inspecting the data.

We recommend you go through the notebooks in the following order:
* **Read the assignment (see below)**
* **Lab3.2-Sentiment-analysis-with-VADER.ipynb**
* **Lab3.3-Sentiment-analysis.with-scikit-learn.ipynb**
* **Answer the questions of the assignment (see below) using the provided notebooks and submit**

In this assignment you are asked to perform both quantitative evaluations and error analyses:
* a quantitative evaluation concerns the scores (Precision, Recall, and F<sub>1</sub>) provided by scikit's classification_report. It includes the scores per category, as well as micro and macro averages. Discuss whether the scores are balanced or not between the different categories (positive, negative, neutral) and between precision and recall. Discuss the shortcomings (if any) of the classifier based on these scores
* an error analysis regarding the misclassifications of the classifier. It involves going through the texts and trying to understand what has gone wrong. It servers to get insight in what could be done to improve the performance of the classifier. Do you observe patterns in misclassifications?  Discuss why these errors are made and propose ways to solve them.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io) and [Isa Maks](https://research.vu.nl/en/persons/e-maks). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Part I: VADER assignments


### Preparation (nothing to submit):
To be able to answer the VADER questions you need to know how the tool works. 
* Read more about the VADER tool in [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).  
* VADER provides 4 scores (positive, negative, neutral, compound). Be sure to understand what they mean and how they are calculated.
* VADER uses rules to handle linguistic phenomena such as negation and intensification. Be sure to understand which rules are used, how they work, and why they are important.
* VADER makes use of a sentiment lexicon. Have a look at the lexicon. Be sure to understand which information can be found there (lemma?, wordform?, part-of-speech?, polarity value?, word meaning?) What do all scores mean? https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) 


### [3.5 points] Question1:

Regard the following sentences and their output as given by VADER. Regard sentences 1 to 7, and explain the outcome **for each sentence**. Take into account both the rules applied by VADER and the lexicon that is used. You will find that some of the results are reasonable, but others are not. Explain what is going wrong or not when correct and incorrect results are produced. 

```
INPUT SENTENCE 1 I love apples
VADER OUTPUT {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

INPUT SENTENCE 2 I don't love apples
VADER OUTPUT {'neg': 0.627, 'neu': 0.373, 'pos': 0.0, 'compound': -0.5216}

INPUT SENTENCE 3 I love apples :-)
VADER OUTPUT {'neg': 0.0, 'neu': 0.133, 'pos': 0.867, 'compound': 0.7579}

INPUT SENTENCE 4 These houses are ruins
VADER OUTPUT {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}

INPUT SENTENCE 5 These houses are certainly not considered ruins
VADER OUTPUT {'neg': 0.0, 'neu': 0.51, 'pos': 0.49, 'compound': 0.5867}

INPUT SENTENCE 6 He lies in the chair in the garden
VADER OUTPUT {'neg': 0.286, 'neu': 0.714, 'pos': 0.0, 'compound': -0.4215}

INPUT SENTENCE 7 This house is like any house
VADER OUTPUT {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
```

### Sentence 1
Sentence 1 produced a label of positive, with it being 0% negative, 19.2% neutral, and 80.8% positive. The reason for this is due to the word "love", which has a sentiment rating of 3.2.

### Sentence 2
Sentence two has an overall label of negative, with it being 62.7% negative, 37.3% neutral and 0% positive. The reason for this is due to the word "don't", which switches the sentiment from positive to negative.

### Sentence 3
Sentence 3 has an overall label of positive, with it being 0% negative, 13.3% neutral and 86.7% positive. The sentence is considered to be more positive than sentence 1 due to the fact that it has an emoticon with a sentiment rating of 1.3. 

### Sentence 4
Sentence 4 produced scores of 50.8% neutral, 49.2% negative, and 0% positive. While the score of neutral is higher than that of negative, the split is very close to equal. The word "ruins" has a sentiment rating of -1.9, which makes VADER's output more negative, which results in the overall score of -0.4404. 

### Sentence 5
Sentence 4 produced scores of 0% negative, 51% neutral, and 49% positive, meaning that it is almost equally split between neutral and positive. The words "certainly not" negate the negative sentiment of the word "ruins", which results in a sentence which is scored as positive rather than negative. 

### Sentence 6
Sentence 6 is 28.6% negative, 71.4% neutral, and 0% positive. This sentence seems to be incorrectly labeled as negative due to the fact that the word "lies" is used, which has a sentiment score of -1.8. The sentence should be labelled as neutral, but due to the fact that "lies" can also be used in the context of lying to someone, it is labelled as negative. 

### Sentence 7
Sentence 7 is 0% negative, 66.7% neutral, and 33.3% positive, with a overall label of positive. This seems to be incorret, as this sentence is using the word "like" as a way to compare the house, and that it is not unique. This is done due to the word "like", as stated, which has a sentiment rating of 1.5, but considering the context in which it is used, this is incorrect. 

### [Points: 2.5] Exercise 2: Collecting 50 tweets for evaluation
Collect 50 tweets. Try to find tweets that are interesting for sentiment analysis, e.g., very positive, neutral, and negative tweets. These could be your own tweets (typed in) or collected from the Twitter stream.

We will store the tweets in the file **my_tweets.json** (use a text editor to edit).
For each tweet, you should insert:
* sentiment analysis label: negative | neutral | positive (this you determine yourself, this is not done by a computer)
* the text of the tweet
* the Tweet-URL

from:
```
    "1": {
        "sentiment_label": "",
        "text_of_tweet": "",
        "tweet_url": "",
```
to:
```
"1": {
        "sentiment_label": "positive",
        "text_of_tweet": "All across America people chose to get involved, get engaged and stand up. Each of us can make a difference, and all of us ought to try. So go keep changing the world in 2018.",
        "tweet_url" : "https://twitter.com/BarackObama/status/946775615893655552",
    },
```

You can load your tweets with human annotation in the following way.

In [1]:
import json
import sklearn
import pathlib
import spacy
from sklearn import metrics
from sklearn.datasets import load_files
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report
nlp = spacy.load('en_core_web_sm')
vader_model = SentimentIntensityAnalyzer()

In [2]:
my_tweets = json.load(open('my_tweets.json'))

In [3]:
for id_, tweet_info in my_tweets.items():
    print(id_, tweet_info)
    break

1 {'sentiment_label': 'positive', 'text_of_tweet': 'The Northern Lights, an atmospheric phenomenon rarely seen in the Netherlands, were visible over large parts of the country on Sunday night.', 'tweet_url': 'https://twitter.com/DutchNewsNL/status/1630281109274599425'}


### [5 points] Question 3:

Run VADER on your own tweets (see function **run_vader** from notebook **Lab2-Sentiment-analysis-using-VADER.ipynb**). You can use the code snippet below this explanation as a starting point. 
* [2.5 points] a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.
* [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

In [4]:
def run_vader(textual_unit, 
              lemmatize=False,
              parts_of_speech_to_consider=set(),
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -empty set -> all parts of speech are provided
    -non-empty set: only these parts of speech are considered
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [5]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [6]:
tweets = []
all_vader_output = []
gold = []

# settings (to change for different experiments)
to_lemmatize = True 
pos = set()

for id_, tweet_info in my_tweets.items():
    the_tweet = tweet_info['text_of_tweet']
    vader_output = run_vader(the_tweet, lemmatize = to_lemmatize) # run vader
    vader_label = vader_output_to_label(vader_output) # convert vader output to category
    
    tweets.append(the_tweet)
    all_vader_output.append(vader_label)
    gold.append(tweet_info['sentiment_label'])
    
# use scikit-learn's classification report
report = classification_report(gold,all_vader_output,digits = 3)
print(report)

              precision    recall  f1-score   support

    negative      0.824     0.778     0.800        18
     neutral      0.800     0.571     0.667        14
    positive      0.652     0.833     0.732        18

    accuracy                          0.740        50
   macro avg      0.759     0.728     0.733        50
weighted avg      0.755     0.740     0.738        50



**[2.5 points]** a. Perform a quantitative evaluation. Explain the different scores, and explain which scores are most relevant and why.

The table above shows the output of the quantitative evaluation. The precision shows the ratio of true positives divided by the true positives and false positives; the recall is the ratio of true positives divided by the true positives and false negatives; the f1-score is the weighted mean of the test's precision and recall. The most accurate labeling for the precision was the negative sentiment tweets; for the recall, it was the positive tweets; and for the f1-score, it is the negative tweets. 

# [2.5 points] b. Perform an error analysis: select 10 positive, 10 negative and 10 neutral tweets that are not correctly classified and try to understand why. Refer to the VADER-rules and the VADER-lexicon. Of course, if there are less than 10 errors for a category, you only have to check those. For example, if there are only 5 errors for positive tweets, you just describe those.

### [4 points] Question 4:
Run VADER on the set of airline tweets with the following settings:

* Run VADER (as it is) on the set of airline tweets 
* Run VADER on the set of airline tweets after having lemmatized the text
* Run VADER on the set of airline tweets with only adjectives
* Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
* Run VADER on the set of airline tweets with only nouns
* Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
* Run VADER on the set of airline tweets with only verbs
* Run VADER on the set of airline tweets with only verbs and after having lemmatized the text

* [1 point] a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F<sub>1</sub> scores per category as well as micro and macro averages. **Use a different code cell (or multiple code cells) for each experiment.**
* [3 points] b. Compare the scores and explain what they tell you.
* - Does lemmatisation help? Explain why or why not.
* - Are all parts of speech equally important for sentiment analysis? Explain why or why not.

In [7]:
# Your code here
# Load airline tweet files:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
airline_tweets= load_files(str(airline_tweets_folder))

# Function for running VADER with different settings:
def vader_tweets(tweets,
                        lemmatize_value=False,
                        pos_value=set()):
    vader_label =[]
    gold = []

    for tweet, label_int in zip(tweets.data, tweets.target):
        tweet_ = tweet.decode("utf-8")
        vader_output = run_vader(tweet_, lemmatize = lemmatize_value, parts_of_speech_to_consider = pos_value)
        vader_output_label = vader_output_to_label(vader_output)
        vader_label.append(vader_output_label)
        gold_label = airline_tweets.target_names[label_int]
        gold.append(gold_label)
    report = classification_report(gold, vader_label, digits = 3)
    print(report)

**[1 point]** a. Generate for all separate experiments the classification report, i.e., Precision, Recall, and F1 scores per category as well as micro and macro averages. Use a different code cell (or multiple code cells) for each experiment.

In [8]:
# Run VADER (as it is) on the set of airline tweets
vader_tweets(airline_tweets)

              precision    recall  f1-score   support

    negative      0.797     0.515     0.625      1750
     neutral      0.605     0.506     0.551      1515
    positive      0.559     0.884     0.685      1490

    accuracy                          0.628      4755
   macro avg      0.654     0.635     0.621      4755
weighted avg      0.661     0.628     0.620      4755



In [9]:
# Run VADER on the set of airline tweets after having lemmatized the text
vader_tweets(airline_tweets, lemmatize_value = True)

              precision    recall  f1-score   support

    negative      0.786     0.522     0.628      1750
     neutral      0.598     0.488     0.538      1515
    positive      0.557     0.881     0.682      1490

    accuracy                          0.624      4755
   macro avg      0.647     0.630     0.616      4755
weighted avg      0.654     0.624     0.616      4755



In [10]:
# Run VADER on the set of airline tweets with only adjectives
vader_tweets(airline_tweets, pos_value = {'ADJ'})

              precision    recall  f1-score   support

    negative      0.870     0.210     0.339      1750
     neutral      0.403     0.892     0.555      1515
    positive      0.665     0.438     0.528      1490

    accuracy                          0.499      4755
   macro avg      0.646     0.513     0.474      4755
weighted avg      0.657     0.499     0.467      4755



In [11]:
# Run VADER on the set of airline tweets with only adjectives and after having lemmatized the text
vader_tweets(airline_tweets, lemmatize_value = True,pos_value = {'ADJ'})

              precision    recall  f1-score   support

    negative      0.868     0.210     0.339      1750
     neutral      0.403     0.892     0.556      1515
    positive      0.664     0.438     0.528      1490

    accuracy                          0.499      4755
   macro avg      0.645     0.513     0.474      4755
weighted avg      0.656     0.499     0.467      4755



In [12]:
# Run VADER on the set of airline tweets with only nouns
vader_tweets(airline_tweets, pos_value = {'NOUN'})

              precision    recall  f1-score   support

    negative      0.730     0.143     0.240      1750
     neutral      0.358     0.817     0.498      1515
    positive      0.532     0.340     0.415      1490

    accuracy                          0.420      4755
   macro avg      0.540     0.433     0.384      4755
weighted avg      0.549     0.420     0.377      4755



In [13]:
# Run VADER on the set of airline tweets with only nouns and after having lemmatized the text
vader_tweets(airline_tweets, lemmatize_value = True,pos_value = {'NOUN'})

              precision    recall  f1-score   support

    negative      0.715     0.157     0.257      1750
     neutral      0.358     0.809     0.496      1515
    positive      0.521     0.331     0.405      1490

    accuracy                          0.419      4755
   macro avg      0.531     0.432     0.386      4755
weighted avg      0.540     0.419     0.379      4755



In [14]:
# Run VADER on the set of airline tweets with only verbs
vader_tweets(airline_tweets, pos_value = {'VERB'})

              precision    recall  f1-score   support

    negative      0.774     0.288     0.420      1750
     neutral      0.383     0.810     0.520      1515
    positive      0.568     0.343     0.428      1490

    accuracy                          0.472      4755
   macro avg      0.575     0.480     0.456      4755
weighted avg      0.585     0.472     0.454      4755



In [15]:
# Run VADER on the set of airline tweets with only verbs and after having lemmatized the text
vader_tweets(airline_tweets,lemmatize_value = True, pos_value = {'VERB'})

              precision    recall  f1-score   support

    negative      0.741     0.295     0.422      1750
     neutral      0.377     0.780     0.508      1515
    positive      0.568     0.352     0.434      1490

    accuracy                          0.468      4755
   macro avg      0.562     0.476     0.455      4755
weighted avg      0.571     0.468     0.454      4755



**[3 points]** b. Compare the scores and explain what they tell you.

1. Does lemmatisation help? Explain why or why not.

When looking at the results of the classification reports, it seems to show that lemmatisation does not make a large noticable difference in this dataset. The tweets that have had lemmatisation applied appear to have a decrease in the precision and recall, as well as the macro average and weighted average. There are some aspects that have some slight improvements, such as the negative recall and positive f1-score for verbs. Lemmatisation results in the verb being used as its lemma (base form), and this could mean that the meaning of the verb could be modified or lost, which may explain the results. 

2. Are all parts of speech equally important for sentiment analysis? Explain why or why not.

When comparing the recall, precision, and f1-scores, the results appear to be not significantly different. On the other hand, when looking at the macro and weighted averages, the results show a consistant decrease when looking at specific parts of speech. This difference can be explained by the loss of context when only analyzing one part of speech, as without the context, it can be difficult to tell what the intention of the writer is. 

## Part II: scikit-learn assignments
### [4 points] Question 5
Train the scikit-learn classifier (Naive Bayes) using the airline tweets.

+ Train the model on the airline tweets with 80% training and 20% test set and default settings (TF-IDF representation, min_df=2)
+ Train with different settings:
    + with respect to vectorizing: TF-IDF ('airline_tfidf') vs. Bag of words representation ('airline_count') 
    + with respect to the frequency threshold (min_df). Carry out experiments with increasing values for document frequency (min_df = 2; min_df = 5; min_df =10) 
* [1 point] a. Generate a classification_report for all experiments
* [3 points] b. Look at the results of the experiments with the different settings and try to explain why they differ: 
    + which category performs best, is this the case for any setting?
    + does the frequency threshold affect the scores? Why or why not according to you?

In [16]:
# Your code here
import numpy
import nltk

from collections import Counter
from nltk.corpus import stopwords

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [17]:
def train_vectorize_tweets(df, vectorizer):
    airline_vectorizer = CountVectorizer(min_df = df, tokenizer = nltk.word_tokenize, stop_words = stopwords.words("english")) 
    airline_counts = airline_vectorizer.fit_transform(airline_tweets.data)

    tfidf_transformer = TfidfTransformer()
    airline_tfidf = tfidf_transformer.fit_transform(airline_counts)
    
    if vectorizer == "tfidf": 
        tweets_train, tweets_test, y_train, y_test = train_test_split(airline_tfidf, airline_tweets.target, test_size = 0.20)
        tfidf_clf = MultinomialNB().fit(tweets_train, y_train)
        tfidf_pred = tfidf_clf.predict(tweets_test)
        report = sklearn.metrics.classification_report(y_true = y_test, y_pred = tfidf_pred, digits = 3)
        
    elif vectorizer == "count":
        tweets_train, tweets_test, y_train, y_test = train_test_split(airline_counts, airline_tweets.target, test_size = 0.20)
        count_clf = MultinomialNB().fit(tweets_train, y_train)
        count_pred = count_clf.predict(tweets_test)
        report = sklearn.metrics.classification_report(y_true = y_test, y_pred = count_pred, digits = 2)
    else: 
        report = ("%s vectorizer is not defined" %vectorizer)
        
    return(report)   

**[1 point]** a. Generate a classification_report for all experiments

In [18]:
print(train_vectorize_tweets(2,'tfidf'))



              precision    recall  f1-score   support

           1      0.794     0.913     0.850       334
           2      0.838     0.639     0.725       324
           3      0.781     0.853     0.816       293

    accuracy                          0.801       951
   macro avg      0.805     0.802     0.797       951
weighted avg      0.805     0.801     0.797       951



In [19]:
print(train_vectorize_tweets(2,'count'))



              precision    recall  f1-score   support

           1       0.83      0.92      0.87       356
           2       0.86      0.69      0.77       313
           3       0.79      0.86      0.82       282

    accuracy                           0.83       951
   macro avg       0.83      0.82      0.82       951
weighted avg       0.83      0.83      0.82       951



In [20]:
print(train_vectorize_tweets(5,'tfidf'))



              precision    recall  f1-score   support

           1      0.828     0.890     0.858       347
           2      0.808     0.747     0.776       304
           3      0.828     0.820     0.824       300

    accuracy                          0.822       951
   macro avg      0.822     0.819     0.820       951
weighted avg      0.822     0.822     0.821       951



In [21]:
print(train_vectorize_tweets(10,'tfidf'))



              precision    recall  f1-score   support

           1      0.825     0.864     0.844       360
           2      0.804     0.764     0.783       296
           3      0.857     0.851     0.854       295

    accuracy                          0.829       951
   macro avg      0.829     0.826     0.827       951
weighted avg      0.828     0.829     0.828       951



**[3 points]** b. Look at the results of the experiments with the different settings and try to explain why they differ:
1. which category performs best, is this the case for any setting?

The highest scoring category is 1, which is the negative category. For count, tfidf, and min_df of 2, 5, and 10 all have negative as the highest scoring on all metrics except for precision in certain cases. 


2. does the frequency threshold affect the scores? Why or why not according to you?

The macro average, weighted average, and accuracy appear to increase when the frequency threshold is increased. There is also a tendency of the precision, recall, and f1-score to increase. This could be due to the fact that the remaining words/terms that are analyzed are more likely to contain a sentiment; this trend would likely not remain for a higher frequency threshold as too high of a frequency threshold could filter out too many terms. 

### [4 points] Question 6: Inspecting the best scoring features 

+ Train the scikit-learn classifier (Naive Bayes) model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
* [1 point] a. Generate the list of best scoring features per class (see function **important_features_per_class** below) [1 point]
* [3 points] b. Look at the lists and consider the following issues: 
    + [1 point] Which features did you expect for each separate class and why?
    + [1 point] Which features did you not expect and why ? 
    + [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why? 

In [22]:
airline_vectorizer = CountVectorizer(min_df=2, tokenizer = nltk.word_tokenize,stop_words=stopwords.words("english"))
airline_counts = airline_vectorizer.fit_transform(airline_tweets.data)

tweets_train, tweets_test, y_train, y_test = train_test_split(airline_counts, airline_tweets.target, test_size = 0.20)
    
clf = MultinomialNB().fit(tweets_train, y_train)



**[1 point]** a. Generate the list of best scoring features per class (see function important_features_per_class below) [1 point]

In [23]:
def important_features_per_class(vectorizer,classifier,n=80):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names_out()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
important_features_per_class(airline_vectorizer, clf)

Important words in negative documents
1 1506.0 @
1 1384.0 united
1 1217.0 .
1 414.0 ``
1 403.0 flight
1 386.0 ?
1 374.0 !
1 326.0 #
1 207.0 n't
1 156.0 ''
1 131.0 's
1 109.0 service
1 109.0 :
1 101.0 virginamerica
1 97.0 cancelled
1 96.0 bag
1 92.0 customer
1 90.0 get
1 87.0 delayed
1 83.0 time
1 83.0 plane
1 81.0 -
1 76.0 'm
1 75.0 ...
1 73.0 ;
1 70.0 http
1 69.0 hours
1 69.0 gate
1 68.0 &
1 61.0 hour
1 60.0 still
1 59.0 2
1 58.0 would
1 58.0 amp
1 57.0 late
1 55.0 airline
1 54.0 help
1 53.0 worst
1 53.0 ca
1 52.0 one
1 51.0 like
1 51.0 flights
1 49.0 waiting
1 47.0 never
1 47.0 delay
1 45.0 've
1 44.0 flightled
1 43.0 (
1 42.0 back
1 41.0 $
1 40.0 us
1 40.0 really
1 40.0 3
1 40.0 )
1 38.0 lost
1 37.0 seat
1 37.0 ever
1 36.0 u
1 36.0 due
1 36.0 check
1 35.0 people
1 34.0 trying
1 34.0 seats
1 34.0 fly
1 34.0 airport
1 33.0 wait
1 33.0 hold
1 32.0 thanks
1 32.0 luggage
1 32.0 got
1 32.0 even
1 32.0 day
1 31.0 ticket
1 31.0 last
1 31.0 bags
1 31.0 baggage
1 31.0 another
1 30.0 need
1 30

**[3 points]** b. Look at the lists and consider the following issues:
1. [1 point] Which features did you expect for each separate class and why?

In the negative documents, the words "cancelled", "delayed", and "worst" would be expected as they all carry a negative connotation. In the neutral documents, airline names, "flights", and non-specific words such as "one", "tomorrow", and "would" are expected as they do not contain any specific sentiment. In the positive documents, "like", "nice", "amazing", and "love", are to be expected as they all carry a positive connotation. 

2. [1 point] Which features did you not expect and why ?

The word "thanks" was not expected in the negative documents as it is typically used in a positive context. The word "thanks" may have appeared in the negative documents if it was used in a sarcastic way. The exclamation mark in the neutral class was not expected as it is typically used in something with strong emotion.  

3. [1 point] The list contains all kinds of words such as names of airlines, punctuation, numbers and content words (e.g., 'delay' and 'bad'). Which words would you remove or keep when trying to improve the model and why?

In order to improve the model, punctuation such as """, "&", ";", "-", etc. should be removed because this does not contribute to the sentiment of the text. On the other hand, emoticons such as ":)" should stay as they contain sentiments. Words with sentiment, such as "cancel", "late", "worst", etc., should also of course be kept, along with aspect words connected with the sentiment. 

### [Optional! (will not  be graded)] Question 7
Train the model on airline tweets and test it on your own set of tweets
+ Train the model with the following settings (airline tweets 80% training and 20% test;  Bag of words representation ('airline_count'), min_df=2)
+ Apply the model on your own set of tweets and generate the classification report
* [1 point] a. Carry out a quantitative analysis.
* [1 point] b. Carry out an error analysis on 10 correctly and 10 incorrectly classified tweets and discuss them
* [2 points] c. Compare the results (cf. classification report) with the results obtained by VADER on the same tweets and discuss the differences.

### [Optional! (will not be graded)] Question 8: trying to improve the model
* [2 points] a. Think of some ways to improve the scikit-learn Naive Bayes model by playing with the settings or applying linguistic preprocessing (e.g., by filtering on part-of-speech, or removing punctuation). Do not change the classifier but continue using the Naive Bayes classifier. Explain what the effects might be of these other settings 
+ [1 point] b. Apply the model with at least one new setting (train on the airline tweets using 80% training, 20% test) and generate the scores
* [1 point] c. Discuss whether the model achieved what you expected.

## End of this notebook