# SNLP Assignment 6 - Text Classification

Name 1: William LaCroix<br/>
Student id 1: 7038732<br/>
Email 1: williamplacroix@gmail.com<br/>


Name 2: Nicholas Jennings<br/>
Student id 2: 2573492<br/>
Email 2: s8nijenn@stud.uni-saarland.de<br/>

**Instructions:** Read each question carefully. <br/>
Make sure you appropriately comment your code wherever required. Your final submission should contain the completed Notebook and the respective Python files for any additional exercises necessary. There is no need to submit the data files should they exist. <br/>
Upload the zipped folder on CMS. Only one member of the group should make the submisssion.

---

## <span style="color:red">Sentiment Analysis</span>

   - Use the [FinancialPhraseBank](https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-for-financial-news) corpus with a 80:20 train test split, with a) a naive bayes b) XGBoost classifier for sentiment analysis (**2 points**)

      - Naive Bayes: https://scikit-learn.org/stable/modules/naive_bayes.html

      - XGBoost: https://xgboost.readthedocs.io/en/stable/index.html

      - <span style="color:red">Note:</span> Only use all_data.csv file for your experiments

   - Use [LoughranMcDonald master dictionary](https://drive.google.com/file/d/17CmUZM9hGUdGYjCXcjQLyybjTrcjrhik/view?usp=sharing) to get word level polarity, replace words with the their corresponding polarity and now repeat the same classification task as above (**4 points**)

      - What happens if you remove the stop words? 

      - What's the ratio of stopwords (taken from NLTK) amongst the total word count? 

      - Is it a good choice to do stopword removal? Explain with 2-3 examples why or why not?
   
   - Create cascaded Bi-Grams and tri-grams and repeat both the exercises from above. Which representation do you think is a better value in terms of accuracy and computational power required? (**3 points**)

      - Bi-grams example: 

         - Company A barely surpassed their profit expectations.
         
         - (Company, A), (A, barely), (barely, surpassed), (surpassed, their) ...
   
   - How would you represnt the polarity in uni/bi/tri-gram models with Huffman Encoding? (**1 point**)


### Few points to remember
   - While splitting your dataset, use seed=42
   - Use type hint(s) in your code, and add a docstring to your functions or classes
   - Focus on the readability of your code, that helps us to give you better feedback on where the code went wrong
   - <span style="color:red">Do not submit the data or dictionary file.</span>

### Discussion:

- What happens if you remove the stop words?
    - Answer: For Naive Bayes removing stop words has no effect on accuracy, but there is an improvement of 0.06 for precision, 0.08 for recall and 0.07 for F1. For XGBoost removing improves accuracy by 0.01, precision by 0.01, recall by 0.02, and F1 by 0.02. For both models removing stopwords improves, however it is much more significant for Naive Bayes.

- What's the ratio of stopwords (taken from NLTK) amongst the total word count? 
    - Answer: 35346 stopwords out of 99315 total words were removed (approx. 35.6% of total words)

- Is it a good choice to do stopword removal? Explain with 2-3 examples why or why not?
    - Answer: In general, removing stop words is good choice. For example give the following sentence with positive sentiment: "The chain posted sales of 298 million euros for full 2005 , a rise of 19.5 percent , year-on-year ."
    Common words such as "the", "in", etc. that are in the NLTK english stopword list don't provide any information about the sentiment of the sentence, so removing them is the correct choice.
    But given the negative sentence: "Bosse added that Trygvesta does not have the financial strength to acquire the entire unit"
    Removing the word "not", which appears in the NLTK stopword list would have negative consequences for sentiment analysis as removing it changes the meaning of the sentence.

  
- How would you represnt the polarity in uni/bi/tri-gram models with Huffman Encoding? (**1 point**)
    - Answer: In the unigram case the polarity of a word would be equivalent to a character that is to be encoded. For bi/trigrams etc. each unique polarity ngram would represent a character.  The first step would to compute the frequency of all polarity ngrams and sort them by frequency. Assuming we are dealing with unigrams and that the polarities are ranked as follows by frequency: 1.Neutral 2.Positive 3.Negative, Neutral would be encoded as 0, Positive as 11 and Negative 10. For higher order ngrams the same principal would apply with the difference that character vocabulary would be larger.


In [1]:
import solution
from importlib import reload
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier


In [2]:
# 1. Unigram Naive Bayes, unpolarized -> baseline NB
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data()

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file, since it's a multi-class classification
# Confusion matrix should give you a better idea of how well your model is performing
# Naive Bayes classification of 3 class labels
# NB_confusion_matrix on 3x3 matrix
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Baseline unpolairzed unigram")


Predicted  Negative  Neutral  Positive
True                                  
Negative         53       33        24
Neutral          11      502        58
Positive         10      115       164

Baseline unpolairzed unigram Naive Bayes accuracy: 0.74
Baseline unpolairzed unigram Naive Bayes precision: 0.57
Baseline unpolairzed unigram Naive Bayes recall: 0.74
Baseline unpolairzed unigram Naive Bayes F1: 0.64


In [3]:
# 2. Unigram XGBoost, unpolarized -> baseline XGBoost
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data()

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Baseline unpolairzed unigram")

Predicted  Negative  Neutral  Positive
True                                  
Negative         71       31         8
Neutral          11      541        19
Positive          4      123       162

Baseline unpolairzed unigram XGBoost accuracy: 0.80
Baseline unpolairzed unigram XGBoost precision: 0.56
Baseline unpolairzed unigram XGBoost recall: 0.81
Baseline unpolairzed unigram XGBoost F1: 0.66


In [4]:
# 3. Unigram Naive Bayes, polarized, include stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(polarize=True)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed unigram with stop words")

Predicted  Negative  Neutral  Positive
True                                  
Negative          7       84        19
Neutral          18      527        26
Positive         15      234        40

Polairzed unigram with stop words Naive Bayes accuracy: 0.59
Polairzed unigram with stop words Naive Bayes precision: 0.14
Polairzed unigram with stop words Naive Bayes recall: 0.28
Polairzed unigram with stop words Naive Bayes F1: 0.19


In [5]:
# 4. Unigram Naive Bayes, polarized, exclude stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed unigram without stop words")

total word count: 99315
removed stopword count: 35346
ratio 0.3558979006192418
Predicted  Negative  Neutral  Positive
True                                  
Negative         11       93         6
Neutral          21      537        13
Positive         23      242        24

Polairzed unigram without stop words Naive Bayes accuracy: 0.59
Polairzed unigram without stop words Naive Bayes precision: 0.08
Polairzed unigram without stop words Naive Bayes recall: 0.20
Polairzed unigram without stop words Naive Bayes F1: 0.12


In [6]:
# 5. Unigram XGBoost, polarized, include stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(polarize=True)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed unigram with stop words")

Predicted  Negative  Neutral  Positive
True                                  
Negative          9       81        20
Neutral          11      529        31
Positive         18      225        46

Polairzed unigram with stop words XGBoost accuracy: 0.60
Polairzed unigram with stop words XGBoost precision: 0.16
Polairzed unigram with stop words XGBoost recall: 0.31
Polairzed unigram with stop words XGBoost F1: 0.21


In [7]:
# 6. Unigram XGBoost, polarized, exclude stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed unigram without stop words")

total word count: 99315
removed stopword count: 35346
ratio 0.3558979006192418
Predicted  Negative  Neutral  Positive
True                                  
Negative          5       82        23
Neutral           6      521        44
Positive         12      235        42

Polairzed unigram without stop words XGBoost accuracy: 0.59
Polairzed unigram without stop words XGBoost precision: 0.15
Polairzed unigram without stop words XGBoost recall: 0.29
Polairzed unigram without stop words XGBoost F1: 0.19


In [8]:
# 7. Bigram Naive Bayes, polarized, include stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(polarize=True, ngramize=2)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed bigram with stop words")

Predicted  Negative  Neutral  Positive
True                                  
Negative         23       69        18
Neutral          36      493        42
Positive         51      204        34

Polairzed bigram with stop words Naive Bayes accuracy: 0.57
Polairzed bigram with stop words Naive Bayes precision: 0.12
Polairzed bigram with stop words Naive Bayes recall: 0.28
Polairzed bigram with stop words Naive Bayes F1: 0.17


In [9]:
# 8. Bigram Naive Bayes, polarized, exclude stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True, ngramize=2)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed bigram without stop words")

total word count: 99315
removed stopword count: 35346
ratio 0.3558979006192418
Predicted  Negative  Neutral  Positive
True                                  
Negative         27       72        11
Neutral          40      504        27
Positive         52      209        28

Polairzed bigram without stop words Naive Bayes accuracy: 0.58
Polairzed bigram without stop words Naive Bayes precision: 0.10
Polairzed bigram without stop words Naive Bayes recall: 0.25
Polairzed bigram without stop words Naive Bayes F1: 0.14


In [10]:
# 9. Bigram XGBoost, polaraized, include stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(polarize = True, ngramize=2)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed bigram with stop words")


Predicted  Negative  Neutral  Positive
True                                  
Negative         16       70        24
Neutral          21      500        50
Positive         24      207        58

Polairzed bigram with stop words XGBoost accuracy: 0.59
Polairzed bigram with stop words XGBoost precision: 0.20
Polairzed bigram with stop words XGBoost recall: 0.38
Polairzed bigram with stop words XGBoost F1: 0.26


In [11]:
# 10. Bigram XGBoost, polarized, exclude stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True, ngramize=2)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed bigram without stop words")

total word count: 99315
removed stopword count: 35346
ratio 0.3558979006192418
Predicted  Negative  Neutral  Positive
True                                  
Negative         11       78        21
Neutral          10      522        39
Positive         18      208        63

Polairzed bigram without stop words XGBoost accuracy: 0.61
Polairzed bigram without stop words XGBoost precision: 0.22
Polairzed bigram without stop words XGBoost recall: 0.39
Polairzed bigram without stop words XGBoost F1: 0.28


In [12]:
# 11. Trigram Naive Bayes, polarized, include stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(polarize = True, ngramize=3)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed trigram with stop words")

Predicted  Negative  Neutral  Positive
True                                  
Negative         36       59        15
Neutral          57      459        55
Positive         72      178        39

Polairzed trigram with stop words Naive Bayes accuracy: 0.55
Polairzed trigram with stop words Naive Bayes precision: 0.13
Polairzed trigram with stop words Naive Bayes recall: 0.35
Polairzed trigram with stop words Naive Bayes F1: 0.19


In [13]:
# 12. Trigram Naive Bayes, polarized, exclude stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True, ngramize=3)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed bigram without stop words")

total word count: 99315
removed stopword count: 35346
ratio 0.3558979006192418
Predicted  Negative  Neutral  Positive
True                                  
Negative         35       64        11
Neutral          50      492        29
Positive         72      191        26

Polairzed bigram without stop words Naive Bayes accuracy: 0.57
Polairzed bigram without stop words Naive Bayes precision: 0.09
Polairzed bigram without stop words Naive Bayes recall: 0.26
Polairzed bigram without stop words Naive Bayes F1: 0.13


In [14]:
# 13. Trigram XGBoost, polarized, include stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(polarize = True, ngramize=3)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed trigram with stop words")

Predicted  Negative  Neutral  Positive
True                                  
Negative         17       67        26
Neutral          19      491        61
Positive         25      194        70

Polairzed trigram with stop words XGBoost accuracy: 0.60
Polairzed trigram with stop words XGBoost precision: 0.24
Polairzed trigram with stop words XGBoost recall: 0.43
Polairzed trigram with stop words XGBoost F1: 0.31


In [15]:
# 14. Trigram XGBoost, polarized, exclude stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True, ngramize=3)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.calculate_confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed trigram without stop words")

total word count: 99315
removed stopword count: 35346
ratio 0.3558979006192418
Predicted  Negative  Neutral  Positive
True                                  
Negative         17       65        28
Neutral          18      516        37
Positive         27      200        62

Polairzed trigram without stop words XGBoost accuracy: 0.61
Polairzed trigram without stop words XGBoost precision: 0.21
Polairzed trigram without stop words XGBoost recall: 0.40
Polairzed trigram without stop words XGBoost F1: 0.28
