# SNLP Assignment 6 - Text Classification

Name 1: William LaCroix<br/>
Student id 1: 7038732<br/>
Email 1: williamplacroix@gmail.com<br/>


Name 2: Nicholas Jennings<br/>
Student id 2: 2573492<br/>
Email 2: s8nijenn@stud.uni-saarland.de<br/>

**Instructions:** Read each question carefully. <br/>
Make sure you appropriately comment your code wherever required. Your final submission should contain the completed Notebook and the respective Python files for any additional exercises necessary. There is no need to submit the data files should they exist. <br/>
Upload the zipped folder on CMS. Only one member of the group should make the submisssion.

---

## <span style="color:red">Sentiment Analysis</span>

   - Use the [FinancialPhraseBank](https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-for-financial-news) corpus with a 80:20 train test split, with a) a naive bayes b) XGBoost classifier for sentiment analysis (**2 points**)

      - Naive Bayes: https://scikit-learn.org/stable/modules/naive_bayes.html

      - XGBoost: https://xgboost.readthedocs.io/en/stable/index.html

      - <span style="color:red">Note:</span> Only use all_data.csv file for your experiments

   - Use [LoughranMcDonald master dictionary](https://drive.google.com/file/d/17CmUZM9hGUdGYjCXcjQLyybjTrcjrhik/view?usp=sharing) to get word level polarity, replace words with the their corresponding polarity and now repeat the same classification task as above (**4 points**)

      - What happens if you remove the stop words? 

      - What's the ratio of stopwords (taken from NLTK) amongst the total word count? 

      - Is it a good choice to do stopword removal? Explain with 2-3 examples why or why not?
   
   - Create cascaded Bi-Grams and tri-grams and repeat both the exercises from above. Which representation do you think is a better value in terms of accuracy and computational power required? (**3 points**)

      - Bi-grams example: 

         - Company A barely surpassed their profit expectations.
         
         - (Company, A), (A, barely), (barely, surpassed), (surpassed, their) ...
   
   - How would you represnt the polarity in uni/bi/tri-gram models with Huffman Encoding? (**1 point**)


### Few points to remember
   - While splitting your dataset, use seed=42
   - Use type hint(s) in your code, and add a docstring to your functions or classes
   - Focus on the readability of your code, that helps us to give you better feedback on where the code went wrong
   - <span style="color:red">Do not submit the data or dictionary file.</span>

### Discussion:

- What happens if you remove the stop words?
    - Answer:

- What's the ratio of stopwords (taken from NLTK) amongst the total word count? 
    - Answer:

- Is it a good choice to do stopword removal? Explain with 2-3 examples why or why not?
    - Answer:
  
- How would you represnt the polarity in uni/bi/tri-gram models with Huffman Encoding? (**1 point**)
    - Answer:


In [68]:
import solution
from importlib import reload
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier


In [85]:
# 1. Unigram Naive Bayes, unpolarized -> baseline NB
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data()

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file, since it's a multi-class classification
# Confusion matrix should give you a better idea of how well your model is performing
# Naive Bayes classification of 3 class labels
# NB_confusion_matrix on 3x3 matrix
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Baseline unpolairzed unigram")


Predicted  negative  neutral  positive  All
True                                       
negative         53       33        24  110
neutral          11      502        58  571
positive         10      115       164  289
All              74      650       246  970

Baseline unpolairzed unigram Naive Bayes accuracy: 0.7412371134020619
Baseline unpolairzed unigram Naive Bayes precision: 0.5674740484429066
Baseline unpolairzed unigram Naive Bayes recall: 0.7420814479638009
Baseline unpolairzed unigram Naive Bayes F1: 0.6431372549019608


In [86]:
# 2. Unigram XGBoost, unpolarized -> baseline XGBoost
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data()

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Baseline unpolairzed unigram")

Predicted  negative  neutral  positive  All
True                                       
negative         71       31         8  110
neutral          11      541        19  571
positive          4      123       162  289
All              86      695       189  970

Baseline unpolairzed unigram XGBoost accuracy: 0.797938144329897
Baseline unpolairzed unigram XGBoost precision: 0.5605536332179931
Baseline unpolairzed unigram XGBoost recall: 0.8059701492537313
Baseline unpolairzed unigram XGBoost F1: 0.6612244897959184


In [87]:
# 3. Unigram Naive Bayes, polarized, include stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(polarize=True)

print(corpus)
y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed unigram with stop words")

      0                                                  1
0     1  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
1     1  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
2     0  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
3     2  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
4     2  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
...  ..                                                ...
4841  0  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
4842  1  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
4843  0  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
4844  0  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
4845  0  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...

[4846 rows x 2 columns]


ValueError: empty vocabulary; perhaps the documents only contain stop words

In [80]:
# 4. Unigram Naive Bayes, polarized, exclude stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed unigram without stop words")

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [81]:
# 5. Unigram XGBoost, polarized, include stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(polarize=True)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed unigram with stop words")

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [82]:
# 6. Unigram XGBoost, polarized, exclude stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed unigram without stop words")

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [83]:
# 7. Bigram Naive Bayes, polarized, include stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(polarize=True, ngramize=2)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed bigram with stop words")

AttributeError: 'list' object has no attribute 'lower'

In [84]:
# 8. Bigram Naive Bayes, polarized, exclude stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True, ngramize=2)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed bigram without stop words")

AttributeError: 'list' object has no attribute 'lower'

In [None]:
# 9. Bigram XGBoost, polaraized, include stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(polarize = True, ngramize=2)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed bigram with stop words")


In [None]:
# 10. Bigram XGBoost, polarized, exclude stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True, ngramize=2)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed bigram without stop words")

In [None]:
# 11. Trigram Naive Bayes, polarized, include stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(polarize = True, ngramize=3)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed trigram with stop words")

In [None]:
# 12. Trigram Naive Bayes, polarized, exclude stop words
solution = reload(solution)
classification_model = MultinomialNB()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True, ngramize=3)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="Naive Bayes", preprocessing="Polairzed bigram without stop words")

In [None]:
# 13. Trigram XGBoost, polarized, include stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(polarize = True, ngramize=3)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed trigram with stop words")

In [None]:
# 14. Trigram XGBoost, polarized, exclude stop words
solution = reload(solution)
classification_model = XGBClassifier()
corpus = solution.load_and_preprocess_data(remove_stops=True, polarize=True, ngramize=3)

y_test, y_pred = solution.train_and_fit_model(corpus, classification_model)

# Get the confusion matrix from solution.py file
confusion_matrix = solution.confusion_matrix(y_test, y_pred)
print(confusion_matrix)
solution.test(confusion_matrix, classifier="XGBoost", preprocessing="Polairzed trigram without stop words")