# Assignment 3: Text Classification (CPSC 436N)
 

* Follow the instructions in this notebook to develop a Bag Of Word (BOW) (textbook Fig 4.1), and a word embedding-based classifiers for text classification. 

* Test these implemented classifiers on subsets of the 20newsgroup corpus, and analyze the errors and successes of each model compared to the other.

* In this assigment, you will use sklearn (that you should have already used extensively in either cpsc330 or cpsc340) and spacy that you briefly saw in Assignment 1.



## 1. Install Spacy and pipelines that will be used in later steps:

#### Install Spacy v3.0: 


In [1]:
import sys

!{sys.executable} -m pip install spacy==3.0

'C:\Users\Amy' is not recognized as an internal or external command,
operable program or batch file.


#### Download and install the trained English pipeline ([en_core_web_lg](https://spacy.io/models/en) (https://spacy.io/models/en)) provided by Spacy: 

In [2]:
!python -m spacy download en_core_web_lg

C:\Users\Amy George\.virtualenvs\swtheart_hw7\Scripts\python.exe: No module named spacy


## 2. Load dataset (20newsgroups)

For this assignment, we train and test classification models on the 20newsgroups corpus, which can be easily fetched by sklearn. This dataset comprises around 18000 newsgroups posts on 20 topics and has been splited into two subsets by sklearn for model training and testing. 

To ensure this assignment is manageable and won't take too long for training and inference, we will use the subset of 20newsgroups only covering samples belonging to either one of the two classes (like '__talk.politics.misc__', '__talk.religion.misc__' used below). With this setting, we will perform a  binary classification (instead of multiclass classification). 

Please **read carefully** the two links below which provide details about the 20newsgroups corpus and how to load and process it with sklearn:

* http://qwone.com/~jason/20Newsgroups/
* https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

In [None]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=['talk.politics.misc','talk.religion.misc'])
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=['talk.politics.misc','talk.religion.misc'])

**Q1:** Once the data has been loaded, please look at the testing set (__newsgroups_test__) and answer these questions:
- who did send the shortest message? 
- Who did send the longest message? 
- What are the labels of the shortest/longest message?

In [None]:
#### SOLUTION Q1  ######



## 3. Create your own tokenizer with Spacy:

Tokenization is a critical preprocessing step when we work with text data. What [tokenization](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/) does is separating input text into tokens, which can be either words, characters, or subwords, and further processing these tokens to reduce the noise caused by informal expressions, typos or common but meaningless words contained in text.

There are different ways to specialize your tokenization strategy. Some typical steps include __lowercasing__, __stop words and punctuations removing__, __lemmatization__ and __filtering tokens with part-of-speech tagging__.

**Q2:** In the code cell below, please complete your tokenization function with Spacy (named as __spacy_tokenizer__). It shall cover:

* Lowercasing

* Removing stop words

* Removing punctuations

* Lemmatization


In [None]:
#import libs needed for tokenization.
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import string

nlp = spacy.load('en_core_web_lg')  # Load trained English pipeline "en_core_web_lg".
punctuations = string.punctuation  # Get the list of punctuation.
stop_words = spacy.lang.en.stop_words.STOP_WORDS  # Get the list of stop words identified in the loaded language model.

# Creating your own tokenizer function with functions built in Spacy.
def spacy_tokenizer(doc):

    tokens = nlp(doc)  # Splits the doc into tokens and applies the loaded pipeline 
    #(create all linguistic annotations for the doc, including POS etc).

    ######## FOR SOLUTION Q2 ########
    # Lemmatizing each token and converting each token into lowercase. You can use spacy Token.lemma_
    

    # Removing stop words and punctuations
    
    ######## FOR SOLUTION Q2 ########
    
    return tokens  # return preprocessed list of tokens.

## 4. Build the pipeline for BOW classification model

Now let's design the pipeline for the text classification model with sklearn!

We first start with BOW classifier, the overall pipeline for it should contain:

* A BOW vectorizer applying the tokenizer implemented above

* A classifier, which should be set to logistic regression for now

**Q3:** Please complete the code in the cell below by following the comments.

<br>

If needed, please read the sklearn guide:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

to learn how to use CountVectorizer to obtain BOW vectors with customized tokenizer, and:


https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

to learn how to use Pipeline object to implement classification models.



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

######## FOR SOLUTION Q4 ########
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))  # Create BOW vectorizer object.
# SOLUTION Q4 (a): to change the BOW vectorizer to include both unigram+bigram setting: 

classifier = LogisticRegression()  # Create Logistic Regression classifier object.
# SOLUTION Q4 (b): to change the classifier to NaiveBayes



######## FOR SOLUTION Q3 ########

# Create pipeline for BOW classfier.
pipe = Pipeline([('vectorizer', bow_vector),
                 ('classifier', classifier)])

Now let's train the model on the training set obtained from 20newsgroups.

In [None]:
# >>> pipe.fit(X_train, y_train)
pipe.fit(newsgroups_train.data, newsgroups_train.target)

Now let's evaluate our BOW classifier's performance on the testing set obtained from 20newsgroup.

As an example, here we only compute [accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) as the evaluation metric, more sophisticated evaluation metrics or techniques are needed for deeper model analysis.

In [None]:
from sklearn import metrics

predicted = pipe.predict(newsgroups_test.data)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(newsgroups_test.target, predicted))

**Q4:** In the code above, we apply the BOW vectorizer with unigram setting, now please change the BOW vectorizer with unigram+bigram setting and report the accuracy you get.

Now please try Naive Bayes as the classifier with BOW vectorizer including unigram and unigram+bigram, then report the accuracies you get.

## 5. Build the pipeline for the word embedding-based classification model 

The pipeline of BOW classifier implemeted above consists of two components: BOW vectorizer and the classifier (LogReg or NB). In that scenario, in sklearn our customized tokenizer could be called together with the BOW vectorizer.

This new classifier will use a distributed representation of the input document as the average of the embeddings of the words contained in the document. To deal with this very different input we will use spacy. And this requires that the preprocessing of text (tokenization) and the vectorizing of documents as the average of word embeddings should be two separate steps. Thus, the pipeline for this classifier will include three components:

* a preprocessing component 

* a document vectorization component

* a classifier

We will keep the classifier as Logistic Regression, but we need to implement the classes for the preprocessing and document vectorization components (named as __Preprocessing_CMPT__ and __SpacyWordEmb_CMPT__ in the code cell below) and use them for the sklearn pipeline construction. Please note that we need to implement SpacyWordEmb_CMPT with word embeddings provided by "en_core_web_lg" in Spacy.

**INCLUDE IN CANVAS ASSIGNMENT as Q5:** Please whre requested add comments to the transform functions for Preprocessing_CMPT and SpacyWordEmb_CMPT in the code cell below. The comments should explain the best you can what the code is doing and why.

Notice that smoothly chain up components as a pipeline, each component should have a sklearn __transform( )__ and  __fit( )__ function:

* __transform( )__: transforming the input data into the format ready to pass to the next component.

* __fit( )__: learning/calculating the parameters from the training data. The parameters learned by using the training data will help us to transform our test data. **It will not be used** if there is no need to learn parameters from the training data.


In [None]:
# import libs and classes needed for component construction.
from sklearn.base import BaseEstimator 
import numpy as np


# define the class of the preprocessing component
class Preprocessing_CMPT(BaseEstimator): 
    def transform(self, X, **transform_params): # the function actually performs the preprocessing, X contains all samples in the input corpus.
        return [spacy_tokenizer(text) for text in X] # ADD COMMENT
        

    # in this case, we don't need to calculate values for later scaling, so fit() is doing nothing but just return self.
    def fit(self, X, y=None, **fit_params):
        return self

# define the class of the document vectorization component based on word embeddings
class SpacyWordEmb_CMPT(BaseEstimator): 
    def __init__(self, nlp): # define the language model and dimension of word embeddings we want to use in this component.
        self.nlp = nlp
        self.dim = 300
    
    def transform(self, X): 
    # this function  converts documents into the average of their word embeddings, X contains all documents in the input corpus.
        X_str = []
        for text in X:
            X_str.append(' '.join(text)) # ADD COMMENT
        return [self.nlp(text).vector #ADD COMMENT see https://spacy.io/api/doc#vector
                for text in X_str] 

    # in this case, fit() is doing nothing as well.           
    def fit(self, X, y=None):
        return self

nlp = spacy.load('en_core_web_lg')   # Load the pretrained english pipeline "en_core_web_lg"

# Create pipeline for the word embedding-based classfier.
pipe = Pipeline([("preprocessing", Preprocessing_CMPT()),
                ('encoding', SpacyWordEmb_CMPT(nlp)),
                ("classifier", classifier)])

Now let's train this classifier.

In [None]:
pipe.fit(newsgroups_train.data, newsgroups_train.target)

And do prediction on the testing set and evaluate the performance of this classifier.

In [None]:
from sklearn import metrics

# Predicting with a test dataset
predicted = pipe.predict(newsgroups_test.data)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(newsgroups_test.target, predicted))