**Before Starting**: First, fill out the below code cell with your first name, last name, and student ID.

**Before Submission**: Make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).


**During Lab Tips**:
1. DO NOT write your written responses in the same markdown cell as the question. If you do this, your written response will be lost!


2. DO NOT use outside datasets. Use ONLY the datasets linked in this lab. While outside datasets might work locally for YOU, code is likely to fail when being graded as outside datasets can be formatted differently.


3. If possible, please try to use your local Jupyter Notebook to complete the lab. Online notebook editors like Collab can edit notebook source code and cause our auto-grader to break, making grading your lab more difficult for us!

**<font color='red'>WARNING: Some TODOs have `todo_check()` functions which will give you a rough estimate of whether you will recieve points or not. <u>These checks are there simply to make sure you are on the right track and they DO NOT determine your final grade for the lab</u>. They are only here to provide you with real-time feedback.</font>**

In [1]:
FIRST_NAME = "Claude"
LAST_NAME = "Kouakou"
STUDENT_ID = "801438848"

---

$\newcommand{\xv}{\mathbf{x}}
 \newcommand{\wv}{\mathbf{w}}
 \newcommand{\yv}{\mathbf{y}}
 \newcommand{\zv}{\mathbf{z}}
 \newcommand{\uv}{\mathbf{u}}
 \newcommand{\vv}{\mathbf{v}}
 \newcommand{\Chi}{\mathcal{X}}
 \newcommand{\R}{\rm I\!R}
 \newcommand{\sign}{\text{sign}}
 \newcommand{\Tm}{\mathbf{T}}
 \newcommand{\Xm}{\mathbf{X}}
 \newcommand{\Zm}{\mathbf{Z}}
 \newcommand{\I}{\mathbf{I}}
 \newcommand{\Um}{\mathbf{U}}
 \newcommand{\Vm}{\mathbf{V}} 
 \newcommand{\muv}{\boldsymbol\mu}
 \newcommand{\Sigmav}{\boldsymbol\Sigma}
 \newcommand{\Lambdav}{\boldsymbol\Lambda}
$

# Text Data Classification


### ITCS 5156
### Minwoo "Jake" Lee


In [2]:
import pickle
import warnings
import os

def dump_spacy_embed(data, file):
    with open(file,'wb') as f:
        pickle.dump(data, f)
        
def load_spacy_embed(file):
    with open(file,'rb') as f:
        data = pickle.load(f)
    return data

# Goal

The goal of this activity is to be familar to the preprocessing of text data and its classification afterwards. We will use the built-in 20 Newsgroups dataset in Scikit-Learn, which has approximately 20,000 newsgroup documents. Follow the TODO titles and comments to finish the activity! 

# Agenda

* Loading Newsgroup data from Scikit-Learn
* Preprocessing & Classificaiton with
  * Back of Words
  * TF-IDF
  * n-Gram
  * Word2Vec
  * RoBERTa


# Tables of TODO's


1. [TODO1 (5 points)](#TODO1) 
2. [TODO2 (5 points)](#TODO2) 
3. [TODO3 (5 points)](#TODO3) 
4. [TODO4 (5 points)](#TODO4)  
5. [TODO5 (5 points)](#TODO5) 
6. [TODO6 (5 points)](#TODO6) 
7. [TODO7 (5 points)](#TODO7) 
8. [TODO8 (5 points)](#TODO8) 
9. [TODO9 (20 points)](#TODO9) 
10. [TODO10 (18 points)](#TODO10) 
11. [TODO11 (20 points)](#TODO11) 
12. [Feedback (2 points)](#TODO18) 


Well, let us start the lab with importing the common libraries first. 

In [3]:
from copy import deepcopy as copy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline 

import seaborn as sns

# Loading 20 Newsgroup data from Scikit-Learn

Let us start our practice of working with text data by first loading some news realted text data using Sklearn's `20newsgroups` dataset. The news group classes are given in the table below where more information on the dataset can be found in the [Sklearn docs](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset):

<img src="https://d3i71xaburhd42.cloudfront.net/6eb6a45225d7d6a9ab0bb055b81d9aeaba867930/3-Table1-1.png" width=500 />

To save the computation time, let us selct the five categories as given in the codes below.



<div id="TODO1"></div>

### TODO1 (5 points) 
1. Use Sklearn's `fetch_20newsgroups` data loading class to first load the **training** data. Store the output into `train_data` To do so, pass the following descriptions of keyword arguments:
    1. Select the dataset to load.
    2. Shuffle the data.
    3. Set categories using the corresponding `cat` variable.
    4. Use a seed of 0 for the `random_state`. **WARNING: If you don't use this seed, you are likely to fail future TODOs even if your code is correct!**
    
    
2. Loading the **testing** data using the same keyword arguments as in TODO 1.1. Store the output into `test_data`. 

In [5]:
from sklearn.datasets import fetch_20newsgroups

cat = ['alt.atheism', 'soc.religion.christian',
       'comp.graphics', 'sci.med', 'rec.sport.baseball']

# TODO 1.1 - 1.2
# Load training data
train_data = fetch_20newsgroups(subset='train', shuffle=True, categories=cat, random_state=0)

# Load testing data
test_data = fetch_20newsgroups(subset='test', shuffle=True, categories=cat, random_state=0)

print(f"Length of training data: {len(train_data.data)}")
print(f"Length of testing data: {len(test_data.data)}")

Length of training data: 2854
Length of testing data: 1899


In [6]:
train_data.target_names

['alt.atheism',
 'comp.graphics',
 'rec.sport.baseball',
 'sci.med',
 'soc.religion.christian']

In [7]:
labels, label_count = np.unique(train_data.target, return_counts=True)

print(f"Unique labels: {labels}")
print(f"Unique label counts: {label_count}")

Unique labels: [0 1 2 3 4]
Unique label counts: [480 584 597 594 599]


In [8]:
train_data

{'data': ['From: rcasteto@watsol.uwaterloo.ca (Ron Castelletto)\nSubject: Orioles Phillies Red Sox\nKeywords: orioles phillies red sox baltimore philadelphia boston bosox\nOrganization: University of Waterloo\nDistribution: na\nLines: 20\n\n\nCan someone send me ticket ordering information for the\nfollowing teams:  Baltimore, Philadelphia and Boston.\n\nAlso, if you have a home schedule available - can you tell me the dates\nfor all home games between July26-Aug6 and between Aug30-Sept10 and if\nany of these games are promotion nights or special discount nights?\n\nThanks !!!  Ron\n\nPS: and also who the opponents are for these games :-)\n\nDo NOT reply to this account,\nplease reply to:  ronc@vnet.ibm.com\n\n __        _                 IBM Canada Lab Database Technology\n|  \\      / \\                Associate Development Analyst\n|__/ on   |  astelletto      (416) 448-2546 Tie Line: 778-2546\n| \\_      \\_/                Internal Mail: 51/843/895/TOR\n\n',
  'From: nahess@mir.ga

In [9]:
test_data

{'data': ["From:  (Rashid)\nSubject: Re: Yet more Rushdie [Re: ISLAMIC LAW]\nNntp-Posting-Host: nstlm66\nOrganization: NH\nLines: 19\n\nIn article <116171@bu.edu>, jaeger@buphy.bu.edu (Gregg Jaeger) wrote:\n> \nI have already made the clear claim that\n> Khomeini advocates views which are in contradition with the Qur'an\n> and have given my arguments for this. This is something that can be\n> checked by anyone sufficiently interested. Khomeini, being dead,\n> really can't respond, but another poster who supports Khomeini has\n> responded with what is clearly obfuscationist sophistry. This should\n> be quite clear to atheists as they are less susceptible to religionist\n> modes of obfuscationism. \n> \n\nDon't mind my saying this but the best example of obfuscation is to\ncondemn without having even your most basic facts straight. If you\nwant some examples, go back and look at your previous posts, where\nyou manage to get your facts wrong about the fatwa and Khomeini's \nsupposed infal

# Bag of Words

We now preprocess the text data with bag-of-words processing, which includes tokenizing, filtering of stopwords, and encoding. It is all implemented in `CountVectorizer`, so let us play with it. 



<div id="TODO2"></div>

### TODO2 (5 points)

1. Create vectorizer class instance of the Sklearn class `CountVectorizer` with no passed arguments. Store the output into `vect`. 


2. Use the `fit_transform()` method to both fit and transform the training data. Store the output into `X_train_cnt`. 


3. Transform the testing data. Store the output into `X_test_cnt`.


In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# TODO 2.1 - 2.2
# 1. Create CountVectorizer instance
vect = CountVectorizer()
# 2. Fit and transform the training data
X_train_cnt = vect.fit_transform(train_data.data)



In [13]:
X_train_cnt.shape

(2854, 39806)

In [16]:
# TODO 2.3
# 3. Transform the testing data
X_test_cnt = vect.transform(test_data.data)

## Clasificaiton

<div id="TODO3"></div>

### TODO3 (5 points)

1. Create a Naive Bayes classifier using Sklearn's `MultinomialNB` class. Store the class instance into `nbc`.
    1. Hint: Pass arguments/hyper-parameters as you see fit.


2. Train the classifier with the training data `X_train_cnt`, which was transformed, and the training targets.
    1. Hint: Remember our original training data is contained in `train_data`. It might also contain our targets as well!


3. Make a prediction for the transformed test data `X_test_cnt`. Store the output into `pred`.


4. Using the testing predictions and targets, compute the F1-score to measure the performance using the `average=macro` keyword argument. Store the output into `f1_score`.
    1. Hint: Remember our original testing data is contained in `test_data`. It might also contain our targets as well!


In [35]:
from sklearn.naive_bayes import MultinomialNB

# TODO 3.1 - 3.3
nbc = MultinomialNB(alpha=0.5)
nbc.fit(X_train_cnt,  train_data.target)
pred = nbc.predict(X_test_cnt)

pred

array([0, 4, 2, ..., 2, 3, 0])

In [36]:
# TODO 3.4
from sklearn.metrics import f1_score
f1_score_value = f1_score(test_data.target, pred, average="macro")

print(f"Testing F1 score: {f1_score_value}")

Testing F1 score: 0.9450687104012774


<div id="TODO4"></div>

### TODO4 (5 points)

To make this easy to work with, let us build a pipeline of vectorizer and our classifier. 

1. Create a `Pipeline` instance that connects `CountVectorizer` and `MultinomialNB`. Store the output into `nbclf`.
    1. Hint: Only pass kwargs for `MultinomialNB` as you see fit.


2. Train `nbclf` using the **raw** (untransformed) training data and targets.


3. Make predictions using the **raw** (untransformed) testing data. Store the output into `pred`.


4. Using the testing predictions and targets, compute the F1-score to measure the performance using the `average=macro` keyword argument. Store the output into `f1_score`.

In [37]:
from sklearn.pipeline import Pipeline
# TODO 4.1 - 4.3
nbclf = Pipeline([
    ('vectorizer', CountVectorizer()),  # Text vectorization
    ('classifier', MultinomialNB(alpha=0.5))  # Naive Bayes classifier with Laplace smoothing
])
# 2. Train the pipeline using the raw training data and targets
nbclf.fit(train_data.data, train_data.target)
# 3. Make predictions on the raw testing data
pred = nbclf.predict(test_data.data)

# 4. Compute the F1-score (avoid naming conflicts)
f1_score_result = f1_score(test_data.target, pred, average="macro")  # Use a unique variable name
print(f"Testing F1 score: {f1_score_result}")

Testing F1 score: 0.9450687104012774


We have very good first results. As the text contains a lot of meaningless words. Let us discard the stop words and see how the performance differs.  

<div id="TODO5"></div>

### TODO5 (5 points)

1. Create a `Pipeline` instance that connects `CountVectorizer` and `MultinomialNB`. Now, set the `stop_words` kwarg to the correct string for `CountVectorizer` to stop any "useless" English words. Store the output into `nbclf`.
    1. Hint: Pass kwargs for `MultinomialNB` as you see fit.


2. Train `nbclf` using the **raw** (untransformed) training data and targets.


3. Make predictions using the **raw** (untransformed) testing data. Store the output into `pred`.


4. Run Sklearn's `classification_report()`using the testing predictions and targets. Store the output into `report`. 


In [40]:
# TODO 5.1 - 5.4
from sklearn.metrics import classification_report
nbclf = Pipeline([
    ('vectorizer', CountVectorizer(stop_words="english")),  # Removes common English stop words
    ('classifier', MultinomialNB(alpha=0.5))  # Naive Bayes classifier with Laplace smoothing
])
# 2. Train the pipeline using raw training data
nbclf.fit(train_data.data, train_data.target)

# 3. Make predictions on raw test data
pred = nbclf.predict(test_data.data)

# 4. Generate classification report
report = classification_report(test_data.target, pred)

print(report)

              precision    recall  f1-score   support

           0       0.94      0.91      0.93       319
           1       0.94      0.95      0.95       389
           2       0.96      0.98      0.97       397
           3       0.95      0.92      0.93       396
           4       0.94      0.96      0.95       398

    accuracy                           0.95      1899
   macro avg       0.95      0.95      0.95      1899
weighted avg       0.95      0.95      0.95      1899




The f1 score has been slightly increased. Some words appear only once might not be a key component to describe the document. So, in this time let us see what happens if we ignore them.  



<div id="TODO6"></div>

### TODO6 (5 points)

1. Create a `Pipeline` instance that connects `CountVectorizer` and `MultinomialNB`. Now, set the `min_df` threshold kwarg to be 2 for `CountVectorizer`. This will ignore words that appear below a given frequency. Store the output into `nbclf`.
    1. Hint: Pass kwargs for `MultinomialNB` as you see fit.


2. Train `nbclf` using the **raw** (untransformed) training data and targets.


3. Make predictions using the **raw** (untransformed) testing data. Store the output into `pred`.


4. Run Sklearn's `classification_report()`using the testing predictions and targets. Store the output into `report`. 



In [41]:
# TODO 6.1 - 6.4
# 1. Create a pipeline with CountVectorizer (min_df=2) and MultinomialNB
nbclf = Pipeline([
    ('vectorizer', CountVectorizer(min_df=2)),  # Ignores words appearing in fewer than 2 documents
    ('classifier', MultinomialNB(alpha=0.5))  # Laplace smoothing
])
# 2. Train the pipeline using raw training data
nbclf.fit(train_data.data, train_data.target)

# 3. Make predictions on raw test data
pred = nbclf.predict(test_data.data)

# 4. Generate classification report
report = classification_report(test_data.target, pred)


print(report)

              precision    recall  f1-score   support

           0       0.92      0.91      0.92       319
           1       0.93      0.96      0.94       389
           2       0.96      0.97      0.97       397
           3       0.96      0.91      0.94       396
           4       0.95      0.96      0.96       398

    accuracy                           0.95      1899
   macro avg       0.94      0.94      0.94      1899
weighted avg       0.95      0.95      0.95      1899



The score is similar to removing stop words. That means encoding the count vector with the descriptive words is more effective is helpfult to improve classificaiton accuracy. 

## TF-IDF

We can extract a new feature representation using the term frequency and inverse document frequency using the below equation:
$$
 tfidf(w, d) = tf \times \log \Big( \frac{N+1}{N_w + 1} \Big) + 1. 
$$

The goal of tf-idf is best captured by the Sklearn docs, given as follows.
> Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
>
> The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

Let us transform the count vector into tf-idf feature representation and see how it affects the performance. 

<div id="TODO7"></div>

### TODO7 (5 points)

1. Create an instance of the Sklearn `TfidfTransformer` class to extract feature representations. Store the output into `tf_transformer`. 


2. Use the `fit_transform()` method to both fit and transform the `X_train_cnt` data. Store the output into `tfidf`.

In [42]:
# TODO 7.1 -7.2
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer()

tfidf = tf_transformer.fit_transform(X_train_cnt)

print(f"Shape of extracted features: {tfidf.shape}")

Shape of extracted features: (2854, 39806)


Now, let us insert this into our stop words pipeline and see how it performs.

<div id="TODO8"></div>

### TODO8 (5 points)

1. Create a `Pipeline` instance that connects `CountVectorizer`, `TfidfTransformer`, and `MultinomialNB`. Store the output into `nbclf`. Pass the keyword arguments for the following class instances.
    1. For `CountVectorizer`, set the `stop_words` kwarg to the correct string in order to stop any useless English words.
    2. For `MultinomialNB`, pass kwargs/hyper-parameters as you see fit.


2. Train `nbclf` using the **raw** (untransformed) training data and targets.


3. Make predictions using the **raw** (untransformed) testing data. Store the output into `pred`.


4. Run Sklearn's `classification_report()`using the testing predictions and targets. Store the output into `report`. 


In [43]:
# TODO 8.1 - 8.4
# 1. Create the Pipeline
nbclf = Pipeline([
    ('vectorizer', CountVectorizer(stop_words="english")),  # Remove common English stop words
    ('tfidf', TfidfTransformer()),  # Convert word counts to TF-IDF scores
    ('classifier', MultinomialNB(alpha=0.5))  # Naive Bayes classifier with Laplace smoothing
])

# 2. Train the pipeline using raw training data
nbclf.fit(train_data.data, train_data.target )

# 3. Make predictions on raw test data
pred = nbclf.predict(test_data.data)

# 4. Generate classification report
report = classification_report(test_data.target, pred)

print(report)

              precision    recall  f1-score   support

           0       0.97      0.76      0.85       319
           1       0.96      0.95      0.95       389
           2       0.95      0.99      0.97       397
           3       0.97      0.88      0.92       396
           4       0.80      0.97      0.88       398

    accuracy                           0.92      1899
   macro avg       0.93      0.91      0.91      1899
weighted avg       0.93      0.92      0.92      1899



# n-Gram

Well, we are kind of stuck at 0.95 f1 score. As we discussed, the bag of words has limitation without knowing the order of words and context in the text. n-Gram contruct a token with a sequence of multiple words to resolve the limitation. Let us practice n-Gram to see if we can further improve the performance. 


<div id="TODO9"></div>

### TODO9 (20 points)

For n-Gram, we first need to decide what $n$ to use. So, we start this practice first searching for the hyper-parameters. As we naively selected the alpha for Naive Bayes, we search for it as well using `GridSearchCV`.



1. Finish `parameters` dictionary by adding you own list of custom ranges for $n$ using `ngram_range` kwarg for `CountVectorizer`. 
    1. Hint: The hyper-parameter dictionary `parameters` needs to follow the proper naming convention. That is, name of the object instance in the `ng_clf` pipeline must come first! It is then followed by `__` and the hyperparameter name for each instance. For instance, in the code below, for classifier named 'clf' we want to search for different "alpha" values. Thus, the dictionary key is `clf__alpha`. You need apply this same naming scheme to `ngram_range`.
    
    
2. Create a `GridSearchCV` instance with 3 fold cross validation. Store the output into `gs_clf`.
    1. Hint: pass `n_jobs` to increase speed based on your CPU cores.
    1. Hint: You can specify a higher `cv` if you have more time. Usually 5 is optimal but this can take much longer to run.


3. Access the best hyper-parameter **combination** that is stored inside the `gs_clf` instance after fitting. Store the output into `best_params`.


4. Access the best hyper-parameter **score** that is stored inside the `gs_clf` instance after fitting. Store the output into `best_score`.


5. Make predictions using the **raw** (untransformed) testing data. Store the output into `pred`.


6. Run Sklearn's `classification_report()`using the testing predictions and targets. Store the output into `report`. 

In [44]:
# 0. Create the Pipeline
ng_clf = Pipeline([
                ('vect', CountVectorizer(stop_words="english")),
                ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB(alpha=0.5))
        ])

# 1. Define the parameter grid for ngram_range and alpha
# TODO 9.1
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],  # Unigrams, Bigrams, and both
    'clf__alpha': (1, 0.1, 0.01),
    
}

In [46]:
# TODO 9.2
from sklearn.model_selection import GridSearchCV
# 2. Create GridSearchCV with 3-fold cross-validation
gs_clf = GridSearchCV(ng_clf, parameters, cv=3, n_jobs=-1)  # Use all CPU cores for faster computation

<font color="red">**WARNING: This cell will take a long time to run! If you have more parameters, it will takes longer.**</font>

In [47]:
gs_clf.fit(train_data.data, train_data.target);

In [48]:
# TODO 9.3 - 9.4
#. Access the best hyperparameters and best score

best_score = gs_clf.best_score_  # Best cross-validation score
best_params = gs_clf.best_params_  # Best hyperparameter combination

print(f"Best score: {best_score}")
print(f"Best params: {best_params}")

Best score: 0.973019333916532
Best params: {'clf__alpha': 0.01, 'vect__ngram_range': (1, 1)}


In [49]:
# TODO 9.5 - 9.6
pred = gs_clf.predict(test_data.data)
report = classification_report(test_data.target, pred)
print(report)

              precision    recall  f1-score   support

           0       0.96      0.88      0.92       319
           1       0.94      0.95      0.94       389
           2       0.96      0.98      0.97       397
           3       0.95      0.91      0.93       396
           4       0.90      0.97      0.93       398

    accuracy                           0.94      1899
   macro avg       0.94      0.94      0.94      1899
weighted avg       0.94      0.94      0.94      1899



# Word Embeddings

Word embeddings are numerical vectors that are corresponding to the meaning and the context of words. In the lecture, we discussed the pretrained Skip Gram word embeddings and RoBERTa in spaCy. By using them, we will experiment the embeddings to the new data and check the classification accuracies.




## Word2Vec

Now, let us use the pretrained language model `en_core_web_lg` in spaCy to test Word2Vec embedding. 

<div id="TODO10"></div>

### TODO10 (18 points)


1. Using spacy's `load()` method, load the `en_core_web_lg` language model. Store the language model output into `nlp`. 


2. Create a logistic regression model using Sklearn. Feel free to create a Sklearn `Pipeline` class so that you can standardize the data before passing it the logistic regression model. Store the output into `logreg`. Be sure to pass AT LEAST the arguments that correspond to the below descriptions for the logistic regression class.
    1. Use a seed of 0 for the `random_state`. **WARNING: If you don't use this seed, you might get different accuracies each time you rerun your algorithm. This can cause you to fail a TODO during grading.**
    

3. Train the logistic regression model using `X_train_lg` and the training targets.


4. Make predictions using the `X_test_lg` data. Store the output into `pred`.


5. Run Sklearn's `classification_report()`using the testing predictions and targets. Store the output into `report`. 

In [51]:
# Use this line if you are getting "Can't find model 'en_core_web_lg'. 
# It doesn't seem to be a shortcut link, a Python package or a 
# valid path to a data directory." Then restart the kernel.
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
     ---------------------------------------- 0.0/400.7 MB ? eta -:--:--
     --------------------------------------- 2.1/400.7 MB 13.0 MB/s eta 0:00:31
      -------------------------------------- 5.5/400.7 MB 15.2 MB/s eta 0:00:26
      -------------------------------------- 8.9/400.7 MB 15.4 MB/s eta 0:00:26
     - ------------------------------------ 12.3/400.7 MB 15.4 MB/s eta 0:00:26
     - ------------------------------------ 15.5/400.7 MB 15.7 MB/s eta 0:00:25
     - ------------------------------------ 18.4/400.7 MB 15.6 MB/s eta 0:00:25
     -- ----------------------------------- 21.8/400.7 MB 15.5 MB/s eta 0:00:25
     -- ----------------------------------- 24.9/400.7 MB 15.2 MB/s eta 0:00:25
     -- ----------------------------------- 27.3/400.7 MB 14.9 MB/s eta 0:00:26
     -- ----------------------

In [52]:
import spacy
# TODO 10.1
# 1. Load the en_core_web_lg model using spacy
nlp = spacy.load("en_core_web_lg")


Now we need to convert the train and test data into a numeric 2D array. Codes for this are provided. Please read them and ask questions if you don't understand the list comprehension used. This part of codes takes long time, so try to avoid running it multiple times. 

<font color="red">**WARNING: This cell will take a long time to run the first time! In my labtop, it took 4 and 2.5 minutes each as you can see time report outputs. If you want speed increases then you can pickle the embeded data which saves the data to disk (this is enabled by default, you can disable by setting `pickle_save=False`). Once done, you can quickly rerun these cells.** </font>

In [53]:
def load_word2vec(data, file, pickle_save=True):
    if not os.path.exists(file):
        embed_data = np.vstack([nlp(s).vector for s in data.data])
        if pickle_save:
            dump_spacy_embed(embed_data, file)
    else: 
        embed_data = load_spacy_embed(file)
        
    return embed_data

In [54]:
%%time
train_pickle = 'x_train_lg.pickle'
X_train_lg = load_word2vec(train_data, train_pickle, pickle_save=True)

CPU times: total: 2min 45s
Wall time: 2min 48s


In [55]:
%%time
test_pickle = 'x_test_lg.pickle'
X_test_lg = load_word2vec(test_data, test_pickle, pickle_save=True)

CPU times: total: 1min 59s
Wall time: 2min 1s


In [56]:
X_train_lg.shape, X_test_lg.shape

((2854, 300), (1899, 300))

In [59]:
# TODO 10.2 -10.5
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
logreg = Pipeline([
    ('scaler', StandardScaler()),  # Standardize the features
    ('classifier', LogisticRegression(random_state=0))  # Logistic Regression with random seed
])
logreg.fit(X_train_lg, train_data.target)

pred = logreg.predict(X_test_lg)
report = classification_report(test_data.target, pred)

print(report)

              precision    recall  f1-score   support

           0       0.77      0.78      0.77       319
           1       0.95      0.95      0.95       389
           2       0.96      0.95      0.96       397
           3       0.93      0.88      0.91       396
           4       0.85      0.89      0.87       398

    accuracy                           0.89      1899
   macro avg       0.89      0.89      0.89      1899
weighted avg       0.90      0.89      0.90      1899



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Comparison with Bag-of-Words or n-Gram

Previously, we use `MultinomialNB` for classification but for the word embedding, we used `LogisticRegression`. As we used different classifiers, it is hard to tell which feature representation works better for the newsgroup dataset. Now, let us repeat the previous work with BOW and n-Gram with the linear logistic regression.



<div id="TODO11"></div>

### TODO11 (20 points)

1. Finish `parameters` dictionary by adding your own list of custom ranges for $n$ using the `ngram_range` kwarg for `CountVectorizer` and your own list of different values for the `C` kwarg for `LogisticRegression`.
    1. Hint: Refer to TODO 9 for how to do this properly.

    
2. Create a `GridSearchCV` instance with 3 fold cross validation. Store the output into `gs_clf`.
    1. Hint: pass `n_jobs` to increase speed based on your CPU cores.
    1. Hint: You can specify a higher `cv` if you have more time. Usually 5 is optimal but can take much longer to run.


3. Make predictions using the **raw** (untransformed) testing data. Store the output into `pred`.


4. Run Sklearn's `classification_report()`using the testing predictions and targets. Store the output into `report`.


5. Using `TODO 9` results, how does the performance of logistic regression compare to NB? What do you think about the embeddings? Do you think it is better than BOW or n-Gram encoding? 

In [60]:
ng_clf = Pipeline([
                ('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('scaler', StandardScaler(with_mean=False)), 
                ('clf', LogisticRegression(random_state=0))
                ])

parameters = {
    # TODO 11.1
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],  # Unigrams, Bigrams, and both
    'clf__C': [0.01, 0.1, 1, 10, 100]  # Regularization strength for Logistic Regression   
}

In [61]:
# TODO 11.2
# 2. Create GridSearchCV with 3-fold cross-validation
gs_clf = GridSearchCV(ng_clf, parameters, cv=3, n_jobs=-1)  # Use all CPU cores for faster computation


<font color="red">**WARNING: This cell will take a long time to run! If you have more parameters, it will takes longer.**</font>

In [62]:
gs_clf.fit(train_data.data, train_data.target);

In [63]:
# TODO 11.3 - 11.4
pred = gs_clf.predict(test_data.data)
report = classification_report(test_data.target, pred)

print(report)

              precision    recall  f1-score   support

           0       0.97      0.84      0.90       319
           1       0.85      0.97      0.91       389
           2       0.95      0.98      0.97       397
           3       0.95      0.83      0.89       396
           4       0.88      0.94      0.91       398

    accuracy                           0.92      1899
   macro avg       0.92      0.91      0.92      1899
weighted avg       0.92      0.92      0.92      1899



`TODO 11.5` Using `TODO 9` results, how does the performance of logistic regression compare to NB? What do you think about the embeddings? Do you think it is better than BOW or n-Gram encoding? 


**DO NOT WRITE YOUR ANSWER IN THIS CELL!**


`Answer:`The Multinomial Naive Bayes model shows very strong performance, with high precision, recall, and F1 scores across the different classes. This is expected because NB works well with high-dimensional data like text, especially when the features (words) are conditionally independent given the class.

The Word Embeddings approach with Logistic Regression shows decent performance but significantly lower accuracy (0.89) compared to Naive Bayes (0.94).

Logistic Regression with BoW and n-Gram performs better than word embeddings (0.92 accuracy vs. 0.89), but slightly worse than Multinomial Naive Bayes (0.94 accuracy). However, it still produces very good precision and recall across the classes. 

<div id="TODO18"></div>

## Feedback (2 point)

Did you enjoy the lab? 

Please take time to answer the following feedback questions to help us further improve these labs! Your feedback is crucial to making these labs more useful!
    


* How do you rate the overall experience in this lab? (5 likert scale. i.e., 1 - poor ... 5 - amazing)  
Why do you think so? What was most/least useful?



`ANSWER` As always I enjoyed the lab and I will rate my overall experience as 5- amazing

* What did you find difficult about the lab? Were there any TODOs that were unclear? If so, what specfically did not make sense about it?



`ANSWER`I found the word embeddings a little challenging to understand, I would love to have a little more explanation on the concept

* Which concepts, if any, within the lab do you feel could use more explanation?

`ANSWER` the concept of word embedding to me could use more explaining.