## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. Cells in which "# YOUR CODE HERE" is found are the cells where your graded code should be written.
2. In order to test out or debug your code you may also create notebook cells or edit existing notebook cells other than "# YOUR CODE HERE". We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Natural Language Processing

In [None]:
# # Uncomment the below line to install
# ! pip install pprint
# ! pip install spacy
# ! python -m spacy download en_core_web_md

In [None]:
import sklearn
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score
import numpy as np
import en_core_web_md
from pprint import pprint

import matplotlib
import matplotlib.pyplot as plt
plt.style.use("ggplot")

## Load and examine the data

The "20 newsgroups" dataset contains message posts ("documents") from newgroup message boards, for 20 different topics.

The ultimate goal of the models you construct will be to predict which topic a message belongs to. In order to accomplish these, you'll need to convert the text messages into numerical features, using the various methods we discussed in class. After converting the messages to numeric features, you'll train and test Naive Bayes and SVM models to perform topic classification.

In [None]:
data = fetch_20newsgroups(subset="all")

In [None]:
print(data.DESCR)

In [None]:
# Extract the messages and topic labels, and view the topic labels

text = data["data"]
target = data["target"]
print('The following are the 20 topics that a message ("document") can belong to:')
pprint(data["target_names"])

In [None]:
# Let's look at an example data sample
i_sample = 0
print(f'A sample with label number "{target[i_sample]}", {data["target_names"][target[i_sample]]}')
print('')
print(text[i_sample])

In [None]:
# This "20 newsgroups" dataset has a pre-set train/test split, but in this
# assignment we've loaded all the data, and will use a random shuffling and
# split of the data, as we typically do.

X_train, X_test, y_train, y_test = train_test_split(text, target, random_state=0)

print(f"The training dataset contains {len(X_train)} messages.")
print(f"The test dataset contains {len(X_test)} messages.")

### Create feature representations

Below you will create numerical representations (a vectors) for each message using three methods we discussed in class.
1. Bag of Words (BoW)
2. Term frequency - Inverse Document Frequency (TF-IDF)
3. Hashing

Note that TF-IDF representations build upon the BoW representations by scaling the term counts in the Bag of Words document term matrix.

Scikit-learn implements the BoW feature representation using [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), and it also has implementations for [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) and [hashed vector](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) representations. Determine the feature representations of our dataset using each of those approaches.

#### Bag of Words

In [None]:
%%time
# Use "english" stopwords and produce a BoW representation for the data using n-grams with
# n up to 3 (that is, use unigrams, bigrams, and trigrams).
#
# Save the vectorizer (transformer) as "counter".
# Save the transformed data as "X_train_bow", and "X_test_bow".
#
# "Fit" the BoW transformer using the training data only. During the fitting process
# the transformer identifies/learns its unique set of takens/features. When never before
# seen tokens are found in the test set (or any other non-training data), they are
# ignored. That is, they are not counted and included in the BoW representation matrix.
#
# Note that the CountVectorizer object is performing some of the text preprocessing
# discussed in class, including tokenization and removing stop words (but not lemmatization).
# The number of unique token occurences in each document (message) is done after
# this preprocessing.

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Let's see how many features (tokens) are in the BoW representations.
# It will be a very large number (> 3M) owing to the use of n-grams with n up
# to 3. It is also large to do the many typos, misspellings, or other
# character sequences that are not actual english words, and thus treated
# as a new/unique token.
#
# You may want to alter the cell above to use only unigrams, or
# only unigrams and bigrams, and see how many features are created in
# those situations. But use n-grams with n up to three before moving
# forward in the notebook.

print(f'The training set has: {X_train_bow.shape[0]:5d} samples and {X_train_bow.shape[1]} features.')
print(f'The test set has:     {X_test_bow.shape[0]:5d} samples and {X_test_bow.shape[1]} features.')

In [None]:
assert counter
assert counter.stop_words=="english"
assert counter.ngram_range==(1, 3)
assert len(counter.get_feature_names())==3034327
assert X_train_bow.shape==(14134, 3034327)
assert X_test_bow.shape==(4712, 3034327)

#### Term frequency - Inverse document frequency

Note that sklearn implements a [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [`TfidfTransformer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). The main difference between the two is in the inputs to `fit_transform` and `transform`. The [Vectorizer's fit/transform](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit) take an input of text whereas the [Transformer's](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer.fit) take an input of a BoW matrix. Given that we already determined the BoW matrix, it would be more time efficient to use `TfidfTransformer`.

In [None]:
%%time
# Use the BoW representation you just created above to produce a TF-IDF representation of the data
#
# Save the transformer to "tfidfer".
# Save the transformed data as "X_train_tfidf", and "X_test_tfidf".
#
# As with BoW, using only training data representations for fitting, during
# which time the TF-IDF transformer determines the fixed set of tokens that
# it will represent.

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert tfidfer
assert X_train_tfidf.shape==(14134, 3034327)
assert X_test_tfidf.shape==(4712, 3034327)

#### Hashing vectorizer

So far, we've "vectorized" the raw text messages into numerical feature vectors (assembled into a matrix) using BoW and TF-IDF. Both of those techniques determine a vocabulary, a fixed number of tokens, from the training data. When new tokens are found in the test set data they are ignored.

The hashing vectorizer remedies (but not without a downside) this situation by using a hash algorithm to convert a text token into an index (location) into the representation vector. A hash algorithm converts any sequence of bits into a bit sequence of fixed length. That fixed length sequence is thus an integer. There are many facets to this, but as this is not a cryptogryphy course (hashing is often used as a method to confirm that a digital document/stream has not been altered), we won't discuss those facets here.

The main point is, a hashing vectorizer can deal with test set tokens that is has never seen before, and "find" a place for them in the output feature representation.

In [None]:
%%time 
# Use "english" stopwords and produce a hashed vector representation for the data using up to trigrams.
#
# Save the vectorizer as "hasher".
# Save the transformed data as "X_train_hash" and "X_test_hash".
#
# Make sure you set alternate_sign to False so we can use this representation with Multinomial Naive
# Bayes later in the notebook.

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Let's see how many features (tokens) are in the hashed representations

print(f'The training set has: {X_train_hash.shape[0]:5d} samples and {X_train_hash.shape[1]} features.')
print(f'The test set has:     {X_test_hash.shape[0]:5d} samples and {X_test_hash.shape[1]} features.')

In [None]:
assert hasher
assert hasher.stop_words == "english"
assert hasher.ngram_range == (1,3)
assert X_train_hash.shape == (14134, 1048576)
assert X_test_hash.shape == (4712, 1048576)

Having passed the asserts above, we see that the hashing vectorizer produced feature vectors with 1048576 features. Recall that the hash algorithm creates outputs with a fixed number of bits, and `2**n_bits` must be the number of possible features in the representation.

Just out of curiosity, let's see how many output bits were produced by the hash algorithm.

In [None]:
# Just out of curiosity, let's see how many output bits were produced by the hash algorithm.

n_features = X_test_hash.shape[1]
n_bits = np.log2(n_features)
print(f"The hash algorithm generated {n_bits}-bit representations, and thus 2^{n_bits}=={n_features} features.")

Compare the time it took to run the count vectorizer versus the hashing vectorizer even though they both will iterate through all the words.

Note that TF-IDF built upon the BoW representations, to that BoW time should be added in to the TF-IDF time.

## A Naive Bayes classifier model, with engineered features

New you'll build classifier models that use the vector feature representations you just created.

Recall [Naive Bayes Classification](http://scikit-learn.org/stable/modules/naive_bayes.html) which we discussed early on in the supervised learning lectures. We will use Naive Bayes classifiers to predict the topic of the articles and compare our feature representations.

Use a Multinomial Naive Bayes classifier to predict the topics.

In [None]:
for feat_name, train_feat, test_feat in zip(["Bag of Words", "TF-IDF", "Hashing"],
                                            [X_train_bow, X_train_tfidf, X_train_hash],
                                            [X_test_bow, X_test_tfidf, X_test_hash]):
    
    # Create a Multinomial Naive Bayes model and saved it to `mnb`
    # Fit the 'mnb' model to the training features and labels, for
    # the BoW, TF-IDF, or hashing features in this loop.
    # YOUR CODE HERE
    raise NotImplementedError()
    
    y_pred = mnb.predict(test_feat)
    print(f"Results for {feat_name}")
    print("-"*80)
    print(classification_report(y_test, y_pred))
    print("-"*80)

In [None]:
assert isinstance(mnb, MultinomialNB)

## Latent Dirichlet Allocation (optional)

LDA is a topic modeling approach that can be used to determine high level topics covered in a corpus. Identified topics can help us determine the concepts covered in the corpus and better understand what is being discussed. The topic relevance can be used as a useful feature representation for each document. 

Scikit-learn has an implementation available for [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

Fitting this data can take a **very long time** so this code is just provided for you to uncomment it and examine the results. You can continue with the rest of the notebook while this runs.

In [None]:
%%time
# Feel free to change the number of topics to find by updating n_components
# lda = LatentDirichletAllocation(n_components=30)
# lda.fit(X_train_bow)

In [None]:
def plot_top_words(model, feature_names, n_top_words, title):
    rows = model.n_components // 5
    if rows % 5 != 0:
        rows +=1
    fig, axes = plt.subplots(rows, 5, figsize=(45, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f'Topic {topic_idx +1}',
                     fontdict={'fontsize': 30})
        ax.invert_yaxis()
        ax.tick_params(axis='both', which='major', labelsize=20)
        for i in 'top right left'.split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

In [None]:
# plot_top_words(lda, counter.get_feature_names(), 10, "Topic Content")

In [None]:
print(f"The following names are the target classes:")
print(data.target_names)
print(f"What do you think of the similarity between topics you found and the classes? Are these useful topics?")

In [None]:
%%time
# X_train_lda = lda.transform(X_train_bow)
# X_test_lda = lda.transform(X_test_bow)
# svm = LinearSVC().fit(X_train_lda, y_train)
# y_pred = svm.predict(X_test_lda)
# print(classification_report(y_test, y_pred))

**PLEASE**  

**If you uncommented and ran the above Latent Dirichlet Allocation (LDA) code, recomment it afterwards, as this will help speed up grading time.**  

**Thank you!!**

## Learned Embeddings

We'll now move on to the use of __learned__ representations, rather than ones the were, to some degree, engineered (e.g., the decision to base the features on token counts). We will use [spacy](https://spacy.io/), for more sophisticated NLP. Make sure you downloaded the english model in the commented code at the top of the notebook (`en_core_web_md`) before proceeding. It may take some time to download.

Spacy allows us to parse text and automatically does the following:
- tokenization
- lemmatization
- sentence splitting
- entity recognition
- token vector representation


__We'll start by creating an example string, and observing the results of `spacy`'s preprocessing. Then you can create your own example string and do the same.__

In [None]:
%%time

# The spacy models are of small (sm), medium (md) and large (lg)
# sizes. We'll use the medium-sized model.
nlp = en_core_web_md.load()

In [None]:
text = "This is the first sentence in this test string. The quick brown fox jumps over the lazy dog."

parsed_text = nlp(text)

for sent in parsed_text.sents:
    print(f"Analyzing sentence: {sent}")
    print(f"Lemmatization: {sent.lemma_}")
    for token in sent:
        print(f"Analyzing token: {token}")
        if token.is_sent_start:
            print("This token is the first one in the sentence")
        if token.is_stop:
            print("Stop word")
        else:
            print("Not stop word")
        print(f"Entity type: {token.ent_type_}")
        print(f"Part of speech: {token.pos_}")
        print(f"Lemma: {token.lemma_}")
        print("-"*10)
    print("-"*50)

In [None]:
# Come up with a couple sentences to test out, put them into a single string (pair
# of quuotes), and name that string "my_text".
# Or, go to a website and copy a paragraph from there.

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(my_text) > 10
assert my_text.count(".") > 2  # Two or more sentences

In [None]:
parsed = nlp(my_text)
for sent in parsed.sents:
    print(f"Analyzing sentence: {sent}")
    print(f"Lemmatization: {sent.lemma_}")
    for token in sent:
        print(f"Analyzing token: {token}")
        if token.is_sent_start:
            print("This token is the first one in the sentence")
        if token.is_stop:
            print("Stop word")
        else:
            print("Not stop word")
        print(f"Entity type: {token.ent_type_}")
        print(f"Part of speech: {token.pos_}")
        print(f"Lemma: {token.lemma_}")
        print("-"*10)
    print("-"*50)

If we use the larger `spacy` models, we'll get the __GloVe representation__ for some words. GloVe representations are "pre-trained" (learned) from a large language corpus. In this case, the GloVe vectors should have 300 features.

In [None]:
token.vector.shape

## An SVM classifier model, with learned embedding features

Now you'll use `spacy` to get GloVe feature representations of a subset of the messages, and then train and test an SVM that makes topic predictions based on those features.

Given that the parsing of text takes some time, we will only use __the first 1000 messages__ in our data. __You will get notably lower performance versus the NB models which used the engineered features, but that is almost solely due to the use of a much smaller data set here.__ You may edit the cell below and use more data, but __please__ reset it to 1000 afterwards, or the autograder may not be able to score your notebook.

In [None]:
new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(X_train[:1000], y_train[:1000], random_state=0)

In [None]:
%%time
# Using the "nlp" from above, parse every instance of new_X_train
# save the document vectors to a np.array called X_train_glove

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert X_train_glove.shape == (len(new_X_train), 300)
assert X_test_glove.shape == (len(new_X_test), 300)

In [None]:
svm = LinearSVC().fit(X_train_glove, new_y_train)
y_pred = svm.predict(X_test_glove)

# Due to the smaller data set size, you may get "Precision and F-score are ill-defined"
# warnings from the classification_report() below. That is anticipated and of no concern.
print(classification_report(new_y_test, y_pred))

## Feedback

In [None]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    raise NotImplementedError()