<a href="https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Demystified | Simple Vectorization
https://nlpdemystified.org<br>
https://github.com/futuremojo/nlp-demystified

### spaCy upgrade and package installation.

At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy and download a statisical language model.
<br><br>
**IMPORTANT**<br>
If you're running this in the cloud rather than using a local Jupyter server on your machine, then the notebook will **timeout** after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical packages.
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm
!python -m spacy info

# Basic Bag-of-Words (BOW)

Course module for this demo: https://www.nlpdemystified.org/course/basic-bag-of-words

In [None]:
import spacy

from scipy import spatial
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Plain frequency BOW

In [None]:
# A corpus of sentences.
corpus = [
  "Red Bull drops hint on F1 engine.",
  "Honda exits F1, leaving F1 partner Red Bull.",
  "Hamilton eyes record eighth F1 title.",
  "Aston Martin announces sponsor."
]

We want to build a basic bag-of-words (BOW) representation of our corpus. Based on what you now know from the lesson, you can probably do this from scratch using dictionaries and lists (and maybe that's a good exercise). Fortunately, there are robust libraries which make it easy.

We can use the scikit-learn **CountVectorizer** which takes a collection of text documents and creates a matrix of token counts:<br>
https://scikit-learn.org/stable/index.html<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html




In [None]:
vectorizer = CountVectorizer()

The *fit_transform* method does two things:
1. It learns a vocabulary dictionary from the corpus.
2. It returns a matrix where each row represents a document and each column represents a token (i.e. term).<br>

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform


In [None]:
bow = vectorizer.fit_transform(corpus)

We can take a look at the features and vocabulary dictionary. Notice the **CountVectorizer** took care of tokenization for us. It also removed punctuation and lower-cased everything.

In [None]:
# View features (tokens).
print(vectorizer.get_feature_names_out())

# View vocabulary dictionary.
vectorizer.vocabulary_

Specifically, the **CountVectorizer** generates a sparse matrix using an efficient, compressed representation. The sparse matrix object includes a number of useful methods:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

In [None]:
print(type(bow))

If we look at the raw structure, we'll see tuples where the first element represents the document, and the second element represents a token ID. It's then followed by a count of that token. So in the second document (index 1), token 8 ("f1") occurs twice.

In [None]:
print(bow)

Before we explore further, we want to make a few modifications.
1. What if we want to use another tokenizer like spaCy's?
2. Instead of frequency, what if we want to have a binary BOW?


## Binary BOW with custom tokenizer

**CountVectorizer** supports using a custom tokenizer. For every document, it will call your tokenizer and expect a list of tokens returned. We'll create a simple callback below which has spaCy tokenize and filter tokens, and then return them.

In [None]:
# As usual, we start by importing spaCy and loading a statistical model.
nlp = spacy.load('en_core_web_sm')

# Create a tokenizer callback using spaCy under the hood. Here, we tokenize
# the passed-in text and return the tokens, filtering out punctuation.
def spacy_tokenizer(doc):
  return [t.text for t in nlp(doc) if not t.is_punct]


This time, we instantiate **CountVectorizer** with our custom tokenizer (*spacy_tokenizer*), turn off case-folding, and also set the *binary* parameter to *True* so we simply get 1s and 0s marking token presence rather than token frequency.

In [None]:
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True)
bow = vectorizer.fit_transform(corpus)

Looking at the resulting feature names and vocabulary dictionary, we can see our *spacy_tokenizer* being used. If you're not convinced, you can remove the punctuation filtering in our tokenizer and rerun the code.

In [None]:
print(vectorizer.get_feature_names_out())
vectorizer.vocabulary_

To get a dense array representation of our sparse matrix, use *toarray*.<br>
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.toarray.html#scipy.sparse.csr_matrix.toarray

We can also index and slice into the sparse matrix.

In [None]:
print('A dense representation like we saw in the slides.')
print(bow.toarray())
print()
print('Indexing and slicing.')
print(bow[0])
print()
print(bow[0:2])

## Cosine Similarity

Writing your own cosine similarity function is straight-forward using numpy (left as an exercise). There are multiple ways to calculate it using scipy.
<br><br>
One way is using the **spatial** package, which is a collection of spatial algorithms and data structures. It has a method to calculate cosine *distance*. To get the cosine *similarity*, we have to substract the distance from 1.<br>
https://docs.scipy.org/doc/scipy/reference/spatial.html<br>
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine

In [None]:
# The cosine method expects array_like inputs, so we need to generate
# arrays from our sparse matrix.
doc1_vs_doc2 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[1].toarray()[0])
doc1_vs_doc3 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[2].toarray()[0])
doc1_vs_doc4 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[3].toarray()[0])

print(corpus)

print(f"Doc 1 vs Doc 2: {doc1_vs_doc2}")
print(f"Doc 1 vs Doc 3: {doc1_vs_doc3}")
print(f"Doc 1 vs Doc 4: {doc1_vs_doc4}")

Another approach is using scikit-learn's *cosine_similarity* which computes the metric between multiple vectors. Here, we pass it our BOW and get a matrix of cosine similarities between each document.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [None]:
# cosine_similarity can take either array-likes or sparse matrices.
print(cosine_similarity(bow))

## N-grams

**CountVectorizer** includes an *ngram_range* parameter to generate different n-grams. n_gram range is specified using a minimum and maximum range. By default, n_gram range is set to (1, 1) which generates unigrams. Setting it to (1, 2) generates both unigrams and bigrams.

In [None]:
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True, ngram_range=(1,2))
bigrams = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print('Number of features: {}'.format(len(vectorizer.get_feature_names_out())))
print(vectorizer.vocabulary_)

In [None]:
# Setting n_gram range to (2, 2) generates only bigrams.
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True, ngram_range=(2,2))
bigrams = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(vectorizer.vocabulary_)

## Basic Bag-of-Words Exercises

In [None]:
#
# EXERCISE: Create a spacy_tokenizer callback which takes a string and returns
# a list of tokens (each token's text) with punctuation filtered out.
#
corpus = [
  "Students use their GPS-enabled cellphones to take birdview photographs of a land in order to find specific danger points such as rubbish heaps.",
  "Teenagers are enthusiastic about taking aerial photograph in order to study their neighbourhood.",
  "Aerial photography is a great way to identify terrestrial features that aren’t visible from the ground level, such as lake contours or river paths.",
  "During the early days of digital SLRs, Canon was pretty much the undisputed leader in CMOS image sensor technology.",
  "Syrian President Bashar al-Assad tells the US it will 'pay the price' if it strikes against Syria."
]

nlp = spacy.load('en_core_web_sm')

def spacy_tokenizer(doc):
  pass


In [None]:
#
# EXERCISE: Initialize a CountVectorizer object and set it to use
# your spacy_tokenizer with lower-casing off and to create a binary BOW.
#

# Instantiate a CountVectorizer object called 'vectorizer'.


# Create a binary BOW from the corpus using your CountVectorizer.



In [None]:
#
# The string below is a whole paragraph. We want to create another
# binary BOW but using the vocabulary of our *current* CountVectorizer. This means
# that words in this paragraph which AREN'T already in the vocabulary won't be
# represented. This is to illustrate how BOW can't handle out-of-vocabulary words
# unless you rebuild your whole vocabulary. Still, we'll see that if there's
# enough overlapping vocabulary, some similarity can still be picked up.
#
# Note that we call 'transform' only instead of 'fit_transform' because the
# fit step (i.e. vocabulary build) is already done and we don't want to re-fit here.
#
s = ["Teenagers take aerial shots of their neighbourhood using digital cameras sitting in old bottles which are launched via kites - a common toy for children living in the favelas. They then use GPS-enabled smartphones to take pictures of specific danger points - such as rubbish heaps, which can become a breeding ground for mosquitoes carrying dengue fever."]
new_bow = vectorizer.transform(s)

#
# EXERCISE: using the pairwise cosine_similarity method from sklearn,
# calculate the similarities between each document from the corpus against
# this new document (new_bow). HINT: You can pass two parameters to
# cosine_similarity in this case. See the docs:
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine
#
# Which document is the most similar? Which is the least similar? Do the results make sense
# based on what you see?
#



In [None]:
#
# EXERCISE: Implement your own cosine similarity method using numpy.
# It should take two numpy arrays and output the similarity metric.
# HINTS:
# https://numpy.org/doc/stable/reference/generated/numpy.dot.html
# https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
#
# Verify the similarity between the first document in the corpus and the
# paragraph is the same as the one you got from using pairwise cosine_similarity.
#
import numpy as np
def cos_sim(a, b):
  pass


In [None]:
#
# EXERCISE: In spacy_tokenizer, instead of returning the plain text,
# return the lemma_ attribute instead. How do the cosine similarity
# results differ? What if you filter out stop words as well?
#

# TF-IDF

Course module for this demo: https://www.nlpdemystified.org/course/tf-idf

**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**

In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm
!python -m spacy info

In [None]:
import spacy

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Fetching datasets

This time around, rather than using a short toy corpus, let's use a larger dataset. scikit-learn has a **datasets** module with utilties to load datasets of our own as well as fetch popular reference datasets online.<br>
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets
<br><br>
We'll use the **20 newsgroups** dataset, which is a collection of 18,000 newsgroup posts across 20 topics.<br>
https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset
<br><br>
List of datasets available:<br>
https://scikit-learn.org/stable/datasets.html#datasets

The **datasets** module includes fetchers for each dataset in scikit-learn. For our purposes, we'll fetch only the posts from the *sci.space* topic, and skip on headers, footers, and quoting of other posts.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups
<br><br>
By default, the fetcher retrieves the *training* subset of the data only. If you don't know what that means, it'll become clear later in the course when we discuss modelling. For now, it doesn't matter for our purposes.

In [None]:
corpus = fetch_20newsgroups(categories=['sci.space'],
                            remove=('headers', 'footers', 'quotes'))

We get back a **Bunch** container object containing the data as well as other information.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html
<br><br>
The actual posts are accessed through the *data* attribute and is a list of strings, each one representing a post.

In [None]:
print(type(corpus))

In [None]:
# Number of posts in our dataset.
len(corpus.data)

In [None]:
# View first two posts.
corpus.data[:2]

## Creating TF-IDF features

In [None]:
# Like before, if we want to use spaCy's tokenizer, we need
# to create a callback. Remember to upgrade spaCy if you need
# to (refer to beginnning of file for commentary and instructions).
nlp = spacy.load('en_core_web_sm')

# We don't need named-entity recognition nor dependency parsing for
# this so these components are disabled. This will speed up the
# pipeline. We do need part-of-speech tagging however.
unwanted_pipes = ["ner", "parser"]

# For this exercise, we'll remove punctuation and spaces (which
# includes newlines), filter for tokens consisting of alphabetic
# characters, and return the lemma (which require POS tagging).
def spacy_tokenizer(doc):
  with nlp.disable_pipes(*unwanted_pipes):
    return [t.lemma_ for t in nlp(doc) if \
            not t.is_punct and \
            not t.is_space and \
            t.is_alpha]

Like the classes to create raw frequency and binary bag-of-words vectors, scikit-learn includes a similar class called **TfidfVectorizer** to create TF-IDF vectors from a corpus.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
<br><br>
The usage pattern is similar in that we call *fit_transform* on the corpus which generates the vocabulary dictionary (fit step), and generates the TF-IDF vectors (transform step).

In [None]:
%%time
# Use the default settings of TfidfVectorizer.
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
features = vectorizer.fit_transform(corpus.data)

In [None]:
# The number of unique tokens.
print(len(vectorizer.get_feature_names_out()))

In [None]:
# The dimensions of our feature matrix. X rows (documents) by Y columns (tokens).
print(features.shape)

In [None]:
# What the encoding of the first document looks like in sparse format.
print(features[0])

As we mentioned in the slides, there are TF-IDF variations out there and scikit-learn, among other things, adds **smoothing** (adds a one to the numerator and denominator in the IDF component), and normalizes by default. These can be disabled if desired using the *smooth_idf* and *norm* parameters respectively. See here for more information:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html


## Querying the data

The similarity measuring techniques we learned previously can be used here in the same way. In effect, we can query our data using this sequence:
1. *Transform* our query using the same vocabulary from our *fit* step on our corpus.
2. Calculate the pairwise cosine similarities between each document in our corpus and our query.
3. Sort them in descending order by score.

In [None]:
# Transform the query into a TF-IDF vector.
query = ["lunar orbit"]
query_tfidf = vectorizer.transform(query)

In [None]:
# Calculate the cosine similarities between the query and each document.
# We're calling flatten() here becaue cosine_similarity returns a list
# of lists and we just want a single list.
cosine_similarities = cosine_similarity(features, query_tfidf).flatten()

Now that we have our list of cosine similarities, we can use this utility function to return the indices of the top k documents with the highest cosine similarities.

In [None]:
import numpy as np

# numpy's argsort() method returns a list of *indices* that
# would sort an array:
# https://numpy.org/doc/stable/reference/generated/numpy.argsort.html
#
# The sort is ascending, but we want the largest k cosine_similarites
# at the bottom of the sort. So we negate k, and get the last k
# entries of the indices list in reverse order. There are faster
# ways to do this using things like argpartition but this is
# more succinct.
def top_k(arr, k):
  kth_largest = (k + 1) * -1
  return np.argsort(arr)[:kth_largest:-1]

In [None]:
# So for our query above, these are the top five documents.
top_related_indices = top_k(cosine_similarities, 5)
print(top_related_indices)

In [None]:
# Let's take a look at their respective cosine similarities.
print(cosine_similarities[top_related_indices])

In [None]:
# Top match.
print(corpus.data[top_related_indices[0]])

In [None]:
# Second-best match.
print(corpus.data[top_related_indices[1]])

In [None]:
# Try a different query
query = ["satellite"]
query_tfidf = vectorizer.transform(query)

cosine_similarities = cosine_similarity(features, query_tfidf).flatten()
top_related_indices = top_k(cosine_similarities, 5)

print(top_related_indices)
print(cosine_similarities[top_related_indices])

In [None]:
print(corpus.data[top_related_indices[0]])

So here we have the beginnings of a simple search engine but we're a far cry from competing with commercial off-the-shelf search engines, let alone Google.
<br>
- For each query, we're scanning through our entire corpus, but in practice, you'll want to create an **inverted index**. Search applications such as Elasticsearch do that under the hood.
- You'd also want to evaluate the efficacy of your search using metrics like **precision** and **recall**.
- Document ranking also tends to be more sophisticated, using different ranking functions like Okapi BM25. With major search engines, ranking also involves hundreds of variables such as what the user searched for previously, what do they tend to click on, where are they physically, and on and on. These variables are part of the "secret sauce" and are closely guarded by companies.
- Beyond word presence, intent and meaning are playing a larger role.
<br>

Information Retrieval is a huge, rich topic and beyond search, it's also key in tasks such as question-answering.

## TF-IDF Exercises

**EXERCISE**<br>
Read up on these concepts we just mentioned if you're curious.<br>

https://en.wikipedia.org/wiki/Inverted_index<br>
https://en.wikipedia.org/wiki/Precision_and_recall<br>
https://en.wikipedia.org/wiki/Okapi_BM25<br>

In [None]:
#
# EXERCISE: fetch multiple topics from the 20 newsgroups
# dataset and query them using the approach we followed.
# A list of topics can be found here:
# https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset
#
# If you're feeling ambitious, incorporate n-grams or
# look at how you can measure precision and recall.
#