<img src="data/images/lecture-notebook-header.png" />

# Basic Applications

With representing documents as vectors together with calculating the similarity between two documents using the cosine similarity, we can already address some basic text mining tasks. In this notebook, we look at Keyword Extraction and Document Search in line with the topics covered in the lecture. Please note that we only sketch basic approaches and ideas here; so don't expect production-ready solutions :).


## Setting up the Notebook

### Required packages

Apart from the import parts from numpy and scikit-learn, we also need some packages for visualization. This includes the [`wordcloud`](https://anaconda.org/conda-forge/wordcloud) package to generate nice-looking word clouds. We also need a couple of auxiliary methods provided in `src/utils.py`. "Outsourcing'' that code keeps the notebook clean.


In [None]:
import re
import numpy as np
import pandas as pd

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from src.utils import get_articles, get_random_article, color_func, get_mask, compute_sparsity

As usual, we also need spaCy to handle the preprocessing for us.

In [None]:
import spacy

nlp = spacy.load("en_core_web_md")

---

## Data Collection

The file `data/news-articles-preprocessed.zip` contains a text file with 6k+ news articles collected from The Straits Times around Oct/Nov 2022. The articles are already somewhat preprocessed (punctuation removal, converted to lowercase, line break removal, lemmatization). Each line in the text file represents an individual article.

To get the article, the method `get_articles()` reads this zip file and loops through the text file and returns all articles in a list. The method also accepts a `search_term` to filter articles that contain that search term. While not used by default in the following, you can check it out to get different results


In [None]:
articles = get_articles('data/datasets/news-articles/news-articles-preprocessed.zip')
#articles = get_articles('data/datasets/news-articles/news-articles-preprocessed.zip', search_term="police")

print("Number of articles: {}".format(len(articles)))

There is also a method `get_random_article()` which, to the surprise of no-one, returns a random article from the list of 6k+ articles.

In [None]:
random_article = get_random_article('data/datasets/news-articles/news-articles-preprocessed.zip')

print(random_article)

From the output of the previous code cell, you can kind of see the preprocessing steps that have been performed or not. For example, stopwords have not been removed.

---

## Keyword Extractions

When we want to understand key information from specific documents, we typically turn towards keyword extraction. Keyword extraction is the automated process of extracting the words and phrases that are most relevant to an input text. There are many ways to approach this task as there are often subtle criteria that make a word or phrase relevant.

Using TF-IDF weights is a basic but intuitive approach as it considers a word/phrase relevant for text document if

* The word/phrase appears frequently in the document

* The word/phrase does not appear frequently in many other documents

Let's first fetch a random article that we will use throughout this section.


In [None]:
article = get_random_article('data/datasets/news-articles/news-articles-preprocessed.zip')

print(article)

### Baseline: Using Term Frequencies

As a baseline, let's consider first only the term frequency to identify keywords. Since we only look at a single article, we can calculate the term frequencies using very basic packages and methods provided by Python. First, we convert our article to a list of words. This is trivial since we already preprocessed our article so that we can split simply by whitespaces.


In [None]:
words = article.split()

print("Total number of words in the article: {}".format(len(words)))

Now we can use a [`collections.Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) to compute the number of occurrences of each word. The result is a dictionary with the keys being the words and the values being the counts.

In [None]:
word_freqs = Counter(words)

print(word_freqs)

We can now plot the result as a work cloud. The [`wordcloud`](https://anaconda.org/conda-forge/wordcloud) package provides a method `generate_from_frequencies()` that directly takes the dictionary we have just created as input. By default, the generated word clouds don't look very nice, so the code cell below makes to extension

* We use a mask to enforce an oval shape (by default: rectangle)

* We use a function `color_func` that maps the relevance score of a word to a color; here we actually use the same color to ensure that all words are legible; this also means that the relevance of each word is only marked by its font size.

In [None]:
wc = WordCloud(color_func=color_func, background_color="white", mask=get_mask(), max_words=500,contour_width=0)

wc.generate_from_frequencies(word_freqs)

plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.tight_layout()
plt.show()

Since we consider words as relevant that are frequent -- and we didn't perform stopword removal! -- the most relevant keywords are stopwords. Of course, we know that stopwords don't really carry any interesting meaning. For more useful results, we have 2 options

* Remove all stopwords and recalculate the term frequencies

* Use TF-IDF weights which will "penalize" stopwords as they appear in all documents.

In practice, the latter option is often preferred, particularly when larger n-grams are considered and stopwords as part of an n-gram might become important.


### Calculate TF_IDF Weights

We already saw in the other notebook how easy it is to calculate the TF-IDF weights for a given corpus using scitkit-learn. So let's just do this here. The only noticeable difference is that we set the parameter `max_features`. This was not needed in the other notebook since we only worked with a toy dataset and thus a limited vocabulary. Now we deal with real-world data, where it's common to restrict the vocabulary to some maximum size. Recall from the lecture that concept of sparsity, where corpus typically contains many words that occur very infrequently.


In [None]:
# Create TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf=False, ngram_range = (1, 1), max_features=20000)

# Transform documents to tf-idf vectors
X_tfidf = tfidf_vectorizer.fit_transform(articles)

# Convert to pandas dataframe -- just for a nice visualization
df_tfidf = pd.DataFrame(X_tfidf.A.T, columns=[ "d{}".format(d+1) for d in range(len(articles)) ])
df_tfidf = df_tfidf.set_index(pd.Index(tfidf_vectorizer.get_feature_names_out()))
df_tfidf

Now using a real-world corpus the sparsity is also much higher compared to the values we saw using our toy dataset. This should also make it more obvious why sparse matrix representations are used to store and handle these document-term matrices.

In [None]:
compute_sparsity(X_tfidf.A)

### Convert Article & Plot Word Cloud

We can now use our trained vectorizer to convert our random article to a document vector with TF-IDF weights. Note that the method `transform()` expects a list as input, so we need to wrap our single article into a list.

In [None]:
article_tfidf = tfidf_vectorizer.transform([article])

article_tfidf = np.asarray(article_tfidf.todense())

print(article_tfidf[0])

Of course, the resulting document vector will be sparse vector with most entries in the vector being 0

For the word cloud, we first need to extract all the words with non-zero TF-IDF weights together with their respective TF-IDF weights. The code cell below accomplishes this, and again creates a dictionary with the words as the keys and the TF-IDF weights as the values.


In [None]:
words_tfidf = list(tfidf_vectorizer.get_feature_names_out()[np.nonzero(article_tfidf[0])])

weights = list(article_tfidf[np.nonzero(article_tfidf)])

word_freqs = { w:weights[idx]  for (idx, w) in enumerate(words_tfidf) }

#print(word_freqs)

The weights are, of course, no longer simple integers. However, the [`wordcloud`](https://anaconda.org/conda-forge/wordcloud) package accepts floats as well since the important information stems from the differences in the weights (not in the absolute values of the weights). So let's run the same code to generate a word cloud, only now using the TF-IDF weights.

In [None]:
wc = WordCloud(color_func=color_func, background_color="white", max_words=500, mask=get_mask(), contour_width=0)

wc.generate_from_frequencies(word_freqs)

plt.figure()
# show
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.tight_layout()
plt.show()

This result is arguably much more useful as TF-IDF basically ignore stopwords but focus on those word that are arguably indicative/informative for a given document (here: news article). The exact word cloud will, of course, depend on the random article as well as the parameter settings for the `TfidfVectorizer` but also for the `WordCloud`. Most importantly, try different n-gram sizes (or ranges of n-gram sizes) and see how it affects the results.

---

## Document Search & Ranking

Finding documents of interest in a large text corpus is a very important task. The corpus can be the set of websites on the Internet, where finding relevant documents translates to an online search. Most basically, document search assumes an input query containing a set of search terms, and then finding the most relevant documents w.r.t. to these search terms.

We saw in the lecture that writing a good query is actually not a trivial task. If we require that all search terms must be included in a document, we might miss out on good results because we included a "weird" search term. However, if we include all documents that contain any of the search terms, we are likely to get many documents not relevant w.r.t to the whole query.

In the following, we replicate the basic 2-step approach from the lecture:

* Fetch all candidate documents (i.e., documents that contain at least on of the search terms)

* Rank all candidate terms based on the similarity to the query to identify the most relevant documents.

Of course, search engines such as Google or Bing perform way more sophisticated steps even outside methods such as PageRank. But this is outside our scope here and is a major topic in modules such as CS3245 Information Retrieval.

Just in case you tried different parameter settings above, let's first recalculate the TF-IDF weights for our corpus of news articles using the default parameter settings used in this notebook.


In [None]:
# Create TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf=False, ngram_range = (1, 1), max_features=20000)

# Transform documents to tf-idf vectors
X_tfidf = tfidf_vectorizer.fit_transform(articles)

### Find Candidate Documents

Since we identify candidate documents by checking if they contain at least one of the search terms -- again, in practice, more sophisticated approaches are performed -- we first need to bring our search terms into the same shape as our articles. For example, since we lemmatized the word in the news articles, we also need to lemmatize the words in our query.


In [None]:
def extract_keywords(query):
    # Split query and do some very basic preprocessing (lemmatize, keep only words)
    # The preprocessing of the query should match the preprocessing of the documents
    return [ t.lemma_.lower() for t in nlp(query) if t.is_alpha==True and t.is_stop==False ]

Let's assume the simple query *"money SCAM victims"*. Based on our preprocessing steps, we need to lowercase *"SCAM"* to *"scam"* and lemmatize *"victims"* to *"victim"*.

In [None]:
keywords = extract_keywords('money SCAM victims')

print(keywords)

Now we can implement a method `search()` that loops over all articles to check if an article contains at least one of the search terms from our query. While this is OK for just 6k+ short articles, looping over all documents would be impractical on large corpora. In practice, we would create secondary indices that map from a word to all indices of documents containing that word. But again, this is covered in Information Retrieval. Here, we allow ourselves to be naive :).

In [None]:
def search(docs, keywords):
    result_indices = []
    # Loop over each document and check if it should be part of the result
    # (NOTE: This is a very naive way to do in practice!)
    for idx, doc in enumerate(docs):
        # Keep it simple: return documents that contain ANY of the keywords
        if any([w in doc.split() for w in keywords]) == True:
            result_indices.append(idx)
    # We return the indices of the result documents not the documents themselves
    return np.asarray(result_indices)

Let's execute the method to find all the candidate articles for our set of query terms. Note that we only return the indices of the articles in the `documents list`. This is sufficient, and even more convenient, since we find the respective rows in our document-term matrix using these indices.

In [None]:
result_indices = search(articles, keywords)

print("Number of cadidate documents: {}".format(len(result_indices)))

### Rank Candidate Documents

To rank the candidate documents, we need to calculate the cosine similarity between the query and all candidates. While we get the document vectors directly from the document-term matrix, we still need to "vectorize" our query. Of course, this we can directly do by using the `transform()` method of the vectorizer.


In [None]:
query_tfidf = tfidf_vectorizer.transform([' '.join(keywords)])

print(query_tfidf)

The methods `rank_results()` now performs the final required steps (a) to compute all cosine similarities between the query, (b) to sort them w.r.t. the resulting similarity values, and (c) to return the top-k candidates as specified by the input parameters `topk`. For this method, we make clever use of method such as [`sklearn.metrics.pairwise.cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) and [`np.argpartition`](https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html) to make our lives easy. Not only is the code much simpler, but the performance is also likely to be much better.

In [None]:
def rank_results(result_indices, query_tfidf, X_tfidf, topk=10):
    results = []
    # Get all tf-idf vectors of the query result candidates
    docs_tfidf = X_tfidf.A[result_indices]
    # Compute cosine similarities between query and all candidates
    cosine_similarities = cosine_similarity(docs_tfidf, query_tfidf).squeeze()
    # Consider onlt the top-k cadidates
    top_result_indices = np.argpartition(cosine_similarities, -topk)[-topk:]
    # We have to return the indices of the documents
    return result_indices[top_result_indices]

Now we can call the method `rank_results()` to give us the `topk` most relevant articles.

In [None]:
for index in rank_results(result_indices, query_tfidf, X_tfidf, topk=3):
    print(articles[index])
    print()

Of course, in a practial system, we would new return the preprocessed string of the article but IDs or links to the orginal articles.

---

## Summary

The purpose of this notebook was to provide some very intuitive ideas on how to use document vectors (particularly with TF-IDF weights) to implement important text mining methods, of course in a basic/simplified manner. Later we will cover other common methods that rely on vectorized text documents, including text clustering and text classification.
