# Word Embedding

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Shortcomings of Bag-of-Words</p>

<center>
<img src='https://i.postimg.cc/1zw5pqP1/wordcloud.jpg' width=500>
</center>
<br>

Last time we discussed that with a bag-of-words we **lose word order information**, although this can be partially remedied by using n-grams to encode context. 

When working with a **binary** bag-of-words there is another significant drawback. This is that **all words are treated as equally important**, although we know this is **not the case** in language. 

We could use a **frequency** bag-of-words but then some words like 'the' and 'it' **occur very frequently** and affect similarity calculations. We could remove stop words but this won't remove all of the frequent and redundant words. For example, in a corpus of food recipes, words like 'mix', 'bowl' and 'teaspoon' will appear in almost all documents and so won't be very informative.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Relative frequency</p>

An improvement then would be to use the **relative frequency** of each word in the corpus. This is **calculated** as the number of times a word appears in a document divided by the total number of times it appears in the corpus. 

<br>

$$
\large
\text{relative frequency} = \frac{\text{frequency in document}}{\text{frequency in corpus}}
$$

<br>

The idea is that words that appear **highly frequently** in some documents and rarely in the rest, are likely to be **meaningful to those documents** and will help distinguish between documents. On the other hand, words that appears roughly **uniformly** across all documents are **unlikely to be important**. 

# TF-IDF

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Term Frequency</p>

<center>
<img src='https://i.postimg.cc/xCL5Dt2p/wren.jpg' width=600>
</center>
<br>

**TF-IDF** stands for **Term Frequency - Inverse Document Frequency** and is made up of **two components**. The first is the **term frequency**. 

Whilst some people define the term frequency to be the relative frequency, it is more common to use the **raw frequency** of the token/term $t$ in document $d$.

<br>

$$
\Large
\text{tf}(t,d) = f_{t,d}
$$

<br>

However, some documents may be much **longer** than others and so will naturally have higher frequencies across the board. For this reason, it is standard practice to apply the **log-transform** to reduce this bias. The term frequency then becomes

<br>

$$
\Large
\text{tf}(t,d) = \log(1+f_{t,d})
$$

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Inverse Document Frequency</p>

The second part of TF-IDF is the **inverse document frequency**. This is the part that will **emphasise the more important words** in each document. 

Given a token/term $t$ in a corpus $D$, we define

<br>

$$
\Large
\text{idf}(t,D) = \log \left(\frac{N}{n_t} \right)
$$

<br>

where $N$ is the number of documents and $n_t$ is the number of documents $t$ appears in. Notice how as $n_t$ decreases the idf increases corresponding to a token that is more likely to be **important**.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">TF-IDF score</p>

To get the final tf-idf score, we simply **multiply** the tf with the idf. That is,

<br>

$$
\Large
w_{t,d} = \text{tf}(t,d) \times \text{idf}(t,D)
$$

<br>

So the more frequently a word appears in a given document and the fewer times it appears in other documents the **higher** its TF-IDF score. 

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Variations</p>

There are **many variations** to the TF-IDF score. Like we discussed earlier, instead of raw frequency, we could use the **relative frequency** in the term frequency term. This is actually how Wikipedia presents the formula.

The **sklearn implementation** of tf-idf doesn't apply the log-transform in the tf term. It also adds constants to the idf term to prevent division by zero and uses the **natural logarithm**. In particular, is uses the following formulas.

<br>

$$
\Large
\text{tf}(t,d) = f_{t,d}
$$

<br>

$$
\Large
\text{idf}(t,D) = \ln \left(\frac{N+1}{n_t+1} \right) + 1
$$

<br>

It also **normalizes** the output vector for each document so that each document has a vector of scores with **norm equal to 1**. 

# Application

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">TF-IDF with sklearn</p>

<br>

Import the **libraries**.

In [None]:
import spacy
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

For this demo, we'll use use the **20 newsgroups dataset**, which is a collection of 18,000 newsgroup posts across 20 topics. We'll take the posts relating to the `sci.space` topic as that will be enough for our application.

In [None]:
# Load corpus
corpus = fetch_20newsgroups(categories=['sci.space'], remove=('headers', 'footers', 'quotes'))

# Preview data
print(len(corpus.data))
print(corpus.data[0])

We need to **pre-process** the text first using a **tokenizer**. We'll do this using spacy, like we have seen in previous notebooks. We apply lemmatization, remove punctuation, spaces and non-alphabetic characters. 

In [None]:
# Load english language model
nlp = spacy.load('en_core_web_sm')

# Disable named-entity recognition and parsing to save time
unwanted_pipes = ["ner", "parser"]

# Custom tokenizer using spacy
def custom_tokenizer(doc):
    with nlp.disable_pipes(*unwanted_pipes):
        return [t.lemma_ for t in nlp(doc) if not t.is_punct and not t.is_space and t.is_alpha]

Similar to the bag-of-words countvectorizer, **sklearn** also provides a class to perform **TF-IDF**, namely`TfidfVectorizer`. To use our custom tokenizer, we pass it through as a parameter.

In [None]:
# Initialise tf-idf tokenizer
vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer)

# Fit vectorizer to corpus
features = vectorizer.fit_transform(corpus.data)

The output is a **sparse matrix** with dimensions (number of documents, size of vocabulary). The entries of the matrix are the **tf-idf scores** of each **document-token pair**. 

In [None]:
# Size of vocabulary
print(len(vectorizer.get_feature_names_out()))

# Dimensions of output matrix
print(features.shape)

In [None]:
# What the matrix looks like
print(features)

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">Document Search</p>

Now we have the tf-idf matrix, we can **measure similarity** exactly the same as before - by using **cosine similarity**. Given a document, we can find the other documents which are **most similar to the original one**. 

We will now go a step further and use it to build a basic **document search recommender system**. Given a **query** (i.e. a search term), we **transform** the query, **measure the similarity** with all the other documents and finally **return the most similar documents**. 

In [None]:
# Transform the query
query = ["Mars"]
query_tfidf = vectorizer.transform(query)

In [None]:
# Calculate pairwise similarity with all documents in corpus
cosine_similarities = cosine_similarity(features, query_tfidf).flatten()

In [None]:
# Return indices of top k matching documents
def top_k(arr, k):
    kth_largest = (k + 1) * -1
    return np.argsort(arr)[:kth_largest:-1]

# Return top 5 document indices
top_related_indices = top_k(cosine_similarities, 5)
print(top_related_indices)

In [None]:
# Corresponding cosine similarities
print(cosine_similarities[top_related_indices])

In [None]:
# Top match
print(corpus.data[top_related_indices[0]])

In [None]:
# Second-best match
print(corpus.data[top_related_indices[1]])