---
title: "NLP: Data Mining Intro"
author: "Chris Kelly"
date: '02-21-24'
categories: [NLP, TF-IDF, cosine similarity, LSA, topic extraction]
format:
  html:
    code-fold: true
    toc: true
execute: 
  enabled: true
draft: true
---

# Quick introduction:

Algorithms like dealing with numbers - they like structure, for input data to be *tidy*. So how can an algorithm start to process unstructured text? And even then - start to extract anything meaningful from it?

<img src="https://media.giphy.com/media/eMBKXi56D0EXC/giphy.gif">

### Some definitions/background

When text data is provided, it goes through the process of "tokenisation". This involves splitting the text into smaller pieces, that the algorithm can encode into 0 and 1s (one-hot encoding). In our example, each unique word in a menu will be a "token".

A "document" is a collection of tokens associated with a particular sample. In our example, each document will be a restaurant menu. 

An "embedding" is an attempt to create numerical representation of that document in a vector. This can be useful for similarity search as we will see later (e.g. this user liked this restaurant, so let's recommend them a restaurant with a similar vector).

Finally, the "corpus" is a collection of all documents that the model can learn from. In our example, it the entire colletion of menus.

# TF-IDF: text-frequency inverse-document-frequency

### What is it used for?

Text-frequency Inverse-Document-Frequency (TF-IDF) is a technique to find the most important words in a document, or alternatively formulated to find the most important letters in a word.

It is good at determining 'global' statistics. By this, we mean it contrasts  the frequency of tokens in a document vs their prevalence across the entire corpus. It does not capture more detailed semantic relationships, particularly since it doesn't care about the ordering of words in a document.

<!-- Note further that it is heuristic-based. This means that although the TF-IDF -->

### What is the intuition here?

TF-IDF is based on the intuition that tokens that appear more frequently, especially those that tend to be rarer across the corpus, are the most important ones in that document.

For example, let's contrast two takeaway menus:
* Chicken Kurma Masala, Chicken Tikka Masala, Chicken Curry, Saag Aloo, Pilau Rice, Plain Rice
* Chicken Chow Mein Noodles, Crispy Beef, Chicken and Cashew nut curry, Prawn Crackers, Egg Fried Rice, Steamed rice, Vegetable Noodles

<img src="https://media.giphy.com/media/12xu9HYTRo4Eg0/giphy.gif">

The word 'Chicken' appears the most in the first order, followed by 'masasla' and 'Rice'. This is *text frequency*.

But that's only half the story - because chicken and rice are quite common words in general. So we need a way to measure whether a word is rarer. The word 'masala' only appears in the first order, whereas 'chicken' and 'rice' appears in both orders. Hence, the rareness of a word can be determined by how infrequently it appears in all the documents. This is *inverse document frequency*.

Combining text-frequency and inverse-document frequency scores for each token give it a TF-IDF score. This way, we can derive that the word 'masala' is the most important word from the first order, since it has both high TF and IDF scores.

::: {.column-margin}
TF-IDF can also be done for words, for example splitting the word `manner`` into character tokens:

* `n`` appears twice in the word
* `m`` is rarer

Imagine now that three of the letters are dropped. We are much more likely to guess that the word `m_nn_` could be manner, whereas seeing `_a__er` is far less informative.
:::

#### Text Frequency

Let's say we wanted to build something to classify cuisines. We might first take orders from two different menus, and try to identify which words are the some of the most important.

But how can we do this in an automated way?

<img src="https://media.giphy.com/media/B8Bp8MfpmKbWU/giphy.gif">

We might think that **words that are repeated many times in the menu are more characteristic of that restaurant. This is called 'text frequency'.**

So let's count them:

In [11]:
takeaway_orders = ['Chicken Kurma Masala, Chicken Tikka Masala, Chicken Curry, Pilau Rice, Plain Rice'
                   , 'Chicken Chow Mein Noodles, Crispy Beef, Chicken curry, Egg Fried Rice, Plain rice, Vegetable Noodles']
# unique_tokens = set(' ,'.join(takeaway_orders).replace(',','').split(' '))
from sklearn.feature_extraction.text import CountVectorizer
cnt_vec = CountVectorizer()
count_mat = cnt_vec.fit_transform(takeaway_orders)
import pandas as pd
pd.DataFrame(data = count_mat.todense(), index = ['Order_1', 'Order_2'], columns = cnt_vec.get_feature_names_out())

Unnamed: 0,beef,chicken,chow,crispy,curry,egg,fried,kurma,masala,mein,noodles,pilau,plain,rice,tikka,vegetable
Order_1,0,3,0,0,1,0,0,1,2,0,0,1,1,2,1,0
Order_2,1,2,1,1,1,1,1,0,0,1,2,0,1,2,0,1


Cool, so the most populated word from the first order is `chicken`, followed by `masala` and `rice`. The second order has chicken, noodles and rice as the most populated words.

Two things to note:
* Rather than taking the absolute counts, we might log the counts instead. This is because we might want to capture the concept of diminishing returns - the additional marginal importance we expect between having the word 'Masala' appear once vs twice between documents is greater than appearing nine times vs 10 times.  This is more important for longer documents, and the logging our counts captures this concept.*
* In this instance, we count 'uni-grams', with one token per word. In general implementation, we can count 'bi-grams' such as 'Chicken Kurma' and 'Kurma Masala', 'tri-grams' etc. See more under cosine-similarity

#### Inverse Document Frequency

We are only getting half the information here then. For example in the first order, chicken in general is a common word, so it appearing frequently is less informative, whereas masala is a rare word, and more informative. We thus need to **introduce an additional concept of word uniqueness**, or equivalently the opposite of how frequently it appears across our entire 'corpus' of menu orders.

<img src="https://media.giphy.com/media/5vR6pNsjhoKwo/giphy.gif">

Enter Karen Spärck Jones, with the concept of ‘inverse document frequency’. This is the idea that it is not just how often a word appears, **but how unique the word is across all sentences (or ‘documents’), that determines how important it is. TF-IDF helps use this to turn words into vectors.**

For example, a rare word like Masala appears in many 'documents' (orders), so will have a low document frequency, and thus will have a high inverse document frequency score.

We usually calculate inverse document frequency using the following logic:

$$
\text{idf} = 1+\ln \left(\frac{\text{\# docs in corpus}}{\text{\# docs term appears in}} \right)
$$

In other words, the word chicken appears in both docs, so the idf score is $1+\ln(\frac{2}{2})=1$. 
On the other hand, the word masala only appears in one doc, so it gets an IDF score of  $1+\ln(\frac{2}{1})\sim1.7$

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer
idf_vec = TfidfTransformer(smooth_idf=False,norm=None)
idf_mat = idf_vec.fit_transform(count_mat)
pd.DataFrame(data = idf_mat/count_mat, index = ['Order_1', 'Order_2'], columns = cnt_vec.get_feature_names_out())

Unnamed: 0,beef,chicken,chow,crispy,curry,egg,fried,kurma,masala,mein,noodles,pilau,plain,rice,tikka,vegetable
Order_1,,1.0,,,1.0,,,1.693147,1.693147,,,1.693147,1.0,1.0,1.693147,
Order_2,1.693147,1.0,1.693147,1.693147,1.0,1.693147,1.693147,,,1.693147,1.693147,,1.0,1.0,,1.693147


### TF x IDF

And to finish - TF-IDF is simply the multiplication between the TF and IDF scores, which "combines" the text-frequency and inverse-document-frequency concepts in the same token.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer(sublinear_tf=True)
tfidf_mat = tfidf_vec.fit_transform(takeaway_orders)
pd.DataFrame(data = tfidf_mat.todense(), index = ['Order_1', 'Order_2'], columns = tfidf_vec.get_feature_names_out())

Unnamed: 0,beef,chicken,chow,crispy,curry,egg,fried,kurma,masala,mein,noodles,pilau,plain,rice,tikka,vegetable
Order_1,0.0,0.459492,0.0,0.0,0.218951,0.0,0.0,0.307727,0.521028,0.0,0.0,0.307727,0.218951,0.370715,0.307727,0.0
Order_2,0.269369,0.324505,0.269369,0.269369,0.191658,0.269369,0.269369,0.0,0.0,0.269369,0.456081,0.0,0.191658,0.324505,0.0,0.269369


Cool, so combining text frequency and inverse document frequency now reveals the most important word in the first order is 'Masala', and it is 'noodles'. Nice.

Finally, we don't have to limit ourselves to sentences, and can split stings into character tokens, applying the same logic: for the word $\text{queen}$, we would find the letters $q$ and $e$ to be the most informative because of their rarity (IDF) and being repeated (TF). This could be useful in predicting the word being types or correcting mispelling - let's jump into a character-level example when discussing cosine similarity.

# Cosine similarlity

### What is it used for?

Cosine similarity can help measure how similar two words or documents are. For example, we could better match a search to a result using this, whereas 'keywords' would just weight every word equally. 

<img src="https://media.giphy.com/media/13cgadB959Y0BW/giphy.gif">

### What is the intuition?

TF-IDF creates a row of scores for each token in the text, for example the first order had high TF-IDF scores for masala, chicken and rice, and low scores for noodles and fried. If we have another document that has high TF-IF scores for masala, chicken and rice, and low scores for noodles and fried, we might think it is similar to the first order. Cosine similarity gives a measure between one and zero as to how similar the two texts are.

#### Vectorization

To take this further, what we have done using TF-IDF is a form of 'vectorization'. If we were to plot the first order in 16 dimensional space, with one axis for each word, the line remains at zero for the 'beef' axis, travels 0.46 along the 'chicken' axis, etc.

**We can then measure the angle (the cosine) between these lines to get an idea of how similar two documents are.** Cosine similarity is often used to match, how example, a search string to a result. 

This can feel a bit abstract, but hang on in there because this is a key part of intution!

Let's dive into this with an example we can visualise in 3 dimensional space to make this clearer.