# NLP Notes

## Topic Modelling

- Unsupervised machine learning technique to identify semantic patterns in a text and extract key topics.
- The key idea is that text of a specific topic is more likely to produce certain words more frequently.

### Latent Semantic Analysis (LSA)

- LSA is used to find relationships between many documents.
- It creates a big matrix where each row represents a unique word, and each column a document.
- It then reduces this sparse matrix using *Singular Value Decomposition* while maintaining the relationship between words and documents.
- *Cosine similarity* is used to identify the similarity between documents.

### Latent Dirichlet Allocation (LDA)

- LDA is a Bayesian network.
- Treats each document as a *bag-of-words* and assigns each word to different topics.

### LSA vs LDA

- LSA identifies relationships between documents while LDA extracts topics from individual documents.

In [1]:
# Topic Modelling with genism (LDA).

## TF-IDF

- Term frequency-inverse document frequency (TF-IDF) measures the importance of a word to a specific document.
- It is the product of two statistics: *term frequency (TF)* and *inverse document frequency (IDF).

### Term Frequency

- Term frequency is the relative frequency of a term in a document.
- Calculated dividing the number of times the term appears in the document over the total number of terms in the document.
  - $t$ - Term
  - $d$ - Document
  - $f_{t, d}$ - Frequency of term $t$ in document $d$.

$$\text{tf}(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}$$

### Inverse Document Frequency (IDF)

- Inverse document frequency measures the amount of information a term provides.
- Calculated by dividing the total number of documents by the number of documents that contain the term, and taking the logarithm of the quotient.
  - $t$ - Term.
  - $d$ - Document.
  - $D$ - Set of all documents.
  - $N$ - Total number of documents.

$$\text{idf}(t, D) = \log \frac{N}{\{d \in D : t \in d\}}$$

### TF-IDF Formula

- To calculate TF-IDF, multiply values of TF and IDF.

$$\text{tfidf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D)$$

In [2]:
# TF-IDF with Scikit-Learn.

## Transformers

In [3]:
# transformers example.

## BERT

In [4]:
# BERT example with transformers

## Sentence-BERT

- **Context**: Sometimes we want to encode entire sentences but BERT only creates embeddings for individual words.
- Simply averaging the values of the word vectors are ineffective.
- SBERT uses a *Siamese network*, meaning each time two sentences are passed independently through the same BERT model.
  - So what??

In [None]:
# SBERT with sentence_transformers.

## BERTopic

- BERTopic is a topic modelling algorithm using BERT.
- The algorithm consists of five steps:
  - Generate embeddings with Sentence-BERT.
  - Dimensionality reduction with UMAP to reduce the dimensions of the output embeddings from Sentence-BERT for easy visualization.
  - Clustering with HDBSCAN.
  - Tokenizing with CountVectorizer to find the most representative words for each topic.
  - Weighting with c-TF-IDF.
  - Optional representation tuning.


### Embed Documents

- Start by converting documents to vectors using Sentence-BERT.

### Dimensionality Reduction

- The output from Sentence-BERT is a high-dimensionality vector that is difficult to cluster.
- Therefore we use the UMAP algorithm to reduce the dimensions of our embeddings while retaining their information.
- Dimensionality reduction also helps with visualization, since its impossible to visualize anything more than 3-dimensions.

### Cluster Documents

- Use the HDBSCAN clustering algorithm to cluster similar documents together.

### Bag-of-Words

- Because there exist different clustering algorithms that create different types of clusters, we don't want to use the centroid to represent the cluster.
- As such, the algorithm compiles all documents in each cluster into a giant document. This giant document now represents the cluster.
- The algorithm employs a bag-of-words technique to count all the words in each giant document.
- This can be done using Scikit-Learn's `CountVectorizer` for example.

### Topic Representation

- Finally, we want to assign each cluster a bunch of topics.
- We use TF-IDF to find the importance of each word to its respective document and pick the most important words as our topics.
- Since we are applying TF-IDF on the cluster itself, the algorithm is called class-based TF-IDF (c-TF-IDF).

In [None]:
# Multiple examples using different configurations.