# Topic Modelling

## Where Topic Modelling Fits in NLP and Machine Learning

1. NLP Tasks: Topic modeling is part of unsupervised learning in NLP, often used for text mining, information retrieval, and content recommendation.
2. Machine Learning: It utilizes unsupervised machine learning algorithms to discover hidden patterns in data without predefined labels or categories.

Topic modelling is a technique in natural language processing (NLP) used to uncover the underlying topics that are present in a collection of documents. It helps in identifying patterns and organizing large sets of textual data by clustering similar words and phrases into topics. This technique is particularly useful for summarizing, categorizing, and analyzing text data.

## Bag of Words (BoW) Model

- Definition: The Bag of Words model is a simplifying representation used in natural language processing (NLP). **In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and word order but keeping multiplicity.**
- Feature Extraction: Texts are converted into fixed-length vectors. Common techniques include simple word counts, term frequency (TF), and term frequency-inverse document frequency (TF-IDF). 

LSA (Latent Semantic Analysis), NMF (Non-negative Matrix Factorization), and LDA (Latent Dirichlet Allocation) are considered "bag of words" models. Why:?    

- Input Representation: All three models (LSA, NMF, LDA) take a document-term matrix as input, which is constructed using the bag of words approach.
- Word Order Ignored: The inherent characteristic of BoW models is that they do not consider the order of words. All three models operate on this premise.
- Focus on Frequency/Presence: They focus on the frequency or presence of words in documents, which is a key aspect of the bag of words approach.

## Common Algorithms for Topic Modeling

1. Latent Dirichlet Allocation (LDA): One of the most popular methods, which assumes that each document is a mixture of a small number of topics and that each word in the document is attributable to one of the document's topics. Probabilistic Model: LDA is a generative probabilistic model that assumes documents are mixtures of topics, and topics are mixtures of word
2. Non-negative Matrix Factorization (NMF): A linear algebra approach that factorizes the term-document matrix into non-negative matrices.
3. Latent Semantic Analysis (LSA): A technique that applies singular value decomposition (SVD) to the term-document matrix to reduce its dimensions and uncover the latent structure in the data.

**Summary of Why They Are BoW Model**

1. Input Representation: All three models (LSA, NMF, LDA) take a document-term matrix (eg CountVectorizer, TfIDF sparse matrices) as input, which is constructed using the bag of words approach.
2. Word Order Ignored: The inherent characteristic of BoW models is that they do not consider the order of words. All three models operate on this premise.
3. Focus on Frequency/Presence: They focus on the frequency or presence of words in documents, which is a key aspect of the bag of words approach.

## Code samples

### Example of a simple document term matrix


In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The quick brown fox",
    "jumps over the lazy dog",
    "The fox",
    "The dog is lazy"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert to array and print
print(X.toarray())

# Feature names (vocabulary)
print(vectorizer.get_feature_names_out())


[[1 0 1 0 0 0 0 1 1]
 [0 1 0 0 1 1 1 0 1]
 [0 0 1 0 0 0 0 0 1]
 [0 1 0 1 0 1 0 0 1]]
['brown' 'dog' 'fox' 'is' 'jumps' 'lazy' 'over' 'quick' 'the']


### LDA

In [1]:
# LDA
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk

# Sample documents
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement"
]

# Preprocessing function
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'\W', ' ', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Stemming
    ps = PorterStemmer()
    tokens = [ps.stem(word) for word in tokens]
    return ' '.join(tokens)

# Preprocess all documents
preprocessed_documents = [preprocess(doc) for doc in documents]

# Vectorize the documents using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_documents)

# Train the LDA model
lda = LatentDirichletAllocation(n_components=2, random_state=0)
lda.fit(X_tfidf)

# Print the topics
for idx, topic in enumerate(lda.components_):
    print(f"Topic: {idx} \nWords: {', '.join([tfidf_vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])}\n")


Topic: 0 
Words: machin, abc, applic, measur, perceiv, error, relat, manag, user, interfac

Topic: 1 
Words: comput, respons, time, user, engin, test, survey, opinion, ep, system



### NMF

In [2]:
# NMF
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk

# Sample documents
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement"
]

# Preprocessing function
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'\W', ' ', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Stemming
    ps = PorterStemmer()
    tokens = [ps.stem(word) for word in tokens]
    return ' '.join(tokens)

# Preprocess all documents
preprocessed_documents = [preprocess(doc) for doc in documents]

# Vectorize the documents using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_documents)

# Train the NMF model
nmf = NMF(n_components=2, random_state=0)
nmf.fit(X_tfidf)

# Print the topics
for idx, topic in enumerate(nmf.components_):
    print(f"Topic: {idx} \nWords: {', '.join([tfidf_vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])}\n")



Topic: 0 
Words: lab, machin, user, engin, test, human, manag, interfac, ep, system

Topic: 1 
Words: comput, opinion, survey, measur, perceiv, relat, error, user, time, respons



### LSA

In [5]:
# LSA
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk

nltk.download('punkt')
nltk.download('stopwords')

# Sample documents
documents = [
    "Human machine interface for lab abc computer applications.",
    "A survey of user opinion of computer system response time.",
    "The EPS user interface management system.",
    "System and human system engineering testing of EPS.",
    "Relation of user perceived response time to error measurement."
]

# Preprocessing function
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'\W', ' ', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Stemming
    ps = PorterStemmer()
    tokens = [ps.stem(word) for word in tokens]
    return ' '.join(tokens)

# Preprocess all documents
preprocessed_documents = [preprocess(doc) for doc in documents]
# Vectorize the preprocessed documents using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_documents)

# Apply Truncated SVD (LSA)
lsa = TruncatedSVD(n_components=2, random_state=0)
X_lsa = lsa.fit_transform(X_tfidf)

# Print the topics
terms = tfidf_vectorizer.get_feature_names_out()
for idx, component in enumerate(lsa.components_):
    print(f"Topic {idx}:")
    for i in component.argsort()[-10:]:
        print(f"{terms[i]}", end=' ')
    print("\n")

# Print word-topic matrix to show embeddings of terms
print("Word-Topic Matrix (first 10 terms):\n", lsa.components_[:, :10])

Topic 0:
opinion human comput time respons manag interfac ep user system 

Topic 1:
comput opinion survey user measur perceiv relat error time respons 

Word-Topic Matrix (first 10 terms):
 [[ 0.09173987  0.09173987  0.21909378  0.16600004  0.32477867  0.10543331
   0.20794293  0.26486604  0.09173987  0.09173987]
 [-0.1077054  -0.1077054   0.05139182 -0.17741717 -0.25731592  0.27463756
  -0.23003509 -0.2010729  -0.1077054  -0.1077054 ]]


[nltk_data] Downloading package punkt to /home/frangs/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/frangs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
terms.shape

(21,)

In [7]:
lsa.components_.shape

(2, 21)

## Metrics and evaluation

### Topic coherence

Topic Coherence refers to the degree to which the words within a single topic are semantically related to each other. It helps evaluate how interpretable the topics generated by a model are. A high coherence score generally indicates that the topics are meaningful and make sense to humans.

#### Evaluation metrics

##### Perplexity

   
1. Definition: Perplexity is a measure of how well a probabilistic model predicts a sample of unseen data. In the context of topic modeling, perplexity evaluates how well the model predicts the words in unseen documents given the learned topics.

Intuition behind Perplexity:

- Model Prediction: The perplexity metric quantifies how surprised a model is by new unseen data.
- Word Prediction: Lower perplexity means the model is less surprised by new documents and can predict the words more accurately.


2. Calculation
   $$\text{Perplexity} = \exp\left(-\frac{1}{N} \sum \log p(w_i)\right)$$

Where:

    - N is the total number of words in the test set.
    - p(wi) is the probability assigned by the model to word wi
   

For a given test set of documents, perplexity is calculated using the probabilities assigned to the words in the documents by the topic model.

3. Interpretation of Perplexity

- Lower Perplexity: Indicates better model performance. The model is less surprised by the test data and can predict the words more accurately.
- Higher Perplexity: Indicates poorer model performance. The model struggles to predict the words in the test data accurately.

##### Coherence Score:

1. Definition: The coherence score measures the degree of semantic similarity between high scoring words in the topics.
2. Importance: It aligns more closely with human judgment compared to perplexity.
3. Common Implementations: Tools like Gensim provide functions to compute coherence scores for LDA models.

##### Human Interpretability

1. Definition: Evaluating how understandable and meaningful the topics are to human users.
2. Methods:
   - Human Judgment: Experts or annotators assess the quality of the topics by checking if the top words in each topic form a coherent and interpretable set.
   - Qualitative Analysis: Reviewing the most representative documents for each topic to see if they align with the expected topic themes.



### Code samples

##### Perplexity

In [12]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The dog barks loudly",
    "The cat meows softly",
    "The dog and cat are pets",
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly"
]

# Preprocessing and Vectorization
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# Train LDA model
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

# Calculate Perplexity on the same data (usually you use a held-out test set)
perplexity = lda.perplexity(X)

print(f'Perplexity: {perplexity}')

Perplexity: 24.072064586265917


##### Coherence score 

using gensim, unable to figure out a sklearn version

In [10]:
from gensim.models import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel

# Sample documents
documents = [
    ["dog", "barks", "loudly"],
    ["cat", "meows", "softly"],
    ["dog", "and", "cat", "are", "pets"]
]

# Create a dictionary and corpus
dictionary = Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Train LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, random_state=42)

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=documents, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model_lda.get_coherence()

print(f'Coherence Score: {coherence_score}')


Coherence Score: 0.29927940849732204


In [11]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The dog barks loudly",
    "The cat meows softly",
    "The dog and cat are pets"
]

# Preprocessing and Vectorization
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# Train LDA model
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

# Calculate Perplexity on the same data (usually you use a held-out test set)
perplexity = lda.perplexity(X)

print(f'Perplexity: {perplexity}')


Perplexity: 13.519769559569319


## Term Embeddings in Topic Analysis BoW

Term embeddings are dense vector representations of words that capture semantic meanings. They are not inherently part of the traditional algorithms like LDA, NMF, and LSA. However, embeddings can be used in conjunction with these algorithms to enhance their performance. For instance:

1. LDA and NMF: Typically do not use term embeddings directly. They work with the term-document matrix (from Count Vectorizer or TF-IDF).
2. LSA: Can be considered a precursor to modern embedding techniques. It reduces the dimensionality of the term-document matrix, uncovering latent semantic structures, but does not produce embeddings in the modern sense (like Word2Vec or GloVe).

### LSA and Embeddings

#### LSA (Latent Semantic Analysis):
- *Reduction to Latent Space:* LSA uses Singular Value Decomposition (SVD) to reduce the high-dimensional term-document matrix to a lower-dimensional space. This reduced space can be seen as capturing latent semantic structures.

- *Document-Topic and Word-Topic Matrices:* In LSA, the matrices obtained from SVD can be viewed as embeddings in a reduced semantic space. For instance, the X_lsa matrix represents documents in terms of latent topics, while the other matrices (U, Σ, and V^T from SVD) represent relationships between terms and topics.

- *Similarity to Embeddings:* The key similarity is that **both embeddings and LSA capture semantic relationships and reduce dimensionality. However, LSA’s vectors are derived from linear algebra rather than neural network-based optimization.**

#### Why NMF and LDA Aren't Typically Considered Embeddings

##### NMF (Non-negative Matrix Factorization):
- *Matrix Factorization:* NMF factorizes the term-document matrix into two lower-dimensional matrices (document-topic and topic-word) with non-negative constraints. These matrices can be interpreted similarly to embeddings, but they are not typically referred to as embeddings because they are not trained in the same way as traditional embeddings.
- *Interpretability:* NMF’s components are directly interpretable as topics, which is different from the general-purpose semantic vectors produced by embeddings.

##### LDA (Latent Dirichlet Allocation):

- Probabilistic Model: LDA is a generative probabilistic model that assumes documents are mixtures of topics, and topics are mixtures of words. It outputs distributions over topics for each document and distributions over words for each topic.
- Distributional Output: The outputs (document-topic and topic-word distributions) can be high-dimensional and sparse, contrasting with the dense, fixed-length vectors of embeddings.
- Interpretability: Like NMF, LDA’s outputs are highly interpretable in terms of topics but don’t function as general-purpose semantic embeddings.