# Topic Modelling Modern Approaches

## Where Topic Modelling Fits in NLP and Machine Learning

1. NLP Tasks: Topic modeling is part of unsupervised learning in NLP, often used for text mining, information retrieval, and content recommendation.
2. Machine Learning: It utilizes unsupervised machine learning algorithms to discover hidden patterns in data without predefined labels or categories.

Topic modelling is a technique in natural language processing (NLP) used to uncover the underlying topics that are present in a collection of documents. It helps in identifying patterns and organizing large sets of textual data by clustering similar words and phrases into topics. This technique is particularly useful for summarizing, categorizing, and analyzing text data.

## Code samples

### Example of a simple document term matrix


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The quick brown fox",
    "jumps over the lazy dog",
    "The fox",
    "The dog is lazy"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert to array and print
print(X.toarray())

# Feature names (vocabulary)
print(vectorizer.get_feature_names_out())


[[1 0 1 0 0 0 0 1 1]
 [0 1 0 0 1 1 1 0 1]
 [0 0 1 0 0 0 0 0 1]
 [0 1 0 1 0 1 0 0 1]]
['brown' 'dog' 'fox' 'is' 'jumps' 'lazy' 'over' 'quick' 'the']


## Term Embedding

Embeddings are dense vector representations of words or phrases that capture semantic meanings. Unlike sparse representations like one-hot encoding, embeddings map words to continuous vector spaces where semantically similar words have similar representations. These vectors are typically of lower dimensions (e.g., 100-300 dimensions) compared to the vocabulary size.


### Importance of Embeddings

1. Semantic Similarity: Words with similar meanings are closer in the embedding space.
2. Dimensionality Reduction: Embeddings reduce the dimensionality of text data while preserving meaningful relationships.
3. Improved Performance: Embeddings improve the performance of NLP models by providing more informative features compared to traditional methods.


### Popular Embedding Strategies

1. Word2Vec: Predicts a word given its context (Skip-gram) or predicts the context given a word (CBOW).
2. GloVe (Global Vectors for Word Representation): Uses co-occurrence statistics to learn word embeddings.
3. FastText: Extends Word2Vec by considering subword information, which helps with out-of-vocabulary words.
4. BERT (Bidirectional Encoder Representations from Transformers): Uses transformer-based architecture to create context-aware embeddings.

## Implementation Example: Word2Vec with Gensim

### Code Samples

In [None]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.test.utils import common_texts  # Example dataset

# Tokenize and preprocess the text
sentences = [simple_preprocess(" ".join(doc)) for doc in common_texts]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Save the model
model.save("word2vec.model")

# Load the model
model = Word2Vec.load("word2vec.model")

# Get the vector for a specific word
vector = model.wv['computer']
print(vector)


[-0.00515774 -0.00667028 -0.0077791   0.00831315 -0.00198292 -0.00685696
 -0.0041556   0.00514562 -0.00286997 -0.00375075  0.0016219  -0.0027771
 -0.00158482  0.0010748  -0.00297881  0.00852176  0.00391207 -0.00996176
  0.00626142 -0.00675622  0.00076966  0.00440552 -0.00510486 -0.00211128
  0.00809783 -0.00424503 -0.00763848  0.00926061 -0.00215612 -0.00472081
  0.00857329  0.00428459  0.0043261   0.00928722 -0.00845554  0.00525685
  0.00203994  0.0041895   0.00169839  0.00446543  0.0044876   0.0061063
 -0.00320303 -0.00457706 -0.00042664  0.00253447 -0.00326412  0.00605948
  0.00415534  0.00776685  0.00257002  0.00811905 -0.00138761  0.00808028
  0.0037181  -0.00804967 -0.00393476 -0.0024726   0.00489447 -0.00087241
 -0.00283173  0.00783599  0.00932561 -0.0016154  -0.00516075 -0.00470313
 -0.00484746 -0.00960562  0.00137242 -0.00422615  0.00252744  0.00561612
 -0.00406709 -0.00959937  0.00154715 -0.00670207  0.0024959  -0.00378173
  0.00708048  0.00064041  0.00356198 -0.00273993 -0.0

In [None]:
uwords = set(w for doc in common_texts for w in doc)
uwords, len(uwords)

({'computer',
  'eps',
  'graph',
  'human',
  'interface',
  'minors',
  'response',
  'survey',
  'system',
  'time',
  'trees',
  'user'},
 12)

In [None]:
print(model.wv)
len(vector)

KeyedVectors<vector_size=100, 12 keys>


100

## Implementation using BerTopic

BerTopic docs are a great reference.

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Load dataset
newsgroups_train = fetch_20newsgroups(subset='train')
docs = newsgroups_train.data

# Initialize BERTopic
topic_model = BERTopic()

# Fit the model
topics, _ = topic_model.fit_transform(docs)

# Get the topics
print(topic_model.get_topic_info())

We can extract info  at a document level. 

In [None]:
# per document info
topic_model.get_document_info(docs).head()

In [None]:
#| hide
# topic_model.get_topic_info???

In [None]:
topic_df = topic_model.get_topic_info()
row = topic_df.sample(1)
topic_df.head()

In [None]:
similar_topics, similarity = topic_model.find_topics("car", top_n=5)
print(similar_topics)
topic_model.get_topic(similar_topics[0])