# Using LDA (Latent Dirichlet Allocation) for Topic Modeling
* It's a probabilistic model.
* LDA is trained in a generative manner, where it tries to abstract from the documents a set of hidden topics that are likely to generate a certain collection of words.
* We'll keep working with the newsgroup dataset.
* We'll use scikit-learn builtin LDA decomposition model


# Step 1: Loading and preprocessing the data 
* We'll use the tfid vectorized data instead of the counterVector


In [1]:
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer

# Defining our categories (the ones we'll use to fetch the data)
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space'
]

groups = fetch_20newsgroups(subset = 'all', categories=categories)

# Getting our labels and label names
labels = groups.target
label_names = groups.target_names

# Removing names and lemmatizing 
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()

# An empty list to store our cleaned data
data_cleaned = []

for doc in groups.data:
    doc = doc.lower()
    doc_cleaned = " ".join(lemmatizer.lemmatize(word) for word in doc.split() if word.isalpha() and word not in all_names)
    data_cleaned.append(doc_cleaned)
    
# Using TFidfVectorizer instead of CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vector = TfidfVectorizer(stop_words = 'english', max_features = None,
                              max_df=0.5, min_df = 2)

# Fitting our model
vectorized_data = tfidf_vector.fit_transform(data_cleaned)



# Step 2: Training the LDA model

In [2]:
from sklearn.decomposition import LatentDirichletAllocation
t = 20
lda = LatentDirichletAllocation(n_components=t, 
                               learning_method='batch', random_state=42,
                               max_iter = 10)

lda.fit(vectorized_data)

In [3]:
# Obtaining the resulting topic-term rank
lda.components_

array([[0.05      , 0.05000001, 0.05000001, ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.30041961,
        0.05      ],
       ...,
       [0.05      , 0.05000001, 0.05000001, ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.39388581]])

In [6]:
# Displaying the top 10 terms based on their rank
terms = tfidf_vector.get_feature_names_out()

for topic_index, topic in enumerate(lda.components_):
    print(f"Topic {topic_index}: ")
    print(" ".join([terms[i] for i in topic.argsort()[-10:]]))

Topic 0: 
rle blood davidians private activity bureau tourist cookamunga kent ksand
Topic 1: 
bissell swallow sex tribe lawrence jeremy penn liar pope walla
Topic 2: 
detector batse salvation timer habitable punishable bottle denver chade meng
Topic 3: 
suitable sect sean compassion xv davidians mcmains hernandez convenient ansi
Topic 4: 
mr relates spec tatoos virile buffer nazi double instinctive act
Topic 5: 
article like space know program file graphic wa university image
Topic 6: 
people believe atheist god say article morality think moral wa
Topic 7: 
middle cobb george ezekiel ureply nicholls tax greg illinois tossed
Topic 8: 
petri temperature christmas served leftover truelove turkey cruel gas solid
Topic 9: 
leigh langley film compaq orion cview mccreary magellan vax oliveira
Topic 10: 
fast notre bob tektronix queen manhattan sank blew beauchaine bronx
Topic 11: 
forming normal delaunay fermi sign redesign option accelerator sphere chimp
Topic 12: 
burdett hussein bond grego

* We can find that some of the topics are very well distinguishible from our categories, but others have a lot of noise. 
* The project in this chapter was about finding hidden similarity underneath newsgroups data, be it semantic groups, themes, or word clouds.
