# Using NMF for topic modeling
* Decomposing a matrix V into W and H. 
* We'll use the NMF module from sklearn.decomposition
* To load the data, we'll use the same cleaned data as with the k_means clustering
* Specifying 20 topics

# Step 1: Loading and Preprocessing the data

In [3]:
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer

# Defining our categories (the ones we'll use to fetch the data)
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space'
]

groups = fetch_20newsgroups(subset = 'all', categories=categories)

# Getting our labels and label names
labels = groups.target
label_names = groups.target_names

# Removing names and lemmatizing 
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()

# An empty list to store our cleaned data
data_cleaned = []

for doc in groups.data:
    doc = doc.lower()
    doc_cleaned = " ".join(lemmatizer.lemmatize(word) for word in doc.split() if word.isalpha() and word not in all_names)
    data_cleaned.append(doc_cleaned)
    
# Using TFidfVectorizer instead of CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vector = TfidfVectorizer(stop_words = 'english', max_features = None,
                              max_df=0.5, min_df = 2)

# Fitting our model
vectorized_data = tfidf_vector.fit_transform(data_cleaned)



# Step 2: Fitting the NMF model on the term matrix
* The idea is to obtain the topic-feature rank W after the model is trained


In [4]:
from sklearn.decomposition import NMF

t = 20
nmf = NMF(n_components=t, random_state=42, max_iter = 200, tol = 1e-4)
nmf.fit(vectorized_data)

In [8]:
# Obtaining the top 10 terms for each topic, based on their ranks
terms = tfidf_vector.get_feature_names_out()

for topic_index, topic in enumerate(nmf.components_):
    print(f"Topic {topic_index}: ")
    print(" ".join([terms[i] for i in topic.argsort()[-10:]]))

Topic 0: 
right know ha good make want like just think people
Topic 1: 
ftp information thanks looking software university library package computer graphic
Topic 2: 
forwarded program sci digest international launch nasa station shuttle space
Topic 3: 
objectively mean basis article christian value say moral objective morality
Topic 4: 
say law faith bible doe believe love christian jesus god
Topic 5: 
software display colour weather processing jpeg xv bit color image
Topic 6: 
solar array oms mass day scheduled servicing shuttle mission hst
Topic 7: 
day message fbi biblical said david article did koresh wa
Topic 8: 
know jpeg cview tiff ftp gif program convert format file
Topic 9: 
kipling temperature dick collision dunn resembles spencer henry toronto zoology
Topic 10: 
data galileo loss jet timer propulsion comet spacecraft command orbit
Topic 11: 
rushdie activity islam bureau tourist cookamunga private kent ksand islamic
Topic 12: 
awful discussing forum convenience just post gro

* Topics 1, 5, 8 and 14 seem to be computer/software related.
* Topics 2, 6, 10 and 15 seem related to the space. 
* Tipics 3, 4 and 16 seem religion-oriented.
* Some topics are hard to interpret, but that's fine since topic modeling is a kind of free-form learning. 