# Implement LSA and Topic model

LSA (Latent Semantic Analysis)

Topic Modeling (LDA – Latent Dirichlet Allocation)
Both are used for discovering hidden topics/relationships in text documents.

## Understanding LSA (Latent Semantic Analysis)

Goal: Reduce high-dimensional text data into fewer dimensions while preserving meaning.

Works on the Term-Document Matrix created by TF-IDF.

Uses SVD (Singular Value Decomposition) to reduce dimensions.

Captures synonyms & semantic similarity.

Example:
Words: "car", "automobile"
Both appear in similar contexts → LSA puts them closer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample documents
documents = [
    "I love machine learning and natural language processing",
    "Deep learning improves natural language understanding",
    "Artificial intelligence and machine learning are related fields",
    "Hospitals use AI to improve patient care",
    "Doctors apply medicine to treat patients"
]

# Step 1: TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')  
X = vectorizer.fit_transform(documents) 

# Step 2: Apply LSA (using SVD)
lsa_model = TruncatedSVD(n_components=2)  
lsa_topic_matrix = lsa_model.fit_transform(X) 
print("LSA Topics (Word Importance):")   
terms = vectorizer.get_feature_names_out() 
for i, comp in enumerate(lsa_model.components_): 
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
    print("Topic", i, ":", [t[0] for t in sorted_terms])


LSA Topics (Word Importance):
Topic 0 : ['learning', 'language', 'natural', 'machine', 'love']
Topic 1 : ['apply', 'doctors', 'medicine', 'patients', 'treat']


In [2]:
from sklearn.decomposition import LatentDirichletAllocation

# Step 1: Use same TF-IDF features
lda_model = LatentDirichletAllocation(n_components=2, random_state=42)
lda_matrix = lda_model.fit_transform(X)

# Step 2: Print topics
print("\nLDA Topics (Word Importance):")
terms = vectorizer.get_feature_names_out()
for i, topic in enumerate(lda_model.components_):
    sorted_terms = [terms[i] for i in topic.argsort()[:-6:-1]]
    print("Topic", i, ":", sorted_terms)



LDA Topics (Word Importance):
Topic 0 : ['machine', 'learning', 'love', 'processing', 'related']
Topic 1 : ['improves', 'understanding', 'deep', 'treat', 'patients']
