# Lesson 3 Topic Modeling


Topic Models, in a nutshell, are a type of statistical language models used for uncovering hidden structure in a collection of texts. 


In a practical and more intuitively, you can think of it as a task of:


- Dimensionality Reduction, where rather than representing a text T in its feature space as {Word_i: count(Word_i, T) for Word_i in Vocabulary}, you can represent it in a topic space as {Topic_i: Weight(Topic_i, T) for Topic_i in Topics}


- Unsupervised Learning, where it can be compared to clustering, as in the case of clustering, the number of topics, like the number of clusters, is an output parameter. By doing topic modeling, we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a specific weight


Tagging, abstract “topics” that occur in a collection of documents that best represents the information in them.
There are several existing algorithms you can use to perform the topic modeling. The most common of it are, Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA)
In this article, we’ll take a closer look at LDA, and implement our first topic model using the sklearn.


In [None]:
!pip install pyldavis

In [None]:
# Load the data
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

from sklearn.datasets import fetch_20newsgroups

# categories = ['alt.atheism', 
#               'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', 
                                  # categories=categories,
                                  shuffle=True,
                                  random_state=11)

twenty_test = fetch_20newsgroups(subset='test',
                                #  categories=categories,
                                 shuffle=True,
                                 random_state=11)


categories = twenty_test.target_names

In [None]:
def display_topics(H, W, feature_names, no_top_words, n_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:n_top_documents]

            
sample_text = twenty_train.data[1]
sample_text

'From: chrism@cirrus.com (Chris Metcalfe)\nSubject: Nazi Eugenic Theories Circulated by CPR => (unconventianal peace)\nOrganization: Cirrus Logic Inc.\nLines: 85\n\nNow we have strong evidence of where the CPR really stands.\nUnbelievable and disgusting.  It only proves that we must\nnever forget...\n\n\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n\nIn article <1483500348@igc.apc.org> cpr@igc.apc.org (Center for Policy Research) writes:\n>\n>From: Center for Policy Research <cpr>\n>Subject: Unconventional peace proposal\n>\n>\n>A unconventional proposal for peace in the Middle-East.\n\nNot so unconventional.  Eugenic solutions to the Jewish Problem\nhave been suggested by Northern Europeans in the past.\n\n  Eugenics: a science that deals with the improvement (as by\n  control of human mating) of hereditory qualities of race\n  or breed.  -- Webster\'s Ninth Collegiate Dictionary.\n\n>5.      The emergence of a considerable number of \'mixed\'\n>marriages in Israe

In [None]:
dir(LatentDirichletAllocation)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_approx_bound',
 '_check_feature_names',
 '_check_n_features',
 '_check_non_neg_array',
 '_check_params',
 '_e_step',
 '_em_step',
 '_get_param_names',
 '_get_tags',
 '_init_latent_vars',
 '_more_tags',
 '_perplexity_precomp_distr',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_unnormalized_transform',
 '_validate_data',
 'fit',
 'fit_transform',
 'get_params',
 'partial_fit',
 'perplexity',
 'score',
 'set_params',
 'transform']

In [None]:
import numpy as np

docs = twenty_train.data

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer( stop_words='english')
tf = tf_vectorizer.fit_transform(docs)
tf_feature_names = tf_vectorizer.get_feature_names()

n_topics = len(categories)
n_topics = 20
n_top_words = 5
n_top_documents = 5
# Run LDA
lda_model = LatentDirichletAllocation(n_components=n_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
display_topics(lda_H, lda_W, tf_feature_names, n_top_words, n_top_documents)



LDA Topics
Topic 0:
nasa space 00 gov jpl
Topic 1:
__ ___ mit lcs fr
Topic 2:
gerard alleg de7 uccxkvb dps
Topic 3:
cx w7 c_ mv uw
Topic 4:
db mov nwu een acns
Topic 5:
uchicago frank objective morality midway
Topic 6:
temple ge dane keele ocis
Topic 7:
edu subject lines organization com
Topic 8:
toronto henry udel spencer zoo
Topic 9:
uk ac liverpool liv archbishop
Topic 10:
ax max g9v b8f a86
Topic 11:
nfotis plplot virginia plot roy
Topic 12:
key file use com chip
Topic 13:
edu people com writes subject
Topic 14:
adobe smokeless nichols sherri snichols
Topic 15:
drive 55 16 ide entry
Topic 16:
radar ncr detector detectors waterloo
Topic 17:
stratus sw wpi cdt atf
Topic 18:
ncsu harris eos hernlem uoregon
Topic 19:
georgia ai uga michael athens


## Removing some data

Now let's remove some of the metadata to see if there is any improvement.

In [None]:
remove_info = ('headers', 'footers', 'quotes'),
    
twenty_train = fetch_20newsgroups(subset='train', 
                                  remove=remove_info,
                                  categories=categories,
                                  shuffle=True,
                                  random_state=11)

twenty_test = fetch_20newsgroups(subset='test',
                                 remove=remove_info,
                                 categories=categories,
                                 shuffle=True,
                                 random_state=11)


In [None]:

import numpy as np

docs = twenty_train.data

n_topics = 20
n_top_words = 11
n_top_documents = 5
n_features = 1000

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=5, 
                                max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(docs)
tf_feature_names = tf_vectorizer.get_feature_names()

n_topics = len(categories)


# Run LDA
lda_model = LatentDirichletAllocation(n_components=n_topics, max_iter=5, 
                                      learning_method='online', 
                                      learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
display_topics(lda_H, lda_W, tf_feature_names, n_top_words, n_top_documents)



LDA Topics
Topic 0:
game team year play games ca edu season hockey win players
Topic 1:
car uiuc time just went bike left cars didn cso like
Topic 2:
cs edu information science computer list berkeley colorado faq sci pitt
Topic 3:
file window program mit files image available server ftp use code
Topic 4:
key uk chip encryption clipper ac keys security government algorithm privacy
Topic 5:
card __ speed ___ ca use drivers driver bus performance tom
Topic 6:
com netcom edu writes article jim au virginia brian fbi david
Topic 7:
gun government law state people rights guns right control states american
Topic 8:
bit using use work problem memory video com mouse time disk
Topic 9:
00 10 16 scsi 15 drive 25 apr 20 14 11
Topic 10:
windows dos graphics ibm ms pc os color washington software purdue
Topic 11:
israel jews israeli armenian turkish people armenians jewish war men said
Topic 12:
com posting host nntp access writes ca hp distribution reply article
Topic 13:
ax max g9v b8f a86 145 pl 1

In [None]:
ng_train = fetch_20newsgroups(subset='train', 
                                  remove=remove_info,
                                  categories=categories,
                                  shuffle=True,
                                  random_state=11)

ng_train.data[-1]

'From: behanna@syl.nj.nec.com (Chris BeHanna)\nSubject: Re: Should liability insurance be required?\nOrganization: NEC Systems Laboratory, Inc.\nDistribution: usa\nLines: 32\n\nIn article <tcora-140493155620@b329-gator-3.pica.army.mil> tcora@pica.army.mil (Tom Coradeschi) writes:\n>In article <1993Apr14.125209.21247@walter.bellcore.com>,\n>fist@iscp.bellcore.com (Richard Pierson) wrote:\n>> \n>> Lets get this "No Fault" stuff straight, I lived in NJ\n>> when NF started, my rates went up, ALOT. Moved to PA\n>> and my rates went down ALOT, the NF came to PA and it\n>> was a different story. If you are sitting in a parking\n>> lot having lunch or whatever and someone wacks you guess\n>> whose insurance pays for it ? give up ?  YOURS.\n>\n>BZZZT! If it is the other driver\'s fault, your insurance co pays you, less\n>deductible, then recoups the total cost from the other guy/gal\'s company\n>(there\'s a fancy word for it, which escapes me right now), and pays you the\n>deductible. Or: you c

In [None]:

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()


tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda_model, tf_feature_names, n_top_words)
    

Topic #0: game team year play games ca edu season hockey win players
Topic #1: car uiuc time just went bike left cars didn cso like
Topic #2: cs edu information science computer list berkeley colorado faq sci pitt
Topic #3: file window program mit files image available server ftp use code
Topic #4: key uk chip encryption clipper ac keys security government algorithm privacy
Topic #5: card __ speed ___ ca use drivers driver bus performance tom
Topic #6: com netcom edu writes article jim au virginia brian fbi david
Topic #7: gun government law state people rights guns right control states american
Topic #8: bit using use work problem memory video com mouse time disk
Topic #9: 00 10 16 scsi 15 drive 25 apr 20 14 11
Topic #10: windows dos graphics ibm ms pc os color washington software purdue
Topic #11: israel jews israeli armenian turkish people armenians jewish war men said
Topic #12: com posting host nntp access writes ca hp distribution reply article
Topic #13: ax max g9v b8f a86 145 p



In [None]:
transformed = lda_model.transform(tf)
doc_topic_dist_unnormalized = np.matrix(transformed)

# normalize the distribution (only needed if you want to work with the probabilities)
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

res = doc_topic_dist.argmax(axis=1).ravel().tolist()[0]

In [None]:

from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

data, _ = fetch_20newsgroups(shuffle=True, 
                             random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)
data_samples = data[:n_samples]

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(data_samples)

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

tf = tf_vectorizer.fit_transform(data_samples)
print()

# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)

print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Extracting tf-idf features for NMF...
Extracting tf features for LDA...

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000...





Topics in NMF model (generalized Kullback-Leibler divergence):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files format win sound ftp pub read save site help image available create copy running memory self version
Topic #7: game team games y



done in 4.372s.

Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic



## LDAvis

A better way to explore the LDA topics is to use pyldavis.

In [None]:
from __future__ import division
!pip install pyLDAvis
import pandas as pd
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

docs_raw = twenty_train.data


dtm_tf = tf

Collecting pyLDAvis
  Using cached pyLDAvis-3.3.1.tar.gz (1.7 MB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting numpy>=1.20.0
  Downloading numpy-1.21.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 4.1 MB/s 
[?25hCollecting pandas>=1.2.0
  Downloading pandas-1.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 45.8 MB/s 
Collecting funcy
  Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136897 sha256=4ba403cbac70c13db47162eba7f4b10d096aebf42cec4a6f91e7243987ab59eb
  Stored

  from collections import Iterable
  from collections import Mapping


In [None]:
pyLDAvis.sklearn.prepare(lda_model, dtm_tf, tf_vectorizer)




TypeError: ignored

In [None]:
pyLDAvis.sklearn.prepare(lda_model, dtm_tf, tf_vectorizer, mds='tsne')

## Excercise

Explore different parameters for the LDA model and visualize the results. Create a new pipline and experiment with HashVectorizer instehad of CounterVectorizer.


# Non-Negative Matrix Factorization


In [None]:
# Importing Necessary packages

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [None]:
# Let's check the first 3 articles
text_data= fetch_20newsgroups(remove=('headers', 'footers', 'quotes')).data
text_data[:3]

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 "A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't an

In [None]:
# converting the given text term-document matrix
 
vectorizer = TfidfVectorizer(max_features=1500, min_df=10, stop_words='english')
X = vectorizer.fit_transform(text_data)
words = np.array(vectorizer.get_feature_names())

# The algorithm splits each term in the document and assigns weightage to each words.
print(X)
print("X = ", words)

  (0, 829)	0.13596515131134768
  (0, 809)	0.1439640091285723
  (0, 707)	0.16068505607893963
  (0, 672)	0.16927150728890597
  (0, 1495)	0.1274990882101728
  (0, 506)	0.19413995565094086
  (0, 887)	0.17648781190400797
  (0, 757)	0.09424560560725692
  (0, 247)	0.17513150125349702
  (0, 1158)	0.1651151431885443
  (0, 1218)	0.19781957502373113
  (0, 128)	0.190572546028195
  (0, 1256)	0.153503242191245
  (0, 1118)	0.12154002727766956
  (0, 273)	0.14279390121865662
  (0, 484)	0.1714763727922697
  (0, 767)	0.18711856186440218
  (0, 808)	0.18303366583393096
  (0, 469)	0.2009979730339519
  (0, 411)	0.14249215589040326
  (0, 1191)	0.17201525862610714
  (0, 278)	0.630558141606117
  (0, 1472)	0.1855076564575762
  (1, 1355)	0.12138696862814867
  (1, 653)	0.1728163048656526
  :	:
  (11312, 1027)	0.45507155319966874
  (11312, 647)	0.21811161764585577
  (11312, 1302)	0.2391477981479836
  (11312, 1276)	0.39611960235510485
  (11312, 1100)	0.1839292570975713
  (11312, 926)	0.2458009890045144
  (11312, 140


Let's apply NMF to our data and view the topics generated. For simplicity, we will look at 10 topics that the model has generated. 

A commonly used method of optimization is the multiplicative update method. In this method, W and H are each updated iteratively according to the following rule:


img.png



img2.png

it's possible to find an implementation in Scikits-learn.


In [None]:
# Applying Non-Negative Matrix Factorization
 
nmf = NMF(n_components=10, solver="mu")
W = nmf.fit_transform(X)
H = nmf.components_

for i, topic in enumerate(H):
     print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in words[topic.argsort()[-10:]]])))

Topic 1: way,people,time,ve,good,know,think,like,just,don
Topic 2: info,help,looking,card,hi,know,advance,mail,does,thanks
Topic 3: church,does,christians,christian,faith,believe,christ,bible,jesus,god
Topic 4: league,win,hockey,play,players,season,year,games,team,game
Topic 5: bus,floppy,card,controller,ide,hard,drives,disk,scsi,drive
Topic 6: 20,price,condition,shipping,offer,space,10,sale,new,00
Topic 7: running,problem,using,program,use,files,window,dos,file,windows
Topic 8: nsa,law,algorithm,escrow,government,keys,clipper,encryption,chip,key
Topic 9: state,war,turkish,armenians,government,armenian,jews,israeli,israel,people
Topic 10: email,internet,pub,com,article,ftp,university,cs,soon,edu


When we decompose the representation into two matrices similar words will be close to each other. The word “eat” would be likely to appear in food-related articles, and therefore co-occur with words like “tasty” and “food”. Therefore, these words would probably be grouped together into a “food” component vector, and each article would have a certain weight of the “food” topic.
Therefore, an NMF decomposition of the term-document matrix would yield components that could be considered “topics”, and decompose each document into a weighted sum of topics. This is called topic modeling and is an important application of NMF.

This is another example where the underlying components (topics) and their weights should be non-negative.
Another interesting property of NMF is that it naturally produces sparse representations. 

In [None]:
print(H[:10,:10])

[[1.73056098e-17 1.36668423e-02 2.53748965e-05 1.10637295e-02
  5.52709051e-07 1.44044380e-05 1.60522855e-08 7.40583315e-06
  2.64988025e-68 3.33684954e-54]
 [1.99640281e-12 0.00000000e+00 1.58904902e-09 2.42163785e-12
  2.63606445e-03 5.53213460e-04 4.91903700e-04 6.25551731e-10
  3.37267223e-29 5.33354304e-36]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [8.00543659e-11 4.51766792e-02 1.74974107e-03 2.19629994e-03
  2.12219682e-03 3.35820275e-06 2.38710570e-03 2.61996182e-04
  1.11511012e-07 3.44085840e-07]
 [6.36113869e-13 4.59738488e-03 0.00000000e+00 9.30771870e-03
  0.00000000e+00 0.00000000e+00 4.46894025e-03 0.00000000e+00
  2.62850141e-09 5.88084949e-11]
 [9.96690595e-01 2.35187247e-01 8.04633131e-02 5.30344540e-02
  3.72116368e-02 7.34792940e-02 4.60705088e-02 4.26739192e-02
  4.64046591e-03 2.50667205e-03]
 [0.00000000e+00 0.00000000e+00 2.15656893e-02 0.00000000e

In [None]:
print(W[:10,:10])

[[3.19899200e-02 2.92240867e-02 0.00000000e+00 3.31929055e-03
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [6.13370596e-03 2.98563482e-02 9.45242384e-09 5.59946663e-04
  3.17299443e-02 8.05593970e-03 0.00000000e+00 5.17698394e-03
  8.46700302e-08 6.79058186e-04]
 [6.51443869e-02 6.11518854e-02 0.00000000e+00 8.40811885e-03
  0.00000000e+00 0.00000000e+00 0.00000000e+00 1.09107226e-03
  0.00000000e+00 0.00000000e+00]
 [4.35713959e-03 2.75717939e-02 0.00000000e+00 0.00000000e+00
  0.00000000e+00 2.27908873e-02 0.00000000e+00 8.82905660e-02
  0.00000000e+00 2.39390397e-16]
 [3.43456763e-02 5.79639104e-04 3.06596547e-03 0.00000000e+00
  0.00000000e+00 2.50565357e-02 1.05897780e-02 0.00000000e+00
  0.00000000e+00 9.20000663e-03]
 [1.61820688e-02 0.00000000e+00 3.75697641e-03 0.00000000e+00
  6.35541109e-11 3.95591922e-03 8.51394041e-03 0.00000000e+00
  1.68173246e-02 0.00000000e+00]
 [7.83745392e-03 6.40798550e-02 3.48897365e-04 2.57943711e

The we may want to impose stronger sparsity constraints or prevent the weights from becoming too large. To solve these problems, we can introduce L1 and L2 regularization losses on the weights of the matrices

# LSA (Latent Semantic Analysis)


Latent Semantic Analysis, or LSA, is one of the foundational techniques in topic modeling. The core idea is to take a matrix of what we have — documents and terms — and decompose it into a separate document-topic matrix and a topic-term matrix.

Latent Semantic Analysis, or LSA, is one of the foundational techniques in topic modeling. The core idea is to take a matrix of what we have — documents and terms — and decompose it into a separate document-topic matrix and a topic-term matrix.

The first step is generating our document-term matrix. Given m documents and n words in our vocabulary, we can construct an m × n matrix A in which each row represents a document and each column represents a word. In the simplest version of LSA, each entry can simply be a raw count of the number of times the j-th word appeared in the i-th document. In practice, however, raw counts do not work particularly well because they do not account for the significance of each word in the document. For example, the word “nuclear” probably informs us more about the topic(s) of a given document than the word “test.”

Consequently, LSA models typically replace raw counts in the document-term matrix with a tf-idf score. Tf-idf, or term frequency-inverse document frequency, assigns a weight for term j in document i as follows:


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
  
# tf-idf matrix: 
vectorizer = TfidfVectorizer(stop_words='english', 
                             use_idf=True, 
                             smooth_idf=True)


In [None]:
# SVD to reduce dimensionality: 
svd_model = TruncatedSVD(n_components=100,
                         algorithm='randomized',
                         n_iter=10)



In [None]:
# pipeline of tf-idf + SVD, fit to and applied to documents:
svd_transformer = Pipeline([('tfidf', vectorizer), 
                            ('svd', svd_model)])

In [None]:
svd_matrix = svd_transformer.fit_transform(text_data)
# svd_matrix can later be used to compare documents, compare words, or compare queries with documents

In [None]:
svd_matrix

array([[ 0.10844544,  0.01526925, -0.03565005, ..., -0.02215143,
        -0.01419781, -0.03082219],
       [ 0.07258907,  0.06080842,  0.01094119, ...,  0.01814924,
        -0.01513074, -0.00087727],
       [ 0.21922509,  0.05713034,  0.00277719, ...,  0.00439044,
        -0.04791822,  0.02120843],
       ...,
       [ 0.0536384 ,  0.03130991, -0.01026261, ...,  0.00100542,
        -0.00196158,  0.00296829],
       [ 0.07042151, -0.02059494, -0.00911099, ..., -0.02152662,
         0.02946954, -0.0337031 ],
       [ 0.05823994,  0.01816512, -0.03403748, ..., -0.01148088,
         0.00450796,  0.0343873 ]])