<a href="https://colab.research.google.com/github/gcosma/COP509/blob/main/Tutorials/Tutorial4LSA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Python Latent Semantic Analysis (LSA) Tutorial**



Import dependencies:

In [None]:
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import rand
from sklearn.metrics.pairwise import cosine_similarity
from numpy import argsort
import numpy as np

In this tutorial we assume that rows represent samples and columns are features according to sklearn.

Generate a random binary 150x100 matrix (150 samples, 100 features):





In [None]:
B = rand(150, 100, density=0.3, format='csr')
B.data[:] = 1
print("B shape: " + str(B.shape))
print(B)

B shape: (150, 100)
  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 4)	1.0
  (0, 6)	1.0
  (0, 8)	1.0
  (0, 10)	1.0
  (0, 13)	1.0
  (0, 17)	1.0
  (0, 20)	1.0
  (0, 25)	1.0
  (0, 26)	1.0
  (0, 27)	1.0
  (0, 28)	1.0
  (0, 35)	1.0
  (0, 38)	1.0
  (0, 43)	1.0
  (0, 45)	1.0
  (0, 48)	1.0
  (0, 52)	1.0
  (0, 58)	1.0
  (0, 60)	1.0
  (0, 67)	1.0
  (0, 71)	1.0
  (0, 72)	1.0
  :	:
  (149, 33)	1.0
  (149, 37)	1.0
  (149, 39)	1.0
  (149, 43)	1.0
  (149, 44)	1.0
  (149, 48)	1.0
  (149, 49)	1.0
  (149, 52)	1.0
  (149, 53)	1.0
  (149, 54)	1.0
  (149, 57)	1.0
  (149, 61)	1.0
  (149, 64)	1.0
  (149, 65)	1.0
  (149, 67)	1.0
  (149, 68)	1.0
  (149, 72)	1.0
  (149, 74)	1.0
  (149, 75)	1.0
  (149, 76)	1.0
  (149, 87)	1.0
  (149, 88)	1.0
  (149, 89)	1.0
  (149, 92)	1.0
  (149, 98)	1.0


Generate a random binary query (1x100 vector):

In [None]:
query = rand(1, 100, density=0.3, format='csr')
query.data[:] = 1
print("Query shape: " + str(query.shape))
print(query)

Query shape: (1, 100)
  (0, 2)	1.0
  (0, 5)	1.0
  (0, 6)	1.0
  (0, 8)	1.0
  (0, 9)	1.0
  (0, 10)	1.0
  (0, 11)	1.0
  (0, 15)	1.0
  (0, 16)	1.0
  (0, 17)	1.0
  (0, 24)	1.0
  (0, 29)	1.0
  (0, 31)	1.0
  (0, 32)	1.0
  (0, 34)	1.0
  (0, 37)	1.0
  (0, 41)	1.0
  (0, 46)	1.0
  (0, 47)	1.0
  (0, 48)	1.0
  (0, 52)	1.0
  (0, 70)	1.0
  (0, 73)	1.0
  (0, 80)	1.0
  (0, 85)	1.0
  (0, 87)	1.0
  (0, 88)	1.0
  (0, 90)	1.0
  (0, 93)	1.0
  (0, 97)	1.0


Generate the k-truncated B matrix using SVD decomposition:


*   trunc_SVD_model is a TruncatedSVD object;
*   fit_transform is a method of TruncatedSVD which computes the rank k SVD decomposition of B and the approximated B matrix;
*   the SVD decomposition is saved into the trunc_SVD_model state.

In this case k=5:

In [None]:
trunc_SVD_model = TruncatedSVD(n_components=5)
approx_B = trunc_SVD_model.fit_transform(B)
print("Approximated B shape: " + str(approx_B.shape))

Approximated B shape: (150, 5)


Transform the query for the new B using the transform method of trunc_SVD_model:



In [None]:
transformed_query = trunc_SVD_model.transform(query)
print("Transformed query: " + str(transformed_query))
print("Query shape: " + str(transformed_query.shape))
print(transformed_query)

Transformed query: [[ 2.92657792 -0.13633113  0.44262241 -0.07563084  0.50171465]]
Query shape: (1, 5)
[[ 2.92657792 -0.13633113  0.44262241 -0.07563084  0.50171465]]


Compute cosine similarities between the transformed query and the column vectors of B:

In [None]:
similarities = cosine_similarity(approx_B, transformed_query)
print("Similarities shape: " + str(similarities.shape))
print(similarities)

Similarities shape: (150, 1)
[[0.8934839 ]
 [0.84462397]
 [0.80340683]
 [0.94575057]
 [0.96345244]
 [0.77970967]
 [0.90874733]
 [0.9182778 ]
 [0.97617859]
 [0.90687963]
 [0.94176384]
 [0.96686605]
 [0.88603408]
 [0.71063741]
 [0.88268144]
 [0.98725853]
 [0.89103844]
 [0.83972007]
 [0.76214606]
 [0.96489901]
 [0.73040484]
 [0.89806628]
 [0.87854962]
 [0.88574412]
 [0.84057716]
 [0.79710854]
 [0.93806154]
 [0.8795631 ]
 [0.7690248 ]
 [0.80258448]
 [0.86202563]
 [0.75742092]
 [0.81955435]
 [0.9123917 ]
 [0.91484657]
 [0.89147405]
 [0.98252963]
 [0.90050718]
 [0.83865397]
 [0.86750468]
 [0.89849733]
 [0.92581713]
 [0.88088155]
 [0.83986179]
 [0.77806228]
 [0.83426649]
 [0.81780456]
 [0.91516863]
 [0.80551264]
 [0.94092467]
 [0.89172797]
 [0.81355134]
 [0.91285809]
 [0.73249929]
 [0.91382419]
 [0.81668688]
 [0.79849338]
 [0.67257302]
 [0.95191962]
 [0.81024958]
 [0.998709  ]
 [0.82685278]
 [0.92723247]
 [0.92017889]
 [0.81241731]
 [0.94838725]
 [0.86837658]
 [0.79252057]
 [0.78354931]
 [0.8

Let's take the indexes of the n most similarity documents:

In [None]:
n=3
indexes = np.argsort(similarities.flat)[-n:]
print("Top n documents: " + str(indexes))
print("Top n similarities: " + str(similarities.flat[indexes]))

Top n documents: [108 122  60]
Top n similarities: [0.98972664 0.99338836 0.998709  ]


How to convert corpus to TFIDF:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first document']
vocabulary = ['this', 'document', 'first', 'is', 'second', 'the',
               'and', 'one']

# use countVectorizer to compute word occurrence
vectorizer = CountVectorizer(vocabulary=vocabulary)

# transform the count matrix to a normalized tf-idf representation
# Normalization is "c" (cosine) when ``norm='l2'``, "n" (none) when ``norm=None``
transformer = TfidfTransformer(norm='l2')
TFIDF = transformer.fit_transform(vectorizer.fit_transform(corpus))

print(TFIDF.shape)
print(TFIDF.toarray())

(4, 8)
[[0.38408524 0.46979139 0.58028582 0.38408524 0.         0.38408524
  0.         0.        ]
 [0.28108867 0.6876236  0.         0.28108867 0.53864762 0.28108867
  0.         0.        ]
 [0.31091996 0.         0.         0.31091996 0.         0.31091996
  0.59581303 0.59581303]
 [0.38408524 0.46979139 0.58028582 0.38408524 0.         0.38408524
  0.         0.        ]]
