<div style="text-align: center;"> <!-- This div will center all its contents -->
  <img src="https://scontent.fopo6-1.fna.fbcdn.net/v/t39.30808-6/327345211_708012977623591_5371889953719216000_n.png?_nc_cat=104&ccb=1-7&_nc_sid=5f2048&_nc_eui2=AeGA4Epi5DPgQWGmwJnzDzYwlTHqnE4dPp2VMeqcTh0-ndnVzTPGmZ1C7LYJvEsh0wc&_nc_ohc=oHf3AV_aUB0AX_auBWi&_nc_ht=scontent.fopo6-1.fna&oh=00_AfCTA0yaHCQugeMu_44t-6cLSKGa53d67a0DpQQ-fVTGYg&oe=654F295F" width="570" height="250" style="display: block; margin: auto;"/> <!-- This will center the image -->
  <div><strong style="color: #4F5B63;">Master in Data Science for Social Sciences</strong></div>
  <div><strong style="color: #4F5B63;">University of Aveiro</strong></div>
</div>


<div style="display: flex; justify-content: space-around; align-items: flex-start;">
  <div style="width: 100%; padding: 10px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); margin: 10px;">
    <h2><h1 style="text-align: center; font-size: 4em; color: #46627F; margin-top: 0; margin-bottom: 0; line-height: 1;">Latent Semantic Analysis</h1>
<h1 style="text-align: center; color: #B1C0CF; margin-top: 0; margin-bottom: 0; line-height: 1;"> -Deduce the hidden topic from the document- </h1></h2>
      </div>
</div>


## Topic Model

... is a analytical method used in the realm of natural language processing to uncover the latent thematic structure within a large corpus of text. At its core, it discovers pattherns of words that frequently occur together in documents. These pattherns form what we refer to as "topics," which are abstract themes that pervade the text. Each topic is essentially a collection of terms that are statistically significant for defining the content of the documents.

One of the most compelling applications of topic modeling is in the organization and summarization of large datasets of unstructured text. The power of topic modeling lies in its ability to distill large volumes of text into the essence of what is being discussed, which can be invaluable for information retrieval, understanding content trends, and data organization

![Analytic vidhya — Topic Modeling](https://cdn-images-1.medium.com/max/2000/1*EECZMH6ZpM8QjKl0joa0fw.png)

## Latent Semantic Analysis (LSA)

... is a technique in natural language processing that helps in identifying patterns and relationships within a collection of texts. By analyzing the context in which words appear, LSA can effectively deduce the underlying, or "latent," topics that are present in a body of text.

LSA operates on the principle that words that are used in the same contexts tend to have similar meanings. It transforms the text into a matrix of terms and documents, applying singular value decomposition (SVD) to reduce the number of rows while preserving the similarity structure among columns. This process condenses the information, capturing the essence of the text in a way that highlights the relationships between the terms and their associated concepts.

The "topics" uncovered by LSA are not explicit labels like those we might manually assign to articles (e.g., "finance" or "health"). Instead, they are patterns of word usage that represent the text's content without preconceived categories. These patterns can then be used to group similar documents together, making LSA a powerful tool for document clustering.

As an unsupervised method, LSA does not require pre-labeled training data. It doesn't start with a known topic for each document; rather, it discovers the topics from the text itself. This makes LSA particularly useful for exploring large sets of unstructured data where the topics are not previously known. It's widely used in fields such as information retrieval, text mining, and document categorization, helping to uncover the hidden structure in text and facilitate a deeper understanding of the content.

## Why LSA?

Most simple way of finding similar documents is by using vector representation of text and cosine similarity. Vector representation represents each document in the form of vector. This vector is known as document-term matrix.

...offers a more nuanced approach to text analysis compared to the straightforward vector representation and cosine similarity method. While the basic vector space model represents documents as vectors of term frequencies in a high-dimensional space, it has limitations, particularly in handling synonyms and polysemy (words with multiple meanings).

Here's why LSA can be a better choice:

**Contextual Meaning**: LSA goes beyond mere word frequency counts. It attempts to capture the contextual-usage patterns of words in documents. This allows LSA to understand different meanings of the same word in different contexts (polysemy) and to recognize that different words may share similar meanings (synonymy).

**Dimensionality Reduction**: LSA applies singular value decomposition (SVD) to the document-term matrix, reducing the dimensionality of the data. This reduction filters out noise and insignificant details, helping to highlight the underlying structure of the text.

**Conceptual Grouping**: By reducing dimensions, LSA groups together terms that are conceptually related, even if they do not co-occur frequently. This helps in identifying the latent concepts or topics within the text.

**Handling Ambiguity**: LSA's ability to deal with synonyms and polysemy makes it superior for understanding the true meaning of words in context, which is a common challenge in natural language processing.

**Scalability**: Although computationally intensive, LSA can be scaled to handle large datasets, making it suitable for big data applications in text analysis.

**Improved Similarity Measures**: When documents are represented in the reduced LSA space, similarity measures such as cosine similarity become more meaningful, as they now reflect conceptual rather than just term-based similarity.


For example:

In [2]:
a1 = "the petrol in this car is low"
a2 = "the vehicle is short on fuel"

Taking into account the two aforementioned strings, we can discern from the context that they bear a resemblance to each other. We will endeavor to quantify the degree of similarity between these strings by employing vector representation.

The document-term matrix for the given example is as follows:

![image.png](attachment:image.png)

The dimensions of the document-term matrix are determined by the number of documents multiplied by the size of the vocabulary. The size of the vocabulary refers to the total count of unique words found across all documents. In this scenario, the vocabulary size is 11, and there are 2 documents.

The similarity between the documents is calculated using the cosine similarity measure applied to their respective vectors in the matrix. For the documents labeled a1 and a2, the similarity score is 0.3086067, which is unexpectedly low given that the documents share a similar context. This highlights a limitation of using a simple document-term matrix and vector representation: it sometimes fails to accurately reflect the true semantic similarity between texts. Another drawback is the potentially large size of the vocabulary, which can lead to a vast and computationally demanding matrix.

These shortcomings of vector representation necessitate a more advanced method for assessing document similarity and uncovering the implicit topics within them. The desired technique would need to address the issue of synonymous terms and remain computationally efficient. Latent Semantic Analysis (LSA) emerged as the proposed solution to these challenges, offering a way to overcome the limitations of the traditional vector space model.

# Working of LSA
## Term Co-occurrence Matrix
The term co-occurrence matrix is a square matrix with dimensions equal to the size of the vocabulary, which means it has as many rows and columns as there are unique words in the dataset. Each cell in this matrix indicates how frequently pairs of words appear together within a certain context in the dataset. This matrix is instrumental in identifying which words tend to co-occur, thereby providing insights into the relationships and associations between different words.

For the example provided, the term co-occurrence matrix would be structured as follows:

![image.png](attachment:image.png)

As we can see the words the and is are the most common but are not very useful in meanings of the sentence. We’ll see how to use this matrix and it’s benefits later in this blog.

***Concepts***
- The LSA returns concepts instead of topics which represents the given document. The concepts are list of words which represents the document in the best possible way.

For example, in the dataset of sports document the concepts can be

*Concept 1: ball, shoes, goals, win*

*Concept 2: ball, bat, score, umpire*

In the scenario described, there are two concepts — Concept 1 is associated with football, and Concept 2 is linked to cricket. It's evident that there can be an overlap of words between these concepts, indicating that a group of words collectively represents a concept rather than each word individually signifying a distinct idea. Latent Semantic Analysis (LSA) leverages this understanding by using the term co-occurrence matrix to identify the optimal grouping of words, referred to as a 'concept', that best represents the content of the documents. LSA's strength lies in its ability to discern these conceptual groupings, which encapsulate the essence of the documents more effectively than individual words could.

*Concept is also a way of representing the document through dimension reduction.*

**Singular Value Decomposition (SVD)** 

The document-term matrix is typically characterized by its sparsity and substantial size. When dealing with such matrices, computational tasks can become quite resource-intensive, and the results may not always be meaningful, especially considering that many of the values in the matrix are zero. To mitigate these computational challenges and to extract more pertinent and valuable insights, Singular Value Decomposition (SVD) is employed.

SVD is a matrix factorization technique that decomposes a matrix into three distinct matrices: an orthogonal matrix representing the columns, a diagonal matrix of singular values, and an orthogonal matrix representing the rows. This decomposition helps in distilling the essence of the original matrix, reducing its dimensionality while preserving its most significant features. The process effectively captures the underlying structure in the data, which can then be used for more efficient and insightful analysis.

![image.png](attachment:image.png)

The principal benefit of Singular Value Decomposition (SVD) lies in its ability to significantly reduce the size of a matrix. In practical terms, this means that a matrix with potentially millions of dimensions can be approximated by a much smaller one, with dimensions in the order of hundreds (100 or 1000), depending on the rank (K) chosen for the matrix. The rank (K) represents the number of singular values we decide to keep, which corresponds to the number of columns and rows used in the approximation. This truncated version of the matrix preserves the most significant data features, allowing for a close approximation of the original matrix A without substantial information loss.

During the SVD process, the product of matrix A and its transpose (A*(A'T)) is computed, which effectively results in the term co-occurrence matrix. In this matrix, the value at position (i,j) indicates how frequently term(i) and term(j) co-occur across the entire dataset of documents. This matrix is a key element in understanding the relationships and associations between terms in the dataset. To gain a deeper understanding of SVD and its applications, one can explore further resources and literature on the subject.

## **Implementation**

Let's consider a small, illustrative example to understand how LSA works.

Suppose we have a set of documents, which could be sentences or paragraphs from articles, books, or any other text source. For the sake of this example, let's assume we have the following five documents:

In [11]:
a1 = "He is a good dog."
a2 = "The dog is too lazy."
a3 = "That is a brown cat."
a4 = "The cat is very active."
a5 = "I have brown cat and dog."

In this instance, it is apparent that two distinct concepts should be formulated: one representing 'cat' and the other 'dog'.

Please proceed to transform the given list of documents into a DataFrame structure:

In [12]:
import pandas as pd
df = pd.DataFrame()
df["documents"] = [a1,a2,a3,a4,a5]
df

#The df would look like:

Unnamed: 0,documents
0,He is a good dog.
1,The dog is too lazy.
2,That is a brown cat.
3,The cat is very active.
4,I have brown cat and dog.


## **Preprocessing**
The most important part of any machine learning algorithm is data preprocessing. More the noise present in data lesser the accuracy of model.

We’ll perform four types of processing on data:

 1. Remove all the special characters from the text.
 2. Remove all the words with less than 3 letters.
 3. Lowercase all the characters.
 4. Remove stop words.

In [13]:
#remove special characters
df['clean_documents'] = df['documents'].str.replace("[^a-zA-Z#]", " ")
#remove words have letters less than 3
df['clean_documents'] = df['clean_documents'].fillna('').apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
#lowercase all characters
df['clean_documents'] = df['clean_documents'].fillna('').apply(lambda x: x.lower())

For removing stop words we’ll tokenise the string and than again append all the words which are not stop words.

In [14]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# tokenization
tokenized_doc = df['clean_documents'].fillna('').apply(lambda x: x.split())
# remove stop-words
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
# de-tokenization
detokenized_doc = []
for i in range(len(df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)
df['clean_documents'] = detokenized_doc

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JLM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


After this preprocessing our data will look like:

In [15]:
df['clean_documents']

0         good dog.
1         dog lazy.
2        brown cat.
3       cat active.
4    brown cat dog.
Name: clean_documents, dtype: object

## Document-Term matrix

We’ll use sklearn for generating the document-term matrix.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', smooth_idf=True)
X = vectorizer.fit_transform(df['clean_documents'])

We opted for TfidfVectorizer over CountVectorizer because tf-idf offers a more effective approach to vectorization. To understand the different parameters that can be adjusted within TfidfVectorizer, you can refer to this resource. For a deeper understanding of the tf-idf mechanism, this link provides a comprehensive explanation.

The dimensions of matrix X are (5,6), indicating that there are 5 documents (as represented by the rows) and 6 unique terms (as represented by the columns).

To review the terms,

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the vectorizer and transform your documents
vectorizer = TfidfVectorizer(stop_words='english', smooth_idf=True)
X = vectorizer.fit_transform(df['clean_documents'])

# Get the feature names using the new method
dictionary = vectorizer.get_feature_names_out()
dictionary


#which will give an array of words


array(['active', 'brown', 'cat', 'dog', 'good', 'lazy'], dtype=object)

# **Singular Value Decomposition**

In [26]:
from sklearn.decomposition import TruncatedSVD
# SVD represent documents and terms in vectors 
svd_model = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=100, random_state=122)
lsa = svd_model.fit_transform(X)

TruncatedSVD in scikit-learn applies Singular Value Decomposition (SVD) to the document-term matrix, yielding a vector representation with reduced dimensions. To obtain the matrix prior to dimension reduction, one should utilize the fit method rather than fit_transform.

The parameter n_components determines the size of the output data, correlating to the count of distinct themes identified within the data. The selection of n_components thus reflects the number of topics you're attempting to discern. For further details on SVD as implemented in scikit-learn, you can consult their documentation.

Next, we will examine the topics that have been assigned to our documents.

In [27]:
pd.options.display.float_format = '{:,.16f}'.format
topic_encoded_df = pd.DataFrame(lsa, columns = ["topic_1", "topic_2"])
topic_encoded_df["documents"] = df['clean_documents']
display(topic_encoded_df[["documents", "topic_1", "topic_2"]])

#The output will look like:

Unnamed: 0,documents,topic_1,topic_2
0,good dog.,0.3413834191239968,0.7199781067501032
1,dog lazy.,0.3413834191239964,0.719978106750103
2,brown cat.,0.8609490919302161,-0.3659836550739518
3,cat active.,0.5166658991993199,-0.3850046207843264
4,brown cat dog.,0.9494117370834864,0.0236302940661143


The topics corresponding to each document are observable. Documents related to dogs align with topic_2, while those related to cats align with topic_1. Documents that mention both cats and dogs are predominantly associated with topic_1, although they also share a connection with topic_2. This stronger association with topic_1 may be due to the presence of the words "brown" and "cat," which carry more significance in topic_1.

Additionally, we can examine the significance attributed to the terms within each topic.

In [28]:
encoding_matrix = pd.DataFrame(svd_model.components_, index = ["topic_1","topic_2"], columns = (dictionary)).T
encoding_matrix

Unnamed: 0,topic_1,topic_2
active,0.2003541259081104,-0.2424408501618364
brown,0.5965117122287046,-0.2018098984872581
cat,0.6293380994160945,-0.329885908871532
dog,0.4158307960649449,0.6169033286639753
good,0.1323826028466497,0.4533766476433693
lazy,0.1323826028466492,0.453376647643369


From the observation, it is clear that the terms "brown" and "cat" hold greater importance in topic_1 compared to topic_2. This higher weighting in topic_1 suggests that these terms are more influential or prevalent within the documents categorized under this particular topic.

**Conclusion**

1. Latent Semantic Analysis (LSA) is leveraged for cutting down the dimensions of a vector. It has the capability to condense vector dimensions from millions down to just thousands, all while preserving the original context. This reduction is beneficial for diminishing the computational power required and the time it takes to execute these computations.

2. In the realm of search engines, LSA underpins the Latent Semantic Indexing (LSI) algorithm. LSI employs the vector created by LSA to locate documents that match a given search query, facilitating more efficient and relevant search results.

3. Furthermore, LSA finds its application in document clustering. With LSA's ability to assign topics to documents, these assignments can be used to group documents into clusters, enhancing the organization and retrieval of information based on subject matter.

References:
https://medium.com/towards-data-science/latent-semantic-analysis-deduce-the-hidden-topic-from-the-document-f360e8c0614b