Random Stuff:

### Why PoS(Parts of Speech) tagging in needed in text preprocessing?

- *Apple* is a great company? $\rightarrow$ Here "Apple" is a **Proper Noun**
- Did you eat that *Apple*? $\rightarrow$ Here "Apple" is a **Common Noun**

So, here "Apple" word has different context in both cases. And that is why PoS tagging is very important.

## Latent Semantic Analysis(LSA)

### **Intutuion behind LSA:**

When we write anything like text, the words are not chosen randomly from a vocabulary.

Rather, we think about a theme (or topic) and then chose words such that we can express our thoughts to others in a more meaningful way. This theme or topic is usually considered as a latent dimension.
 
It is latent because we can’t see the dimension explicitly. Rather, we understand it only after going through the text. This means that most of the words are **semantically linked** to other words to express a theme. So, if words are occurring in a collection of documents with varying frequencies, it should indicate how different people try to express themselves using different words and different topics or themes.

In other words, word frequencies in different documents play a key role in extracting the latent topics. 

- *One Line Definition of What LSA is:*
    - ***LSA tries to extract the underlying theme/context/topics present in the text documents using Singular Value Decomposition(SVD).***

* **

### **LSA - Wikipedia**

- Latent semantic analysis (LSA) is a technique in NLP, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts/topics related to the documents and terms. 

- LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). 

    - The distributional hypothesis in linguistics is derived from the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings.
    
    - The underlying idea that "a word is characterized by the company it keeps" was popularized by Firth in the 1950s<br></br>

- A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.


* **

- LSA uses a **document-term matrix** which describes the occurrences of terms in documents. It is a sparse matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents and whose columns correspond to terms. It is also common to encounter the transpose, or **term-document matrix** where documents are the columns and terms are the rows. 


- In practice, however, raw counts do not work particularly well because they do not account for the significance of each word in the document.

- Instead of simply using frequency of terms in the matrix, we can weight the raw counts using **tf-idf**(term frequency–inverse document frequency): the weight of an element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance.

* **

### **How does LSA works?**

1. LSA creates a document-term matrix or a term-document matrix. $$\large{X_{m \times n}} \rightarrow \normalsize{\text{term-document matrix}}$$ where $n=\text{# of documents; } m = \text{# of unique terms}$, vice-versa in case of document-term matrix.

<!-- ![Latent-Semantic-Analysis](images/lsa_1.png) -->
<div align='center'>
    <img src="images/lsa_1.png" width='1000'>
</div>


> The dot product $\textbf{t}_i^T\textbf{t}_p$ b/w the two terms vectors **gives the correlation b/w the terms over the set of documents**. The matrix product $\large{XX^T}$ contains all these dot products, and matrix $\large{XX^T}$ is a symmectric i.e. $\textbf{t}_i^T\textbf{t}_p = \textbf{t}_p^T\textbf{t}_i$.

> Similarly the matrix $\large{X^TX}$ contains dot products b/w all the document vectors, **giving their correlation over the terms**. The matrix $\large{X^TX}$ is also symmetric i.e $\textbf{d}_j^T\textbf{d}_q = \textbf{d}_q^T\textbf{d}_j$.



2. By **Singular Value Decomposition(SVD)**, any $m \times n$ matrix can be decomposed into three matrices as: $$\large{X = U \Sigma V^{T}}$$ where $\large{U_{m \times m}}$ and $\large{V_{n \times n}}$ are orthogonal matrices and $\large{\Sigma_{m \times n}}$ diagonal matrix, not necessarily square.

    - The elements along the diagonal of $\large{\Sigma_{m \times n}}$ are known as the **singular values of the matrix $\large{X}$**, which are square-root of eigen values of matrix $\large{XX^T}$.

    - The columns of $\large{U_{m \times m}}$ are known as the **left-singular vectors**, which are eigen-vectors of $\large{XX^T}$.

    - The columns of $\large{V_{n \times n}}$ are known as as the **right-singular vectors**, which are eigen-vectors of $\large{X^TX}$.
    
    - The sigular values in diagonal matrix and singular vectors in U & V are arranged in the ascending order of sigular values.<br></br>
    

3. It turns out that when you select the $\large{k}$ $(k << m)$ largest singular values, and their corresponding singular vectors from $\large{U}$ and $\large{V}$, you get the rank $\large{k}$ approximation to $\large{X}$ with the smallest error. This approximation has a minimal error. Also we can now treat the term and document vectors as a ["semantic space"](https://en.wikipedia.org/wiki/Semantic_space).  $$\large{X_{k} = U_{k} \Sigma_{k} V_{k}^{T}}$$

$$\large{X_{k} = U_{m \times k} \Sigma_{k \times k} V_{k \times n}^{T}}$$


* **

### Interpretation of Mathematical Symbols

4. **This is what LSA does:** Essentially we have reduced the dimenions of term and document vectors. 

    - Each row in the matrix $\large{U_{k}}$, which is a $m \times k$ matrix, **represents term vector** reduced to $k$ dimensions.
    
        - **The Matix $\large{U}$ is called term-topic matrix**, whose each value represents relation b/w each term and (latent)topic within the document i.e. how closely that particular term is related to the topic. Values close to 1 represent high correlation b/w them.<br></br>
    
    - Each column in the $\large{V_{k}^{T}}$, which is a ${k \times n}$ matrix, **represents document vector** reduced to $k$ dimensions.
    
        - **The Matix $\large{V}$ is called document-topic matrix**, whose each value represents relation b/w each document and (latent)topic within the document i.e. how closely that particular text/doc. is related to the topic. Values close to 1 represent high correlation b/w them.<br></br>
        
        
    - The Matrix $\large{\Sigma}$ represents the correlation b/w each identified (latent)topics.


<div align='center'>
    <img src="images/lsa_2.png" width='1000'>
</div>


- For visual understanding [Read this article](https://www.geeksforgeeks.org/latent-semantic-analysis/).

### Application of Document and Term Vectors

Now with the help of these document vectors and term vectors, we can easily calculate some measures such as cosine similarity to evaluate:

1. The similarity of different documents.
2. The similarity of different words.
3. The similarity of terms or queries and documents which will become useful in information retrieval, when we want to retrieve passages most relevant to our search query.

### Code-Implementation of LSA

[Databricks Academy - LSA](https://youtube.com/playlist?list=PLroeQp1c-t3qwyrsq66tBxfR6iX6kSslt)

- References:
    1. [Wikipedia](https://en.wikipedia.org/wiki/Latent_semantic_analysis)
    2. [TDS](https://towardsdatascience.com/latent-semantic-analysis-intuition-math-implementation-a194aff870f8)
    3. [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/09/latent-semantic-analysis-and-its-uses-in-natural-language-processing/)
    4. [GFG](https://www.geeksforgeeks.org/latent-semantic-analysis/)
    5. [Databricks Academy - LSA](https://youtube.com/playlist?list=PLroeQp1c-t3qwyrsq66tBxfR6iX6kSslt)