# LSA Demonstrator
In this tutorial, you will learn how to use Latent Semantic Analysis to either discover hidden topics from given documents in an unsupervised way 
Later you'll use LSA values as a feature vectors to classify document with known document categories.

## Imports

In [None]:
# Load file from drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
import os
os.chdir("drive/MyDrive/NLP @ X_HEC - 2K21/Cours 3 - Embedding part 1/") # path to your drive folder

In [None]:
!pwd
!ls

In [None]:
!pip3 install nltk

In [None]:
#import modules
import os
import pandas as pd
import numpy as np
from string import punctuation

import nltk
from nltk import WordNetLemmatizer, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

nltk.download("stopwords")
nltk.download('punkt')
nltk.download("wordnet")

## Preprocessing function

In [None]:
stop_words = nltk.corpus.stopwords.words("english")
stop_char = stop_words + list(punctuation)

In [None]:
def preprocessing(sentence):
    """ Basic processing of a document, word by word. 
    Outputs a list of processed tokens
    """
    # Tokenization
    tokens = word_tokenize(sentence)
    # stopwords + lowercase
    tokens = [token.lower().replace("'", "") for token in tokens if token.lower() not in stop_char]
    
    Lemmatizer = WordNetLemmatizer()
    tokens = [Lemmatizer.lemmatize(token) for token in tokens]
    
    # Deleting words with  only one caracter
    tokens = [token for token in tokens if len(token)>2]
    
    return tokens

## Work on your data !

In [None]:
## import your cleaned data
path_to_your_dataset = 'data/clean_text_scrapped_data_2021.csv'

reviews = pd.read_csv(path_to_your_dataset,
            low_memory=False,
            parse_dates=['rating_date', 'diner_date']
            )

### Preprocessing

In [None]:
## apply preprocessing on each document (optional)

### TF-IDF vectorization
To convert text data in a document-term matrix, we are goint to use `TfidfVectorizer` from `sklearn` library

In [None]:
## Build TF-idf matrix from your data
## terms in columns, document in rows

#dictionary = np.array(vectorizer.get_feature_names())
#df_tfidf = pd.DataFrame(vect_corpus.todense(), columns = dictionary)
#df_tfidf.head()

### Singular Value Decomposition

To perform Singular Value Decomposition, you can use `TruncatedSVD`. You must specify the number of topics/latent features you are expecting. Default value is set to 2. Here we will keep 2 as number of components as we are expecting to discover 2 topics regarding this corpus. Later, you'll see how to optimize this number.
Keep in mind that your latent features are sorted by decreasingly importance.

In [None]:
## Build Singular Value Decomposition using TruncatedSVD. You can choose the number of components you want to use.

#n_components = 5

In [None]:
#Convert your lsa in document_concept matrix

#topic_encoded_df = pd.DataFrame(lsa, columns=[f'topic_{i+1}' for i in range(n_components)])
#topic_encoded_df['corpus'] = corpus
#topic_encoded_df

### Deep dive into dictionary

Use the `components_`attribute of svd to get your term_concept similarity matrix

In [None]:
#encoding_matrix = pd.DataFrame(svd.components_, index=[f'topic_{i+1}' for i in range(n_components)], columns=dictionary).T
#encoding_matrix

Have a look to the top words of each topic (think about the absolute value)

In [None]:
## top words by topics

### Plot topic encoded data

We are going to represent each sentence regarding the first two latent features.
You can use rating of the review as target, to marked each of your document.

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

colors = ['red', 'orange', 'yellow', 'green', 'blue']

for val in topic_encoded_df['rating'].unique():
    topic_1 = topic_encoded_df[topic_encoded_df['rating']==val]['topic_1'].values
    topic_2 = topic_encoded_df[topic_encoded_df['rating']==val]['topic_2'].values
    color = colors[i]
    ax.scatter(topic_1, topic_2, alpha=0.5, label=val, color=color)
    
ax.set_xlabel('First Topic')
ax.set_ylabel('Second Topic')
ax.axvline(linewidth=0.5)
ax.axhline(linewidth=0.5)
ax.legend()

## Select the best number of components for SVD

Create a function calculating the number of components required to pass threshold. 
This function has to take in parameters a large list of explained variance ratio (number of components close from number of originally features/terms). You can use the `explained_variance_ratio_` attribute of your svd

In [None]:
def select_n_components(var_ratio, var_threshold):
    # Set initial variance explained explained_variance
    
    # Set initial number of features n_components
    
    # For the explained variance of each feature:
    for explained_variance in var_ratio:
        
        # Add the explained variance to the total
        
        # Add one to the number of components
        
        # If we reach our goal level of explained variance
        if  >= :
            # End the loop
            break
            
    # Return the number of components
    return n_components

Now, perform LSA with large number of components (one less than number of features of your input matrix) and then use your fonction to find a good number of components

In [None]:
large_svd = TruncatedSVD(n_components=df_tfidf.shape[1]-1)
large_lsa = large_svd.fit_transform(df_tfidf)
threshold = 0.5

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

explained_variance = pd.Series(large_svd.explained_variance_ratio_.cumsum())
explained_variance.plot()

ax.xaxis.set_ticks(np.arange(0, len(explained_variance), 100))

ax.set_xlabel('Number of Topics')
ax.set_ylabel('Percentage of explained variance')
ax.set_title('Percentage of explained variance by number of topics')

In [None]:
n_opt = select_n_components(large_svd.explained_variance_ratio_, threshold)
print(f"The optimal number of components to explain {threshold*100}% of the variance is {n_opt}")

In [None]:
optimal_svd = TruncatedSVD(n_components=n_opt)
optimal_lsa = optimal_svd.fit_transform(df_tfidf)

In [None]:
optimal_encoding_matrix = pd.DataFrame(optimal_svd.components_, index=[f'topic_{i+1}' for i in range(n_opt)], columns=dictionary).T

In [None]:
for i in range(10):
    optimal_encoding_matrix[f'abs_topic_{i+1}'] = np.abs(optimal_encoding_matrix[f'topic_{i+1}'])
    top_words = optimal_encoding_matrix.sort_values(f'abs_topic_{i+1}', ascending=False).index[:5]
    print(f"Top words for topic {i+1} are : ")
    print(top_words)
    print()
    print()