# Dr. Data Science 101: NLP

## Latent Semantic Analysis

### Introduction-

In the real world, there are a variety of different ways that we as humans choose to organize our things. For example, if youre a fan of books then you might choose to file yours away alphabetically on your bookshelf. Or else, you might choose instead to leave them strewn about in piles according to subject. 

Similarly, computers must also find ways to store and retrieve unstructured data while retaining the ability to quickly access information when needed. 

**Latent semantic analysis (LSA)** is one such method of indexing unstructured electronic text data.

This is a method of unsupervised learning. There is no target variable. Instead, we are trying to mine documents for similarities. 

This can be done through a combination of three different pre-processing steps:

    1) Determine frequency of terms in document
    
        2) Determine inverse frequency of terms in document across all documents
        
            3) **Singular Value Decomposition**
            
We will not get into the technical details of this procedure but instead focus on its implementation in Python. 

#### Text Processing-

Extracting features and attributes from unstructured data is no easy task. Before we can make sense of the information at hand, we must first process it so that the data can be more easily ingested by the computer. 

This entails several steps. These steps do not necessarily need to be performed in any specific order, which is actually dictated by the constraints of the programming language and style of the coder. Regardless, some important components are: **removing punctuation / special characters**,
                **removing "stop" words**,
                **"stemming"**.

In addition to these steps, there are other optional pre-processing methods that can additionally impact accuracy and results of any model, these include: *n-gram counting, lemmatizing, etc.* 

In [1]:
# Import the required libraries
import os
import pandas as pd
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
import umap

In [2]:
# Set the working directory
wd = '/Users/zxs/Documents/code/kaggle/sentiment/'
os.chdir(wd)

In [3]:
# Extract a small subset of data for demonstration
df = pd.read_csv('train.csv.zip', compression = 'zip')
df = df.sample(frac = .1, random_state = 100)

In [None]:
# Filter stopwords
stop_words = set(stopwords.words('english'))

# Stem the words
ps = PorterStemmer()

# Fix the casing
text = [i.lower() for i in df['comment_text']]

# Remove punctuation
no_punct = [i.translate(str.maketrans('', '', string.punctuation)) for i in text]

# Tokenize the words
tokens = [word_tokenize(x) for x in no_punct]

# Remove stop words
no_stops = []

for i in tokens:
    
    no_stops.append([x for x in i if x not in stop_words])
    
# Stemming
stems = []

for i in no_stops:
    
    stemmed = [ps.stem(x) for x in i]

    stems.append(stemmed)

# Rejoin content
joined = []
    
for i in range(len(stems)):
    
    t = ' '.join(stems[i])
    joined.append(t)