# Dimensionality Reduction

In this lab, we will work with the IMDB to estimate the sentiment of movie reviews. We will study PCA and Sparse PCA in this context, and work using Single Value Decomposition to perform topic analysis. In the context of text mining, we call SVD *Latent Semantic Analysis* (LSA).

LSA is already implemented in Python in scikit-learn in the package [*TruncatedSVD*](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), we will use that along with the Natural Language Processing library [*NLTK*](https://www.nltk.org/) for our methods.

The general process can be summarized as follows:

1. Load the text in free form.
2. Preprocess the text to normalize it.
3. Calculate LSA.
4. Explore the results.

## Loading text: IMDB database.

This dataset comes from the website Internet Movie Database, and represents 25,000 reviews which were labeled (by humans) as positive or negative, see [here](http://ai.stanford.edu/~amaas/data/sentiment/) for more details. It is a pretty big dataset, so we will work with small samples of 500 positive cases and 500 negative cases.

The uncompressed data is simply a series of text documents, each in its own text file, stored in two classes, one per folder.

The first step is to load the data and create a "corpus". A corpus is, quite simply, a set of documents. Here, we will read the files from our folders, and assign it a sentiment. We need to read the documents one by one, and store them into a dataset which will have our texts, and the tag considering whether they are positive or negative.

### Reading the text

The first step is to read the data into a vector. We need to read from the document path, using the internal system. This package is called *os* and comes pre-installed in Python.


In [17]:
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import sklearn.feature_extraction.text as sktext
from sklearn.decomposition import PCA, SparsePCA, TruncatedSVD
import re
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
# https://umap-learn.readthedocs.io/en/latest/
# Import umap. Install first if not available!
# !pip install umap-learn 
# !pip install datashader bokeh holoviews scikit-image colorcet ipywidgets
import umap
import umap.plot

In [18]:
# Download the data
# !gdown https://drive.google.com/uc?id=15AL-2F2Vdg9xlVmHfmeIOs3opnXoxzcP

In [19]:
# !unzip LSA_Sample.zip

In [20]:
# List all files in the positive samples. Replace with your own!
dir = 'Lecture_Sample/train/pos/'
fileList = os.listdir(dir)
fileList[:10] # see first 10 files in the directory

['1_7.txt',
 '397_9.txt',
 '280_8.txt',
 '264_7.txt',
 '209_8.txt',
 '377_7.txt',
 '69_10.txt',
 '198_8.txt',
 '122_9.txt',
 '114_10.txt']

In [21]:
# Create vector with texts
outtexts = []

# Read the files in the directory and append them with the class to the dataset
for eachFile in fileList:
    with open(dir + eachFile, 'rb', newline = None) as _fp:
        fileData = _fp.read()
        outtexts.append(fileData)
    _fp.close()
    
# Create dataframe from outputs
texts = pd.DataFrame({'texts': outtexts, 'class': 1})
texts.head()

Unnamed: 0,texts,class
0,"b""If you like adult comedy cartoons, like Sout...",1
1,b'I have to admit that Tsui Hark is one of a k...,1
2,"b""Undying is a very good game which brings som...",1
3,"b""Hickory Dickory Dock was a good Poirot myste...",1
4,"b""Walter Matthau and George Burns just work so...",1


In [22]:
# Repeat for negative values
# List all files in the "pos" directory
dir = 'Lecture_Sample/train/neg/'
fileList = os.listdir(dir)

# Create vector with texts
outtexts = []

# Read the files in the directory and append them with the class to the dataset
for eachFile in fileList:
    with open(dir + eachFile, 'rb', newline = None) as _fp:
        fileData = _fp.read()
        outtexts.append(fileData)
    _fp.close()
    
# Create dataframe from outputs
texts = pd.concat((texts, pd.DataFrame({'texts': outtexts, 'class': 0})), ignore_index = True)
texts.tail()

Unnamed: 0,texts,class
995,b'You may consider a couple of facts in the di...,0
996,b'What is this crap? My little cousin picked t...,0
997,"b""This film was choppy, incoherent and contriv...",0
998,"b""This film, once sensational for its forward-...",0
999,b'It as absolutely incredible to me that anyon...,0


In [23]:
texts.describe()

Unnamed: 0,class
count,1000.0
mean,0.5
std,0.50025
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [24]:
texts

Unnamed: 0,texts,class
0,"b""If you like adult comedy cartoons, like Sout...",1
1,b'I have to admit that Tsui Hark is one of a k...,1
2,"b""Undying is a very good game which brings som...",1
3,"b""Hickory Dickory Dock was a good Poirot myste...",1
4,"b""Walter Matthau and George Burns just work so...",1
...,...,...
995,b'You may consider a couple of facts in the di...,0
996,b'What is this crap? My little cousin picked t...,0
997,"b""This film was choppy, incoherent and contriv...",0
998,"b""This film, once sensational for its forward-...",0


The text is quite dirty, so we'll use regex code to clean it. It is available in Python using the package [re](https://www.rexegg.com/regex-quickstart.html). Regex can be daunting, but it is very rewarding to learn. Do spend some time with it!

In [25]:
CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
def cleanhtml(raw_html):
    html = raw_html.decode('ISO-8859-1') # Change the encoding to your locale!
    cleantext = re.sub(CLEANR, '', html)
    return cleantext

texts['texts'] = texts['texts'].apply(cleanhtml)
texts

Unnamed: 0,texts,class
0,"If you like adult comedy cartoons, like South ...",1
1,I have to admit that Tsui Hark is one of a kin...,1
2,Undying is a very good game which brings some ...,1
3,Hickory Dickory Dock was a good Poirot mystery...,1
4,Walter Matthau and George Burns just work so w...,1
...,...,...
995,You may consider a couple of facts in the disc...,0
996,What is this crap? My little cousin picked thi...,0
997,"This film was choppy, incoherent and contrived...",0
998,"This film, once sensational for its forward-th...",0


Now we will transform the text. The following code uses sklearn's [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) which applies a [Term Frequency - Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) transformation to the text, which means counting how many times a certain concept appears in the document versus the total times it appears in the document, to do the following:

1. Eliminate accents and other characters.
2. Eliminate the so-called "stopwords", or words that are irrelevant to the learning given they are only connectors. These words are [here](https://gist.github.com/ethen8181/d57e762f81aa643744c2ffba5688d33a).
3. Eliminate concepts that are rare (min_df) or too common (max_df). Here we eliminate concepts that appear in less than 5% of documents and those that appear in over 90%.

The last argument calculates a logaritmic (or sublinear) transformation, which is more robust. This effectively transforms our dataset into a fully numeric one!


In [26]:
# Transform the text
TfIDFTransformer = sktext.TfidfVectorizer(strip_accents='unicode', # Eliminate accents and special characters
                      stop_words='english', # Eliminates stop words.
                      min_df = 0.05, # Eliminate words that do not appear in more than 5% of texts
                      max_df = 0.90, # Eliminate words that appear in more than 95% of texts
                      sublinear_tf=True # Use sublinear weights (softplus)
                      )

The model structure of scikit-learn follows always the same:

1. We define the model using the appropriate function directly from the package (as above).

2. We train the model using the "fit" method over the object we created in 1.

3. We apply the model to new data using the "transform" method.

In cases where we want to fit *and* transform the inputs - such as a TF-IDF transform, which is applied over the same data where the weights are "trained" - we can use directly the method "fit_transform", that performs steps 2 and 3 directly.

In [27]:
TfIDF_IMDB = TfIDFTransformer.fit_transform(texts['texts'])
TfIDF_IMDB

<1000x230 sparse matrix of type '<class 'numpy.float64'>'
	with 23848 stored elements in Compressed Sparse Row format>

The output is a **sparse matrix** with 1647 words. These matrices only store the relevant information! They are *much* more efficient in-memory.

The output of the TF-IDF transformer is a sparse matrix. We can check the outputs of the first row with the below code.

In [28]:
print(TfIDF_IMDB[0,:])

  (0, 25)	0.26221561360196516
  (0, 102)	0.19553930525319846
  (0, 43)	0.20636328414550276
  (0, 27)	0.20986235766373093
  (0, 167)	0.2666753264211833
  (0, 79)	0.21743149588993194
  (0, 80)	0.2936185733749881
  (0, 195)	0.18088119475091288
  (0, 76)	0.2666753264211833
  (0, 87)	0.25904739767431134
  (0, 75)	0.27268042611426274
  (0, 180)	0.47280717413125994
  (0, 33)	0.24255769565192892
  (0, 107)	0.27221410897561893


 The following vector shows the list of words associated to each index for indexes 30 to 39.

In [29]:
print(TfIDFTransformer.get_feature_names_out()[30:40])

# Let's save the indexes for later.
word_index = TfIDFTransformer.get_feature_names_out ()

['cinema' 'classic' 'come' 'comedy' 'comes' 'completely' 'couldn' 'couple'
 'course' 'day']
