# Numerical Representation of Text Data
### Author: Ariel Cintron, Ph.D.

This notebook has the following objectives:
1. Read the text data.
2. Convert text into smaller units called tokens.
3. Evaluate basic functions for text processing involving stop word and punctuation removal.
4. Transform text data into a numerical matrix of token counts while using Python tools.
5. Visualize text data with heatmap plots of matrix representations for text data.
6. Explore rank and singular values for document-term matrices.


### Required Python Modules
If needed, use `pip install` and then re-start the Kernell.

In [None]:
import pandas as pd
import re
import string
import nltk
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
pd.set_option('display.max_colwidth', 100)

### Reading Data Files

The dataset stored in the file `SMSSpamCollection.tsv` is available in the UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [None]:
data1 = pd.read_csv("data/SMSSpamCollection.tsv", sep='\t')
data1.columns = ['label', 'body_text']

In [None]:
data1.info() # summary of tabular data information

In [None]:
data1.shape # number of rows and columns in the data frame

In [None]:
data1.head() # inspecting the top rows in the data frame

In [None]:
data1['label'].unique() # distinct classes in categorical feature

### Functions for Processing Text Data

In [None]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()
def remove_stopwords(tokenized_list):
    text = [ps.stem(word) for word in tokenized_list if word not in stopwords]
    return text

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

In [None]:
data1['body_text_nopunct'] = data1['body_text'].apply(lambda x: remove_punct(x))
data1['body_text_tokenized'] = data1['body_text_nopunct'].apply(lambda x: tokenize(x))
data1['body_nostopw'] = data1['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

In [None]:
data1.info()

In [None]:
data1.head()

### Document-Term Matrix

https://en.wikipedia.org/wiki/Document-term_matrix

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. 

Document-term matrices may be calcualted by deploying `CountVectorizer()` from the Scikit-Learn module:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


### Using CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer=clean_text) #Instatiating CountVectorizer()

X1_counts = count_vect.fit_transform(data1['body_text'])

### Sample or subset of original matrix
scalar_multiple = 7 # fill in with a positive integer less than 7 to set sample size as a multiple of 10
rows_in_sample = (scalar_multiple)*10 
data1_sample = data1[0:rows_in_sample]
count_vect_sample1 = CountVectorizer(analyzer=clean_text)
X1_counts_sample = count_vect.fit_transform(data1_sample['body_text'])

#### What is the number of rows and columns for the original document-term matrix versus a sample?

In [None]:
print('Dimensions of original document-term matrix = {}'.format(X1_counts.shape))

In [None]:
print('Dimensions of sample document-term matrix = {}'.format(X1_counts_sample.shape))

### Color-Coded Representation of Matrices

The heat map is a color-coded representation of the entries of a matrix, where
zero is coded with the color white and larger positive integer values
are coded in the darker-color spectrum.

The original 5567-by-8104 document-term matrix is very sparse (that is, it has
multiple entries equal to zero). Thus, its heat map
representation displays most regions with tones ranging form light-blue to white.

In [None]:
fig, axs = plt.subplots(ncols=1, nrows=2, figsize=(15,12),layout="constrained")
sns.heatmap(X1_counts.todense(),cmap='Blues',ax=axs[0])
axs[0].set_title('Document-Term Matrix Dimensions = {}'.format(X1_counts.shape))
sns.heatmap(X1_counts_sample.todense(),cmap='Blues',ax=axs[1])
axs[1].set_title('Document-Term Matrix Dimensions = {}'.format(X1_counts_sample.shape))
plt.show()

### Singular Value Decomposition

In [None]:
U,S,V = np.linalg.svd(X1_counts.todense())
U1,S1,V1 = np.linalg.svd(X1_counts_sample.todense())

#### What is the rank of the original matrix versus sample matrix?
Hint: The attribute `.shape` gives pertinent information about the number of singular values which is then employed to establish the rank.

In [None]:
print('The rank of the original document-term matrix is {}'.format(S.shape[0]))

In [None]:
print('The rank of the sample document-term matrix is {}'.format(S1.shape[0]))

#### What are the smallest and largest singular values?

In [None]:
print('The smallest singular value in the original document-term matrix is {}'.format(S.min()))

In [None]:
print('The smallest singular value in the sample matrix is {}'.format(S1.min()))

In [None]:
print('The largest singular value in the original document-term matrix equals {}'.format(S.max()))

In [None]:
print('The largest singular value in the sample matrix is {}'.format(S1.max()))

Next we display the singular values in semi-logarithmic scale for both the original document-term matrix and the sample matrix. The singular values are plotted in the vertical axis.

In [None]:
figsvd, axssvd = plt.subplots(ncols=2, nrows=1, figsize=(15,10),layout="constrained")
axssvd[0].semilogy(S)
axssvd[0].set_title('Document-Term Matrix Dimensions = {}'.format(X1_counts.shape))
axssvd[0].set_ylabel('Log(Singular Values)')
axssvd[1].semilogy(S1) 
axssvd[1].set_title('Document-Term Matrix Dimensions = {}'.format(X1_counts_sample.shape))
axssvd[1].set_ylabel('Log(Singular Values)')
plt.show()