<a id='top'></a><a name='top'></a>
# Chapter 4: Finding meaning in word counts (semantic analysis)

## 4.3 Singular value decomposition

* [Introduction](#introduction)
* [4.0 Imports and Setup](#4.0)
* [4.3 Singular value decomposition](#4.3)
    - [4.3.1 U-left singular vectors](#4.3.1)
    - [4.3.2 S-singular values](#4.3.2)
    - [4.3.3 VT-right singular vectors](#4.3.3)
    - [4.3.4 SVD matrix operation](#4.3.4)
    - [4.3.5 Truncating the topics](#4.3.5)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Datasets

* cats_and_dogs_sorted.txt: [script](#cats_and_dogs_sorted.txt), [source](https://github.com/totalgood/nlpia/raw/master/src/nlpia/data/cats_and_dogs_sorted.txt)

### Explore

* Analyzing semantics (meaning) to create topic vectors
* Semantic search using the similarity between topic vectors
* Scalable semantic analysis and semantic search for large copora
* Using semantic components (topics) as features in your NLP pipeline
* Navigating high-dimensional vector spaces


### Key points

* You can use SVD for semantic analysis to decompose and transform TF-IDF
* Use LDiA when you need to compute explainable topic vectors
* No matter how you create your topic vectors, they can be used for semantic search to find documents based on their meaning
* Topic vectors can be used to predict whether a social post is spam or is likely to be "liked"
* We can sidestep the curse of dimensionality to approximate nearest neighbors in a semantic vector space


---
<a name='4.0'></a><a id='4.0'></a>
# 4.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
import os
if not os.path.exists('setup'):
    os.mkdir('setup')

In [2]:
req_file = "setup/requirements_04.txt"

In [3]:
%%writefile {req_file}
isort
scikit-learn-intelex
scrapy
watermark

Overwriting setup/requirements_04.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [5]:
#if IS_COLAB:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [6]:
%%writefile setup/chp04_4.3_imports.py
import locale
import os
import pprint
import random
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D  # noqa
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.auto import tqdm
from watermark import watermark

Overwriting setup/chp04_4.3_imports.py


In [7]:
!isort setup/chp04_4.3_imports.py --sl
!cat setup/chp04_4.3_imports.py

import locale
import os
import pprint
import random

import numpy as np
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D  # noqa
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.auto import tqdm
from watermark import watermark


In [8]:
import locale
import os
import pprint
import random
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D  # noqa
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.auto import tqdm
from watermark import watermark

In [9]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)
random.seed(42)
np.random.seed(42)

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

pandas : 1.5.3
sys    : 3.8.12 (default, Dec 13 2021, 20:17:08) 
[Clang 13.0.0 (clang-1300.0.29.3)]
seaborn: 0.12.1
numpy  : 1.23.5



---
<a name='4.3'></a><a id='4.3'></a>
# 4.3 Singular value decomposition
<a href="#top">[back to top]</a>

Problem: What is the mathematical algorithm behind LSA?

Idea: Singular value decomposition (SVD).

Importance: A matrix containing counts per document (rows representing unique words and columns represent each document) is constructed from the text. This can be a term-document matrix, TF-IDF matrix or any other vector space model. SVD is then applied to reduce the number of rows while preserving the similarity structure among columns. 

Further notes:
* http://www.scholarpedia.org/article/Latent_semantic_analysis

<a id='cats_and_dogs_sorted.txt'></a><a name='cats_and_dogs_sorted.txt'></a>
### Dataset: cats_and_dogs_sorted.txt
<a href="#top">[back to top]</a>

In [10]:
data_dir = 'data/data_cats_dogs'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    
data_cats_dogs = f"{data_dir}/cats_and_dogs_sorted.txt"
!wget -P {data_dir} -nc https://github.com/totalgood/nlpia/raw/master/src/nlpia/data/cats_and_dogs_sorted.txt
!ls -l {data_cats_dogs}

File ‘data/data_cats_dogs/cats_and_dogs_sorted.txt’ already there; not retrieving.

-rw-r--r--  1 gb  staff  10095 Mar 26 16:19 data/data_cats_dogs/cats_and_dogs_sorted.txt


In [11]:
!head {data_cats_dogs}

NYC is the Big Apple.
NYC is known as the Big Apple.
I love NYC!
I wore a hat to the Big Apple party in NYC.
Come to NYC. See the Big Apple!
Manhattan is called the Big Apple.
New York is a big city for a small cat.
The lion, a big cat, is the king of the jungle.
I love my pet cat.
I love New York City (NYC).


In [12]:
with open(data_cats_dogs, 'r') as f:
    contents_raw = [stripped for line in f if (stripped := line.strip())]
    
print(contents_raw[:5])
HR()
corpus = ' '.join(contents_raw)
print(corpus[:100])

['NYC is the Big Apple.', 'NYC is known as the Big Apple.', 'I love NYC!', 'I wore a hat to the Big Apple party in NYC.', 'Come to NYC. See the Big Apple!']
----------------------------------------
NYC is the Big Apple. NYC is known as the Big Apple. I love NYC! I wore a hat to the Big Apple party


In [13]:
# Use np.linalg.svd directly to illustrate LSA on a small corpus. 

VOCABULARY = vocabulary='cat dog apple lion NYC love'.lower()  # 'cat dog apple lion NYC love big small bright'.lower().split()
DOCS = contents_raw

def docs_to_tdm(docs=DOCS, vocabulary=VOCABULARY, verbosity=0):
    tfidfer = TfidfVectorizer(min_df=1, max_df=.99, stop_words=None, token_pattern=r'(?u)\b\w+\b',
                              vocabulary=vocabulary)
    tfidf_dense = pd.DataFrame(tfidfer.fit_transform(docs).todense())
    id_words = [(i, w) for (w, i) in tfidfer.vocabulary_.items()]
    tfidf_dense.columns = list(zip(*sorted(id_words)))[1]

    tfidfer.use_idf = False
    tfidfer.norm = None
    bow_dense = pd.DataFrame(tfidfer.fit_transform(docs).todense())
    bow_dense.columns = list(zip(*sorted(id_words)))[1]
    bow_dense = bow_dense.astype(int)
    tfidfer.use_idf = True
    tfidfer.norm = 'l2'
    if verbosity:
        print(tfidf_dense.T)
    return bow_dense.T, tfidf_dense.T, tfidfer


def prettify_tdm(tdm=None, docs=[], vocabulary=[], **kwargs):
    bow_pretty = tdm.T.copy()[vocabulary]
    bow_pretty['text'] = docs
    for col in vocabulary:
        bow_pretty.loc[bow_pretty[col] == 0, col] = ''
    return bow_pretty


def accuracy_study(tdm=None, u=None, s=None, vt=None, verbosity=0, **kwargs):
    """ Reconstruct the term-document matrix and measure error as SVD terms are truncated
    """
    smat = np.zeros((len(u), len(vt)))
    np.fill_diagonal(smat, s)
    smat = pd.DataFrame(smat, columns=vt.index, index=u.index)
    if verbosity:
        print()
        print('Sigma:')
        print(smat.round(2))
        print()
        print('Sigma without zeroing any dim:')
        print(np.diag(smat.round(2)))
    tdm_prime = u.values.dot(smat.values).dot(vt.values)
    if verbosity:
        print()
        print('Reconstructed Term-Document Matrix')
        print(tdm_prime.round(2))

    err = [np.sqrt(((tdm_prime - tdm).values.flatten() ** 2).sum() / np.product(tdm.shape))]
    if verbosity:
        print()
        print('Error without reducing dimensions:')
        print(err[-1])

    smat2 = smat.copy()
    for numdim in range(len(s) - 1, 0, -1):
        smat2.iloc[numdim, numdim] = 0
        if verbosity:
            print('Sigma after zeroing out dim {}'.format(numdim))
            print(np.diag(smat2.round(2)))

        tdm_prime2 = u.values.dot(smat2.values).dot(vt.values)
        err += [np.sqrt(((tdm_prime2 - tdm).values.flatten() ** 2).sum() / np.product(tdm.shape))]
        if verbosity:
            print('Error after zeroing out dim {}'.format(numdim))
            print(err[-1])
    return err


def lsa(tdm, verbosity=0):
    if verbosity:
        print(tdm)

    u, s, vt = np.linalg.svd(tdm)

    u = pd.DataFrame(u, index=tdm.index)
    if verbosity:
        print('U')
        print(u.round(2))

    vt = pd.DataFrame(vt, index=['d{}'.format(i) for i in range(len(vt))])
    if verbosity:
        print('VT')
        print(vt.round(2))

    # Reconstruct the original term-document matrix.
    # The sum of the squares of the error is 0.
    return {'u': u, 's': s, 'vt': vt, 'tdm': tdm}


def lsa_models(vocabulary='cat dog apple lion NYC love'.lower().split(), docs=11, verbosity=0):
    # vocabulary = 'cat dog apple lion NYC love big small bright'.lower().split()
    if isinstance(docs, int):
        docs = contents_raw[:docs]
        
    tdm, tfidfdm, tfidfer = docs_to_tdm(docs=docs, vocabulary=vocabulary)
    lsa_bow_model = lsa(tdm)  # (tdm - tdm.mean(axis=1)) # SVD fails to converge if you center, like PCA does
    lsa_bow_model['vocabulary'] = tdm.index.values
    lsa_bow_model['docs'] = docs
    err = accuracy_study(verbosity=verbosity, **lsa_bow_model)
    lsa_bow_model['err'] = err
    lsa_bow_model['accuracy'] = list(1. - np.array(err))
    
    lsa_tfidf_model = lsa(tdm=tfidfdm)
    lsa_bow_model['vocabulary'] = tfidfdm.index.values
    lsa_tfidf_model['docs'] = docs
    err = accuracy_study(verbosity=verbosity, **lsa_tfidf_model)
    lsa_tfidf_model['err'] = err
    lsa_tfidf_model['accuracy'] = list(1. - np.array(err))

    return lsa_bow_model, lsa_tfidf_model


bow_svd, tfidf_svd = lsa_models()

prettify_tdm(**bow_svd)

Unnamed: 0,cat,dog,apple,lion,nyc,love,text
0,,,1.0,,1.0,,NYC is the Big Apple.
1,,,1.0,,1.0,,NYC is known as the Big Apple.
2,,,,,1.0,1.0,I love NYC!
3,,,1.0,,1.0,,I wore a hat to the Big Apple party in NYC.
4,,,1.0,,1.0,,Come to NYC. See the Big Apple!
5,,,1.0,,,,Manhattan is called the Big Apple.
6,1.0,,,,,,New York is a big city for a small cat.
7,1.0,,,1.0,,,"The lion, a big cat, is the king of the jungle."
8,1.0,,,,,1.0,I love my pet cat.
9,,,,,1.0,1.0,I love New York City (NYC).


This is a document-term matrix where each row is a vector of the bag-of-words for a document.

This works on TF-IDF matrices or any other vector space model.

In [14]:
tdm = bow_svd['tdm']
tdm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
cat,0,0,0,0,0,0,1,1,1,0,1
dog,0,0,0,0,0,0,0,0,0,0,1
apple,1,1,0,1,1,1,0,0,0,0,0
lion,0,0,0,0,0,0,0,1,0,0,0
nyc,1,1,1,1,1,0,0,0,0,1,0
love,0,0,1,0,0,0,0,0,1,1,0


<a name='4.3.1'></a><a id='4.3.1'></a>
## 4.3.1 U-left singular vectors
<a href="#top">[back to top]</a>

Problem: What does the U (left singular) matrix represent?

Idea: The U matrix contains the term-topic matrix representing "the company a word keeps." This is called the "left singular vector" because it contains row vectors that are multiplied by a matrix of columns vectors from the left. U is the cross-correlation between words and topics based on word co-occurrence in the same document. By default it is a square matrix. 

Importance: This is the most important matrix for semantic analysis in NLP. 

In [15]:
# Return Numpy arrays
U, s, Vt = np.linalg.svd(tdm)
pd.DataFrame(U, index=tdm.index).round(2)

Unnamed: 0,0,1,2,3,4,5
cat,-0.04,0.83,-0.38,-0.0,0.11,-0.38
dog,-0.0,0.21,-0.18,-0.71,-0.39,0.52
apple,-0.62,-0.21,-0.51,0.0,0.49,0.27
lion,-0.0,0.21,-0.18,0.71,-0.39,0.52
nyc,-0.75,0.0,0.24,-0.0,-0.52,-0.32
love,-0.22,0.42,0.69,0.0,0.41,0.37


The U-matrix contains all the topic vectors for each word in the corpus as columns. This means it can be used as a transformation to convert a word-document vector (a TF-IDF vector or a BOW vector) into a topic-document vector. 

<a name='4.3.2'></a><a id='4.3.2'></a>
## 4.3.2 S-singular values
<a href="#top">[back to top]</a>

Problem: What does the S (singular) matrix represent?

Idea: The Sigma or S matrix contains the topic "singular values" in a square diagonal matrix. 

Importance: The S matrix tell you how much information is captured by each dimension in the new semantic (topic) vector space.

numpy saves space by returning the singular values as an array.

In [16]:
s.round(1)

array([3.1, 2.2, 1.8, 1. , 0.8, 0.5])

Convert it to a diagonal matrix with the numpy.diag function

In [17]:
S = np.zeros((len(U), len(Vt)))
np.fill_diagonal(S, s)
pd.DataFrame(S).round(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,3.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,2.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0


<a name='4.3.3'></a><a id='4.3.3'></a>
## 4.3.3 V<sup>T</sup>-right singular vectors
<a href="#top">[back to top]</a>

Problem: What does the V (right singular) matrix represent?

Idea: The V<sup>T</sup> matrix contains the "right singular vectors" as the columns of the document-document matrix. 

Importance: This gives you the shared meaning between documents, because it measures how often documents use the same topics in the new semantic model of the documents. 


In [18]:
pd.DataFrame(Vt).round(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,-0.44,-0.44,-0.31,-0.44,-0.44,-0.2,-0.01,-0.01,-0.08,-0.31,-0.01
1,-0.09,-0.09,0.19,-0.09,-0.09,-0.09,0.37,0.47,0.56,0.19,0.47
2,-0.16,-0.16,0.52,-0.16,-0.16,-0.29,-0.22,-0.32,0.17,0.52,-0.32
3,0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,0.71,-0.0,-0.0,-0.71
4,-0.04,-0.04,-0.14,-0.04,-0.04,0.58,0.13,-0.33,0.62,-0.14,-0.33
5,-0.09,-0.09,0.1,-0.09,-0.09,0.51,-0.73,0.27,-0.01,0.1,0.27
6,-0.57,0.21,0.11,0.33,-0.31,0.34,0.34,0.0,-0.34,0.23,0.0
7,-0.32,0.47,0.25,-0.63,0.41,0.07,0.07,0.0,-0.07,-0.18,0.0
8,-0.5,0.29,-0.2,0.41,0.16,-0.37,-0.37,-0.0,0.37,-0.17,0.0
9,-0.15,-0.15,-0.59,-0.15,0.42,0.04,0.04,-0.0,-0.04,0.63,-0.0


<a name='4.3.4'></a><a id='4.3.4'></a>
## 4.3.4 SVD matrix orientation
<a href="#top">[back to top]</a>

Problem: In traditional linear algebra operations, vectors are usually thought of as column vectors. If we do any SVD linear algebra directly on vectors, that matrix needs to be term-document format. But in NLP training sets, vectors are row vectors. 

Idea: If we are training a ML model, we have to ensure our term document or topic-document matrices are transformed to scikit-learn compatible orientation.

<a name='4.3.5'></a><a id='4.3.5'></a>
## 4.3.5 Truncating the topics
<a href="#top">[back to top]</a>

Problem: We have created a topic model, a way to transform word frequency vectors into topic weight vectors. But we still have not reduced the number of dimensions. Because we have just as many topics as words, the vector space model has just as many dimensions as the original BOW vectors.

Idea: Truncate the topics by lopping off columns on the RHS of U. We can ignore the S matrix, because the rows and columns if the U matrix are already arranged so that the most important topics (with the largest singular values) are on the left. Also, we can ignore S because most of the word-document vectors we want to use with this model, like TF-IDF vectors, have already been normalized.

In [19]:
tdm = bow_svd['tdm']
U, s, Vt = np.linalg.svd(tdm)
S = np.zeros((len(U), len(Vt)))
np.fill_diagonal(S, s)

err = []
for numdim in range(len(s), 0, -1):
    S[numdim - 1, numdim - 1] = 0
    reconstructed_tdm = U.dot(S).dot(Vt)
    err.append(
        np.sqrt(
            ((reconstructed_tdm - tdm).values.flatten() ** 2)
                .sum() / np.product(tdm.shape)
        )
    )

reconstruction_accuracy = np.array(err).round(2)
reconstruction_accuracy

array([0.06, 0.12, 0.17, 0.28, 0.39, 0.55])