<a id='top'></a><a name='top'></a>
# Chapter 4: Finding meaning in word counts (semantic analysis)

## 4.2 Latent semantic analysis

* [Introduction](#introduction)
* [4.0 Imports and Setup](#4.0)
* [4.2 Latent semantic analysis](#4.2)
    - [4.2.1 Your thought experiment made real](#4.2.1)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Datasets

* cats_and_dogs_sorted.txt: [script](#cats_and_dogs_sorted.txt), [source](https://github.com/totalgood/nlpia/raw/master/src/nlpia/data/cats_and_dogs_sorted.txt)

### Explore

* Analyzing semantics (meaning) to create topic vectors
* Semantic search using the similarity between topic vectors
* Scalable semantic analysis and semantic search for large copora
* Using semantic components (topics) as features in your NLP pipeline
* Navigating high-dimensional vector spaces


### Key points

* You can use SVD for semantic analysis to decompose and transform TF-IDF
* Use LDiA when you need to compute explainable topic vectors
* No matter how you create your topic vectors, they can be used for semantic search to find documents based on their meaning
* Topic vectors can be used to predict whether a social post is spam or is likely to be "liked"
* We can sidestep the curse of dimensionality to approximate nearest neighbors in a semantic vector space

---

Latent semantic analysis is based on the oldest and most commonly-used technique for dimension reduction, singular value decomposition. SVD was in widespread use long before the term "machine learning" even existed. SVD decomposes a matrix into three square matrices, one of which is diagonal.

Using SVD, LSA can break down your TF-IDF term-document matrix into three simpler matrices. And they can be multiplied back together to produce the original matrix, without any changes. This is like factorization of a large integer. Big whoop. But these three simpler matrices from SVD reveal properties about the original TF-IDF matrix that you can exploit to simplify it. You can truncate those matrices (ignore some rows and columns) before multiplying them back together, which reduces the number of dimensions you have to deal with in your vector space model.

These truncated matrices don’t give the exact same TF-IDF matrix you started with — they give you a better one. Your new representation of the documents contains the essence, the “latent semantics” of those documents. That’s why SVD is used in other fields for things such as compression. It captures the essence of a dataset and ignores the noise. A JPEG image is ten times smaller than the original bitmap, but it still contains all the information of the original image.

When you use SVD this way in natural language processing, you call it latent semantic analysis. LSA uncovers the semantics, or meaning, of words that is hidden and waiting to be uncovered.

Latent semantic analysis is a mathematical technique for finding the “best” way to linearly transform (rotate and stretch) any set of NLP vectors, like your TF-IDF vectors or bag-of-words vectors. And the “best” way for many applications is to line up the axes (dimensions) in your new vectors with the greatest “spread” or variance in the word frequencies. You can then eliminate those dimensions in the new vector space that don’t contribute much to the variance in the vectors from document to document.

---
<a name='4.0'></a><a id='4.0'></a>
# 4.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
import os
if not os.path.exists('setup'):
    os.mkdir('setup')

In [2]:
req_file = "setup/requirements_04.txt"

In [3]:
%%writefile {req_file}
isort
scikit-learn-intelex
scrapy
watermark

Overwriting setup/requirements_04.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [5]:
# if IS_COLAB:
from sklearnex import patch_sklearn 
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [6]:
%%writefile setup/chp04_4.2_imports.py
import locale
import os
import pprint
import random
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D  # noqa
from nltk.tokenize import casual_tokenize
from nltk.tokenize.casual import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.auto import tqdm
from watermark import watermark

Overwriting setup/chp04_4.2_imports.py


In [7]:
!isort setup/chp04_4.2_imports.py --sl
!cat setup/chp04_4.2_imports.py

import locale
import os
import pprint
import random

import numpy as np
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D  # noqa
from nltk.tokenize import casual_tokenize
from nltk.tokenize.casual import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.auto import tqdm
from watermark import watermark


In [8]:
import locale
import os
import pprint
import random
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D  # noqa
from nltk.tokenize import casual_tokenize
from nltk.tokenize.casual import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.auto import tqdm
from watermark import watermark

In [9]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)
random.seed(42)
np.random.seed(42)

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

numpy  : 1.23.5
pandas : 1.5.3
seaborn: 0.12.1
sys    : 3.8.12 (default, Dec 13 2021, 20:17:08) 
[Clang 13.0.0 (clang-1300.0.29.3)]



---
<a name='4.2'></a><a id='4.2'></a>
# 4.2 Latent semantic analysis (in-depth explanation)
<a href="#top">[back to top]</a>

Problem: What is the underlying mechanism of LSA?

Idea: SVD linear algebra

Importance: Using SVD, LSA can break down a TF-IDF term-document matrix into three simpler matrices. These matrices can be multiplied back together in truncated form. These truncated matrices don't return the exact TF-IDF document matrix, but a simpler representation containing the essence, the "latent semantics" of those documents. This is akin to compression as used in JPEG images. When SVD is used this way in NLP, this is called latent semantic analysis. LSA uncovers the semantics, or meaning, of words that is hidden. Latent semantic analysis is a mathematical technique for finding the "best" way to linearly transform any set of NLP vectors. 

LSA is a way to train a machine to recognize the meaning (semantics) of words and phrases by giving the machine some example usages. Like people, machines can learn better semantics from example usages of words much faster and easier than they can from dictionary definitions.

<a name='4.2.1'></a><a id='4.2.1'></a>
## 4.2.1 Your thought experiment made real
<a href="#top">[back to top]</a>

Problem: Explore computing specified topics from our thought experiment.

Idea: Use an algorithm to try specifying topics such as "animalness", "petness", etc from our thought experiment.

<a id='cats_and_dogs_sorted.txt'></a><a name='cats_and_dogs_sorted.txt'></a>
### Dataset: cats_and_dogs_sorted.txt
<a href="#top">[back to top]</a>

In [10]:
data_dir = 'data/data_cats_dogs'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    
data_cats_dogs = f"{data_dir}/cats_and_dogs_sorted.txt"
!wget -P {data_dir} -nc https://github.com/totalgood/nlpia/raw/master/src/nlpia/data/cats_and_dogs_sorted.txt
!ls -l {data_cats_dogs}

File ‘data/data_cats_dogs/cats_and_dogs_sorted.txt’ already there; not retrieving.

-rw-r--r--  1 gb  staff  10095 Mar 26 16:19 data/data_cats_dogs/cats_and_dogs_sorted.txt


In [11]:
!head {data_cats_dogs}

NYC is the Big Apple.
NYC is known as the Big Apple.
I love NYC!
I wore a hat to the Big Apple party in NYC.
Come to NYC. See the Big Apple!
Manhattan is called the Big Apple.
New York is a big city for a small cat.
The lion, a big cat, is the king of the jungle.
I love my pet cat.
I love New York City (NYC).


In [12]:
with open(data_cats_dogs, 'r') as f:
    contents_raw = [stripped for line in f if (stripped := line.strip())]
    
print(contents_raw[:5])
HR()
corpus = ' '.join(contents_raw)
print(corpus[:100])

['NYC is the Big Apple.', 'NYC is known as the Big Apple.', 'I love NYC!', 'I wore a hat to the Big Apple party in NYC.', 'Come to NYC. See the Big Apple!']
----------------------------------------
NYC is the Big Apple. NYC is known as the Big Apple. I love NYC! I wore a hat to the Big Apple party


---
Topic-word matrix for LSA on 16 short sentences about cats, dogs, and NYC.

In [13]:
# From ch04_catdog_lsa_3x6x16.py

NUM_TOPICS = 3
NUM_WORDS = 6
NUM_DOCS = NUM_PRETTY = 16
SAVE_SORTED_CORPUS = ''  # 'cats_and_dogs_sorted.txt'
STOPWORDS = []
SYNONYMS = {}

stemmer = None  # PorterStemmer()

def normalize_corpus_words(corpus, stemmer=stemmer, synonyms=SYNONYMS, stopwords=STOPWORDS):
    docs = [doc.lower() for doc in corpus]
    docs = [casual_tokenize(doc) for doc in docs]
    docs = [[synonyms.get(w, w) for w in words if w not in stopwords] for words in docs]
    if stemmer:
        docs = [[stemmer.stem(w) for w in words if w not in stopwords] for words in docs]
    docs = [[synonyms.get(w, w) for w in words if w not in stopwords] for words in docs]
    docs = [' '.join(w for w in words if w not in stopwords) for words in docs]
    return docs

def tokenize(text, vocabulary, synonyms=SYNONYMS, stopwords=STOPWORDS):
    doc = normalize_corpus_words([text.lower()], synonyms=synonyms, stopwords=stopwords)[0]
    stems = [w for w in doc.split() if w in vocabulary]
    return stems

fun_words = vocabulary = 'cat dog apple lion nyc love big small'
fun_stems = normalize_corpus_words([fun_words])[0].split()[:NUM_WORDS]
fun_words = fun_words.split()


# do it all over again on a tiny portion of the corpus and vocabulary
docs = normalize_corpus_words(corpus)

tfidfer = TfidfVectorizer(
    min_df=1, 
    max_df=.99, 
    stop_words=None, 
    token_pattern=r'(?u)\b\w+\b',
    vocabulary=fun_stems
)

tfidf_dense = pd.DataFrame(tfidfer.fit_transform(docs).todense())
id_words = [(i, w) for (w, i) in tfidfer.vocabulary_.items()]
tfidf_dense.columns = list(zip(*sorted(id_words)))[1]
tfidfer.use_idf = False
tfidfer.norm = None

bow_dense = pd.DataFrame(tfidfer.fit_transform(docs).todense())
bow_dense.columns = list(zip(*sorted(id_words)))[1]
bow_dense = bow_dense.astype(int)

tfidfer.use_idf = True
tfidfer.norm = 'l2'
bow_pretty = bow_dense.copy()
bow_pretty = bow_pretty[fun_stems]
bow_pretty['text'] = corpus
for col in fun_stems:
    bow_pretty.loc[bow_pretty[col] == 0, col] = ''

# print(bow_pretty)
word_tfidf_dense = pd.DataFrame(tfidfer.transform(fun_stems).todense())
word_tfidf_dense.columns = list(zip(*sorted(id_words)))[1]
word_tfidf_dense.index = fun_stems

tfidf_pretty = tfidf_dense.copy()
tfidf_pretty = tfidf_pretty[fun_stems]
tfidf_pretty = tfidf_pretty.round(2)
for col in fun_stems:
    tfidf_pretty.loc[tfidf_pretty[col] == 0, col] = ''

tfidf_zeros = tfidf_dense.T.sum()[tfidf_dense.T.sum() == 0]

[corpus[i] for i in tfidf_zeros.index]

pcaer = PCA(n_components=NUM_TOPICS)

doc_topic_vectors = pd.DataFrame(
    pcaer.fit_transform(tfidf_dense.values), 
    columns=['top{}'.format(i) for i in range(NUM_TOPICS)]
)

doc_topic_vectors['text'] = corpus
pd.options.display.max_colwidth = 55

word_topic_vectors = pd.DataFrame(
    pcaer.transform(word_tfidf_dense.values),
    columns=['top{}'.format(i) for i in range(NUM_TOPICS)]
)

word_topic_vectors.index = fun_stems

def tfidf_search(text, corpus=tfidf_dense, corpus_text=corpus):
    """ search for the most relevant document """
    tokens = tokenize(text, vocabulary=corpus.columns)
    tfidf_vector_query = np.array(tfidfer.transform([' '.join(tokens)]).todense())[0]
    query_series = pd.Series(tfidf_vector_query, index=corpus.columns)

    return corpus_text[query_series.dot(corpus.T).values.argmax()]

def topic_search(text, corpus=doc_topic_vectors, pcaer=pcaer, corpus_text=corpus):
    """ search for the most relevant document """
    tokens = tokenize(text, vocabulary=corpus.columns)
    tfidf_vector_query = np.array(tfidfer.transform([' '.join(tokens)]).todense())[0]
    topic_vector_query = pcaer.transform([tfidf_vector_query])
    query_series = pd.Series(topic_vector_query, index=corpus.columns)
    return corpus_text[query_series.dot(corpus.T).values.argmax()]


U, Sigma, VT = np.linalg.svd(tfidf_dense.T)  # <1> Transpose the doc-word tfidf matrix, because SVD works on column vectors
S = Sigma.copy()
S[4:] = 0
doc_labels = ['doc{}'.format(i) for i in range(len(tfidf_dense))]
U_df = pd.DataFrame(U, index=fun_stems, columns=fun_stems)
VT_df = pd.DataFrame(VT, index=doc_labels, columns=doc_labels)
ndim = 2
truncated_tfidf = U[:, :ndim].dot(np.diag(Sigma)[:ndim, :ndim]).dot(VT.T[:, :ndim].T)


word_topic_vectors.T.round(1)

Unnamed: 0,cat,dog,apple,lion,nyc,love
top0,1.0,0.0,0.0,0.0,0.0,0.0
top1,0.0,1.0,0.0,0.0,0.0,0.0
top2,0.0,0.0,1.0,0.0,0.0,0.0
