<a id='top'></a><a name='top'></a>
# Chapter 4: Finding meaning in word counts (semantic analysis)

## 4.7 Steering with Feedback
## 4.8 Topic vector power


* [Introduction](#introduction)
* [4.0 Imports and Setup](#4.0)
* [4.7 Steering with feedback](#4.7)
    - [4.7.1 Linear discriminant analysis](#4.7.1)
* [4.8 Topic vector power](#4.8)
    - [4.8.1 Semantic search](#4.8.1)
    - [4.8.2 Improvements](#4.8.2)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Datasets

* sms-spam.csv: [script](#sms-spam.csv), [source](https://github.com/totalgood/nlpia/raw/master/src/nlpia/data/sms-spam.csv)


### Explore

* Analyzing semantics (meaning) to create topic vectors
* Semantic search using the similarity between topic vectors
* Scalable semantic analysis and semantic search for large copora
* Using semantic components (topics) as features in your NLP pipeline
* Navigating high-dimensional vector spaces


### Key points

* You can use SVD for semantic analysis to decompose and transform TF-IDF
* Use LDiA when you need to compute explainable topic vectors
* No matter how you create your topic vectors, they can be used for semantic search to find documents based on their meaning
* Topic vectors can be used to predict whether a social post is spam or is likely to be "liked"
* We can sidestep the curse of dimensionality to approximate nearest neighbors in a semantic vector space


---
<a name='4.0'></a><a id='4.0'></a>
# 4.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
import os
if not os.path.exists('setup'):
    os.mkdir('setup')

In [2]:
req_file = "setup/requirements_04.txt"

In [3]:
%%writefile {req_file}
isort
plyfile
scikit-learn-intelex
scrapy
watermark

Overwriting setup/requirements_04.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [5]:
if IS_COLAB:
    # On this script, this seems to crash local computer
    from sklearnex import patch_sklearn
    patch_sklearn()

In [9]:
%%writefile setup/chp04_4.7_imports.py
import locale
import os
import random
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from nltk.tokenize import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from watermark import watermark

Overwriting setup/chp04_4.7_imports.py


In [7]:
!isort setup/chp04_4.7_imports.py --sl
!cat setup/chp04_4.7_imports.py

Fixing /Users/gb/Desktop/examples/setup/chp04_4.7_imports.py
import locale
import os
import random

import numpy as np
import pandas as pd
import seaborn as sns
from nltk.tokenize import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from watermark import watermark


In [8]:
import locale
import os
import random
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from nltk.tokenize import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from watermark import watermark

In [9]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
sns.set_style("darkgrid")
random.seed(42)
np.random.seed(42)

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

sys    : 3.8.12 (default, Dec 13 2021, 20:17:08) 
[Clang 13.0.0 (clang-1300.0.29.3)]
numpy  : 1.23.5
seaborn: 0.12.1
pandas : 1.5.3



---

<a name='4.7'></a><a id='4.7'></a>
# 4.7 Steering with feedback
<a href="#top">[back to top]</a>

Problem: How to improve distance metrics for distance-based applications?

Idea: Perform "distance metric learning", which aims to learn a set of latent factors based on which distances between data points can be effectively learned. By adjusting the distance scores reported to clustering and embedding algorithms, we can "steer" the vectors so that they minimize some cost function. In this way, we can force the vectors to focus on some aspect of the information content we are interested in.

<a name='4.7.1'></a><a id='4.7.1'></a>
## 4.7.1 Linear discriminant analysis
<a href="#top">[back to top]</a>

Problem: Test the effectiveness of linear discriminant analysis model  on labeled data on a supervised classification problem.

Idea: LDA is a variant of LSA, a dimensionality reduction technique that is commonly used in supervised classification problems. Rather than maximizing the variance between all vectors in the new space, LDA maximizes the distance between the centroids of the vectors within each class. To do this, we have tell the LDA algorithm what "topics" we want to model by giving it examples (labeled vectors).

<a id='sms-spam.csv'></a><a name='sms-spam.csv'></a>
### Dataset: sms-spam.csv
<a href="#top">[back to top]</a>

In [10]:
data_dir = 'data/data_sms_spam'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    
data_sms_spam = f"{data_dir}/sms-spam.csv"
!wget -P {data_dir} -nc https://github.com/totalgood/nlpia/raw/master/src/nlpia/data/sms-spam.csv
!ls -l {data_sms_spam}

sms = pd.read_csv(data_sms_spam, index_col=0)

index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]  # <1>
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms.spam = sms.spam.astype(int)
sms.head(6)

File ‘data/data_sms_spam/sms-spam.csv’ already there; not retrieving.

-rw-r--r--  1 gb  staff  493232 Mar 25 11:17 data/data_sms_spam/sms-spam.csv


Unnamed: 0,spam,text
sms0,0,"Go until jurong point, crazy.. Available only ..."
sms1,0,Ok lar... Joking wif u oni...
sms2!,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,0,U dun say so early hor... U c already then say...
sms4,0,"Nah I don't think he goes to usf, he lives aro..."
sms5!,1,FreeMsg Hey there darling it's been 3 week's n...


In [11]:
# Calculate the TF-IDF vectors for each of these messages.
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)

tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
tfidf_docs = pd.DataFrame(tfidf_docs, index=index)
tfidf_docs = tfidf_docs - tfidf_docs.mean()

print(tfidf_docs.shape)
print(sms.spam.sum())

(4837, 9232)
638


In [12]:
%%time
# Requires a long time to run.

lda = LDA(n_components=1)
lda = lda.fit(tfidf_docs, sms.spam)
lda

CPU times: user 3min 24s, sys: 6.4 s, total: 3min 30s
Wall time: 2min 11s


In [13]:
%%time
sms['lda_spaminess'] = lda.predict(tfidf_docs)
sms['lda_spaminess']

CPU times: user 209 ms, sys: 34.9 ms, total: 244 ms
Wall time: 224 ms


sms0        0
sms1        0
sms2!       1
sms3        0
sms4        0
           ..
sms4832!    1
sms4833     0
sms4834     0
sms4835     0
sms4836     0
Name: lda_spaminess, Length: 4837, dtype: int64

In [14]:
((sms.spam - sms.lda_spaminess) ** 2.).sum() ** .5

0.0

In [15]:
(sms.spam == sms.lda_spaminess).sum()

4837

In [16]:
len(sms)

4837

We are probably overfitting here.

Try some cross-validation.

In [17]:
%%time
# Requires a long time to run.

lda = LDA(n_components=1)
scores = cross_val_score(lda, tfidf_docs, sms.spam, cv=2)

"Accuracy: {:.2f} (+/-{:.2f})".format(scores.mean(), scores.std() * 2)

CPU times: user 1min 27s, sys: 3.79 s, total: 1min 31s
Wall time: 54.6 s


'Accuracy: 0.66 (+/-0.12)'

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    tfidf_docs, 
    sms.spam, 
    test_size=0.33, 
    random_state=271828
)

print(np.nanmin(X_train))
print(np.nanmax(X_train))
test4 = pd.DataFrame(X_train)
test4.head()

-0.06573907666856706
0.9997932602853008


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9222,9223,9224,9225,9226,9227,9228,9229,9230,9231
sms207,0.114506,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,0.524417,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
sms68!,0.089771,0.407319,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
sms1440,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
sms2636,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
sms3541,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05


In [19]:
test4.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3240 entries, sms207 to sms2178
Columns: 9232 entries, 0 to 9231
dtypes: float64(9232)
memory usage: 228.2+ MB


In [20]:
%%time
# This requires time to run.
lda = LDA(n_components=1)
lda.fit(X_train, y_train) 

lda.score(X_test, y_test).round(3)

CPU times: user 1min 16s, sys: 2.41 s, total: 1min 19s
Wall time: 44.4 s


0.764

The above reveals rather poor test set accuracy.

----

Next, try LSA combined with LDA to create a more accurate model that also generalizes well, so it can handle new SMS messages better.

In [29]:
%%time
# Specify 16-D vectors
pca = PCA(n_components=16)

pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)

# Create DataFrame for more convenience
columns = ['topic{}'.format(i) for i in range(pca.n_components)]
columns

CPU times: user 6.84 s, sys: 722 ms, total: 7.56 s
Wall time: 3.25 s


['topic0',
 'topic1',
 'topic2',
 'topic3',
 'topic4',
 'topic5',
 'topic6',
 'topic7',
 'topic8',
 'topic9',
 'topic10',
 'topic11',
 'topic12',
 'topic13',
 'topic14',
 'topic15']

In [30]:
pca_topic_vectors = pd.DataFrame(
    pca_topic_vectors, 
    columns=columns, 
    index=index
)

pca_topic_vectors.round(3).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,0.003,0.037,0.011,-0.019,-0.053,0.039,-0.065,0.012,-0.082,0.008,0.011,0.007,-0.035,-0.01,0.035
sms1,0.404,-0.094,-0.078,0.051,0.1,0.047,0.023,0.065,0.024,-0.024,-0.005,-0.038,0.042,-0.016,0.049,-0.041
sms2!,-0.03,-0.048,0.09,-0.067,0.091,-0.043,-0.0,-0.002,-0.057,0.051,0.124,-0.024,0.027,-0.016,-0.043,0.061
sms3,0.329,-0.033,-0.035,-0.016,0.052,0.056,-0.165,-0.075,0.063,-0.108,0.02,-0.027,0.073,-0.039,0.028,-0.072
sms4,0.002,0.031,0.038,0.034,-0.075,-0.093,-0.044,0.062,-0.045,0.029,0.028,0.01,0.025,0.031,-0.079,-0.015


In [31]:
print(type(pca_topic_vectors))
print(type(pca_topic_vectors.values))

<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>


In [32]:
X_train, X_test, y_train, y_test = train_test_split(
    pca_topic_vectors.values, 
    sms.spam, 
    test_size=0.3, 
    random_state=271828
)

print(np.nanmin(X_train))
print(np.nanmax(X_train))
test5 = pd.DataFrame(X_train)
test5.head()

# -2.527663924870674e+303
# 5.2270877941723415e+299

-0.454355120437229
0.6905163499445719


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,-0.024001,0.152769,-0.018366,-0.014253,0.100766,0.010496,0.004555,0.001028,0.017042,0.044738,-0.075004,-0.041092,-0.067349,-0.017618,-0.035576,0.021848
1,0.015476,0.007608,0.04718,-0.063238,-0.028474,-0.021096,0.013783,-0.000787,-0.104146,-0.00567,-0.039918,0.014758,-0.025962,-0.084905,0.046081,-0.066619
2,-0.009608,0.112268,0.042762,-0.002024,0.017786,-0.027263,-0.049576,0.061489,-0.143477,0.044998,0.146418,0.009608,0.000156,0.032144,0.077956,0.077623
3,0.090047,-0.044906,-0.022838,-0.050271,-0.156378,-0.004107,0.005525,-0.044773,0.101857,0.09342,0.036763,0.129166,-0.045691,-0.096757,-0.086781,0.112243
4,-0.133593,-0.123237,0.091927,0.261289,-0.039362,0.04552,-0.06203,-0.110969,0.207442,0.194476,0.047076,0.207852,-0.100322,0.156661,0.103972,-0.100201


In [33]:
test5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3385 entries, 0 to 3384
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3385 non-null   float64
 1   1       3385 non-null   float64
 2   2       3385 non-null   float64
 3   3       3385 non-null   float64
 4   4       3385 non-null   float64
 5   5       3385 non-null   float64
 6   6       3385 non-null   float64
 7   7       3385 non-null   float64
 8   8       3385 non-null   float64
 9   9       3385 non-null   float64
 10  10      3385 non-null   float64
 11  11      3385 non-null   float64
 12  12      3385 non-null   float64
 13  13      3385 non-null   float64
 14  14      3385 non-null   float64
 15  15      3385 non-null   float64
dtypes: float64(16)
memory usage: 423.2 KB


In [34]:
pd.DataFrame(X_train).isnull().any().any()

False

In [35]:
pd.DataFrame(X_train).isnull().any()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
dtype: bool

In [36]:
np.any(np.isnan(X_train))

False

In [37]:
np.all(np.isfinite(X_train))

True

In [38]:
lda = LDA(n_components=1)
lda

In [39]:
%%time

lda.fit(
    X_train, 
    y_train
) 

CPU times: user 15.8 ms, sys: 3.22 ms, total: 19.1 ms
Wall time: 17.1 ms


In [40]:
print(f"Test score accuracy: {lda.score(X_test, y_test).round(3)}")
HR()

lda = LDA(n_components=1)
scores = cross_val_score(lda, pca_topic_vectors, sms.spam, cv=10)

print(f"Test set accuracy: {scores.mean():.3f} (+/-{(scores.std() * 2):3f})")

Test score accuracy: 0.963
----------------------------------------
Test set accuracy: 0.956 (+/-0.022902)


Note: With LSA, you can characterize an SMS message with only 16 dimensions and still have plenty of information to classify them as spam or not. This low-dimensional model is also much less likely to overfit. It should generalize well and be able to classify as-yet-unseen SMS messages or chats.

We get better accuracy with a simple LDA model, compared to semantic analysis. However, the advantage of the new model is you can create vectors that represent the semantics of a statement in more than just a single dimension. 

---
<a name='4.8'></a><a id='4.8'></a>
# 4.8 Topic vector power
<a href="#top">[back to top]</a>

Problem: What can be done with topic vectors?

Idea: Compare the meaning of words, documents, statements, and corpora. You can find "clusters" of similar documents and statements. We can find documents that are relevant to the query, not just a good match for solely the word statistics. This is called "semantic search."

<a name='4.8.1'></a><a id='4.8.1'></a>
## 4.8.1 Semantic search
<a href="#top">[back to top]</a>

Problem: What is semantic search?

Idea: It is full text search that takes into account the meaning of words in a query and documents to be searched. We can use LSA and LDiA to compute topic vectors that capture the semantics of words and documents in a vector. There are efficient and accurate  *appropriate nearest neighbors* algorithms using LSH to efficiently implement semantic search. 

<a name='4.8.2'></a><a id='4.8.2'></a>
## 4.8.2 Improvements
<a href="#top">[back to top]</a>

Problem: What is the next step for topic vectors?

Idea: Ensure the vectors associated with words are more precise and useful. To do that, neural nets are useful. This improves the pipeline's ability to extract meaning from short texts or even solitary words. 