<a id='top'></a><a name='top'></a>
# Chapter 4: Finding meaning in word counts (semantic analysis)

## 4.5 Latent Dirichlet allocation (LDiA)

* [Introduction](#introduction)
* [4.0 Imports and Setup](#4.0)
* [4.5 Latent Dirichlet allocation (LDA)](#4.5)
    - [4.5.1 The LDiA idea](#4.5.1)
    - [4.5.2 LDiA topic model for SMS message](#4.5.2)
    - [4.5.3 LDiA + LDA = spam classifier](#4.5.3)
    - [4.5.4 A fairer comparison: 32 LDiA topics](#4.5.4)
* [4.6 Distance and similarity](#4.6)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Datasets

* sms-spam.csv: [script](#sms-spam.csv), [source](https://github.com/totalgood/nlpia/raw/master/src/nlpia/data/sms-spam.csv)

### Explore

* Analyzing semantics (meaning) to create topic vectors
* Semantic search using the similarity between topic vectors
* Scalable semantic analysis and semantic search for large copora
* Using semantic components (topics) as features in your NLP pipeline
* Navigating high-dimensional vector spaces


### Key points

* You can use SVD for semantic analysis to decompose and transform TF-IDF
* Use LDiA when you need to compute explainable topic vectors
* No matter how you create your topic vectors, they can be used for semantic search to find documents based on their meaning
* Topic vectors can be used to predict whether a social post is spam or is likely to be "liked"
* We can sidestep the curse of dimensionality to approximate nearest neighbors in a semantic vector space


---
<a name='4.0'></a><a id='4.0'></a>
# 4.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
import os
if not os.path.exists('setup'):
    os.mkdir('setup')

In [2]:
req_file = "setup/requirements_04.txt"

In [3]:
%%writefile {req_file}
isort
plyfile
scikit-learn-intelex
scrapy
watermark

Overwriting setup/requirements_04.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [5]:
if IS_COLAB:
    # On this script, this seems to crash local computer
    from sklearnex import patch_sklearn
    patch_sklearn()

In [9]:
%%writefile setup/chp04_4.5_imports.py
import locale
import os
import random
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from nltk.tokenize import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.decomposition import LatentDirichletAllocation as LDiA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from watermark import watermark

Overwriting setup/chp04_4.5_imports.py


In [7]:
!isort setup/chp04_4.5_imports.py --sl
!cat setup/chp04_4.5_imports.py

Fixing /Users/gb/Desktop/examples/setup/chp04_4.5_imports.py
import locale
import os
import random

import numpy as np
import pandas as pd
import seaborn as sns
from nltk.tokenize import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.decomposition import LatentDirichletAllocation as LDiA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from watermark import watermark


In [8]:
import locale
import os
import random
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from nltk.tokenize import casual_tokenize
from sklearn.decomposition import PCA
from sklearn.decomposition import LatentDirichletAllocation as LDiA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from watermark import watermark

In [9]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
np.seterr(invalid='warn')
sns.set_style("darkgrid")
random.seed(42)
np.random.seed(42)

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

pandas : 1.5.3
sys    : 3.8.12 (default, Dec 13 2021, 20:17:08) 
[Clang 13.0.0 (clang-1300.0.29.3)]
numpy  : 1.23.5
seaborn: 0.12.1



---
<a name='4.5'></a><a id='4.5'></a>
# 4.5 Latent Dirichlet allocation (LDiA)
<a href="#top">[back to top]</a>

Problem: What is the Latent Dirichlet Allocation (LDiA) model?
 
Idea: It is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model used for discovering abstract topics from a collection of documents.

Importance: LDiA can give better results than LSA in certain situations. LDiA topic models can be easier to understand than LSAs because the words assigned to topics and topics assigned to documents tend to make more sense thant LSAs. In general, LSA focus on reducing matrix dimension, while LDiA solves topic modeling problems.

<a name='4.5.1'></a><a id='4.5.1'></a>
## 4.5.1 The LDiA idea
<a href="#top">[back to top]</a>

Problem: What is the core concept of LDiA?

Idea: It is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. Observations (eg words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

<a id='sms-spam.csv'></a><a name='sms-spam.csv'></a>
### Dataset: sms-spam.csv
<a href="#top">[back to top]</a>

In [10]:
data_dir = 'data/data_sms_spam'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    
data_sms_spam = f"{data_dir}/sms-spam.csv"
!wget -P {data_dir} -nc https://github.com/totalgood/nlpia/raw/master/src/nlpia/data/sms-spam.csv
!ls -l {data_sms_spam}

sms = pd.read_csv(data_sms_spam, index_col=0)

# https://github.com/totalgood/nlpia/blob/master/src/nlpia/book/examples/ch04.rst
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]  # <1>
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms.spam = sms.spam.astype(int)
sms.head(6)

File ‘data/data_sms_spam/sms-spam.csv’ already there; not retrieving.

-rw-r--r--  1 gb  staff  493232 Mar 25 11:17 data/data_sms_spam/sms-spam.csv


Unnamed: 0,spam,text
sms0,0,"Go until jurong point, crazy.. Available only ..."
sms1,0,Ok lar... Joking wif u oni...
sms2!,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,0,U dun say so early hor... U c already then say...
sms4,0,"Nah I don't think he goes to usf, he lives aro..."
sms5!,1,FreeMsg Hey there darling it's been 3 week's n...


In [11]:
# Calculate the TF-IDF vectors for each of these messages.
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)

tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
tfidf_docs = pd.DataFrame(tfidf_docs, index=index)
tfidf_docs = tfidf_docs - tfidf_docs.mean()

print(tfidf_docs.shape)
print(sms.spam.sum())

(4837, 9232)
638


In [12]:
%%time
# Using sklearnex on PCA seems to cause an error.

# Specify 16-D vectors
pca = PCA(n_components=16)

pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)

# Create DataFrame for more convenience
columns = ['topic{}'.format(i) for i in range(pca.n_components)]
columns

CPU times: user 6.53 s, sys: 702 ms, total: 7.23 s
Wall time: 3.98 s


['topic0',
 'topic1',
 'topic2',
 'topic3',
 'topic4',
 'topic5',
 'topic6',
 'topic7',
 'topic8',
 'topic9',
 'topic10',
 'topic11',
 'topic12',
 'topic13',
 'topic14',
 'topic15']

In [13]:
total_corpus_len = 0
for document_text in sms.text:
    total_corpus_len += len(casual_tokenize(document_text))

mean_document_len = total_corpus_len / len(sms)
round(mean_document_len, 2)

21.35

<a name='4.5.2'></a><a id='4.5.2'></a>
## 4.5.2 LDiA topic model for SMS messages
<a href="#top">[back to top]</a>

Problem: How does LDiA treat topics?

Idea: The topics produced by LDiA tend to be more understandable and "explainable". This is because words that frequently occur together are assigned the same topic. Where LSA (PCA) tries to keep things spread apart that were spread apart to start with, LDiA tries to keep things close together that started out close together. 

In [14]:
counter = CountVectorizer(tokenizer=casual_tokenize)
bow_docs = pd.DataFrame(counter.fit_transform(raw_documents=sms.text).toarray(), index=index)
column_nums, terms = zip(*sorted(zip(counter.vocabulary_.values(), counter.vocabulary_.keys())))
bow_docs.columns = terms

In [15]:
sms.loc['sms0'].text

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [16]:
bow_docs.loc['sms0'][bow_docs.loc['sms0'] > 0].head()

,            1
..           1
...          2
amore        1
available    1
Name: sms0, dtype: int64

Use LDiA to create topic vectors for the SMS corpus.

In [17]:
%%time

ldia = LDiA(n_components=16, learning_method='batch')
ldia = ldia.fit(bow_docs)
ldia.components_.shape

CPU times: user 23 s, sys: 1.01 s, total: 24 s
Wall time: 28.7 s


(16, 9232)

Examine the first few words and how they're allocated to the 17 topics. 

In [18]:
pd.set_option('display.width', 75)

components = pd.DataFrame(ldia.components_.T, index=terms, columns=columns)
components.round(2).head(10)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
!,4.63,33.22,202.06,11.01,13.23,144.89,86.72,73.59,487.08,107.13,35.32,29.73,7.32,69.83,61.54,21.71
"""",0.06,0.06,4.0,195.89,0.06,35.85,8.42,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,11.16
#,0.06,0.06,0.06,0.06,0.06,3.66,0.06,1.41,0.06,0.06,0.06,0.06,0.06,0.06,2.12,0.06
#150,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,1.06
#5000,0.06,0.06,0.06,0.06,0.06,0.06,3.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06
$,6.14,0.06,9.32,0.06,0.06,4.6,0.06,0.06,0.06,0.06,0.06,0.06,2.06,0.06,1.19,0.06
%,1.62,0.06,0.06,0.06,0.06,4.51,0.06,0.06,0.06,1.06,0.06,1.06,2.06,0.06,0.06,0.06
&,17.05,5.87,18.83,0.06,2.82,37.45,58.58,0.06,7.59,102.87,0.06,36.77,0.06,0.57,0.27,6.07
',0.06,0.06,13.44,5.52,0.06,7.73,0.06,10.88,8.78,2.49,107.42,4.24,0.06,0.06,0.06,0.06
(,10.36,0.06,1.58,0.06,0.06,24.59,11.58,22.13,0.06,9.73,0.06,1.06,0.06,6.45,0.06,0.06


In [19]:
components.topic3.sort_values(ascending=False)[:10]

"        195.891765
..       172.154417
,        114.853307
call      78.200003
.         70.338211
d         69.603604
sorry     58.616796
u         56.246730
later     46.006989
of        42.608127
Name: topic3, dtype: float64

In [20]:
ldia16_topic_vectors = ldia.transform(bow_docs)
ldia16_topic_vectors = pd.DataFrame(ldia16_topic_vectors, index=index, columns=columns)
ldia16_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.0,0.46,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.51,0.0,0.0,0.0,0.0
sms1,0.01,0.9,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01
sms2!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.98,0.0,0.0,0.0,0.0,0.0,0.0
sms3,0.0,0.86,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



<a name='4.5.3'></a><a id='4.5.3'></a>
## 4.5.3 LDiA + LDA = spam classifier
<a href="#top">[back to top]</a>

Problem: Examine how well LDiA topics are at predicting something useful, such as spaminess. 

Idea: Use LDiA topic vectors to train a LDA model, and test via inference.

In [21]:
# Use the LDiA topic vectors to train an LDA model.
X_train, X_test, y_train, y_test = train_test_split(
    ldia16_topic_vectors, 
    sms.spam, 
    test_size=0.5, 
    random_state=271828
)

print(np.nanmin(X_train))
print(np.nanmax(X_train))
test1 = pd.DataFrame(X_train)
test1.head()

9.936412490739722e-06
0.9919871750616998


Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms929!,0.004167,0.004167,0.159604,0.004167,0.004167,0.178012,0.004167,0.004167,0.004167,0.608217,0.004167,0.004167,0.004167,0.004167,0.004167,0.004167
sms2321,0.003906,0.003906,0.003906,0.003906,0.003906,0.003906,0.003906,0.003906,0.292796,0.003906,0.652517,0.003906,0.003906,0.003906,0.003906,0.003906
sms4443!,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.308197,0.002083,0.002083,0.002083,0.662637,0.002083,0.002083
sms3615!,0.002016,0.002016,0.002016,0.002016,0.002016,0.002016,0.002016,0.002016,0.002016,0.002016,0.002016,0.969758,0.002016,0.002016,0.002016,0.002016
sms2313!,0.002404,0.002404,0.002404,0.002404,0.002404,0.963942,0.002404,0.002404,0.002404,0.002404,0.002404,0.002404,0.002404,0.002404,0.002404,0.002404


In [22]:
test1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2418 entries, sms929! to sms2178
Data columns (total 16 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   topic0   2418 non-null   float64
 1   topic1   2418 non-null   float64
 2   topic2   2418 non-null   float64
 3   topic3   2418 non-null   float64
 4   topic4   2418 non-null   float64
 5   topic5   2418 non-null   float64
 6   topic6   2418 non-null   float64
 7   topic7   2418 non-null   float64
 8   topic8   2418 non-null   float64
 9   topic9   2418 non-null   float64
 10  topic10  2418 non-null   float64
 11  topic11  2418 non-null   float64
 12  topic12  2418 non-null   float64
 13  topic13  2418 non-null   float64
 14  topic14  2418 non-null   float64
 15  topic15  2418 non-null   float64
dtypes: float64(16)
memory usage: 321.1+ KB


In [23]:
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
sms['ldia16_spam'] = lda.predict(ldia16_topic_vectors)

print(f"Test set accuracy: {round(float(lda.score(X_test, y_test)), 2)}")

Test set accuracy: 0.9


---

Compare how the LDiA model compares to a much higher-dimensional model based on the TF-IDF vectors.

The TF-IDF vectors have many more features (more than 3,000 unique terms. So you’re likely to experience overfitting and poor generalization. This is where the generalization of LDiA and PCA should help.

In [24]:
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)

tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()

tfidf_docs = tfidf_docs - tfidf_docs.mean(axis=0)

X_train, X_test, y_train, y_test = train_test_split(
    tfidf_docs, 
    sms.spam.values, 
    test_size=0.5, 
    random_state=271828
)

print(np.nanmin(X_train))
print(np.nanmax(X_train))
test2 = pd.DataFrame(X_train)
test2.head()

-0.06573907666856706
0.9997932602853008


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9222,9223,9224,9225,9226,9227,9228,9229,9230,9231
0,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,0.205146,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
1,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
2,0.219732,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,0.113991,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
3,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,0.119565,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
4,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05


In [25]:
test2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2418 entries, 0 to 2417
Columns: 9232 entries, 0 to 9231
dtypes: float64(9232)
memory usage: 170.3 MB


In [26]:
# Fitting an LDA model to all these thousands of features will take 
# quite a long time. It’s slicing up your vector space with a 
# 9,332-dimension hyperplane.
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)

print(f"Training set accuracy: {round(float(lda.score(X_train, y_train)), 3)}")
print(f"Test set accuracy: {round(float(lda.score(X_test, y_test)), 3)}")

Training set accuracy: 1.0
Test set accuracy: 0.748


<a name='4.5.4'></a><a id='4.5.4'></a>
## 4.5.4 A fairer comparison: 32 LDiA topics
<a href="#top">[back to top]</a>

Problem: LDiA may not be as efficient as LSA (PSA), so it may need more topics to allocate words to.

Idea: Try 32 topics (components)

In [27]:
%%time
# Try 32 topics (components)
ldia32 = LDiA(n_components=32, learning_method='batch')
ldia32 = ldia32.fit(bow_docs)
ldia32.components_.shape

CPU times: user 24.3 s, sys: 1.12 s, total: 25.5 s
Wall time: 32 s


(32, 9232)

In [28]:
# Compute new 32-D topic vectors for all the documents (SMS messages)
ldia32_topic_vectors = ldia32.transform(bow_docs)
columns32 = ['topic{}'.format(i) for i in range(ldia32.n_components)]
ldia32_topic_vectors = pd.DataFrame(ldia32_topic_vectors, index=index, columns=columns32)
ldia32_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic22,topic23,topic24,topic25,topic26,topic27,topic28,topic29,topic30,topic31
sms0,0.0,0.0,0.0,0.0,0.47,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.49,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.0,0.0,0.0,0.0,0.13,0.0,0.12,0.0,0.0,0.0,...,0.0,0.0,0.0,0.49,0.0,0.0,0.0,0.0,0.0,0.0
sms2!,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.91
sms3,0.0,0.0,0.0,0.0,0.27,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.66,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.0,0.0,0.0,0.0,0.31,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# LDA model (classifier) training, using 32-D LDiA topic vectors
X_train, X_test, y_train, y_test = train_test_split(
    ldia32_topic_vectors, 
    sms.spam, 
    test_size=0.5, 
    random_state=271828
)

print(np.nanmin(X_train))
print(np.nanmax(X_train))
test3 = pd.DataFrame(X_train)
test3.head()

4.968203497626435e-06
0.991720085470069


Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic22,topic23,topic24,topic25,topic26,topic27,topic28,topic29,topic30,topic31
sms929!,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,...,0.002083,0.002083,0.166457,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083
sms2321,0.340144,0.001953,0.001953,0.178571,0.001953,0.424644,0.001953,0.001953,0.001953,0.001953,...,0.001953,0.001953,0.001953,0.001953,0.001953,0.001953,0.001953,0.001953,0.001953,0.001953
sms4443!,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,...,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042,0.001042
sms3615!,0.001008,0.001008,0.001008,0.001008,0.001008,0.001008,0.001008,0.001008,0.001008,0.001008,...,0.001008,0.001008,0.001008,0.001008,0.001008,0.001008,0.001008,0.078617,0.001008,0.001008
sms2313!,0.001202,0.001202,0.001202,0.001202,0.001202,0.001202,0.161379,0.001202,0.001202,0.001202,...,0.001202,0.001202,0.001202,0.001202,0.001202,0.001202,0.001202,0.745291,0.001202,0.001202


In [30]:
test3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2418 entries, sms929! to sms2178
Data columns (total 32 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   topic0   2418 non-null   float64
 1   topic1   2418 non-null   float64
 2   topic2   2418 non-null   float64
 3   topic3   2418 non-null   float64
 4   topic4   2418 non-null   float64
 5   topic5   2418 non-null   float64
 6   topic6   2418 non-null   float64
 7   topic7   2418 non-null   float64
 8   topic8   2418 non-null   float64
 9   topic9   2418 non-null   float64
 10  topic10  2418 non-null   float64
 11  topic11  2418 non-null   float64
 12  topic12  2418 non-null   float64
 13  topic13  2418 non-null   float64
 14  topic14  2418 non-null   float64
 15  topic15  2418 non-null   float64
 16  topic16  2418 non-null   float64
 17  topic17  2418 non-null   float64
 18  topic18  2418 non-null   float64
 19  topic19  2418 non-null   float64
 20  topic20  2418 non-null   float64
 21  topic21  2

In [31]:
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
sms['ldia32_spam'] = lda.predict(ldia32_topic_vectors)

# Check number of dimensions in topic vectors
X_train.shape

(2418, 32)

In [32]:
print(f"Train accuracy: {round(float(lda.score(X_train, y_train)), 3)}")

Train accuracy: 0.922


In [33]:
print(f"Test accuracy: {round(float(lda.score(X_test, y_test)), 3)}")

Test accuracy: 0.924


---
<a name='4.6'></a><a id='4.6'></a>
# 4.6 Distance and similarity
<a href="#top">[back to top]</a>

Problem: Test how well LSA topic models agree with the higher-dimensional TF-IDF model.

Idea: Use similarity scores and distances.