# Dimension Reduction

## Singular value decomposition (SVD)

SVD 将任何一个matrix 分解成3个matrices，这三个matrices 相乘可以得到原先的matrix.

SVD 可以用于bag-of-word matrix，也可以用于TF-IDF matrix. 

Whether you run SVD on a BOW term-document matrix or a TF-IDF termdocument matrix, SVD will find combinations of words that belong together. SVD finds those co-occurring words by calculating the correlation between the columns (terms) of your term-document matrix.

$$W_{m\times n} = U_{m\times p}S_{p\times p}V_{p\times n}^T$$

- m is the number of terms in your vocabulary
- n is the number of documents in your corpus
- p is the number of topics in your corpus

### U—left singular vectors

#### term-topic matrix (left singular vectors, left eigenvectors, row eigenvectors)
tells about "the company a word keeps". 这是NLP 中semantic analysis 最重要的一个matrix.

U is the **cross-correlation** between **words** and **topics** based on word co-occurrence in the same document

⚠️ 如果input matrix 是document-terms matrix，我们需要从$V^T$中提取term-topic 信息。例如，使用scikit learn 的pca 模块。
scikit-learn always arranges data as row vectors so your term-document matrix in tdm is transposed into a document-term matrix when you use PCA.fit() or any other sklearn model training.

原始的U 是一个square matrix ($m\times m$). 当我们truncate (删除列) 之后，U变成一个$m\times p$ 的矩阵。

U matrix contains all the topic vectors for each word in your corpus as columns (represent how important each word is to each topic).

In [1]:
import numpy as np
import pandas as pd


In [38]:
doc_to_term_df = pd.read_csv('data/svd.csv', index_col=0)

In [39]:
doc_to_term_df

Unnamed: 0_level_0,cat,dog,apple,lion,nyc,love
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NYC is the Big Apple.,0,0,1,0,1,0
NYC is known as the Big Apple.,0,0,1,0,1,0
I love NYC!,0,0,0,0,1,1
I wore a hat to the Big Apple party in NYC.,0,0,1,0,1,0
Come to NYC. See the Big Apple!,0,0,1,0,1,0
Manhattan is called the Big Apple.,0,0,1,0,0,0
New York is a big city for a small cat.,1,0,0,0,0,0
The lion - a big cat - is the king of the jungle.,1,0,0,1,0,0
I love my pet cat.,1,0,0,0,0,1
I love New York City (NYC).,0,0,1,0,1,1


In [40]:
term_to_doc_df = doc_to_term_df.T
term_to_doc_df

text,NYC is the Big Apple.,NYC is known as the Big Apple.,I love NYC!,I wore a hat to the Big Apple party in NYC.,Come to NYC. See the Big Apple!,Manhattan is called the Big Apple.,New York is a big city for a small cat.,The lion - a big cat - is the king of the jungle.,I love my pet cat.,I love New York City (NYC).,Your dog chased my cat.
cat,0,0,0,0,0,0,1,1,1,0,1
dog,0,0,0,0,0,0,0,0,0,0,1
apple,1,1,0,1,1,1,0,0,0,1,0
lion,0,0,0,0,0,0,0,1,0,0,0
nyc,1,1,1,1,1,0,0,0,0,1,0
love,0,0,1,0,0,0,0,0,1,1,0


In [41]:
doc_to_term_matrix = doc_to_term_df['cat dog apple lion nyc love'.split()].to_numpy()
doc_to_term_matrix

array([[0, 0, 1, 0, 1, 0],
       [0, 0, 1, 0, 1, 0],
       [0, 0, 0, 0, 1, 1],
       [0, 0, 1, 0, 1, 0],
       [0, 0, 1, 0, 1, 0],
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 1],
       [1, 1, 0, 0, 0, 0]])

In [42]:
term_to_doc_matrix = doc_to_term_matrix.T
term_to_doc_matrix

array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0]])

In [43]:
# term_to_document_matrix = np.array([[0,0,0,0,0,0,1,1,1,0,1],
#                                    [0,0,0,0,0,0,0,0,0,0,1],
#                                    [1,1,0,1,1,1,0,0,0,0,0],
#                                    [0,0,0,0,0,0,0,1,0,0,0],
#                                    [1,1,1,1,1,0,0,0,0,1,0],
#                                    [0,0,1,0,0,0,0,0,1,1,0]])

In [44]:
U, s, Vt = np.linalg.svd(term_to_doc_matrix)

In [45]:
pd.DataFrame(U, index=term_to_doc_df.index).round(2)

Unnamed: 0,0,1,2,3,4,5
cat,-0.03,0.87,-0.28,-0.0,0.06,-0.39
dog,-0.0,0.22,-0.19,-0.71,-0.26,0.59
apple,-0.67,-0.13,-0.42,0.0,0.57,0.16
lion,-0.0,0.22,-0.19,0.71,-0.26,0.59
nyc,-0.7,-0.04,0.13,-0.0,-0.66,-0.23
love,-0.25,0.35,0.81,0.0,0.31,0.26


### S—singular values

Contains the topic “singular values” (eigenvalues) in a **square diagonal matrix**. 只有对角的元素不为零。

The singular values tell you how much information is captured by each dimension in your new semantic (topic) vector space.

In [49]:
pd.DataFrame(s).round(1)  # 为了节省空间，对于对角矩阵，numpy 只保留对角线的数据

Unnamed: 0,0
0,3.4
1,2.2
2,1.6
3,1.0
4,0.9
5,0.6


In [50]:
S = np.zeros((len(s), len(s)))
pd.np.fill_diagonal(S, s)
pd.DataFrame(S).round(1)

Unnamed: 0,0,1,2,3,4,5
0,3.4,0.0,0.0,0.0,0.0,0.0
1,0.0,2.2,0.0,0.0,0.0,0.0
2,0.0,0.0,1.6,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.9,0.0
5,0.0,0.0,0.0,0.0,0.0,0.6


从上面可以看出，first dimension contains the most information (“explained variance”) about your corpus.



### $V^T$—right singular vectors

This gives you the shared meaning between documents, because it measures how often documents use the same topics in your new semantic model of the documents.

Like the S matrix, you’ll ignore the V T matrix whenever you’re transforming new word-document vectors into your topic vector space. you’ll only use it to check the accuracy of your topic vectors for recreating the original word-document vectors that you used to “train” it.

In [51]:
pd.DataFrame(Vt).round(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,-0.4,-0.4,-0.28,-0.4,-0.4,-0.2,-0.01,-0.01,-0.08,-0.48,-0.01
1,-0.08,-0.08,0.14,-0.08,-0.08,-0.06,0.39,0.49,0.55,0.08,0.49
2,-0.18,-0.18,0.6,-0.18,-0.18,-0.27,-0.18,-0.3,0.34,0.33,-0.3
3,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,0.71,0.0,0.0,-0.71
4,-0.1,-0.1,-0.41,-0.1,-0.1,0.66,0.07,-0.22,0.43,0.25,-0.22
5,-0.13,-0.13,0.05,-0.13,-0.13,0.28,-0.68,0.34,-0.23,0.33,0.34
6,-0.57,0.21,0.2,0.33,-0.35,0.2,0.37,-0.0,-0.37,0.18,0.0
7,-0.34,0.45,0.08,-0.66,0.47,0.08,0.08,0.0,-0.08,-0.0,0.0
8,-0.49,0.3,-0.23,0.42,0.17,-0.23,-0.41,-0.0,0.41,-0.18,0.0
9,-0.07,-0.07,-0.51,-0.07,0.06,-0.51,0.14,-0.0,-0.14,0.65,-0.0


#### SVD matrix orientation

- **scikit learn**: document-term matrix, 一共m行n列，m 是document 的个数，n是feature 的个数（即lexicon 的大小），每个term 是一个feature，每个document 是一个sample。

- **svd**: term-document matrix, 相当于是上面的matrix 的transpose. 如果我们不flip matrix，那我们对与U V 的interpretation 就要做相应的调整。

### Truncate

truncate: zero out the dimensions at the lower right. 

You’ve just created some new words and called them “topics” because they each combine words together in various ratios. 下一步：reduce the number of dimensions.

How many topics will be enough to capture the essence of a document? 
One way to measure the accuracy of LSA is to see how accurately you can recreate a **term-document matrix** from a **topic-document matrix**. 



维度压缩的越大，error 越高。

## Principal Component Analysis (PCA)

sparse matrices: most are empty, a few meaningful values scattered around.

#### scikit learn PCA

use dense matrices rather than sparse matrices:
- faster 
- waste of RAM

#### Truncated SVD model

#### 1. load sms data

In [52]:
import pandas as pd

pd.options.display.width = 120

sms = pd.read_csv('data/sms-spam.csv', index_col=0)

In [53]:
sms.head(10)

Unnamed: 0,spam,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
5,1,FreeMsg Hey there darling it's been 3 week's n...
6,0,Even my brother is not like to speak with me. ...
7,0,As per your request 'Melle Melle (Oru Minnamin...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...


下面，我们修改以下index的格式：`sms{i}{!}` 表示第i条日志，如果是spam 的话在后面加一个感叹号.

In [54]:
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
sms.head(10)

Unnamed: 0,spam,text
sms0,0,"Go until jurong point, crazy.. Available only ..."
sms1,0,Ok lar... Joking wif u oni...
sms2!,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,0,U dun say so early hor... U c already then say...
sms4,0,"Nah I don't think he goes to usf, he lives aro..."
sms5!,1,FreeMsg Hey there darling it's been 3 week's n...
sms6,0,Even my brother is not like to speak with me. ...
sms7,0,As per your request 'Melle Melle (Oru Minnamin...
sms8!,1,WINNER!! As a valued network customer you have...
sms9!,1,Had your mobile 11 months or more? U R entitle...


#### 2. 计算TF-IDF vectors

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()


In [56]:
len(tfidf.vocabulary_)

9232

center your vectorized documents (BOW vectors) by subtracting the mean.

In [57]:
tfidf_docs = pd.DataFrame(tfidf_docs)
tfidf_docs = tfidf_docs - tfidf_docs.mean()

In [59]:
tfidf_docs.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9222,9223,9224,9225,9226,9227,9228,9229,9230,9231
0,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
1,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
2,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,0.096125,0.12734,0.124007,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
3,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
4,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05


In [60]:
tfidf_docs.shape

(4837, 9232)

In [61]:
sms.spam.sum()

638

tfidf_docs 是一个document-term matrix.

我们可以看出，一共有4837个documents (sms), 其中有638个标记为spam. lexicon 有9232 个tokens。

这很有可能overfitting. 
- 特征个数大于样本个数
- 负样本占比低(spam 占比约为1/8)

这很有可能导致，最后找到的特征是几个"spammy" words, 也就是说有些单词只出现在spam sms 中，这样的特征过于明显，所以造成overfitting。后果是可以通过换一些spammy words 的同义词来逃过spam detection。

避免overfitting 的一个方法就是使用更大的样本。也就是说，使样本书远大于特征数。这样出现上面这样spammy words 的概率很低。

dimension reduction 是降低overfitting 的一个方法。同样的spam filter，如果用bag-of-words or TF-IDF vector可能会overfitting，用topic vector 可以取得更好的泛化效果。试想，以前是包含哪几个词的sms 是spam，现在是有哪些topic 的是spam，因而范围更广，更易泛化。

有关overfitting, 一个研究领域是**One-Shot Learning**. 通过学习更少的数据来取得一样的准确率。


#### 3. Using PCA for SMS message semantic analysis

In [65]:
from sklearn.decomposition import PCA

pca = PCA(n_components=16)  # 降维到16维topic vector
pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)



In [67]:
columns = ['topic{}'.format(i) for i in range(pca.n_components)]  # pandas 显示用列名
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns, index=index)  # 将结果转换为pandas

pca_topic_vectors.round(3).head(5)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,0.003,0.037,0.011,-0.019,-0.053,0.039,-0.066,0.013,-0.08,0.007,0.011,0.019,-0.028,-0.019,0.021
sms1,0.404,-0.094,-0.078,0.051,0.1,0.047,0.023,0.065,0.023,-0.025,-0.004,-0.034,0.041,-0.011,0.055,-0.039
sms2!,-0.03,-0.048,0.09,-0.067,0.091,-0.043,-0.0,-0.002,-0.058,0.053,0.124,-0.024,0.039,-0.01,-0.043,0.035
sms3,0.329,-0.033,-0.035,-0.016,0.052,0.056,-0.166,-0.074,0.062,-0.109,0.021,-0.025,0.072,-0.042,0.036,-0.073
sms4,0.002,0.031,0.038,0.034,-0.075,-0.092,-0.044,0.063,-0.043,0.027,0.028,0.004,0.023,0.031,-0.074,-0.017


In [69]:
pca.components_

array([[-7.11140633e-02,  8.17691034e-03, -1.21158347e-03, ...,
         5.71855488e-04,  5.71855488e-04,  5.71855488e-04],
       [ 6.35086981e-02,  7.61037184e-03,  2.67224842e-04, ...,
         1.02158963e-03,  1.02158963e-03,  1.02158963e-03],
       [ 7.08719053e-02,  2.70732209e-02,  1.34877785e-04, ...,
        -9.52766139e-04, -9.52766139e-04, -9.52766139e-04],
       ...,
       [ 1.00509374e-01, -2.60529443e-02,  3.04911538e-03, ...,
         3.48877813e-04,  3.48877813e-04,  3.48877813e-04],
       [-2.35445889e-02, -2.58424896e-02, -2.36249338e-06, ...,
         4.69628561e-04,  4.69628561e-04,  4.69628561e-04],
       [-4.21326074e-02, -4.37024961e-02, -1.48951818e-04, ...,
        -1.09568012e-03, -1.09568012e-03, -1.09568012e-03]])

In [70]:
tfidf.vocabulary_

{'go': 3807,
 'until': 8487,
 'jurong': 4675,
 'point': 6296,
 ',': 13,
 'crazy': 2549,
 '..': 21,
 'available': 1531,
 'only': 5910,
 'in': 4396,
 'bugis': 1973,
 'n': 5594,
 'great': 3894,
 'world': 8977,
 'la': 4811,
 'e': 3056,
 'buffet': 1971,
 '...': 25,
 'cine': 2277,
 'there': 8071,
 'got': 3855,
 'amore': 1296,
 'wat': 8736,
 'ok': 5874,
 'lar': 4848,
 'joking': 4642,
 'wif': 8875,
 'u': 8395,
 'oni': 5906,
 'free': 3604,
 'entry': 3195,
 '2': 471,
 'a': 1054,
 'wkly': 8933,
 'comp': 2386,
 'to': 8192,
 'win': 8890,
 'fa': 3328,
 'cup': 2608,
 'final': 3450,
 'tkts': 8180,
 '21st': 497,
 'may': 5272,
 '2005': 487,
 '.': 15,
 'text': 8020,
 '87121': 948,
 'receive': 6688,
 'question': 6574,
 '(': 9,
 'std': 7651,
 'txt': 8379,
 'rate': 6628,
 ')': 10,
 't': 7889,
 '&': 7,
 "c's": 2020,
 'apply': 1383,
 '08452810075': 115,
 'over': 6003,
 '18': 438,
 "'": 8,
 's': 6959,
 'dun': 3041,
 'say': 7034,
 'so': 7438,
 'early': 3069,
 'hor': 4207,
 'c': 2019,
 'already': 1268,
 'then': 

In [71]:
column_nums, terms = zip(*sorted(zip(tfidf.vocabulary_.values(), tfidf.vocabulary_.keys())))

In [72]:
terms

('!',
 '"',
 '#',
 '#150',
 '#5000',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '. .',
 '. . .',
 '. . . .',
 '. . . . .',
 '. ..',
 '..',
 '.. .',
 '.. . . .',
 '.. ... ...',
 '...',
 '... . . . .',
 '/',
 '0',
 '00',
 '00870405040',
 '0089',
 '01',
 '0121 2025050',
 '01223585236',
 '01223585334',
 '01256987',
 '02',
 '02/06',
 '02/09',
 '0207 153 9153',
 '0207 153 9996',
 '0207-083-6089',
 '02072069400',
 '02073162414',
 '02085076972',
 '03',
 '03530150',
 '04',
 '04/09',
 '05',
 '050703',
 '06',
 '06.05',
 '06/11',
 '07/11',
 '07008009200',
 '07046744435',
 '07090201529',
 '07090298926',
 '07099833605',
 '07123456789',
 '07732584351',
 '07734396839',
 '07742676969',
 '07753741225',
 '0776xxxxxxx',
 '07786200117',
 '077xxx',
 '078',
 '07801543489',
 '07808',
 '07808247860',
 '07808726822',
 '07815296484',
 '07821230901',
 '078498',
 '07880867867',
 '0789xxxxxxx',
 '07946746291',
 '0796xxxxxx',
 '07973788240',
 '07xxxxxxxxx',
 '08',
 '0800',
 '0800 0721072',
 '

下面，我们看以下每个token 和topic 的映射关系。

In [74]:
weights = pd.DataFrame(pca.components_, columns=terms, index=['topic{}'.format(i) for i in range(16)])
pd.options.display.max_columns = 8
weights.head(4).round(3)

Unnamed: 0,!,"""",#,#150,...,…,┾,〨ud,鈥
topic0,-0.071,0.008,-0.001,-0.0,...,-0.002,0.001,0.001,0.001
topic1,0.064,0.008,0.0,-0.0,...,0.003,0.001,0.001,0.001
topic2,0.071,0.027,0.0,0.001,...,0.002,-0.001,-0.001,-0.001
topic3,-0.059,-0.032,-0.001,-0.0,...,0.001,0.001,0.001,0.001


In [75]:
pd.options.display.max_columns = 12

deals = weights['! ;) :) half off free crazy deal only $ 80 %'.split()].round(3) * 100

deals

Unnamed: 0,!,;),:),half,off,free,crazy,deal,only,$,80,%
topic0,-7.1,0.1,-0.5,-0.0,-0.4,-2.0,-0.0,-0.1,-2.2,0.3,-0.0,-0.0
topic1,6.4,0.0,7.4,0.1,0.4,-2.3,-0.2,-0.1,-3.8,-0.1,-0.0,-0.2
topic2,7.1,0.2,-0.2,0.1,0.3,4.4,0.1,-0.1,0.7,0.0,0.0,0.1
topic3,-5.9,-0.3,-7.1,0.2,0.3,-0.2,0.0,0.1,-2.3,0.1,-0.1,-0.3
topic4,38.1,-0.1,-12.5,-0.1,-0.2,9.9,0.1,-0.2,3.0,0.3,0.1,-0.1
topic5,-26.5,0.1,-1.5,-0.3,-0.7,-1.4,-0.6,-0.2,-1.8,-0.9,0.0,0.0
topic6,-10.9,-0.5,19.8,-0.4,-0.9,-0.6,-0.2,-0.1,-1.4,-0.0,-0.0,-0.1
topic7,16.0,0.1,-17.6,0.8,0.8,-2.9,0.0,0.1,-1.8,-0.3,0.0,-0.1
topic8,34.4,0.1,5.2,-0.4,-0.5,0.2,-0.4,-0.4,3.2,-0.6,-0.0,-0.2
topic9,7.8,-0.3,15.5,1.5,-0.9,6.4,-0.5,-0.4,3.1,-0.5,-0.0,0.0


In [77]:
deals.T.sum()

topic0    -11.9
topic1      7.6
topic2     12.7
topic3    -15.5
topic4     38.3
topic5    -33.8
topic6      4.7
topic7     -4.9
topic8     40.6
topic9     31.7
topic10   -29.7
topic11   -48.4
topic12     7.6
topic13    46.7
topic14    23.0
topic15    -4.7
dtype: float64

LSA 的一个挑战是，如何解释这些topic，即topic0 - topic15 都代表什么意思。

因为LSA 是寻找单词之间的线性关系，所以可能会找到一些人们无法理解的结果。

#### 4. Using truncated SVD

- you can see what’s going on inside the PCA wrapper
- It can handle sparse matrices, so if you’re working with large datasets you’ll want to use TruncatedSVD instead of PCA anyway

In [78]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=16, n_iter=100)
svd_topic_vectors = svd.fit_transform(tfidf_docs.values)

svd_topic_vectors = pd.DataFrame(svd_topic_vectors, columns=columns, index=index)
svd_topic_vectors.round(3).head(6)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,...,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,0.003,0.037,0.011,-0.019,-0.053,...,0.007,-0.007,0.002,-0.036,-0.014,0.037
sms1,0.404,-0.094,-0.078,0.051,0.1,0.047,...,-0.004,0.036,0.043,-0.021,0.051,-0.042
sms2!,-0.03,-0.048,0.09,-0.067,0.091,-0.043,...,0.125,0.023,0.026,-0.02,-0.042,0.052
sms3,0.329,-0.033,-0.035,-0.016,0.052,0.056,...,0.022,0.023,0.073,-0.046,0.022,-0.07
sms4,0.002,0.031,0.038,0.034,-0.075,-0.093,...,0.028,-0.009,0.027,0.034,-0.083,-0.021
sms5!,-0.016,0.059,0.014,-0.006,0.122,-0.04,...,0.041,0.055,-0.037,0.075,-0.001,0.02


我们可以看出，通过truncate svd 生成的向量和PCA 完全相同。

- n_iter:
- centered: 

#### 5. spam classification

我们下面使用topic vector 来做spam classification.

One way to find out how well a **vector space model** will work for classification is to see how **cosine similarities** between vectors correlate with membership in the same class.



In [79]:
svd_topic_vectors = (svd_topic_vectors.T / np.linalg.norm(svd_topic_vectors, axis=1)).T
svd_topic_vectors.iloc[:10].dot(svd_topic_vectors.iloc[:10].T).round(1)

Unnamed: 0,sms0,sms1,sms2!,sms3,sms4,sms5!,sms6,sms7,sms8!,sms9!
sms0,1.0,0.6,-0.1,0.6,-0.0,-0.3,-0.3,-0.1,-0.3,-0.3
sms1,0.6,1.0,-0.2,0.8,-0.2,0.0,-0.2,-0.2,-0.1,-0.1
sms2!,-0.1,-0.2,1.0,-0.2,0.1,0.4,0.0,0.3,0.5,0.4
sms3,0.6,0.8,-0.2,1.0,-0.2,-0.3,-0.1,-0.3,-0.2,-0.1
sms4,-0.0,-0.2,0.1,-0.2,1.0,0.2,0.0,0.1,-0.4,-0.2
sms5!,-0.3,0.0,0.4,-0.3,0.2,1.0,-0.1,0.1,0.3,0.4
sms6,-0.3,-0.2,0.0,-0.1,0.0,-0.1,1.0,0.1,-0.2,-0.2
sms7,-0.1,-0.2,0.3,-0.3,0.1,0.1,0.1,1.0,0.1,0.4
sms8!,-0.3,-0.1,0.5,-0.2,-0.4,0.3,-0.2,0.1,1.0,0.3
sms9!,-0.3,-0.1,0.4,-0.1,-0.2,0.4,-0.2,0.4,0.3,1.0


从上面可以看出，spam0 和2，5，8，9 都是负相关。而2和5，8，9 都是正相关。可以说明Spam messages share similar semantics, they talk about similar topics.

同理，可以通过cos 距离来做semantic search - 找到语义最相近的message.

从上面我们也可以看出以下错误，例如sms7 和spam messages 都比较接近。所以这个问题不是线性可分的，一定会有错误。下面我们来评估以下错误率。



## Notes

topics = features = **meaningful** words

meaningful: 从统计的角度看，就是能够拉开variance 的单词。有一些频繁出现在所有documents 中的单词（例如stop words） uniformly distributed across all documents.