# Chapter 04: Semantic Analysis

之前我们通过TF-IDF 来评估每个word 的重要程度。TF-IDF vectors and matrices tell how important each word is to the
overall meaning of a bit of text in a document collection.

#### Latent Semantic Analysis (LSA)
algorithm for revealing the meaning of word combinations and computing vectors to represent this meaning.

本章主要介绍，如何使用TF-IDF vector 来计算topic vector / semantic vector. 
- group words together in topics
- the linear combinations of words that make up the dimensions of your topic vectors

使用topic vector 的优势在于：
- semantic search: search based on meaning -- better than keyword search
- 摘要：the most meaningful words for a document — a set of keywords that summarizes its meaning.
- compare two documents to check how close they are in meaning to each other.



## 1. From word counts to topic scores 

### 1.1 TF-IDF 的局限性

TF-IDF 只是frequency, nothing else (cannot tell the meaning).

texts that restate the same meaning will have completely different TF-IDF vector representations if they
spell things differently or use different words.  -- document similarity comparisons 有时会出问题 相似的documents 必须使用相似的tokens，这点要求有些苛刻。

stemming / lemmatization: similar spelling, similar meaning. 相当于根据spelling 做了一个clustering，然后用stem 或lemma 来代替这一个cluster 的tokens.

- 问题是这两种方法只能识别拼写相近的token，但是很难找到近义词(Synonyms). False Negative -- 相似的找不到
- 有时，会把反义词(antonyms)判断成同义词，因为拼写十分相近。 False Positive -- 找到的不相似

所以，内容相近的documents，因为使用了不同的词，他们在TF-IDF vector space 的距离可能很远。除此以为，对这些vector 进行+/- 操作，得到的结果也不是很有意义(compare to word embeddings).

### 1.2 Topic Vectors

我们需要得到additional information -- meaning, or topic. 
- estimate of what the words in a document “signify.”
- what that combination of words means in a particular document.

然后把这个meaning 转变成一个more compact, meaningful vector - topic vector.
- 对topic vector 进行+/- 操作很有意义；
- distance between vector is useful for clustering documents or semantic search
- 每个token 有一个word topic vector，基于此，计算每个document 的document topic vector。
- 可以给每个word 一个权值，权值可以为负

#### Challenges
- polysemy: The existence of words and phrases with more than one meani
    - many different interpretations of the same words
- Homonyms： 
    - text: Words with the same spelling and pronunciation, but different meanings
    - speech: Words spelled the same, but with different pronunciations and meanings
- Zeugma： 
    - text: Use of two meanings of a word simultaneously in the same sentence
    - speech: Words with the same pronunciation, but different spellings and meanings (an NLP challenge with voice interfaces)

### 1.3 Thought Experiment

⚠️这不是一个real algorithm or implementation，只是一个思考问题解决方法的办法。

think: 每个单词对topic 的贡献度

假设我们有三个topics：
- pets
- animals
- cities

假设我们的lexicon 包括以下几个单词：

```python
['cat', 'dog', 'apple', 'lion', 'NYC', 'love']
```

我们可以对每个单词对每个topic 的贡献度设置一个权值，然后通过weighted the word frequencies 来计算topic vector.

**weight**: how likely the word is associated with a topic

```python
>>> topic['petness'] = (.3 * tfidf['cat'] +\
... .3 * tfidf['dog'] +\
... 0 * tfidf['apple'] +\
... 0 * tfidf['lion'] -\
... .2 * tfidf['NYC'] +\
... .2 * tfidf['love'])
>>> topic['animalness'] = (.1 * tfidf['cat'] +\
... .1 * tfidf['dog'] -\
... .1 * tfidf['apple'] +\
... .5 * tfidf['lion'] +\
... .1 * tfidf['NYC'] -\
... .1 * tfidf['love'])
>>> topic['cityness'] = ( 0 * tfidf['cat'] -\
... .1 * tfidf['dog'] +\
... .2 * tfidf['apple'] -\
... .1 * tfidf['lion'] +\
... .5 * tfidf['NYC'] +\
... .1 * tfidf['love'])
```
    
这个weight matrix 可以flipped (transposed).

```python
>>> word_vector = {}
>>> word_vector['cat'] = .3*topic['petness'] +\
... .1*topic['animalness'] +\
... 0*topic['cityness']
>>> word_vector['dog'] = .3*topic['petness'] +\
... .1*topic['animalness'] -\
... .1*topic['cityness']
>>> word_vector['apple']= 0*topic['petness'] -\
... .1*topic['animalness'] +\
... .2*topic['cityness']
>>> word_vector['lion'] = 0*topic['petness'] +\
... .5*topic['animalness'] -\
... .1*topic['cityness']
>>> word_vector['NYC'] = -.2*topic['petness'] +\
... .1*topic['animalness'] +\
... .5*topic['cityness']
>>> word_vector['love'] = .2*topic['petness'] -\
... .1*topic['animalness'] +\
... .1*topic['cityness']
```

上面的6个vector，代表了在三维空间(topic vector space) 中的点, 如下图所示。

<img src="img/topic_vectors.png" alt="drawing" width="600"/>

在这个例子中，我们把6D向量压缩到3D向量。原先每个token 是一个6D 向量，6是lexicon 的大小。所以在真实场景中，可能是10000D。3是topic 个数，在真实场景中，可能也只有100.


#### 下一步：这些weight 怎么设置？
    
上面的例子中，我们手动设置weight matrix，缺陷明显：
- labor-intensive
- Common sense is hard to code into an algorithm.
    - 数据多了也不知道该assign 多少weight
- 多少个topic？

实际上，我们重新看这个问题，发现他就是一个降维的问题。transfer a vector from a higher vector space (TF-IDF space) to a lower-dimensional vector space (topic space).

- inputs:
    - weight matrix: 3*6 (3 topics, 6 tokens in the lexicon)
    - TF-IDF vector: 6*1
- outputs:
    - topic vector: 3*1

下面，我们来看具体的算法


### 1.4 An algorithm for scoring topics

**You shall know a word by the company it keeps.**

**company** = co-occurrences

**LSA (Latent Semantic Analysis)** is an algorithm to analyze your TF-IDF matrix (table of TF-IDF vectors) to gather up words into topics. 
- It works on bag-of-words vectors, too, but TF-IDF vectors give slightly better results.
- maintain diversity
- LSA is often referred to as a dimension reduction technique
    - PCA is exactly the same math as LSA
    - LSA: 用于语义分析的PCA
    - in the field of information retrieval, focus is to create index for search, LSA = LSI (Latent Semantic Indexing)
    
    
two algorithms are similar to LSA:
- Linear Discriminant Analysis (LDA)
- Latent Dirichlet allocation (LDiA)


### 1.5 An LDA Classifier

LDA 简单，明了，我们介绍当作一个warm up. 后面我们来看一些更fancy 的方法。

- supervised, need labels
- 需要的sample 个数不多

The model “training” has only three steps：
1. Compute the average position (centroid) of all the TF-IDF vectors within the class (such as spam SMS messages).
2. Compute the average position (centroid) of all the TF-IDF vectors not in the class (such as nonspam SMS messages).
3. Compute the vector difference between the centroids (the line that connects them).

**核心**：
- training: to find the vector (line) between the two centroids for your binary class
- classifying: 离哪个class 的centroid 更近

下面，我们用LDA Classifier 来做一个最简单的饿spam filter.

#### 1. load data

In [6]:
import pandas as pd

pd.options.display.width = 120

sms = pd.read_csv('data/sms-spam.csv', index_col=0)

In [7]:
sms['spam'] = sms.spam.astype(int)

In [8]:
len(sms)  # 一共有4837 个messages

4837

In [11]:
sms.spam.sum()  # 一共有638 个message 被标记成spam

638

In [12]:
sms.head(10)

Unnamed: 0,spam,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
5,1,FreeMsg Hey there darling it's been 3 week's n...
6,0,Even my brother is not like to speak with me. ...
7,0,As per your request 'Melle Melle (Oru Minnamin...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...


#### 2. tokenization and TF-IDF vector transformation

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize

tfidf_model = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf_model.fit_transform(raw_documents=sms.text).toarray()

In [14]:
tfidf_docs.shape

(4837, 9232)

In [15]:
sms.spam.sum()

638

我们可以看出，tokenization 之后，lexicon 有9232 个token。所以现在特征的个数大于样本的个数. 在这种情况下，一些分类器（例如Naive Bayes Classifier）的表现不好。在这种情况下，可以使用semantic analysis techniques 的方法。



#### 3. 计算两个class 的centroid

In [16]:
mask = sms.spam.astype(bool).values

spam_centroid = tfidf_docs[mask].mean(axis=0)
ham_centroid = tfidf_docs[~mask].mean(axis=0)

In [17]:
spam_centroid.round(2)

array([0.06, 0.  , 0.  , ..., 0.  , 0.  , 0.  ])

In [18]:
ham_centroid.round(2)

array([0.02, 0.01, 0.  , ..., 0.  , 0.  , 0.  ])

#### 4. 计算连接两个centroids 的一个vector

from ham centroid to spam centroid

model_vec: The arrow from the nonspam centroid to the spam centroid is the line that defines your trained model.

In [24]:
model_vec = spam_centroid - ham_centroid
model_vec

array([ 4.39266024e-02, -1.92685506e-03,  3.84287194e-04, ...,
       -6.31869803e-05, -6.31869803e-05, -6.31869803e-05])

##### 5. 对已知样本进行分类

每一个document 是一个TF-IDF vector. 然后用这个vector 和model_vec 做点乘

In [26]:
spamminess_score = tfidf_docs.dot(model_vec)
spamminess_score

array([-0.01469806, -0.02007376,  0.03856095, ..., -0.01014774,
       -0.00344281,  0.00395752])

In [27]:
max(spamminess_score)

0.06904539440075813

In [28]:
min(spamminess_score)

-0.03935727183816804

#### normalization

我们希望这个score 在0 - 1 之间，类似于概率，所以我们要做standardization - MaxMinScaler

然后当这个normalized score 大于0.5， 我们预测为spam，否则预测为nonspam. 

In [29]:
from sklearn.preprocessing import MinMaxScaler
sms['lda_score'] = MinMaxScaler().fit_transform(spamminess_score.reshape(-1,1))
sms['lda_predict'] = (sms.lda_score > .5).astype(int)

In [30]:
sms.head(10)

Unnamed: 0,spam,text,lda_score,lda_predict
0,0,"Go until jurong point, crazy.. Available only ...",0.227478,0
1,0,Ok lar... Joking wif u oni...,0.177888,0
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,0.718785,1
3,0,U dun say so early hor... U c already then say...,0.184565,0
4,0,"Nah I don't think he goes to usf, he lives aro...",0.286944,0
5,1,FreeMsg Hey there darling it's been 3 week's n...,0.548003,1
6,0,Even my brother is not like to speak with me. ...,0.324953,0
7,0,As per your request 'Melle Melle (Oru Minnamin...,0.499636,0
8,1,WINNER!! As a valued network customer you have...,0.892853,1
9,1,Had your mobile 11 months or more? U R entitle...,0.766372,1


从上面的结果我们可以看出，前10条数据我们都预测正确了。下面，我们来统计一下对于所有样本的准确率。

In [31]:
(1. - (sms.spam - sms.lda_predict).abs().sum() / len(sms)).round(3)

0.977

In [39]:
label = sms['spam'].tolist()
predict = sms['lda_predict'].tolist()

In [42]:
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 
  
results = confusion_matrix(label, predict) 
  
print('Confusion Matrix :')
print(results) 

Confusion Matrix :
[[4135   64]
 [  45  593]]


In [43]:
print('Accuracy Score :',accuracy_score(label, predict))

Accuracy Score : 0.9774653710977879


In [44]:
print('Report : ')
print(classification_report(label, predict))

Report : 
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      4199
           1       0.90      0.93      0.92       638

    accuracy                           0.98      4837
   macro avg       0.95      0.96      0.95      4837
weighted avg       0.98      0.98      0.98      4837



LDA is a very simple model, with few parameters, so it should generalize well.

### 1.6 LDiA: latent Dirichlet allocation

#### limitations
- take longer time to traing
- less practical for many real-world applications

#### advantages
- topics are easier to intepret

#### 使用场景
- document summarization: 
    - identify the most "central" sentences of a document
    - put sentences together to create a machine-generated summary
    
- 对于clasification 和regression problems, LSA is better.

#### tools
- gensim

## 2. Latent Semantic Analysis

基于SVD (singular value decomposition). SVD 的一个应用场景是"**matrix inversion**". A matrix can be inverted by decomposing it into three simpler square matrices, transposing matrices, and then multiplying them back together.

Latent semantic analysis is a mathematical technique for finding the “best” way to linearly transform (rotate and stretch) any set of NLP vectors, like your TF-IDF vectors or bag-of-words vectors.

Eliminate those dimensions in the new vector space that don’t contribute much to the variance in the vectors from document to document.

some tricks that help improve the accuracy of LSA vectors

The machine doesn’t “understand” what the combinations of words means, just that they go together. When it sees words like “dog,” “cat,” and “love” together a lot, it puts them together in a topic. It doesn’t know that such a topic is likely about “pets.” 我们需要给这每个topic 一个name. 

#### Awas! Awas! Tom is behind you! Run!

从上下文我们可以大致猜出Awas 的意思。有点儿类似于我们小时候做的填空题。

just focusing on the language context, the words, you can often “transfer” a lot of the significance or meaning of words you do know to words that you don’t.

document 通常是一句话，而不是更长的单位。原因是the meaning of a word is usually closely related to the meanings of the words in the sentence that contains it. 联想word2vec 的窗口大小设置为5.