# 语义分析，或者语义生成
定义是：确定字符列或word sequences 意思的过程。可用于歧义消除的任务。

当分析一个输入的句子时，如果句子结构已经建立，那么可以进行该句的语义分析。

语义解释就是把意义映射到句子中。

语境解释（contextual interpretation）是将逻辑形式映射到knowledge representation。

语义分析的基本单位或基本单位称为meaning或sense。

几个处理senses 的工具：

    * ELIZA：它利用替换和模式匹配技术分析句子，并为给定的输入提供输出。
    * MARGIE：它可以用11个原语primitives 来表示所有的英语动词。
    * Script Applier Mechanism (SAM)：它可以翻译不同语言的句子，如英语、汉语、俄语、荷兰语和西班牙语。
    * 为了实现对文本数据的处理，要使用一个Python库或是TextBlob
    
语义分析可以用来查询数据库和检索信息。

    * 另一个Python库，Gensim，可以用来执行文件索引，主题建模，和相似性检索。
    * Polyglot 支持多种语言的应用的NLP工具。它提供40种不同语言的NER，165种for tokenization，165种for 语音分析……
    * MontyLingua用于对英语文本进行语义解释
    
# 句子可以用逻辑形式来表现。
命题逻辑 propositional logic 中的基本表达式或句子可用命题符号表示，比如P，Q，R等。
复数表达式可以用布尔运算符表示。

例如，用propositional logic 来表示句子：If it is raining, I'll wear a raincoat 

    • P: It is raining.
    • Q: I'll wear raincoat.
    • P->Q: If it is raining, I'll wear a raincoat.

## 一些操作符号：

In [1]:
import nltk
nltk.boolean_ops()

negation       	-
conjunction    	&
disjunction    	|
implication    	->
equivalence    	<->


## Well-formed Formulas(WFF) 合语法的公式的构成：
使用propositional symbols 命题符号或 使用命题符号和上述boolean operators布尔操作符的组合

#### 将不同的逻辑表达式 归为不同的类：

In [5]:
import nltk
input_expr = nltk.sem.Expression.fromstring
input_expr('X | (Y -> Z)')

<OrExpression (X | (Y -> Z))>

In [6]:
input_expr('-(X & Y)')

<NegatedExpression -(X & Y)>

In [7]:
input_expr('X & Y')

<AndExpression (X & Y)>

In [8]:
input_expr('X <-> -- X')

<IffExpression (X <-> --X)>

使用***Valuation*** function to map **True** or **False** values to logical expressions

In [9]:
import nltk
value = nltk.Valuation([('X', True), ('Y', False), ('Z', True)])
value['Z']

True

In [10]:
domain = set()
v = nltk.Assignment(domain)
u = nltk.Model(domain, value)
print(u.evaluate('(X & Y)', v))
print(u.evaluate('-(X & Y)',v))
print(u.evaluate('(X & Z)', v))
print(u.evaluate('(X | Y)', v))

False
True
True
True


#### 涉及常数和谓词的一阶谓词逻辑：

In [11]:
import nltk
input_expr = nltk.sem.Expression.fromstring
expression = input_expr('run(marcus)', type_check=True)
expression.argument

<ConstantExpression marcus>

In [12]:
expression.argument.type

e

In [13]:
expression.function

<ConstantExpression run>

In [14]:
expression.function.type

<e,?>

In [15]:
sign = {'run': '<e, t>'}
expresion = input_expr('run(marcus)', signature = sign)
expression.function.type

<e,?>

#### signature用于映射相关的类型及non-logical constants，
to generate a query and retrieve data from the database:

In [17]:
import nltk
nltk.data.show_cfg('grammars/book_grammars/sql1.fcfg')

% start S
S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]
VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]
VP[SEM=(?v + ?np)] -> TV[SEM=?v] NP[SEM=?np]
VP[SEM=(?vp1 + ?c + ?vp2)] -> VP[SEM=?vp1] Conj[SEM=?c] VP[SEM=?vp2]
NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]
NP[SEM=(?n + ?pp)]  -> N[SEM=?n] PP[SEM=?pp]
NP[SEM=?n]  -> N[SEM=?n]  | CardN[SEM=?n] 
CardN[SEM='1000'] -> '1,000,000' 
PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
NP[SEM='Country="greece"'] -> 'Greece'
NP[SEM='Country="china"'] -> 'China'
Det[SEM='SELECT'] -> 'Which' | 'What'
Conj[SEM='AND'] -> 'and'
N[SEM='City FROM city_table'] -> 'cities'
N[SEM='Population'] -> 'populations'
IV[SEM=''] -> 'are'
TV[SEM=''] -> 'have'
A -> 'located'
P[SEM=''] -> 'in'
P[SEM='>'] -> 'above'


In [19]:
from nltk import load_parser
test = load_parser('grammars/book_grammars/sql1.fcfg')
q = ' What cities are in Greece'  # 问：哪座城市在希腊
t = list(test.parse(q.split()))  # 通过解析出city,country 这些字段，通过sql语句在数据库中找
t

[Tree(S[SEM=(SELECT, City FROM city_table, WHERE, , , Country="greece")], [Tree(NP[SEM=(SELECT, City FROM city_table)], [Tree(Det[SEM='SELECT'], ['What']), Tree(N[SEM='City FROM city_table'], ['cities'])]), Tree(VP[SEM=(, , Country="greece")], [Tree(IV[SEM=''], ['are']), Tree(PP[SEM=(, Country="greece")], [Tree(P[SEM=''], ['in']), Tree(NP[SEM='Country="greece"'], ['Greece'])])])])]

In [22]:
t[0].label()

S[SEM=(SELECT, City FROM city_table, WHERE, , , Country="greece")]

In [24]:
ans = t[0].label()['SEM']
ans = [s for s in ans if s]
ans  

['SELECT', 'City FROM city_table', 'WHERE', 'Country="greece"']

In [25]:
q = ' '.join(ans)
q   # 得到sql query语句：

'SELECT City FROM city_table WHERE Country="greece"'

In [28]:
from nltk.sem import chat80
r = chat80.sql_query('corpora/city_database/city.db',q) # 参数q即sql语句，用该语句在city数据库中做sql查询
r

<sqlite3.Cursor at 0x11856cb20>

In [30]:
for p in r:    # 遍历结果
    print(p[0], end=' ') # Athens 雅典

athens 

# Introducing NER
命名实体识别是将专有名词或命名实体置于文档中的过程。然后，这些命名实体分为不同的类别，如人、地点、组织名称，等等

NER的一个应用是信息提取。在NLTK，我们可以通过存储元组（实体、关系、实体）进行信息提取的任务，然后，实体的值可以被检索

### information extraction :

In [1]:
import nltk
locations = [('Jaipur','IN','Rajasthan'), ('Ajmer','IN','Rajasthan'),
             ('Udaipur','IN','Rajsthan'),('Mumbai','IN','Maharashtra'),('Ahmedabad','IN','Gujrat')]
q = [x1 for (x1, relation, x2) in locations if x2=='Rajasthan']
print(q)

['Jaipur', 'Ajmer']


### NLTK中已有训练好的Name Entities 分类器：
nltk.ne_chunk() 可从文本中检测NE
如果参数值设为True，那么检出的命名实体将被NE tag所标记；否则被tagged 为 PERSON，GPE，和ORGANIZATION。

In [13]:
import nltk
sentences1 = nltk.corpus.treebank.tagged_sents()[17]
sentences1

[('The', 'DT'),
 ('total', 'NN'),
 ('of', 'IN'),
 ('18', 'CD'),
 ('deaths', 'NNS'),
 ('from', 'IN'),
 ('malignant', 'JJ'),
 ('mesothelioma', 'NN'),
 (',', ','),
 ('lung', 'NN'),
 ('cancer', 'NN'),
 ('and', 'CC'),
 ('asbestosis', 'NN'),
 ('was', 'VBD'),
 ('far', 'RB'),
 ('higher', 'JJR'),
 ('than', 'IN'),
 ('*', '-NONE-'),
 ('expected', 'VBN'),
 ('*?*', '-NONE-'),
 (',', ','),
 ('the', 'DT'),
 ('researchers', 'NNS'),
 ('said', 'VBD'),
 ('0', '-NONE-'),
 ('*T*-1', '-NONE-'),
 ('.', '.')]

In [14]:
print(nltk.ne_chunk(sentences1, binary=True)) # 检测实体，若存在，则用NE tag 标注

(S
  The/DT
  total/NN
  of/IN
  18/CD
  deaths/NNS
  from/IN
  malignant/JJ
  mesothelioma/NN
  ,/,
  lung/NN
  cancer/NN
  and/CC
  asbestosis/NN
  was/VBD
  far/RB
  higher/JJR
  than/IN
  */-NONE-
  expected/VBN
  *?*/-NONE-
  ,/,
  the/DT
  researchers/NNS
  said/VBD
  0/-NONE-
  *T*-1/-NONE-
  ./.)


In [15]:
sentences2 = nltk.corpus.treebank.tagged_sents()[7]
print(nltk.ne_chunk(sentences2, binary=True))

(S
  A/DT
  (NE Lorillard/NNP)
  spokewoman/NN
  said/VBD
  ,/,
  ``/``
  This/DT
  is/VBZ
  an/DT
  old/JJ
  story/NN
  ./.)


In [16]:
print(nltk.ne_chunk(sentences2))

(S
  A/DT
  (ORGANIZATION Lorillard/NNP)
  spokewoman/NN
  said/VBD
  ,/,
  ``/``
  This/DT
  is/VBZ
  an/DT
  old/JJ
  story/NN
  ./.)


### detect named entities:

In [19]:
import nltk
from nltk.corpus import conll2002
for documents in conll2002.chunked_sents('ned.train')[25]:
    print(documents)

(PER Vandenbussche/Adj)
('zelf', 'Pron')
('besloot', 'V')
('dat', 'Conj')
('het', 'Art')
('hof', 'N')
('"', 'Punc')
('de', 'Art')
('politieke', 'Adj')
('zeden', 'N')
('uit', 'Prep')
('het', 'Art')
('verleden', 'N')
('"', 'Punc')
('heeft', 'V')
('willen', 'V')
('veroordelen', 'V')
('.', 'Punc')


## Chunker 用于将纯文本分割成一系列语义相关的词。
NLTK 中的NER操作，使用默认的chunkers，其基于ACE语料上训练的分类器。

其他chunkers 则在 解析或分块后的NLTK语料库上 训练。

NLTK chunkers 涵盖的语言有荷兰语、西班牙语、葡萄牙语、英语。

In [20]:
#识别命名实体并归类到不同的命名实体类：
import nltk
sentence = "I went to Greece to meet John"
tok = nltk.word_tokenize(sentence) # 分词
pos_tag = nltk.pos_tag(tok)  # 词性标注
print(nltk.ne_chunk(pos_tag)) # chunk 并 ne

(S
  I/PRP
  went/VBD
  to/TO
  (GPE Greece/NNP)
  to/TO
  meet/VB
  (PERSON John/NNP))


# A NER system using Hidden Markov Model
隐马尔可夫模型是一种NER常用的统计方法。
隐马尔可夫模型的定义：一个随机有限状态自动机 a Stochastic Finite State Automaton（SFSA），由一组有限的状态组成，这组状态与给定的概率分布有关。状态是未知的（隐藏或未观测到）。
HMM 生成最佳状态序列作为output。
HMM是基于马尔可夫链性质，即下一个状态发生的概率仅取决于当前tag。
HMM的缺点是它需要大量的训练。

## HMM 的组成：

    • Set of states, S, where |S|=N. Here, N is the total number of states.
    • Start state, S0.
    • Output alphabet, O;|O|=k. k is the total number of output alphabets.
    • Transition probability, A.
    • Emission probability, B.
    • Initial state probabilities, π .
    
HMM is represented by the following tuple— ƛ=(A,B,π)


利用Baum Welch算法对HMM参数进行极大似然估计和后验模态估计。给定一列emission or observations，用 向前向后算法 forward-backward algorithm 得到 所有隐藏的状态变量的 后验边际。

## 使用隐马尔可夫模型进行NER有三个步骤：

    1、HMM—Annotation，将原始文本进行标注，转换为可训练数据
    2、HMM train，计算HMM参数，即A、B、π 
    3、HMM test，用Viterbi Algorithm 解得最佳tag sequence

#### 用HMM 进行chunking，得到名词词组NP 和动词词组VP 的chunk

NP chunk 可以进一步处理得到 proper noun 或 named entities：

In [21]:
import nltk
nltk.tag.hmm.demo_pos()


HMM POS tagging demo

Training HMM...
Testing...
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Entropy: 18.7331739705

------------------------------------------------------------
Test: the/AT jury/NN further/RBR said/VBD in/IN term-end/NN presentments/NNS that/CS the/AT city/NN executive/JJ committee/NN ,/, which/WDT had/HVD over-all/JJ charge/NN of/IN the/AT election/NN ,/

命名实体标注器的结果（元组），可以看作(response, answer key)，由此得到以下定义：

    • Correct: If the response is exactly the same as answer key
    • Incorrect: If the response is not same as answer key
    • Missing: If answer key is found tagged, but response is not tagged
    • Spurious: If response is found tagged, but answer key is not tagged
    
从而有NER模型的性能度量：

    • Precision (P): correct/(correct + incorrect + missing)                                       
    • Recall (R): correct / (correct +incorrect + spurious)                               
    • F-Measure:  2 x P x R / (P + R )                      

# Training NER using Machine Learning Toolkits
NER can be performed using the following approaches:

    • Rule-based or Handcrafted approach:
          List Lookup approach
          Linguistic approach
    • Machine Learning-based approach or Automated approach:
          Hidden Markov Model
          Maximum Entropy Markov Model
          Conditional Random Fields
          Support Vector Machine
          Decision Trees

# NER using POS tagging
标注词性，然后根据词性检测命名实体。
比如词性标注为NNP(proper noun) 的 即是 命名实体

In [1]:
import nltk
from nltk import pos_tag, word_tokenize
pos_tag(word_tokenize("John and Smith are going to NY and Germany"))

[('John', 'NNP'),
 ('and', 'CC'),
 ('Smith', 'NNP'),
 ('are', 'VBP'),
 ('going', 'VBG'),
 ('to', 'TO'),
 ('NY', 'NNP'),
 ('and', 'CC'),
 ('Germany', 'NNP')]

Jonh, Smith, NY, Germany 是NNP，也是表示人名、地名的Named entities

In [2]:
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger
tagger = UnigramTagger(brown.tagged_sents(categories='news')[:700]) # 限定了训练范围，所以有些token的tag是None
sentence = ['John','and','Smith','are','going','to','NY','and','Germany']
for word,tag in tagger.tag(sentence):
    print(word,'->',tag)

John -> NP
and -> CC
Smith -> None
are -> BER
going -> VBG
to -> TO
NY -> None
and -> CC
Germany -> None


John 标注为NP，是命名实体；有些词的标注是None，因为这些token没有被trained 到

# 从WordNet生成同义词集的ID
WordNet可以定义为一个英语词汇数据库
这个词与词之间的概念从属，如上位词，同义词，反义词，和下义词，可以通过synsets得到。

### look up a word using synsets:

In [4]:
import nltk
from nltk.corpus import wordnet
wordnet.synsets('cat')

[Synset('cat.n.01'),
 Synset('guy.n.01'),
 Synset('cat.n.03'),
 Synset('kat.n.01'),
 Synset('cat-o'-nine-tails.n.01'),
 Synset('caterpillar.n.02'),
 Synset('big_cat.n.01'),
 Synset('computerized_tomography.n.01'),
 Synset('cat.v.01'),
 Synset('vomit.v.01')]

In [5]:
wordnet.synsets('cat',pos=wordnet.VERB)

[Synset('cat.v.01'), Synset('vomit.v.01')]

In [14]:
wordnet.synset('cat.n.01') # 注意不是synsets

Synset('cat.n.01')

***cat.n.01*** means that **cat** is of the **noun** category and **only one** meaning of *cat* exists:

In [15]:
print(wordnet.synset('cat.n.01').definition())

feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats


In [17]:
len(wordnet.synset('cat.n.01').examples())

0

In [19]:
wordnet.synset('cat.n.01').lemmas()

[Lemma('cat.n.01.cat'), Lemma('cat.n.01.true_cat')]

In [21]:
[str(lemma.name()) for lemma in wordnet.synset('cat.n.01').lemmas()]

['cat', 'true_cat']

In [22]:
wordnet.lemma('cat.n.01.cat').synset()

Synset('cat.n.01')

### the use of Synsets and Open Multilingual Wordnet using ISO 639 language codes:

In [25]:
import nltk
from nltk.corpus import wordnet as wn
sorted(wn.langs())

['als',
 'arb',
 'bul',
 'cat',
 'cmn',
 'dan',
 'ell',
 'eng',
 'eus',
 'fas',
 'fin',
 'fra',
 'glg',
 'heb',
 'hrv',
 'ind',
 'ita',
 'jpn',
 'nno',
 'nob',
 'pol',
 'por',
 'qcn',
 'slv',
 'spa',
 'swe',
 'tha',
 'zsm']

In [26]:
wn.synset('cat.n.01').lemma_names('ita')

['gatto']

In [27]:
sorted(wn.synset('cat.n.01').lemmas('dan'))

[Lemma('cat.n.01.kat'), Lemma('cat.n.01.mis'), Lemma('cat.n.01.missekat')]

In [28]:
sorted(wn.synset('cat.n.01').lemmas('por'))

[Lemma('cat.n.01.Gato-doméstico'),
 Lemma('cat.n.01.Gato_doméstico'),
 Lemma('cat.n.01.bichano'),
 Lemma('cat.n.01.gata'),
 Lemma('cat.n.01.gato'),
 Lemma('cat.n.01.gato-doméstico')]

In [29]:
len(wordnet.all_lemma_names(pos='n', lang='jpn'))

64797

In [33]:
sorted(wn.synset('cat.n.01').lemmas('jpn'))

[Lemma('cat.n.01.にゃんにゃん'),
 Lemma('cat.n.01.キャット'),
 Lemma('cat.n.01.ネコ'),
 Lemma('cat.n.01.猫')]

In [34]:
cat = wn.synset('cat.n.01')
cat.hypernyms()

[Synset('feline.n.01')]

In [35]:
cat.hyponyms()

[Synset('domestic_cat.n.01'), Synset('wildcat.n.03')]

In [36]:
cat.root_hypernyms()

[Synset('entity.n.01')]

In [37]:
wn.synset('cat.n.01').lowest_common_hypernyms(wn.synset('dog.n.01'))

[Synset('carnivore.n.01')]

# 消除歧义
WordNet语义相似度，NLTK中的度量算法有

### Path Distance Similarity:


In [38]:
import nltk
from nltk.corpus import wordnet as wn
lion = wn.synset('lion.n.01')
cat = wn.synset('cat.n.01')
lion.path_similarity(cat)

0.25

### Leacock Chodorow Similarity:

In [40]:
import nltk
from nltk.corpus import wordnet as wn
lion = wn.synset('lion.n.01')
cat = wn.synset('lion.n.01')
lion.lch_similarity(cat)

3.6375861597263857

### Wu-Palmer Similarity:


In [42]:
lion.wup_similarity(cat)

1.0

### Resnik Similarity, Lin Similarity, and Jiang-Conrath Similarity:

In [43]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [44]:
import nltk
from nltk.corpus import wordnet
from nltk.corpus import wordnet_ic


In [46]:
brown_ic = wordnet_ic.ic('ic-brown.dat')
semcor_ic = wordnet_ic.ic('ic-semcor.dat')
from nltk.corpus import genesis
genesis_ic = wn.ic(genesis, False, 0.0)
lion = wn.synset('lion.n.01')
cat = wn.synset('cat.n.01')


8.663481537685325

In [50]:
lion.res_similarity(cat, brown_ic)

8.663481537685325

In [49]:
lion.res_similarity(cat, genesis_ic)

7.339696591781996

In [52]:
lion.jcn_similarity(cat, brown_ic)

0.36425897775957294

In [51]:
lion.jcn_similarity(cat, genesis_ic)

0.30578008567889475

In [53]:
lion.lin_similarity(cat, semcor_ic)

0.8560734335071154