# Information Extraction

## 1. Named Entity and relations

extract information & facts --- know what user is saying.

- 提取named entity
- 提取named entity 之间的关系

### 1.1 Knowledge base

chatbot 通常把extracted information 存储在一个knowledge base 中。

global knowledge base:

context: 与当前会话(se ssion) 相关的knowledge。会话的context 信息可以存在global knowledge base，也可以存在一个单另的knowledge base



## 3. Extracting relationships 

之前我们介绍了extract information，比如提取时间，提取经度纬度等。下面我们介绍如何提取relationships。

如果你想理解一句话，你必须要理解实体之间的关系(关联)。Relationships 的内部表示是一个knowledge graph，也称为knowledge base：
- 点：提取出来的实体
- 边：实体之前的关系

要知道这种关系，我们通常需要借用part of speech 分析。

### 3.1 POS Tagging

language models that contain dictionaries of words with all their possible parts of speech.
- trained on tagged sentence
- NLTK / spaCy: spaCy 更准一些

In [1]:
import spacy

In [2]:
en_model = spacy.load('en_core_web_md')

In [3]:
sentence = ("In 1541 Desoto wrote in his journal that the Pascagoula people ranged as far north as the confluence of the Leaf and Chickasawhay rivers at 30.4, -88.5.")


In [4]:
parsed_sent = en_model(sentence)

In [5]:
parsed_sent.ents

(1541, Desoto, Pascagoula, Leaf, Chickasawhay, 30.4)

In [6]:
' '.join(['{}:{}'.format(tok, tok.tag_) for tok in parsed_sent])

'In:IN 1541:CD Desoto:NNP wrote:VBD in:IN his:PRP$ journal:NN that:IN the:DT Pascagoula:NNP people:NNS ranged:VBD as:RB far:RB north:RB as:IN the:DT confluence:NN of:IN the:DT Leaf:NNP and:CC Chickasawhay:NNP rivers:NNS at:IN 30.4:CD ,:, -88.5:NNP .:.'

#### Dependency Tree

In [7]:
from spacy.displacy import render

In [8]:
sentence = "In 1541 Desotomet the Pascagoula."

In [9]:
parsed_sent = en_model(sentence)

In [10]:
render(docs=parsed_sent, page=True, options=dict(compact=True))

#### 表格显示

In [11]:
import pandas as pd
from collections import OrderedDict

In [12]:
def token_dict(token):
    return OrderedDict(ORTH=token.orth_, LEMMA=token.lemma_,POS=token.pos_, TAG=token.tag_, DEP=token.dep_)


In [13]:
def doc_dataframe(doc):
    return pd.DataFrame([token_dict(tok) for tok in doc])

In [14]:
doc_dataframe(en_model(sentence))

Unnamed: 0,ORTH,LEMMA,POS,TAG,DEP
0,In,in,ADP,IN,ROOT
1,1541,1541,NUM,CD,pobj
2,Desotomet,Desotomet,PROPN,NNP,pobj
3,the,the,DET,DT,det
4,Pascagoula,Pascagoula,PROPN,NNP,appos
5,.,.,PUNCT,.,punct


#### matcher

In [55]:
pattern = [{'TAG': 'NNP', 'OP': '+'}, 
           {'IS_ALPHA': True, 'OP': '*'},
           {'LEMMA': 'meet'}, 
           {'IS_ALPHA': True, 'OP': '*'}, 
           {'TAG': 'NNP', 'OP': '+'}
          ]


In [56]:
doc = en_model("In 1541 Desoto met the Pascagoula.")
matcher = Matcher(en_model.vocab)
matcher.add('met', None, pattern)
m = matcher(doc)

In [57]:
m

[(14332210279624491740, 2, 6)]

In [58]:
doc[m[0][1]:m[0][2]]

Desoto met the Pascagoula

#### multiple matches

In [59]:
doc = en_model("October 24: Lewis and Clark met their first Mandan Chief, Big White.")
m = matcher(doc)[0]  # 第一个matcher

In [60]:
m

(14332210279624491740, 5, 10)

In [61]:
doc[m[1]:m[2]]

Clark met their first Mandan

In [62]:
m = matcher(doc)  # 所有的matcher

In [63]:
m

[(14332210279624491740, 5, 10),
 (14332210279624491740, 3, 10),
 (14332210279624491740, 5, 11),
 (14332210279624491740, 3, 11)]

In [64]:
for i in range(4):
    print(doc[m[i][1]:m[i][2]])  # longest match is the last one in the list

Clark met their first Mandan
Lewis and Clark met their first Mandan
Clark met their first Mandan Chief
Lewis and Clark met their first Mandan Chief


#### add pattern

In [65]:
doc = en_model("On 11 October 1986, Gorbachev and Reagan met at a house")

matcher(doc)

[]

In [44]:
doc = en_model("On 11 October 1986, Gorbachev and Reagan met at a house")
pattern = [{'TAG': 'NNP', 'OP': '+'}, 
           {'LEMMA': 'and'}, 
           {'TAG': 'NNP', 'OP': '+'}, 
           {'IS_ALPHA': True, 'OP': '*'}, 
           {'LEMMA': 'meet'}
          ]
matcher.add('met', None, pattern)


In [45]:
m = matcher(doc)

In [46]:
m

[(14332210279624491740, 5, 9)]

In [67]:
doc[m[-1][1]:m[-1][2]]

1986, Gorbachev and Reagan met at a