### 50. 文区切り

(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし，入力された文書を1行1文の形式で出力せよ．

In [37]:
import re
def nlp_lines():
    with open("nlp.txt") as f:
        pattern=re.compile(r'(^.*?[\.|\;|\?|\!])\s([A-Z].*)')
        for line in f:
            line=line.rstrip()    
            while len(line)>0:
                match=pattern.match(line)
                if match:
                    yield match.group(1)
                    line=match.group(2)
                else:
                    yield line
                    line=''
for i,line in enumerate(nlp_lines()):
    print(line)
    if i==10:
        break

Natural language processing
From Wikipedia, the free encyclopedia
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of humani-computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
History
The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would be a 

### 51. 単語の切り出し

空白を単語の区切りとみなし，50の出力を入力として受け取り，1行1単語の形式で出力せよ．ただし，文の終端では空行を出力せよ

In [24]:
def nlp_words():
    for line in nlp_lines():
        for word in line.split():
            yield word.rstrip(".,;:?!")
        yield ''
for i,word in enumerate(nlp_words()):
    print(word)
    if i==20:
        break

Natural
language
processing

From
Wikipedia
the
free
encyclopedia

Natural
language
processing
(NLP)
is
a
field
of
computer
science
artificial


### 52. ステミング

51の出力を入力として受け取り，Porterのステミングアルゴリズムを適用し，単語と語幹をタブ区切り形式で出力せよ． Pythonでは，Porterのステミングアルゴリズムの実装としてstemmingモジュールを利用するとよい．

In [29]:
import snowballstemmer
stemmer=snowballstemmer.stemmer('english')
for i,word in enumerate(nlp_words()):
    print('{}\t{}'.format(word,stemmer.stemWord(word)))
    if i==30:
        break

Natural	Natur
language	languag
processing	process
	
From	From
Wikipedia	Wikipedia
the	the
free	free
encyclopedia	encyclopedia
	
Natural	Natur
language	languag
processing	process
(NLP)	(NLP)
is	is
a	a
field	field
of	of
computer	comput
science	scienc
artificial	artifici
intelligence	intellig
and	and
linguistics	linguist
concerned	concern
with	with
the	the
interactions	interact
between	between
computers	comput
and	and


### 53. Tokenization

Stanford Core NLPを用い，入力テキストの解析結果をXML形式で得よ．また，このXMLファイルを読み込み，入力テキストを1行1単語の形式で出力せよ．

In [2]:
import xml.etree.ElementTree as ET
tree=ET.parse('nlp.txt.xml')
root=tree.getroot()
i=0
for word in root.findall(".//word"):
    print(word.text)
    i+=1
    if i>20:
        break

Natural
language
processing
From
Wikipedia
,
the
free
encyclopedia
Natural
language
processing
-LRB-
NLP
-RRB-
is
a
field
of
computer
science


### 54. 品詞タグ付け

Stanford Core NLPの解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

In [46]:
i=0
for token in root.findall(".//token"):
    word=token.find('word').text
    lemma=token.find('lemma').text
    pos=token.find('POS').text
    print("{}\t{}\t{}".format(word,lemma,pos))
    i+=1
    if i>15:
        break

Natural	natural	JJ
language	language	NN
processing	processing	NN
From	from	IN
Wikipedia	Wikipedia	NNP
,	,	,
the	the	DT
free	free	JJ
encyclopedia	encyclopedia	NN
Natural	natural	JJ
language	language	NN
processing	processing	NN
-LRB-	-lrb-	-LRB-
NLP	nlp	NN
-RRB-	-rrb-	-RRB-
is	be	VBZ


### 55. 固有表現抽出

入力文中の人名をすべて抜き出せ．

In [47]:
for token in root.findall(".//token"):
    if token.find("NER").text=="PERSON":
        print(token.find("word").text)

Alan
Turing
Joseph
Weizenbaum
MARGIE
Schank
Wilensky
Meehan
Lehnert
Carbonell
Lehnert
Racter
Jabberwacky
Moore


### 56. 共参照解析

Stanford Core NLPの共参照解析の結果に基づき，文中の参照表現（mention）を代表参照表現（representative mention）に置換せよ．ただし，置換するときは，「代表参照表現（参照表現）」のように，元の参照表現が分かるように配慮せよ．



In [30]:
rep_dict={}
for coreference in root.findall("./document/coreference/coreference"):
    i=0
    for mention in coreference.findall("mention"):
        if i==0:
            rep_text=mention.findtext("text")
            i+=1
        else:
            sent_id=int(mention.findtext("sentence"))
            start=int(mention.findtext("start"))
            end=int(mention.findtext("end"))
            rep_dict[(sent_id,start)]=(end,rep_text)
j=0
for sentence in root.findall('./document/sentences/sentence'):
    sent_id=int(sentence.get('id'))
    b=0 

    for token in sentence.iterfind('./tokens/token'):
        token_id = int(token.get('id'))     

        if (sent_id,token_id) in rep_dict:
            (end, rep_text)=rep_dict[(sent_id,token_id)]
            print(rep_text+'(',end='')
            b=end-token_id 
        print(token.findtext('word'),end='')
        if b>0:
            b-=1
            if b==0:
                print(')',end='')
        print(' ',end='')
    print() 
    j+=1
    if j>5:
        break

Natural language processing From Wikipedia , the free encyclopedia Natural language processing -LRB- NLP -RRB- is the free encyclopedia Natural language processing -LRB- NLP -RRB-(a field of computer science) , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages . 
As such , NLP is related to the area of humani-computer interaction . 
Many challenges in NLP involve natural language understanding , that is , enabling computers(computers) to derive meaning from human or natural language input , and others involve natural language generation . 
History The history of NLP generally starts in the 1950s , although work can be found from earlier periods . 
In 1950 , Alan Turing published an article titled `` Computing Machinery and Intelligence '' which proposed what is now called the Alan Turing(Turing) test as a criterion of intelligence . 
The Georgetown experiment in 1954 involved fully automatic translation o

### 57. 係り受け解析

Stanford Core NLPの係り受け解析の結果（collapsed-dependencies）を有向グラフとして可視化せよ．可視化には，係り受け木をDOT言語に変換し，Graphvizを用いるとよい．また，Pythonから有向グラフを直接的に可視化するには，pydotを使うとよい．

In [70]:
from graphviz import Digraph
G=Digraph(format="png")
G.attr('node',shape='circle')
dependencies=root.findall("./document/sentences/sentence[@id='1']/dependencies[@type='collapsed-dependencies']/dep")
for a in dependencies:
    if a.get("type")!="punct":
        G.node(a.find("./governor").get("idx"),a.findtext("governor"))
        G.node(a.find("./dependent").get("idx"),a.findtext("dependent"))
        G.edge(a.find("./governor").get("idx"),a.find("./dependent").get("idx"))
G.render("graphs")

'graphs.png'

### 58. タプルの抽出

In [17]:
k=1
for sentence in root.findall("./document/sentences/sentence"):
    dict_1={}
    dict_2={}
    dict_3={}
    for dep in sentence.findall("./dependencies[@type='collapsed-dependencies']/dep"):
        if dep.get("type")=="nsubj" or dep.get("type")=="dobj":
            dongci=dep.find("governor").text
            idx=dep.find("governor").get("idx")
            dict_1[idx]=dongci
            if dep.get("type")=="nsubj":
                dict_2[idx] = dep.find('./dependent').text
            else:
                dict_3[idx] = dep.find('./dependent').text
    for idx,dongci in dict_1.items():
        nsubj=dict_2.get(idx)
        dobj=dict_3.get(idx)
        if nsubj is not None and dobj is not None:
            print("{}\t{}\t{}".format(nsubj,dongci,dobj))
    k+=1
    if k>50:
        break

understanding	enabling	computers
others	involve	generation
Turing	published	article
experiment	involved	translation
ELIZA	provided	interaction
patient	exceeded	base
ELIZA	provide	response
which	structured	information
underpinnings	discouraged	sort
that	underlies	approach
Some	produced	systems
which	make	decisions
systems	rely	which
that	contains	errors
implementations	involved	coding
algorithms	take	set
Some	produced	systems
which	make	decisions
models	have	advantage
they	express	certainty
Systems	have	advantages
Automatic	make	use
that	make	decisions


### 59. S式の解析

Stanford Core NLPの句構造解析の結果（S式）を読み込み，文中のすべての名詞句（NP）を表示せよ．入れ子になっている名詞句もすべて表示すること．

In [19]:
import re
pattern=re.compile(r"^\((.*?)\s(.*)\)",re.DOTALL)
def s_parse(str1,list_np):
    word=[]
    match=pattern.match(str1)
    tag=match.group(1)
    value=match.group(2)
    depth=0
    chunk=''
    words=[]
    for c in value:
        if c=="(":
            chunk+=c
            depth+=1
        elif c==')':
            chunk+=c
            depth-=1
            if depth==0:
                words.append(s_parse(chunk, list_np))
                chunk=''
        else:
            if not(depth==0 and c==' '):
                chunk+=c
    if chunk!='':
        words.append(chunk)
    result=' '.join(words)
    if tag=='NP':
        list_np.append(result)
    return result
        
    
for s in root.findall("./document/sentences/sentence[@id='2']/parse"):
    result=[]
    s_parse(s.text,result)
    for a in result:
        print(a)

such
NLP
the area
humani-computer interaction
the area of humani-computer interaction
