# 第6章: 英語テキストの処理

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

In [10]:
!head -n 8 nlp.txt

Natural language processing
From Wikipedia, the free encyclopedia

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of humani-computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.

History

The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.


## 50. 文区切り
(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし，入力された文書を1行1文の形式で出力せよ．

In [81]:
test_doc.replace(".","<EOS>")

'Hello,World<EOS>:;  How are you ? : '

In [8]:
def multiple_replace(text,adict):  
    """ 
    一度に複数の文字列を置換する. 
    - text中からディクショナリのキーに合致する文字列を探し、対応の値で置換して返す 
    """  
    # マッチさせたいキー群を正規表現の形にする e.g) (a1|a2|a3...)  
    rx = re.compile('|'.join(map(re.escape,adict)))  
    def one_xlat(match):  
        return adict[match.group(0)]  
    return rx.sub(one_xlat, text)  

def multiple_replace_re(text,adict):
    rx = re.compile('|'.join(adict))
    
    def dedictkey(text):
        for key in adict.keys():
            if re.search(key,text):
                return key
    
    def one_xlat(match):
        return adict[dedictkey(match.group(0))]
    
    return rx.sub(one_xlat,text)

In [45]:
#replace 複数形 test
test = "Hello,World. How are you? I'm fine thank you, and you? This Pen is mine: But lost? \
I am sad."
adict = {"(\. |; |: |\? |! )" : "\n",}
result=multiple_replace_re(test,adict)
print(result)

Hello,World
How are you
I'm fine thank you, and you
This Pen is mine
But lost
I am sad.


In [46]:
re.compile("|".join(adict))

re.compile(r'(\. |; |: |\? |! )', re.UNICODE)

In [47]:
adict.keys()

dict_keys(['(\\. |; |: |\\? |! )'])

In [52]:
#正規表現を用いる
import re

#入力された文章をパターンに沿って<last>と<first>の間に改行(\n)を加える
pattern = re.compile("(?P<last>[\.;:\?!]) (?P<first>[A-Z])")
ans=""
doc_list = []

for line in open("nlp.txt","r"):
        #入力された文章をパターンに沿って<last>と<first>の間に改行(\n)を加える
        ans += re.sub(pattern, "\g<last>\n\g<first>", line)
        doc_list.append(re.sub(pattern, "\g<last>\n\g<first>", line))

print(ans)

Natural language processing
From Wikipedia, the free encyclopedia

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of humani-computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.

History

The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would b

## 51. 単語の切り出し
空白を単語の区切りとみなし，50の出力を入力として受け取り，1行1単語の形式で出力せよ．ただし，文の終端では空行を出力せよ．

In [57]:
doc_list

['Natural language processing\n',
 'From Wikipedia, the free encyclopedia\n',
 '\n',
 'Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.\nAs such, NLP is related to the area of humani-computer interaction.\nMany challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.\n',
 '\n',
 'History\n',
 '\n',
 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.\nIn 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.\n',
 '\n',
 'The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.\nThe authors claimed 

In [66]:
words=[]
for line in open("nlp.txt","r"):
        #入力された文章をパターンに沿って<last>と<first>の間に改行(\n)を加える
        word_list = re.sub(pattern, "\g<last>\n\g<first>", line).split(' ')
        for word in word_list:
            word=word.replace('\n',' ')
            words.append(word)

In [67]:
words[0:10]

['Natural',
 'language',
 'processing ',
 'From',
 'Wikipedia,',
 'the',
 'free',
 'encyclopedia ',
 ' ',
 'Natural']

## 52. ステミング
51の出力を入力として受け取り，Porterのステミングアルゴリズムを適用し，単語と語幹をタブ区切り形式で出力せよ． Pythonでは，Porterのステミングアルゴリズムの実装としてstemmingモジュールを利用するとよい．

(´･ω･｀)stemmingがpython3系にあるかわからないのでとりあえず NLTKをダウンロードしました
```zsh
%pip install nltk
```

In [68]:
from nltk import stem

In [81]:
stemmer = stem.PorterStemmer()
res=[]
for word in words:
    res.append( stemmer.stem(word))
print(len(res))

1239


In [94]:
res[0:10]

['natur',
 'languag',
 'processing ',
 'from',
 'wikipedia,',
 'the',
 'free',
 'encyclopedia ',
 ' ',
 'natur']

## 53. Tokenization
Stanford Core NLPを用い，入力テキストの解析結果をXML形式で得よ．また，このXMLファイルを読み込み，入力テキストを1行1単語の形式で出力せよ．

```zsh
% java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file nlp.txt
```

In [101]:
!head -n 15 nlp.txt.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Natural</word>
            <lemma>natural</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>7</CharacterOffsetEnd>
            <POS>JJ</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>


In [102]:
!head -n 30 nlp.txt.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Natural</word>
            <lemma>natural</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>7</CharacterOffsetEnd>
            <POS>JJ</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>
          <token id="2">
            <word>language</word>
            <lemma>language</lemma>
            <CharacterOffsetBegin>8</CharacterOffsetBegin>
            <CharacterOffsetEnd>16</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>
          <token id="3">
            <word>processing</word>
            <lemma>processing</lemma>
            <CharacterOffsetBegin>17</

(´･ω･｀) < 入力テキストを1行1単語<word>の形式で出力せよ!!

In [97]:
pattern_word = re.compile("<word>(?P<word>.+)</word>")

res = []
for line in open("nlp.txt.xml"):
    if re.search(pattern_word,line):
        res.append(re.search(pattern_word,line).group("word"))

In [98]:
res[0:10]

['Natural',
 'language',
 'processing',
 'From',
 'Wikipedia',
 ',',
 'the',
 'free',
 'encyclopedia',
 'Natural']

## 54. 品詞タグ付け
Stanford Core NLPの解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

(´･ω･｀) >構造見ると < token id>を引っ張って来てその中の< word> < lemma> < pos>を引っ張ってくれば良さげ?

In [115]:
#pattern
pattern_word = re.compile("<word>(?P<word>.+)</word>")
pattern_lemma = re.compile("<lemma>(?P<lemma>.+)</lemma>")
pattern_pos = re.compile("<POS>(?P<POS>.+)</POS>")

s=[]
w=[]
for line in open("nlp.txt.xml"):
    if len(s) == 0 and re.search(pattern_word,line):
        s.append(re.search(pattern_word,line).group("word"))
    elif len(s) == 1 and re.search(pattern_lemma,line):
        s.append(re.search(pattern_lemma,line).group("lemma"))
    elif len(s) == 2 and re.search(pattern_pos,line):
        s.append(re.search(pattern_pos,line).group("POS"))
        print("\t".join(s))
        w.append("\t".join(s))
        s = []

Natural	natural	JJ
language	language	NN
processing	processing	NN
From	from	IN
Wikipedia	Wikipedia	NNP
,	,	,
the	the	DT
free	free	JJ
encyclopedia	encyclopedia	NN
Natural	natural	JJ
language	language	NN
processing	processing	NN
-LRB-	-lrb-	-LRB-
NLP	nlp	NN
-RRB-	-rrb-	-RRB-
is	be	VBZ
a	a	DT
field	field	NN
of	of	IN
computer	computer	NN
science	science	NN
,	,	,
artificial	artificial	JJ
intelligence	intelligence	NN
,	,	,
and	and	CC
linguistics	linguistics	NNS
concerned	concern	VBN
with	with	IN
the	the	DT
interactions	interaction	NNS
between	between	IN
computers	computer	NNS
and	and	CC
human	human	JJ
-LRB-	-lrb-	-LRB-
natural	natural	JJ
-RRB-	-rrb-	-RRB-
languages	language	NNS
.	.	.
As	as	IN
such	such	JJ
,	,	,
NLP	nlp	NN
is	be	VBZ
related	relate	VBN
to	to	TO
the	the	DT
area	area	NN
of	of	IN
humani-computer	humani-computer	JJ
interaction	interaction	NN
.	.	.
Many	many	JJ
challenges	challenge	NNS
in	in	IN
NLP	nlp	NN
involve	involve	VBP
natural	natural	JJ
language	language	NN
understanding	unde

ValueError: too many values to unpack (expected 2)

## 55. 固有表現抽出
入力文中の人名をすべて抜き出せ

In [128]:
w[0]

'Natural\tnatural\tJJ'