Stanford Core NLPを用い，入力テキストの解析結果をXML形式で得よ．また，このXMLファイルを読み込み，入力テキストを1行1単語の形式で出力せよ．

In [1]:
import re
from nltk import stem

In [2]:
fname = './nlp.txt'
stemmer = stem.PorterStemmer()

In [3]:
def nlp_lines():
    '''nlp.txtを1文ずつ読み込むジェネレータ
    nlp.txtを順次読み込んで1文ずつ返す

    戻り値：
    1文の文字列
    '''
    with open(fname) as lines:

        # 文切り出しの正規表現コンパイル
        pattern = re.compile(r'''
            (
                ^                   # 行頭
                .*?                 # 任意のn文字、最少マッチ
                [\.|;|:|\?|!]       # . or ; or : or ? or !
            )
            \s                      # 空白文字
            (
                [A-Z].*             # 英大文字以降（＝次の文以降)

            )
        ''', re.MULTILINE + re.VERBOSE + re.DOTALL)
        for line in lines:

            line = line.strip()     # 前後の空白文字除去
            while len(line) > 0:

                # 行から1文を取得
                match = pattern.match(line)
                if match:

                    # 切り出した文を返す
                    yield match.group(1)        # 先頭の文
                    line = match.group(2)       # 次の文以降

                else:
                    
                    # 区切りがないので、最後までが1文(表題などピリオド等で終わらない行)
                    yield line
                    line = ''

In [4]:
def nlp_words():
    '''nlp.txtを1単語ずつ返すジェネレータ
    文の終わりでは空文字を返す。

    戻り値：
    1単語、ただし文の終わりでは空文字を返す
    '''
    for line in nlp_lines():

        # 単語に分解、終端の区切り文字は除去して返す(スペース連続を考慮してオリジナルから変更)
        # for word in line.split(' '):
        for word in re.split(" +", line):
            yield word.rstrip('.,;:?!')

        # 文の終わりは空文字
        yield ''

In [5]:
# 読み込み
for word in nlp_words():
    
    # 単語とステミングされた部分をタブ区切り出力
    print('{}\t{}'.format(word, stemmer.stem(word)))

Natural	natur
language	languag
processing	process
	
From	from
Wikipedia	wikipedia
the	the
free	free
encyclopedia	encyclopedia
	
Natural	natur
language	languag
processing	process
(NLP)	(nlp)
is	is
a	a
field	field
of	of
computer	comput
science	scienc
artificial	artifici
intelligence	intellig
and	and
linguistics	linguist
concerned	concern
with	with
the	the
interactions	interact
between	between
computers	comput
and	and
human	human
(natural)	(natural)
languages	languag
	
As	As
such	such
NLP	nlp
is	is
related	relat
to	to
the	the
area	area
of	of
humani-computer	humani-comput
interaction	interact
	
Many	mani
challenges	challeng
in	in
NLP	nlp
involve	involv
natural	natur
language	languag
understanding	understand
that	that
is	is
enabling	enabl
computers	comput
to	to
derive	deriv
meaning	mean
from	from
human	human
or	or
natural	natur
language	languag
input	input
and	and
others	other
involve	involv
natural	natur
language	languag
generation	gener
	
History	histori
	
The	the
history	histori
of	of
NL

algorithms	algorithm
-	-
often	often
although	although
not	not
always	alway
grounded	ground
in	in
statistical	statist
inference	infer
-	-
to	to
automatically	automat
learn	learn
such	such
rules	rule
through	through
the	the
analysis	analysi
of	of
large	larg
corpora	corpora
of	of
typical	typic
real-world	real-world
examples	exampl
	
A	A
corpus	corpu
(plural	(plural
"corpora")	"corpora")
is	is
a	a
set	set
of	of
documents	document
(or	(or
sometimes	sometim
individual	individu
sentences)	sentences)
that	that
have	have
been	been
hand-annotated	hand-annot
with	with
the	the
correct	correct
values	valu
to	to
be	be
learned	learn
	
Many	mani
different	differ
classes	class
of	of
machine	machin
learning	learn
algorithms	algorithm
have	have
been	been
applied	appli
to	to
NLP	nlp
tasks	task
	
These	these
algorithms	algorithm
take	take
as	as
input	input
a	a
large	larg
set	set
of	of
"features"	"features"
that	that
are	are
generated	gener
from	from
the	the
input	input
data	data
	
Some	some
of	of
the	the
