極性分析に有用そうな素性を各自で設計し，学習データから素性を抽出せよ．素性としては，レビューからストップワードを除去し，各単語をステミング処理したものが最低限のベースラインとなるであろう．

In [1]:
import warnings
from collections import Counter
import csv

from nltk.corpus import stopwords
import stanfordnlp

In [2]:
# 速くするためにタプルとして定義
STOP_WORDS = set(stopwords.words('english'))

In [3]:
# Universal POS tags に準拠していそう
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT',   # 句読点
           'X',       # その他
           'SYM',     # 記号
           'PART',    # 助詞('sなど)
           'NUM'}     # 番号

In [4]:
# プロセッサをデフォルトの全指定にするおｔ遅かったので最低限に絞る
# https://stanfordnlp.github.io/stanfordnlp/processors.html
nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma')

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/home/i348221/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/home/i348221/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/home/i348221/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/home/i348221/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
Done loading processors!
---


In [5]:
# ストップワード真偽判定
def is_stopword(word):
    return True if word.lemma in STOP_WORDS \
                  or word.upos in EXC_POS \
                else False

In [6]:
# 警告非表示
warnings.simplefilter('ignore', UserWarning)

In [7]:
lemma = []

In [8]:
%%time
with open('./sentiment.txt') as file:
    for i, line in enumerate(file):
        print("\r{0}".format(i), end="")
        
        # 最初の3文字はネガポジを示すだけなのでnlp処理しない(少しでも速くする)
        doc = nlp(line[3:])
        for sentence in doc.sentences:
            lemma.extend([word.lemma for word in sentence.words if is_stopword(word) is False])

10661CPU times: user 1h 49min 54s, sys: 2min 57s, total: 1h 52min 52s
Wall time: 45min 21s


In [9]:
freq_lemma = Counter(lemma)

In [10]:
with open('./lemma_all.txt', 'w') as f_out:
    writer = csv.writer(f_out, delimiter='\t')
    writer.writerow(['Char', 'Freq'])
    for key, value in freq_lemma.items():
        writer.writerow([key] + [value])