#### データセットの準備  
カテゴリ毎のデータのダウンロード  
trainとtest、その他にtextを切り出し  
カテゴリ毎に保存  
カテゴリ名を指定すると、ダウンサンプリングされたtrain_data, test_dataが返ってくる関数を作成  

#### 全ての語彙の格納  
全カテゴリのデータの語彙を格納したVocabを作成  
Vocabは単語⇄インデックスのやり取りや、語彙数の管理を行う  
オンライン学習(未知の単語への対応)はまだよくわかっていない  
格納したvocabを保存しておく  

#### DataIteratorの作成  
train_data, test_dataを元にインスタンスを作成し、forループを回すと自動的にバッチを作成するclassを定義  
バッチは全ての単語がインデックスに変換されており、バッチ内で一番長い文の長さに合わせて0埋めされている  

#### モデルの作成  
入力: 文章  
出力: スカラー(所属確率)→2変数ベクトル(多値分類出来るように拡張の余地を残す、やり方をさらっておく)  
→2値分類はattention無しでも精度が高過ぎたので、他クラス分類にして問題の難易度をあげる  
となるモデルを作成する  
※targetはラベル({0,1})を直接入れればOK、これはpytorchの仕様  
通常のLSTMと、self attention付きのLSTM2パターンを作成する  

#### 学習に必要な関数の定義と学習の実行  
損失関数や学習関数、パラメータの設定  
同一iteration内で行う処理をまとめた関数の設定  
KFoldでデータを5分割して、1つをvalidationとして使用(つまり、1回のiterationで5回学習を行う)  
ラベルの偏りが出ないように、ラベルの値毎に層別で分割を行う  
同一iterationのvalid_lossは、KFoldした値の平均値を採用
learning_curveの監視の仕組みを整備  
early stopping  
学習  

#### 学習済みモデルの保存  
モデルとモデルのパラメータを保存  

#### モデルの性能評価  
test_dataに対して混同行列とF1-scoreを計算して見せる  
カテゴリ毎の分類で、いい感じの分類器と微妙な分類器を見せる  
上手くいったパターン、上手くいかなかったパターンの文章をself-attention付きで確認する  
self-attentionありのモデルと無しのモデルでF1-scoreがどれだけ変わるのかをまとめる  

#### 結論  
まとめる

notebookを分割した方が良さそう  
このノートブックは、Video Gameカテゴリのレビューかを当てるモデルの作成と学習を行う  
単一のカテゴリの学習データを作成し、予測を行う  

全カテゴリ比較するのしんどそう  
とりあえず1カテゴリの分類器作って、余力があれば全カテゴリに展開していく

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm_notebook as tqdm
import re
from collections import defaultdict
import glob

np.random.seed(1234)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
sns.set_style('darkgrid')

%matplotlib inline

  (fname, cnt))
  (fname, cnt))


In [2]:
import torch
import torch.nn as nn
from torch import optim
from torch.nn import functional as F

In [3]:
torch.manual_seed(1234)

<torch._C.Generator at 0x7f4888709590>

In [4]:
device = torch.device( "cuda" if torch.cuda.is_available() else "cpu")

In [5]:
print(torch.__version__)

0.4.0


In [6]:
class Vocab:
    def __init__(self):
        self.word2index = defaultdict(int)
        self.word2count = defaultdict(int)
        self.index2word = defaultdict(str)
        self.n_words = 0
    def add_sentence(self, sentence):
        for word in sentence.split(" "):
            self.add_word(word)
    
    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 0
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [7]:
%%time
vocab = Vocab()

for path in glob.glob('../preprocessed/*.csv'):
    series = pd.read_csv(path, header=None, dtype={0: str}, encoding='utf-8').dropna(axis=0)[0]
    for sentence in series:
        vocab.add_sentence(sentence)

# defaultdictは未知のkeyに対応するvalueを要求すると、defaultのvalueを作成してしまう
# 後々のバグを防ぐため、通常のdictに変えてロックする
vocab.word2index = dict(vocab.word2index)
vocab.index2word = dict(vocab.index2word)
vocab.word2count = dict(vocab.word2count)

CPU times: user 1min 3s, sys: 0 ns, total: 1min 3s
Wall time: 1min 3s


In [8]:
vg_test = pd.read_csv('../preprocessed/vg_test.csv', header=None, encoding='utf-8')
hk_test = pd.read_csv('../preprocessed/hk_test.csv', header=None, encoding='utf-8')
so_test = pd.read_csv('../preprocessed/so_test.csv', header=None, encoding='utf-8')
csj_test = pd.read_csv('../preprocessed/csj_test.csv', header=None, encoding='utf-8')
hpc_test = pd.read_csv('../preprocessed/hpc_test.csv', header=None, encoding='utf-8')
aa_test = pd.read_csv('../preprocessed/aa_test.csv', header=None, encoding='utf-8')

# モデルの作成

In [9]:
class LSTMClassifer(nn.Module):
    def __init__(self, emb_dim, h_dim, v_size, n_class=2, bidirectional=True,
                 batch_first=True):
        super(LSTMClassifer, self).__init__()
        self.h_dim = h_dim
        self.bi = 2 if bidirectional else 1
        self.emb = nn.Embedding(v_size, emb_dim)
        self.lstm = nn.LSTM(emb_dim, h_dim, batch_first=batch_first, 
                            bidirectional = bidirectional)
        
        self.attn = nn.Sequential(
            nn.Linear(h_dim, 24),
            nn.ReLU(True),
            nn.Linear(24, 1)
        )
        
        self.affine = nn.Linear(self.h_dim, n_class)
        
    def init_hidden(self, b_size):
        h0 = torch.zeros(self.bi, b_size, self.h_dim, device=device)
        return (h0, h0) # LSTMはhiddenとcell2つの隠れ層が必要
    
    def forward(self, sentences, lengths):
        batch_len = sentences.shape[0]
        hidden, cell = self.init_hidden(batch_len)
        embed = self.emb(sentences)
        packed_input = nn.utils.rnn.pack_padded_sequence(embed, lengths, batch_first=True)
        output, hidden = self.lstm(packed_input, (hidden, cell))
        output = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)[0] # (b, s, h)
        output = output[:, :, :self.h_dim] + output[:, :, self.h_dim:] # 正方向の隠れ層と逆方向の隠れ層を加算
        
        # Attention
        attn = self.attn(output.view(-1, self.h_dim)) # (b,s,h)→(b*s,h)→(b*s,1)
        attn = F.softmax(attn.view(batch_len, -1), dim=1).unsqueeze(2) # (b*s,1)→(b,s)→(b,s,1)
        
        output = (output * attn).sum(dim=1) # (b, s, h)→(b, h)
        output = self.affine(output) # (b,h)→(b,c)
        output = F.log_softmax(output, dim=1) # (b, c), 各データが各クラスに属した場合の対数尤度を計算
        return output, attn

# 予測関数の作成

In [10]:
def predict_model(review):
    review_idxes = [vocab.word2index[w] for w in str(review).split()]
    review_tensor = torch.LongTensor(review_idxes).to(device).unsqueeze(0)
    length_tensor = torch.LongTensor([len(review_idxes)]).to(device)
    model.eval()
    with torch.no_grad():
        out, attn = model(review_tensor, length_tensor)
    
    return out.max(dim=1)[1].item(), attn

# 学習済みモデルの重み呼び出し

In [11]:
model = LSTMClassifer(100, 32, vocab.n_words, n_class=6).to(device)

In [21]:
model.load_state_dict(torch.load("../output/"))
model.eval()

LSTMClassifer(
  (emb): Embedding(210819, 100)
  (lstm): LSTM(100, 32, batch_first=True, bidirectional=True)
  (attn): Sequential(
    (0): Linear(in_features=32, out_features=24, bias=True)
    (1): ReLU(inplace)
    (2): Linear(in_features=24, out_features=1, bias=True)
  )
  (affine): Linear(in_features=32, out_features=6, bias=True)
)

# htmlを用いた注目単語の可視化

https://github.com/nn116003/self-attention-classification/blob/master/view_attn.py  
このコードを参考にしました

In [59]:
def highlight(word, attn):
    html_color = '#%02X%02X%02X' % (255, int(255*(1 - attn)), int(255*(1 - attn)))
    return '<span style="background-color: {}">{}</span>'.format(html_color, word)

In [60]:
def mk_html(sentence, attns):
    html = ""
    for word, attn in zip(sentence, attns):
        html += ' ' + highlight(word, attn)
    return html + "<br><br>\n"

In [76]:
s = csj_test.iloc[0,0]

In [77]:
s

'this is far more beautiful than this picture shows and i have received so many compliments on it when i have worn it i would highly recommend this to others'

In [80]:
label

3

In [78]:
wl = [str(w) for w in s.split()]
label, attn = predict_model(s)
attn = attn.cpu().numpy().flatten()

In [79]:
mk_html(wl, attn)

' <span style="background-color: #FFEAEA">this</span> <span style="background-color: #FFFCFC">is</span> <span style="background-color: #FFF6F6">far</span> <span style="background-color: #FFE8E8">more</span> <span style="background-color: #FFFEFE">beautiful</span> <span style="background-color: #FFFEFE">than</span> <span style="background-color: #FFECEC">this</span> <span style="background-color: #FFFDFD">picture</span> <span style="background-color: #FFF9F9">shows</span> <span style="background-color: #FFFEFE">and</span> <span style="background-color: #FFF9F9">i</span> <span style="background-color: #FFFEFE">have</span> <span style="background-color: #FFFCFC">received</span> <span style="background-color: #FFFEFE">so</span> <span style="background-color: #FFF1F1">many</span> <span style="background-color: #FFF3F3">compliments</span> <span style="background-color: #FFEBEB">on</span> <span style="background-color: #FFFDFD">it</span> <span style="background-color: #FFFCFC">when</span> <sp

<span style="background-color: #FFFDFD">this</span> <span style="background-color: #FFBBBB">game</span> <span style="background-color: #FFFEFE">was</span> <span style="background-color: #FFFEFE">incredibly</span> <span style="background-color: #FFF4F4">fun</span> <span style="background-color: #FFFEFE">to</span> <span style="background-color: #FFEAEA">play</span> <span style="background-color: #FFFCFC">the</span> <span style="background-color: #FFF3F3">graphics</span> <span style="background-color: #FFFEFE">are</span> <span style="background-color: #FFFCFC">incredible</span> <span style="background-color: #FFFEFE">and</span> <span style="background-color: #FFFAFA">the</span> <span style="background-color: #FFA3A3">game</span> <span style="background-color: #FFFEFE">has</span> <span style="background-color: #FFFEFE">twist</span> <span style="background-color: #FFFEFE">and</span> <span style="background-color: #FFFCFC">turns</span> <span style="background-color: #FFFEFE">whereever</span> <span style="background-color: #FFFEFE">you</span> <span style="background-color: #FFFEFE">go</span> <span style="background-color: #FFFEFE">i</span> <span style="background-color: #FFFEFE">had</span> <span style="background-color: #FFFEFE">alot</span> <span style="background-color: #FFFEFE">of</span> <span style="background-color: #FFFCFC">fun</span> <span style="background-color: #FFFEFE">sawing</span> <span style="background-color: #FFFEFE">the</span> <span style="background-color: #FFFEFE">monsters</span> <span style="background-color: #FFFEFE">in</span> <span style="background-color: #FFFEFE">half</span> <span style="background-color: #FFFEFE">its</span> <span style="background-color: #FFFDFD">fun</span> <span style="background-color: #FFFEFE">because</span> <span style="background-color: #FFFEFE">you</span> <span style="background-color: #FFFEFE">cant</span> <span style="background-color: #FFFEFE">just</span> <span style="background-color: #FFFEFE">shoot</span> <span style="background-color: #FFFEFE">them</span> <span style="background-color: #FFFEFE">you</span> <span style="background-color: #FFFEFE">have</span> <span style="background-color: #FFFEFE">to</span> <span style="background-color: #FFFEFE">saw</span> <span style="background-color: #FFFEFE">off</span> <span style="background-color: #FFFEFE">their</span> <span style="background-color: #FFFEFE">limbs</span> <span style="background-color: #FFFEFE">awesome</span> <span style="background-color: #FFDCDC">game</span><br><br>

<span style="background-color: #FFFEFE">this</span> <span style="background-color: #FFFEFE">has</span> <span style="background-color: #FFFEFE">come</span> <span style="background-color: #FFFEFE">in</span> <span style="background-color: #FFFEFE">handy</span> <span style="background-color: #FFFEFE">on</span> <span style="background-color: #FFFEFE">more</span> <span style="background-color: #FFFEFE">than</span> <span style="background-color: #FFFEFE">one</span> <span style="background-color: #FFFEFE">occasion</span> <span style="background-color: #FFFDFD">this</span> <span style="background-color: #FFFEFE">is</span> <span style="background-color: #FFFEFE">a</span> <span style="background-color: #FFFDFD">great</span> <span style="background-color: #FF0505">app</span> <span style="background-color: #FFFEFE">i</span> <span style="background-color: #FFFEFE">would</span> <span style="background-color: #FFFEFE">recommend</span> <span style="background-color: #FFFEFE">this</span> <span style="background-color: #FFFEFE">to</span> <span style="background-color: #FFFEFE">anyone</span><br><br>

<span style="background-color: #FFFEFE">the</span> <span style="background-color: #FFF4F4">kettle</span> <span style="background-color: #FFFEFE">i</span> <span style="background-color: #FFFEFE">received</span> <span style="background-color: #FFF1F1">has</span> <span style="background-color: #FFFEFE">smaller</span> <span style="background-color: #FFEAEA">printing</span> <span style="background-color: #FFFEFE">but</span> <span style="background-color: #FFFEFE">it</span> <span style="background-color: #FFFEFE">seems</span> <span style="background-color: #FFFEFE">to</span> <span style="background-color: #FFFEFE">say</span> <span style="background-color: #FFFEFE">the</span> <span style="background-color: #FFFEFE">same</span> <span style="background-color: #FFFEFE">thing</span> <span style="background-color: #FFFEFE">made</span> <span style="background-color: #FFFEFE">in</span> <span style="background-color: #FFFDFD">germany</span> <span style="background-color: #FFFEFE">some</span> <span style="background-color: #FFFEFE">instructions</span> <span style="background-color: #FFFDFD">looks</span> <span style="background-color: #FFFEFE">delicate</span> <span style="background-color: #FFFEFE">but</span> <span style="background-color: #FFFEFE">works</span> <span style="background-color: #FFFEFE">fine</span> <span style="background-color: #FFFEFE">seems</span> <span style="background-color: #FFFEFE">easy</span> <span style="background-color: #FFFEFE">enough</span> <span style="background-color: #FFFEFE">to</span> <span style="background-color: #FFF7F7">clean</span> <span style="background-color: #FFFEFE">and</span> <span style="background-color: #FFFEFE">care</span> <span style="background-color: #FFFEFE">for</span> <span style="background-color: #FFFCFC">the</span> <span style="background-color: #FFEBEB">glass</span> <span style="background-color: #FFFEFE">obviously</span> <span style="background-color: #FFFEFE">gets</span> <span style="background-color: #FFFEFE">hot</span> <span style="background-color: #FFFEFE">but</span> <span style="background-color: #FFFEFE">the</span> <span style="background-color: #FFFEFE">handle</span> <span style="background-color: #FFFEFE">and</span> <span style="background-color: #FFD9D9">lid</span> <span style="background-color: #FFFDFD">stay</span> <span style="background-color: #FFFEFE">cool</span> <span style="background-color: #FFFEFE">enough</span> <span style="background-color: #FFFEFE">to</span> <span style="background-color: #FFFCFC">handle</span> <span style="background-color: #FFFDFD">barely</span> <span style="background-color: #FFE3E3">warm</span> <span style="background-color: #FFFEFE">after</span> <span style="background-color: #FFA0A0">boiling</span> <span style="background-color: #FFF9F9">water</span> <span style="background-color: #FFFEFE">the</span> <span style="background-color: #FFFEFE">whistle</span> <span style="background-color: #FFFEFE">isn\'t</span> <span style="background-color: #FFFEFE">very</span> <span style="background-color: #FFFEFE">loud</span> <span style="background-color: #FFFEFE">but</span> <span style="background-color: #FFFEFE">that\'s</span> <span style="background-color: #FFFEFE">actually</span> <span style="background-color: #FFFEFE">my</span> <span style="background-color: #FFFEFE">preference</span> <span style="background-color: #FFFEFE">very</span> <span style="background-color: #FFFEFE">happy</span> <span style="background-color: #FFFEFE">with</span> <span style="background-color: #FFFEFE">this</span><br><br>

<span style="background-color: #FFEAEA">this</span> <span style="background-color: #FFFCFC">is</span> <span style="background-color: #FFF6F6">far</span> <span style="background-color: #FFE8E8">more</span> <span style="background-color: #FFFEFE">beautiful</span> <span style="background-color: #FFFEFE">than</span> <span style="background-color: #FFECEC">this</span> <span style="background-color: #FFFDFD">picture</span> <span style="background-color: #FFF9F9">shows</span> <span style="background-color: #FFFEFE">and</span> <span style="background-color: #FFF9F9">i</span> <span style="background-color: #FFFEFE">have</span> <span style="background-color: #FFFCFC">received</span> <span style="background-color: #FFFEFE">so</span> <span style="background-color: #FFF1F1">many</span> <span style="background-color: #FFF3F3">compliments</span> <span style="background-color: #FFEBEB">on</span> <span style="background-color: #FFFDFD">it</span> <span style="background-color: #FFFCFC">when</span> <span style="background-color: #FFFDFD">i</span> <span style="background-color: #FFFEFE">have</span> <span style="background-color: #FFA2A2">worn</span> <span style="background-color: #FFFCFC">it</span> <span style="background-color: #FFFDFD">i</span> <span style="background-color: #FFFEFE">would</span> <span style="background-color: #FFFEFE">highly</span> <span style="background-color: #FFFDFD">recommend</span> <span style="background-color: #FFF5F5">this</span> <span style="background-color: #FFF4F4">to</span> <span style="background-color: #FFFEFE">others</span><br><br>

In [81]:
import sys

In [82]:
sys.__version__

AttributeError: module 'sys' has no attribute '__version__'

In [83]:
sys.version

'3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) \n[GCC 7.2.0]'

In [None]:
print(pyt)