# 03 fastTextのword embeddingsを使ってみる
* このnotebookは、Google Colabではなく、手元の環境で動かすことを想定しています。
 * Google Colabで動かすとかなり時間がかかると思います。
* 必要ならば、このnotebookを実行する前に、condaの環境を作っておきましょう。

`$ conda create -n D_wordvec`

`$ source activate D_wordvec`

## 03-01 fastTextをインストールする

In [2]:
# !pip install fasttext

### "Word vectors for 157 languages"から英語データをダウンロード
* fastTextのドキュメント https://fasttext.cc/docs/en/crawl-vectors.html
* 論文 https://arxiv.org/abs/1802.06893

In [3]:
import fasttext.util

fasttext.util.download_model('en', if_exists='ignore')

'cc.en.300.bin'

## 03-02 IMDbデータセットをダウンロード

### 本家のサイトからダウンロード
* 方法は他にもあるが、ここでは本家サイトから直にダウンロードする。

In [4]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2020-10-17 03:18:32--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.1’


2020-10-17 03:18:45 (6.52 MB/s) - ‘aclImdb_v1.tar.gz.1’ saved [84125825/84125825]



In [5]:
!tar zxf aclImdb_v1.tar.gz

### ml-datasetsをインストール
* https://pypi.org/project/ml-datasets/
* 機械学習のデータセットのローダ。IMDbも簡単に扱える。

In [6]:
!pip install ml-datasets

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


### fastTextの単語ベクトルを読み込む
* さきほどダウンロードし、解凍しておいたものを読み込む。

In [7]:
import fasttext

model_path = 'cc.en.300.bin'
print(f'# loading {model_path} ...', flush=True) 
ft = fasttext.load_model(model_path)

# loading cc.en.300.bin ...


### IMDbデータセットを読み込む
* 本家サイトからダウンロードし、解凍しておいたものを、ml_datasetsを使って読み込む。

In [8]:
from ml_datasets import imdb

train_valid_data, test_data = imdb(loc='./aclImdb')

In [9]:
train_valid_data[:3]

[("Once again, Pia Zadora, the woman who owes her entire career to her husband, proves she can't act. This disaster of a film butchers the Harold Robbins novel. Ray Liotta must have been hogtied and carried to the set to appear in this one.\n\n\n\nAvoid this at all costs. I doubt even doing the MST3K thing would save it.",
  0),
 ('Just finished this movie... saw it on the video shelf and being a Nick Stahl fan I just had to rent it. In all honesty, it probably should have stayed on the shelf. The concept was an interesting one and there were several fairly smart twists and turns but somehow I guessed almost all of them before they came along. And the movie just went a little too far in the end in my opinion... if you have to suffer through a viewing of it you\'ll see what I mean!\n\n\n\nOn a positive note, Nick Stahl\'s acting was great (especially considering what he had to work with). Eddie Kaye Thomas was also good but he always plays the same type of character... too much Paul Fin

### テキスト部分と0/1ラベル部分に分ける

In [10]:
train_valid_texts, train_valid_labels = zip(*train_valid_data)
test_texts, test_labels = zip(*test_data)

### テストセット以外をランダムにシャッフル

In [11]:
import random

random.seed(123)
random.shuffle(train_valid_data)

### 手動で訓練データと検証データへ分割

In [12]:
split = int(len(train_valid_data) * 0.8)
train_texts, train_labels = train_valid_texts[:split], train_valid_labels[:split]
valid_texts, valid_labels = train_valid_texts[split:], train_valid_labels[split:]

In [13]:
print(f'# {len(train_texts)} training, {len(valid_texts)} validation, and {len(test_texts)} test docs')

# 20000 training, 5000 validation, and 25000 test docs


In [14]:
splits = {
    'train': (train_texts, train_labels),
    'valid': (valid_texts, valid_labels),
    'test': (test_texts, test_labels)
}

### 全文書のembeddingを得てファイルに保存
* fastTextのget_sentence_vectorを使って文書のベクトル表現を得る。
* 全文書のベクトル表現をndarrayに変換、`.npy`形式で保存
* 全文書のラベルもndarrayに変換、`.npy`形式で保存

In [15]:
import numpy as np

for tag in splits:
    print(f'# {tag} set: ', end='', flush=True)
    cnt = 0
    X = list()
    for text in splits[tag][0]:
        vec = ft.get_sentence_vector(' '.join(text.split('\n')))
        X.append(vec)
        cnt += 1
        if cnt % 10000 == 0: print('*', end='', flush=True)
        elif cnt % 1000 == 0: print('-', end='', flush=True)
    X = np.array(X)
    with open(f'{tag}.npy', 'wb') as f:
        np.save(f, X, allow_pickle=False)
    with open(f'{tag}_labels.npy', 'wb') as f:
        np.save(f, np.array(splits[tag][1]), allow_pickle=False)
    print(flush=True)

# train set: ---------*---------*
# valid set: -----
# test set: ---------*---------*-----


In [15]:
!ls -al *.npy

-rw-rw-rw- 1 root root 30000128 Oct 10 11:23 test.npy
-rw-rw-rw- 1 root root   200128 Oct 10 11:23 test_labels.npy
-rw-rw-rw- 1 root root 24000128 Oct 10 11:23 train.npy
-rw-rw-rw- 1 root root   160128 Oct 10 11:23 train_labels.npy
-rw-rw-rw- 1 root root  6000128 Oct 10 11:23 valid.npy
-rw-rw-rw- 1 root root    40128 Oct 10 11:23 valid_labels.npy
