<a href="https://colab.research.google.com/github/asakoRaven/Colab/blob/master/Tensor2Tensor_Intro_my_first_trial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor) Colab

Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and [accelerate ML research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html). T2T is actively used and maintained by researchers and engineers within the [Google Brain team](https://research.google.com/teams/brain/) and a community of users. This colab shows you some datasets we have in T2T, how to download and use them, some models we have, how to download pre-trained models and use them, and how to create and train your own models.

In [0]:
#@title
# Copyright 2018 Google LLC.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# My first trial

参考ページ  
[Tensor2Tensorを使って独自データでseq2seqしてみる](https://www.madopro.net/entry/t2t_seq2seq)

[GoogleCloudPlatform/training-data-analyst](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/09_sequence/poetry.ipynb)

**目標: **Colab で動作するTensor2Tensor に日本語の会話セット（雑談）を読み込ませる
  
日本語らしい一問一答が得られるか、確認する。  

seq2seq についての私のざっくりした理解。  


1.   in / out に対応するsentenceのセットを用意する
2.   in  sentence で ネットワークを作成する
3.   out sentenceで ネットワークを作成する  
4.  ある入力があったとき out  sentence vector が合致したものを探して返す


  これって、世界の言語をメタ言語化するという話みたいだ。。。　　
  
 

## 初期処理


###  data/train用のディレクトリなどを用意


In [2]:
# mount google drive to keep train data
from google.colab import drive
drive.mount('/content/gdrive')
! mkdir -p "./gdrive/My Drive/Colab Notebooks/T2T_my_first_trail/data"
! mkdir -p "./gdrive/My Drive/Colab Notebooks/T2T_my_first_trail/train"
! ls "./gdrive/My Drive/Colab Notebooks/T2T_my_first_trail"


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive
data  train


### install python package  


In [0]:
# Install deps
!pip install -q -U tensor2tensor
!pip install -q tensorflow matplotlib


### setting environment parameters
  
Tensor2Tensor の README を読むと、環境変数を使っていたので。。。

チュートリアルのmodel を seq2seq に変更する。 


In [0]:
import os
# os.environ['DATA_DIR'] = '~/t2t/data'
os.environ['DATA_DIR'] = "./gdrive/My Drive/Colab Notebooks/T2T_my_first_trail/data"
os.environ['TMP_DIR']  ='~/t2t/tmp'
# os.environ['TRAIN_DIR'] = '~/t2t/train'
os.environ['TRAIN_DIR'] = "./gdrive/My Drive/Colab Notebooks/T2T_my_first_trail/train"

os.environ['USR_DIR']  = '~/myproblem'

os.environ['PROBLEM'] = 'my_problem'
os.environ['MODEL']   = 'lstm_seq2seq_attention_bidirectional_encoder'
os.environ['HPARAMS'] = 'lstm_luong_attention_multi'


###  import python packages

チュートリアルの記述を流用。

import matplotlib.pyplot as plt のところで、以下のエラーが出力される  

---  
`This call to matplotlib.use() has no effect because the backend has already   been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot, or matplotlib.backends is imported for the first time.`

---
上記エラー解決のため、セル先頭にimport matplotlibの行を追加する。  
--> 解決しなかった。  よくわからない。  とにかく、処理は進む。  


In [0]:
# to avoid the warning
#   This call to matplotlib.use() has no effect because the backend has already
#   been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
#   or matplotlib.backends is imported for the first time.
import matplotlib

# Imports we need.
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os
import collections

from tensor2tensor import models
from tensor2tensor import problems
from tensor2tensor.layers import common_layers
from tensor2tensor.utils import trainer_lib
from tensor2tensor.utils import t2t_model
from tensor2tensor.utils import registry
from tensor2tensor.utils import metrics

# Enable TF Eager execution
tfe = tf.contrib.eager
tfe.enable_eager_execution()

# Other setup
Modes = tf.estimator.ModeKeys

# Setup some directories
data_dir  = os.path.expanduser(os.environ['DATA_DIR'] )
tmp_dir   = os.path.expanduser(os.environ['TMP_DIR'] )
train_dir = os.path.expanduser(os.environ['TRAIN_DIR'])
usr_dir   = os.path.expanduser(os.environ['USR_DIR'])

# tf.gfile.MakeDirs(data_dir)
tf.gfile.MakeDirs(tmp_dir)
# tf.gfile.MakeDirs(train_dir)
tf.gfile.MakeDirs(usr_dir)

gs_data_dir = "gs://tensor2tensor-data"
gs_ckpt_dir = "gs://tensor2tensor-checkpoints/"

## download conversation data  

名古屋大学の公開会話データセットからデータをダウンロード

https://qiita.com/knok/items/df7a155d17e3c9a12e94 のページにあるスクリプトを利用させてもらう。ありがとう、知らない人。


In [0]:
if not os.path.isfile(os.path.join(data_dir, 'sequence.txt')):
    ! git clone https://github.com/knok/make-meidai-dialogue.git
    ! cd make-meidai-dialogue; make all
    # 作成内容の確認
    ! ls make-meidai-dialogue
    ! head make-meidai-dialogue/sequence.txt

    # 指定したdata directoryへデータをコピー
    ! cp make-meidai-dialogue/sequence.txt "${DATA_DIR}/"
    ! ls -ltr "$DATA_DIR"
    

In [7]:
# 作成内容の確認
! head "${DATA_DIR}/sequence.txt"

input: ＊＊＊でも録音した方がいいじゃん。
output: そうしましょう。
input: そうしましょう。
output: はい。
input: 何？
output: 作ってくるわ、瀬戸資料館で。
input: もう始まってんだよね。
output: うん、始まってる。
input: うん、始まってる。
output: 私もここに潜んでいていいかしら？


## create my problem

In [8]:
%%writefile ~/myproblem/myproblem.py
import os
import re

from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_problems
from tensor2tensor.utils import registry

@registry.register_problem
class MyProblem(text_problems.Text2TextProblem):
  """Predict next line of poetry from the last line. From Gutenberg texts."""

  @property
  def approx_vocab_size(self):
    return 2**13  # ~8k

  @property
  def is_generate_per_split(self):
    # generate_data will shard the data into TRAIN and EVAL for us.
    return False

  @property
  def dataset_splits(self):
    """Splits of data to produce and number of output shards for each."""
    # 10% evaluation data
    return [{
        "split": problem.DatasetSplit.TRAIN,
        "shards": 9,
    }, {
        "split": problem.DatasetSplit.EVAL,
        "shards": 1,
    }]


  def generate_samples(self, data_dir, tmp_dir, dataset_split):
    filename = os.path.join(data_dir, 'sequence.txt')
    f = open(filename, 'r')
    while True:
        line1 = f.readline()
        line2 = f.readline()
        if not line2: break  # EOF

        yield {
            'inputs': line1.strip('input: '),
            'targets': line2.strip('output: ')
        }



Writing /root/myproblem/myproblem.py


In [9]:
%%writefile ~/myproblem/__init__.py
from . import myproblem

Writing /root/myproblem/__init__.py


In [10]:
! find ~/myproblem

/root/myproblem
/root/myproblem/myproblem.py
/root/myproblem/__init__.py


## generate dataset

In [0]:
import sys
sys.path.append(os.environ['HOME'])

In [0]:
from myproblem import myproblem


# Fetch the my problem problem
my_problem = problems.problem("my_problem")

if not os.path.isfile(os.path.join(data_dir, 'vocab.my_problem.8192.subwords')):
    # The generate_data method of a problem will download data and process it into
    # a standard format ready for training and evaluation.
    my_problem.generate_data(data_dir, tmp_dir)

## train data

### 状況確認

In [13]:
! ls -ltr "$DATA_DIR"

total 6292
-rw------- 1 root root 3720966 Oct  5 08:08 sequence.txt
-rw------- 1 root root   85688 Oct  5 08:12 vocab.my_problem.8192.subwords
-rw------- 1 root root  262423 Oct  5 08:12 my_problem-train-00008-of-00009
-rw------- 1 root root  263787 Oct  5 08:12 my_problem-train-00007-of-00009
-rw------- 1 root root  264467 Oct  5 08:12 my_problem-train-00006-of-00009
-rw------- 1 root root  262788 Oct  5 08:12 my_problem-train-00005-of-00009
-rw------- 1 root root  262585 Oct  5 08:12 my_problem-train-00004-of-00009
-rw------- 1 root root  264082 Oct  5 08:12 my_problem-train-00003-of-00009
-rw------- 1 root root  262037 Oct  5 08:12 my_problem-train-00002-of-00009
-rw------- 1 root root  263995 Oct  5 08:12 my_problem-train-00001-of-00009
-rw------- 1 root root  262837 Oct  5 08:12 my_problem-train-00000-of-00009
-rw------- 1 root root  263875 Oct  5 08:12 my_problem-dev-00000-of-00001


In [14]:
# Fetch the problem
my_problem = problems.problem("my_problem")

# Copy the vocab file locally so we can encode inputs and decode model outputs
# All vocabs are stored on GCS
vocab_name = "vocab.my_problem.8192.subwords"
vocab_file = os.path.join(data_dir, vocab_name)
print(vocab_file)
!head "{vocab_file}"

# Get the encoders from the problem
encoders = my_problem.feature_encoders(data_dir)

# Setup helper functions for encoding and decoding
def encode(input_str, output_str=None):
  """Input str to features dict, ready for inference"""
  inputs = encoders["inputs"].encode(input_str) + [1]  # add EOS id
  batch_inputs = tf.reshape(inputs, [1, -1, 1])  # Make it 3D.
  return {"inputs": batch_inputs}

def decode(integers):
  """List of ints to str"""
  integers = list(np.squeeze(integers))
  if 1 in integers:
    integers = integers[:integers.index(1)]
  return encoders["inputs"].decode(np.squeeze(integers))


./gdrive/My Drive/Colab Notebooks/T2T_my_first_trail/data/vocab.my_problem.8192.subwords
'<pad>_'
'<EOS>_'
'、_'
'。\10;_'
'_'
'うん_'
'？\10;_'
'）_'
'、（_'
'ね_'


 日本語も分かち書きされて、IDが振られているみたいだ。  

In [15]:
# Generate and view the data

example = tfe.Iterator(my_problem.dataset(Modes.TRAIN, data_dir)).next()
inputs = [int(x) for x in example["inputs"].numpy()] # Cast to ints.
targets = [int(x) for x in example["targets"].numpy()] # Cast to ints.


# Example inputs as int-tensor.
print("Inputs, encoded:")
print(inputs)
print("Inputs, decoded:")
# Example inputs as a sentence.
print(decode(inputs))
# Example targets as int-tensor.
print("Targets, encoded:")
print(targets)
# Example targets as a sentence.
print("Targets, decoded:")
print(decode(targets))

INFO:tensorflow:Reading data files from ./gdrive/My Drive/Colab Notebooks/T2T_my_first_trail/data/my_problem-train*
INFO:tensorflow:partition: 0 num_data_files: 9
Inputs, encoded:
[19, 2, 761, 604, 1609, 4044, 821, 12, 2, 5, 3, 1]
Inputs, decoded:
そう、機内持ち込みにしてー、うん。

Targets, encoded:
[5224, 24, 3874, 66, 3, 1]
Targets, decoded:
そっちの方が迷惑じゃん。



### t2t-trainer

t2t-trainerコマンドだが、とても、時間がかかる。
ホスト側から、処理キャンセルされてしまう。  
  
そのため、train data directory をgoogle driveに割り当て、google collabolationからセッション切断されても、トレーニングを再開できるようにする。  


25000 steps がdefaultのようなのだが、たぶん、半日経過しても終了しないし、その前にセッションを切断される。  
中断されたら、またこのセルを実行する。  
250K steps のトレーニングがいつ終了するのか、誰も知らない。

In [0]:
#! t2t-trainer \
#  --data_dir=$DATA_DIR \
#  --t2t_usr_dir=$USR_DIR \
#  --problem=$PROBLEM \
#  --model=$MODEL \
#  --hparams_set=$HPARAMS \
#  --output_dir=$TRAIN_DIR

! t2t-trainer \
  --data_dir="$DATA_DIR" \
  --t2t_usr_dir="$USR_DIR" \
  --problem="$PROBLEM" \
  --model="$MODEL" \
  --hparams_set="$HPARAMS" \
  --output_dir="$TRAIN_DIR" \
  --keep_checkpoint_max=3 \
  --keep_checkpoint_every_n_hours=1


In [0]:
! ls -ltr  "$TRAIN_DIR"

途中で終わったにせよ、trainデータがあるので、様子を見てみる。  


In [0]:
# BEAM_SIZE=4
# ALPHA=0.6
! t2t-decoder \
   --data_dir="$DATA_DIR" \
   --problem=$PROBLEM \
   --model=$MODEL \
   --hparams_set=$HPARAMS \
   --output_dir="$TRAIN_DIR" \
   --decode_hparams="beam_size=4,alpha=0.6" \
   --decode_interactive=true \
   --t2t_usr_dir="$USR_DIR"


学習が進んでいないせいか、何を聞いても「うん」としか答えない。  
学習が進めば、もう少し、賢くなりそうなのだが。。。  


70000 stepsぐらいのトレーニングで、「うん」以外の答えがでるようになった。たいへんだ。


```
>おはよう、世界
INFO:tensorflow:うん。

>今日は暑いね
INFO:tensorflow:気出ないよ。

>なにか、おもしろいことあった？
INFO:tensorflow:はいはい、はい。

>「うん」以外云えないの？
INFO:tensorflow:そうそうそう。

```

