Tested single data split only, not cross validation.

### Polarity classification

    INFO:tensorflow:***** Eval results *****
    INFO:tensorflow:  eval_accuracy = 0.8264151
    INFO:tensorflow:  eval_loss = 0.4085315
    INFO:tensorflow:  global_step = 744
    INFO:tensorflow:  loss = 0.41608095

### Category classification

    INFO:tensorflow:***** Eval results *****
    INFO:tensorflow:  eval_accuracy = 0.49811321
    INFO:tensorflow:  eval_loss = 1.4121249
    INFO:tensorflow:  global_step = 223
    INFO:tensorflow:  loss = 1.3853042


In [1]:
from dlcliche.utils import *
from dlcliche.nlp_mecab import *

In [2]:
# Download dataset & stop_words_ja.txt
! wget https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/chABSA-dataset.zip
! unzip -q chABSA-dataset.zip && rm chABSA-dataset.zip && rm -r __MACOSX
! ls chABSA-dataset
! cd chABSA-dataset && wget https://raw.githubusercontent.com/chakki-works/chABSA-dataset/master/notebooks/resource/stop_words_ja.txt

--2018-11-08 17:57:48--  https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/chABSA-dataset.zip
Resolving s3-ap-northeast-1.amazonaws.com (s3-ap-northeast-1.amazonaws.com)... 52.219.0.80
Connecting to s3-ap-northeast-1.amazonaws.com (s3-ap-northeast-1.amazonaws.com)|52.219.0.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 722777 (706K) [application/zip]
Saving to: ‘chABSA-dataset.zip’


2018-11-08 17:57:48 (7.71 MB/s) - ‘chABSA-dataset.zip’ saved [722777/722777]

e00008_ann.json  e01118_ann.json  e02289_ann.json  e04291_ann.json
e00017_ann.json  e01151_ann.json  e02353_ann.json  e04298_ann.json
e00024_ann.json  e01156_ann.json  e02367_ann.json  e04304_ann.json
e00026_ann.json  e01173_ann.json  e02380_ann.json  e04319_ann.json
e00030_ann.json  e01183_ann.json  e02382_ann.json  e04329_ann.json
e00033_ann.json  e01197_ann.json  e02390_ann.json  e04331_ann.json
e00034_ann.json  e01216_ann.json  e02414_ann.json  e04360_ann.json
e00035_ann.js

In [3]:
! wget https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip
! unzip multilingual_L-12_H-768_A-12.zip 
! rm multilingual_L-12_H-768_A-12.zip

--2018-11-08 18:00:18--  https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.26.48, 2404:6800:4004:818::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.26.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 623781697 (595M) [application/zip]
Saving to: ‘multilingual_L-12_H-768_A-12.zip’


2018-11-08 18:01:11 (11.3 MB/s) - ‘multilingual_L-12_H-768_A-12.zip’ saved [623781697/623781697]

Archive:  multilingual_L-12_H-768_A-12.zip
   creating: multilingual_L-12_H-768_A-12/
  inflating: multilingual_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: multilingual_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: multilingual_L-12_H-768_A-12/vocab.txt  
  inflating: multilingual_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: multilingual_L-12_H-768_A-12/bert_config.json  


In [4]:
DATA = Path('chABSA-dataset')

def check_data_existence(folder):
    file_count = len(list(folder.glob("e*_ann.json")))
    if  file_count == 0:
        raise Exception("Processed Data does not exist.")
    else:
        print("{} files ready.".format(file_count))

check_data_existence(DATA)

stop_words = []
with (DATA/"stop_words_ja.txt").open(encoding="utf-8") as f:
    stop_words = f.readlines()
    stop_words = [w.strip() for w in stop_words]

print("{} stop words ready.".format(len(stop_words)))

230 files ready.
310 stop words ready.


In [5]:
labels = []
# make labels (exclude NULL and OOD)
for e in ["market", "company", "business", "product"]:
    for a in ["general", "sales", "profit", "amount", "price", "cost"]:
        labels.append(e + "#" + a)
        if e in ["market"]:
            break;
print(labels)

['market#general', 'company#general', 'company#sales', 'company#profit', 'company#amount', 'company#price', 'company#cost', 'business#general', 'business#sales', 'business#profit', 'business#amount', 'business#price', 'business#cost', 'product#general', 'product#sales', 'product#profit', 'product#amount', 'product#price', 'product#cost']


In [7]:
import json
import numpy as np
import pandas as pd
from collections import Counter

sentences = []
dataset = []
tokenizer = get_mecab_tokenizer()

for f in DATA.glob("e*_ann.json"):
    with f.open(encoding="utf-8") as j:
        d = json.load(j)
        for s in d["sentences"]:
            tokenized = tokenizer.tokenize(s["sentence"].upper())
            for o in s["opinions"]:
                if o["category"] in labels:
                    # sentence index + category
                    dataset.append((len(sentences), o["category"], o["polarity"]))
            sentences.append(tokenized)

## Polarity classification

In [8]:
from sklearn.model_selection import train_test_split
Y = 2
dataset = np.array(dataset)
Xtrn, Xval, ytrn, yval = train_test_split(dataset[:, 0], dataset[:, Y], test_size=0.1, random_state=0)

def write_dataset(filename, X, y):
    with open(filename, 'w') as f:
        for _x, _y in zip(X, y):
            w = list(sentences[int(_x)])
            f.write(_y+'\t'+' '.join(w)+'\n')
write_dataset(DATA/'train.tsv', Xtrn, ytrn)
write_dataset(DATA/'valid.tsv', Xval, yval)
len(Xtrn), len(Xval), list(set(dataset[:, Y]))

(2381, 265, ['negative', 'positive', 'neutral'])

In [9]:
! export BERT_BASE_DIR=./multilingual_L-12_H-768_A-12 && export ChABSA_DIR=./chABSA-dataset && python run_classifier.py   --task_name=ChABSA   --do_train=true   --do_eval=true   --data_dir=$ChABSA_DIR   --vocab_file=$BERT_BASE_DIR/vocab.txt  --bert_config_file=$BERT_BASE_DIR/bert_config.json   --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt   --max_seq_length=128   --train_batch_size=32   --learning_rate=2e-5   --num_train_epochs=3.0   --output_dir=/tmp/chabsa_output/

INFO:tensorflow:Using config: {'_model_dir': '/tmp/chabsa_output/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f5adc8597b8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:Writing example 0 of 2381
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: train-0
INFO:tens

INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 2381
INFO:tensorflow:  Batch size = 32
INFO:tensorflow:  Num steps = 223
INFO:tensorflow:Skipping training since max_steps has already saved.
INFO:tensorflow:Writing example 0 of 265
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: dev-0
INFO:tensorflow:tokens: [CLS] 個 人 保 険 、 個 人 年 金 保 険 を 合 計 し た 新 契 約 高 は 、 米 トル 建 保 険 およひ 定 期 保 険 の 販 売 は 好 調 て あ ##っ た も ##の ##の 、 変 額 保 険 の 販 売 減 少 により 、 0 兆 0 , 0 億 円 ( 前 年 度 比 0 . 0 % 減 ) と なり ま ##し た [SEP]
INFO:tensorflow:input_ids: 101 1941 1763 1916 9164 1482 1941 1763 3498 8688 1916 9164 1562 2429 7790 1523 1527 4299 2998 6552 9602 1538 1482 6498 71517 3568 1916 9164 20417 3199 4459 1916 9164 1537 8098 2951 1538 3019 7868 1531 1508 97383 1527 1547 10711 10711 1482 2960 9352 1916 9164 1537 8098 2951 5180 3274 13723 1482 121 2054 121 117 121 2031 2094 113 2203 3498 3530 4901 121 119 121 110 5180 114 1532 48331 1543 14440 1527 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

INFO:tensorflow:Graph was finalized.
2018-11-08 18:02:57.280678: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-11-08 18:02:58.104551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-11-08 18:02:58.104583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-11-08 18:02:58.319121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-08 18:02:58.319160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-11-08 18:02:58.319165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
2018-11-08 18:02:58.319379: I tensorflow/core/common_runtime/gpu/gpu_device.cc

## Category classification

In [11]:
from sklearn.model_selection import train_test_split
Y = 1
dataset = np.array(dataset)
Xtrn, Xval, ytrn, yval = train_test_split(dataset[:, 0], dataset[:, Y], test_size=0.1, random_state=0)

def write_dataset(filename, X, y):
    with open(filename, 'w') as f:
        for _x, _y in zip(X, y):
            w = list(sentences[int(_x)])
            f.write(_y+'\t'+' '.join(w)+'\n')
write_dataset(DATA/'train.tsv', Xtrn, ytrn)
write_dataset(DATA/'valid.tsv', Xval, yval)
len(Xtrn), len(Xval), list(set(dataset[:, Y]))

(2381,
 265,
 ['business#profit',
  'business#general',
  'market#general',
  'company#general',
  'product#sales',
  'business#amount',
  'company#sales',
  'product#price',
  'company#profit',
  'business#price',
  'product#general',
  'company#amount',
  'business#cost',
  'business#sales',
  'product#profit',
  'product#cost',
  'company#cost',
  'product#amount'])

In [13]:
! export BERT_BASE_DIR=./multilingual_L-12_H-768_A-12 && export ChABSA_DIR=./chABSA-dataset && python run_classifier.py   --task_name=ChABSA2   --do_train=true   --do_eval=true   --data_dir=$ChABSA_DIR   --vocab_file=$BERT_BASE_DIR/vocab.txt  --bert_config_file=$BERT_BASE_DIR/bert_config.json   --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt   --max_seq_length=128   --train_batch_size=32   --learning_rate=2e-5   --num_train_epochs=3.0   --output_dir=/tmp/chabsa2_output/

INFO:tensorflow:Using config: {'_model_dir': '/tmp/chabsa2_output/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc6c7fd7eb8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:Writing example 0 of 2381
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: train-0
INFO:ten

INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 2381
INFO:tensorflow:  Batch size = 32
INFO:tensorflow:  Num steps = 223
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (32, 128)
INFO:tensorflow:  name = input_mask, shape = (32, 128)
INFO:tensorflow:  name = label_ids, shape = (32,)
INFO:tensorflow:  name = segment_ids, shape = (32, 128)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (105879, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKP

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-11-08 18:05:05.184215: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-11-08 18:05:05.998511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-11-08 18:05:05.998540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-11-08 18:05:06.181846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-08 18:05:06.181878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-11-08 18:05:06.181886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: 

INFO:tensorflow:***** Running evaluation *****
INFO:tensorflow:  Num examples = 265
INFO:tensorflow:  Batch size = 8
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running eval on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (?, 128)
INFO:tensorflow:  name = input_mask, shape = (?, 128)
INFO:tensorflow:  name = label_ids, shape = (?,)
INFO:tensorflow:  name = segment_ids, shape = (?, 128)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (105879, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encode

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-11-08-09:07:15
INFO:tensorflow:Graph was finalized.
2018-11-08 18:07:16.161504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-11-08 18:07:16.161545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-08 18:07:16.161550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-11-08 18:07:16.161556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
2018-11-08 18:07:16.161667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10397 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /tmp/chabsa2_output/model.ckpt-223
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:D