# CNN, RNNを用いた対話行為の分類

CNNまたはRNNを用いて対話行為の分類を行う. データセットはATIS(Aierline Travel Information Systems):航空券予約のデータセットを用いる. このデータセットは4478個の学習用音声と893個のテスト用音声からなり, 21個の意図が含まれている. 学習にはこのうち17個の意図を用いる.

In [26]:
import os
import re

import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.initializers import Constant
from tensorflow.keras.layers import (LSTM, Conv1D, Dense, Embedding, GlobalMaxPooling1D, MaxPooling1D, TextVectorization)
from tensorflow.keras.models import Sequential

## データセット, モデルの読み込み


In [3]:
# データセットとテストセットの先頭行, 形式が異なるため注意が必要

!head ./data/atis.train.w-intent.iob

BOS i want to fly from boston at 838 am and arrive in denver at 1110 in the morning EOS	 O O O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day atis_flight
BOS what flights are available from pittsburgh to baltimore on thursday morning EOS	O O O O O O B-fromloc.city_name O B-toloc.city_name O B-depart_date.day_name B-depart_time.period_of_day atis_flight
BOS what is the arrival time in san francisco for the 755 am flight leaving washington EOS	O O O O B-flight_time I-flight_time O B-fromloc.city_name I-fromloc.city_name O O B-depart_time.time I-depart_time.time O O B-fromloc.city_name atis_flight_time
BOS cheapest airfare from tacoma to orlando EOS	O B-cost_relative O O B-fromloc.city_name O B-toloc.city_name atis_airfare
BOS round trip fares from pittsburgh to philadelphia under 1000 dollars EOS	O B-round_trip I-round_trip O O B-fromloc.city_name O B-toloc.city_name B-cost_relative B-fare_amo

In [4]:
!head ./data/atis.test.w-intent.iob

BOS O
i O
would O
like O
to O
find O
a O
flight O
from O
charlotte B-fromloc.city_name


In [5]:
# GloVe(学習済み単語埋め込み)のダウンロードと展開
!wget  https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip -d data

--2022-04-09 15:53:11--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-04-09 15:53:12--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-04-09 15:55:54 (5.08 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: data/glove.6B.50d.txt   
  inflating: data/glove.6B.100d.txt  
  inflating: data/glove.6B.200d.txt  
  inflating: data/glove.6B.300d.txt  


In [6]:
train_data_path = "./data/atis.train.w-intent.iob"
test_data_path = "./data/atis.test.w-intent.iob"

次の関数ではtrainデータを読み込んで文, ラベル(対話行為の分類ラベル), 意図の3つに分割してリストに格納する処理を行う. 例えばtrain1行目のデータ
```
BOS i want to fly from boston at 838 am and arrive in denver at 1110 in the morning EOS	 O O O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day atis_flight
```
の場合はBOSからEOSまでが文でそれ以降がラベルである. そして末尾の要素は意図を表す. IやwantはO(固有表現でないことを表すラベル), bostonにはB-fromloc.city_name(街の名前)を表すラベルというように固有表現の種類ごとにラベル付けされいる. 意図はatis_○○という形式であるから先頭のatis_を削除して格納する. このようなデータを次の3つのリストsents,labels,intentsに格納する. 
```python 
sents = ['BOS', 'i', 'want', 'to', 'fly', 'from', 'boston', 'at', '838', 'am', 'and', 'arrive', 'in', 'denver', 'at', '1110', 'in', 'the', 'morning', 'EOS'] 
labels = ['', 'O', 'O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-depart_time.time', 'I-depart_time.time', 'O', 'O', 'O', 'B-toloc.city_name', 'O', 'B-arrive_time.time', 'O', 'O', 'B-arrive_time.period_of_day']
intents[0] = 'atis_flight'
```

In [14]:
def load_train_data(filename,remove_validation=True):
    sents, labels, intents = [], [], []
    with open(filename,encoding="utf-8") as f:
        for line in f:
            words, labs = [i.split(' ') for i in line.strip().split('\t')]
            if remove_validation and "#" in labs[-1]:
                continue
            sents.append(words[1:-1])
            labels.append(labs[1:-1])
            intents.append(re.sub(r"^atis_", "", labs[-1]))
    return sents, labels, intents           

In [15]:
# load train
sents, labels, intents = load_train_data(train_data_path)
train_texts = [" ".join(words)for words in sents]
train_labels = intents

print("Number of training sentences :", len(train_texts))
print("Number of unique intents :", len(set(train_labels)))
for i in zip(train_texts[:5], train_labels[:5]):
    print(i)

Number of training sentences : 4952
Number of unique intents : 17
('i want to fly from boston at 838 am and arrive in denver at 1110 in the morning', 'flight')
('what flights are available from pittsburgh to baltimore on thursday morning', 'flight')
('what is the arrival time in san francisco for the 755 am flight leaving washington', 'flight_time')
('cheapest airfare from tacoma to orlando', 'airfare')
('round trip fares from pittsburgh to philadelphia under 1000 dollars', 'airfare')


In [16]:
def load_test_data(filename, remove_validation=True):
    sents, labels, intents = [], [], []
    with open(filename, encoding="utf-8") as f:
        words, tags = [], []
        for line in f:
            line = line.strip()
            if line:
                word, tag = line.split()
                words.append(word)
                tags.append(tag)
            else:
                if not (remove_validation and "#" in tags[-1]):
                    sents.append(words[1: -1])
                    labels.append(tags[1: -1])
                    intents.append(re.sub(r"^atis_", "", tags[-1]))
                words, tags = [], []
    return sents, labels, intents

In [17]:
sents, labels, intents = load_test_data(test_data_path)

test_texts = [" ".join(words) for words in sents]
test_labels = intents

new_labels = set(test_labels) - set(train_labels)

# テストデータにだけ出現するラベルを除去
vals = []
for i in range(len(test_labels)):
    if test_labels[i] in new_labels:
        print(test_labels[i])
        vals.append(i)
for i in vals[::-1]:
    test_labels.pop(i)
    test_texts.pop(i)

print("Number of testing sentences :", len(test_texts))
print("Number of unique intents :", len(set(test_labels)))
for i in zip(test_texts[:5], test_labels[:5]):
    print(i)

day_name
day_name
Number of testing sentences : 876
Number of unique intents : 15
('i would like to find a flight from charlotte to las vegas that makes a stop in st. louis', 'flight')
('on april first i need a ticket from tacoma to san jose departing before 7 am', 'airfare')
('on april first i need a flight going from phoenix to san diego', 'flight')
('i would like a flight traveling one way from phoenix to san diego on april first', 'flight')
('i would like a flight from orlando to salt lake city for april first on delta airlines', 'flight')


## 前処理

In [24]:
BASE_DIR = "data"
GLOVE_PATH = os.path.join(BASE_DIR,"glove.6B.100d.txt")
MAX_SEQUENCE_LENGTH = 300
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.3

In [20]:
# 単語をIDに変換
vectorize_layer = TextVectorization(
    max_tokens = MAX_NUM_WORDS,
    output_mode = "int",
    output_sequence_length = MAX_SEQUENCE_LENGTH)

vectorize_layer.adapt(train_texts)
vectorize_layer.vocabulary_size()

2022-04-09 16:34:54.239045: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:0a:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-04-09 16:34:54.305979: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:0a:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-04-09 16:34:54.306206: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:0a:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-04-09 16:34:54.307539: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate

896

In [21]:
x_train = vectorize_layer(train_texts).numpy()
x_test = vectorize_layer(test_texts).numpy()

In [22]:
# ラベルをIDに変換
le = LabelEncoder()
le.fit(train_labels)
y_train = le.transform(train_labels)
y_test = le.transform(test_labels)
y_train[:10]

array([ 9,  9, 11,  2,  2,  9,  1,  9,  9, 13])

In [23]:
train_labels[:5]

['flight', 'flight', 'flight_time', 'airfare', 'airfare']

In [25]:
x_train, x_valid, y_train, y_valid = train_test_split(
    x_train, y_train, test_size=VALIDATION_SPLIT, random_state=42
)

In [27]:
# 埋め込み行列の準備
# 最初に、単語のインデックスとベクトルのマッピングを作成
embeddings_index = {}
with open(os.path.join(GLOVE_PATH)) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
print('Found %s word vectors in Glove embeddings.' % len(embeddings_index))

# 埋め込み行列の準備
# 行は単語、列はGloVeから得た埋め込みに対応
num_words = min(MAX_NUM_WORDS, vectorize_layer.vocabulary_size()) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for i, word in enumerate(vectorize_layer.get_vocabulary()):
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    # 単語が見つからなければ、ゼロベクトルのまま
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Embedding層に事前学習済み単語埋め込みを読み込み
# 埋め込みを更新しないように、trainable=Falseを設定していることに注意
embedding_layer = Embedding(
    num_words,
    EMBEDDING_DIM,
    embeddings_initializer=Constant(embedding_matrix),
    input_length=MAX_SEQUENCE_LENGTH,
    trainable=False,
    mask_zero=True,
)

Found 400000 word vectors in Glove embeddings.


## Modeling

### 事前学習済み埋め込みを用いたCNN

In [28]:
cnnmodel = Sequential()
cnnmodel.add(embedding_layer)
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation="relu"))
cnnmodel.add(Dense(len(le.classes_), activation="softmax"))

cnnmodel.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["acc"]
)

cnnmodel.summary()

cnnmodel.fit(
    x_train,
    y_train,
    batch_size=128,
    epochs=1,
    validation_data=(x_valid, y_valid)
)
score, acc = cnnmodel.evaluate(x_test, y_test)
print("Test accuracy with CNN:", acc)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 100)          89700     
                                                                 
 conv1d (Conv1D)             (None, 296, 128)          64128     
                                                                 
 max_pooling1d (MaxPooling1D  (None, 59, 128)          0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, 55, 128)           82048     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 11, 128)          0         
 1D)                                                             
                                                                 
 conv1d_2 (Conv1D)           (None, 7, 128)            8

2022-04-09 16:46:06.407907: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8204


Test accuracy with CNN: 0.7214611768722534


### 事前学習済み埋め込みを用いないCNN

In [29]:
cnnmodel = Sequential()
cnnmodel.add(Embedding(MAX_NUM_WORDS, 128, mask_zero=True))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation="relu"))
cnnmodel.add(Dense(len(le.classes_), activation="softmax"))

cnnmodel.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["acc"]
)

cnnmodel.summary()

cnnmodel.fit(
    x_train,
    y_train,
    batch_size=128,
    epochs=1,
    validation_data=(x_valid, y_valid)
)
score, acc = cnnmodel.evaluate(x_test, y_test)
print("Test accuracy with CNN:", acc)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 128)         2560000   
                                                                 
 conv1d_3 (Conv1D)           (None, None, 128)         82048     
                                                                 
 max_pooling1d_2 (MaxPooling  (None, None, 128)        0         
 1D)                                                             
                                                                 
 conv1d_4 (Conv1D)           (None, None, 128)         82048     
                                                                 
 max_pooling1d_3 (MaxPooling  (None, None, 128)        0         
 1D)                                                             
                                                                 
 conv1d_5 (Conv1D)           (None, None, 128)        

### 事前学習済み埋め込みを用いたLSTM

In [30]:
rnnmodel2 = Sequential()
rnnmodel2.add(embedding_layer)
rnnmodel2.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel2.add(Dense(len(le.classes_), activation="softmax"))
rnnmodel2.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

rnnmodel2.summary()

rnnmodel2.fit(
    x_train,
    y_train,
    batch_size=32,
    epochs=1,
    validation_data=(x_valid, y_valid)
)
score, acc = rnnmodel2.evaluate(x_test, y_test, batch_size=32)
print("Test accuracy with RNN:", acc)

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 100)          89700     
                                                                 
 lstm (LSTM)                 (None, 128)               117248    
                                                                 
 dense_4 (Dense)             (None, 17)                2193      
                                                                 
Total params: 209,141
Trainable params: 119,441
Non-trainable params: 89,700
_________________________________________________________________
Test accuracy with RNN: 0.8162100315093994


### 事前学習済み埋め込みを用いないLSTM

In [31]:
rnnmodel = Sequential()
rnnmodel.add(Embedding(MAX_NUM_WORDS, 128, mask_zero=True))
rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel.add(Dense(len(le.classes_), activation="softmax"))
rnnmodel.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

rnnmodel.summary()

rnnmodel.fit(
    x_train,
    y_train,
    batch_size=32,
    epochs=1,
    validation_data=(x_valid, y_valid)
)
score, acc = rnnmodel.evaluate(x_test, y_test, batch_size=32)
print("Test accuracy with RNN:", acc)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, None, 128)         2560000   
                                                                 
 lstm_1 (LSTM)               (None, 128)               131584    
                                                                 
 dense_5 (Dense)             (None, 17)                2193      
                                                                 
Total params: 2,693,777
Trainable params: 2,693,777
Non-trainable params: 0
_________________________________________________________________
Test accuracy with RNN: 0.7557077407836914
