## DSSM and beyond

Повторяем идею из [Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/publication/learning-deep-structured-semantic-models-for-web-search-using-clickthrough-data/)


<img src="https://raw.githubusercontent.com/v-liaha/v-liaha.github.io/master/assets/dssm.png" width=600>

В качестве энкодера используем **conv - maxpooling**

Скачиваем данные [Quora Question Pairs](https://www.kaggle.com/quora/question-pairs-dataset)

**Описание данных:**

* id - the id of a training set question pair
* qid1, qid2 - unique ids of each question (only available in train.csv)
* question1, question2 - the full text of each question
* is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

In [1]:
import os
import re

import tensorflow as tf
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.preprocessing.text import Tokenizer

  return f(*args, **kwds)
Using TensorFlow backend.


In [2]:
# choose the GPU to use
# os.environ["CUDA_VISIBLE_DEVICES"] = '3'

In [3]:
time_steps = 12
vocab_size = 7000

**Задание 1**

Написать функцию, которая приводит строку к нижнему регистру, оставляет запятые, числа, вопросительный и восклицательный знаки

In [4]:
def tokenize_string(string):
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`’:]", " ", string)  
    string = re.sub(r"’", "'", string) 
    string = re.sub(r"`", "'", string) 
    string = re.sub(r"\'s", " \'s", string) 
    string = re.sub(r"\'ve", " \'ve", string) 
    string = re.sub(r"n\'t", " n\'t", string) 
    string = re.sub(r"\'re", " \'re", string) 
    string = re.sub(r"\'d", " \'d", string) 
    string = re.sub(r"\'ll", " \'ll", string) 
    string = re.sub(r",", " , ", string) 
    string = re.sub(r":", " : ", string) 
    string = re.sub(r"!", " ! ", string) 
    string = re.sub(r"\(", " ( ", string) 
    string = re.sub(r"\)", " ) ", string) 
    string = re.sub(r"\?", " ? ", string) 
    string = re.sub(r"\s{2,}", " ", string)    
    return string.strip().lower()

def vectorize(data, tokenizer, time_steps=time_steps):
    data = tokenizer.texts_to_sequences(data)
    data = pad_sequences(data, maxlen=time_steps, padding='post')
    return data

## Обработка данных

Поменяем постановку задачи: теперь вместо того, чтобы предсказывать, с какой вероятностью данные примеры являются дубликатами, будем находить дубликаты среди пула примеров.

In [5]:
# nrows -- сколько строк с *.csv файла загрузить в память
data = pd.read_csv('questions.csv', nrows=50000)

# оставляем только дубликаты
data = data[data['is_duplicate'] == 1]
data = data.dropna()
data = data.rename({'question1': 'query', 'question2': 'd+'}, axis=1)

# очищаем данные от шума
data['query'] = data['query'].apply(lambda x: tokenize_string(x))
data['d+'] = data['d+'].apply(lambda x: tokenize_string(x))

# создаем K=4 не дубликатов для данного примера
data['d1-'] = np.random.permutation(data['d+'].values)
data['d2-'] = np.random.permutation(data['d+'].values)
data['d3-'] = np.random.permutation(data['d+'].values)
data['d4-'] = np.random.permutation(data['d+'].values)

# первый пример всегда является дубликатом, все остальные --- нет
y = np.zeros((data.shape[0], 5), dtype=int)
y[:,0] = 1

In [6]:
# фитим токенайзер

corpus = data['query'].tolist() + data['d+'].tolist()
tok = Tokenizer(num_words=vocab_size)
tok.fit_on_texts(corpus)

In [7]:
# векторизуем данные

q = vectorize(data['query'].values, tok)
d0 = vectorize(data['d+'].values, tok)
d1 = vectorize(data['d1-'].values, tok)
d2 = vectorize(data['d2-'].values, tok)
d3 = vectorize(data['d3-'].values, tok)
d4 = vectorize(data['d4-'].values, tok)

In [8]:
# делим датасет на обучение и валидацию

x = np.hstack((q, d0, d1, d2, d3, d4)).reshape((-1, 6, time_steps))
xtr, xev, ytr, yev = train_test_split(x, y, test_size=0.1, random_state=24)

## input_fn

С помощью tf.data создаем итератор, который будет подавать данные в модель

In [9]:
def expand_x(x):
    return {'q': x[:,0],
            'd0': x[:,1],
            'd1': x[:,2],
            'd2': x[:,3],
            'd3': x[:,4],
            'd4': x[:,5]}

# функция, которая подает данные в модель
def input_fn(x, labels, params, is_training):
    dataset = tf.data.Dataset.from_tensor_slices((x, labels))

    if is_training:
        dataset = dataset.shuffle(buffer_size=params['train_size'])
        dataset = dataset.repeat(count=params['num_epochs'])

    dataset = dataset.batch(params['batch_size'])
    dataset = dataset.map(lambda x, y: (expand_x(x), y))
    dataset = dataset.prefetch(buffer_size=100)
    return dataset

# Model

**Задание 2**

Реализуйте функцию, котора считает косинусную близость между тензорами размера **(batch_size, dim)**

In [10]:
# hint: try to use tf.nn.l2_normalize, tf.multiply
from sklearn.metrics.pairwise import cosine_similarity

def cosine_sim(x, y):
    """
    Подсчет косинусной близости между двумя тензорами размера (batch_size, dim)
    """
#     cos_sim = tf.losses.cosine_distance(labels=x, predictions=y, axis=1)
    dot_product = tf.reduce_sum(tf.multiply(x,y), axis=1)

    return dot_product

**Задание 3**

Реализуйте энкодер, который переводит тензор размера **(batch_size, time_steps, emb_size)** в тензор **(batch_size, new_dim)**

<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/73d826d4c2363701b88e3e234fe3b8756c0f9671/3-Figure1-1.png" width=600>


Применить два типа свертки: **[kernel_size=3, strides=2, filters=32], [kernel_size=5, strides=3, filters=32]**

Над выходами **average-pooling, max-pooling** соответственно. Полученные тензоры сконкатенировать.

In [11]:
def build_model(features, params, is_training):
    
    emb_matrix = tf.get_variable('embedding_matrix',
                             shape=[params['vocab_size'], params['emb_size']],
                             dtype=tf.float32)
    
    def encode(sentences):
        """
        Args:
            sentences: (batch_size, time_steps) последовательности индексов
        Returns:
            out: (batch_size, new_dim) представление текста в новом пространстве
        """
        
        # hints: use tf.nn.embedding_lookup, tf.layers.conv1d, tf.reduce_max
        # tf.reduce_mean, tf.concat
        embs = tf.nn.embedding_lookup(emb_matrix, sentences)
#         sum_vector = tf.reduce_sum(embd, axis=1)
#         out = tf.layers.dense(sum_vector, 256, activation=tf.nn.relu)
#         out = tf.layers.dense(out, 128, activation=tf.nn.relu)
#         out = tf.layers.dense(out, 64)
        out = tf.layers.conv1d(inputs=embs, filters=32, kernel_size=3, strides=2)
        out = tf.reduce_mean(out, axis=1)
        out = tf.layers.conv1d(inputs=embs, filters=32, kernel_size=5, strides=3)
        out = tf.reduce_max(out, axis=1)
#         out = tf.concat(out, axis=1)

        return out
    
    # энкодим все документы
    encoded_features = {}        
    
    with tf.variable_scope('enc'):
        encoded_features['q'] = encode(features['q'])
    
    for key, value in features.items():
        if key != 'q':
            with tf.variable_scope('enc', reuse=True):
                encoded_features[key] = encode(value)
    
    # считаем косинусные близости между q и всеми документами
    cos_sims = {}
    
    for key, value in encoded_features.items():
        if key != 'q':
            cos_sims[key] = cosine_sim(encoded_features['q'], encoded_features[key])
    
    # конкатинируем косинусные близости
    to_concatenate = [cos_sims['d0'], cos_sims['d1'], cos_sims['d2'], cos_sims['d3'], cos_sims['d4']]
    concatenated = tf.stack(to_concatenate)
    
    return concatenated, encoded_features

Функция потерь:

$$J(\theta) = - \sum_i y_i \ln(\hat{y_i})$$

Мы хотим, чтобы $cosine\_similarity(q, d_0) = 1$, а $cosine\_similarity(q, d_j) = 0$, где $j \in \{1,2,3,4\}$, тогда лосс будет стремиться к нулю.


**Задание 4**

Реализовать метрики:

* Accuracy
* MSE

In [12]:
def model_fn(features, labels, mode, params):
    
    is_training = (mode == tf.estimator.ModeKeys.TRAIN)
    
    with tf.variable_scope('model'):
        logits, _ = build_model(features, params, is_training)
        
    preds = tf.argmax(logits, axis=1)
    
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {'preds': preds, 'logits': logits}
        return tf.estimator.EstimatorSpec(mode=mode,
                                          predictions=predictions)
    
    # hints: tf.equal, tf.square, tf.substract, tf.cast, tf.reduce_mean
    logits = tf.transpose(logits)
    labels = tf.to_float(labels)
    accuracy, acc_op = tf.metrics.accuracy(labels=tf.argmax(labels, 1), predictions=tf.argmax(logits,1))
    mse, mse_op = tf.metrics.mean_squared_error(labels=tf.argmax(labels, 1), predictions=tf.argmax(logits,1))
#     print(accuracy)
#     print(mse)
    
#     prediction = tf.argmax(prob, 1)
#     equality = tf.equal(prediction, correct_answer)
#     accuracy = tf.reduce_mean(tf.cast(equality, tf.float32))
    
    
    loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
    
    if mode == tf.estimator.ModeKeys.EVAL:
        with tf.variable_scope('metrics'):
            eval_metrics = {'accuracy': (accuracy, acc_op),
                           'mse': (mse, mse_op)}
        
        return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=eval_metrics)
    
    tf.summary.scalar('accuracy', accuracy)
    tf.summary.scalar('mse', mse)
    tf.summary.scalar('loss', loss)
    
    optimizer = tf.train.AdamOptimizer()
    
    global_step = tf.train.get_global_step()
    train_op = optimizer.minimize(loss, global_step=global_step)
    
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

In [13]:
model_params = {
    'vocab_size': vocab_size,
    'emb_size': 300
}

config = tf.estimator.RunConfig(tf_random_seed=123,
                                model_dir='masha',
                                save_summary_steps=5)

estimator = tf.estimator.Estimator(model_fn,
                                   params=model_params,
                                   config=config)

INFO:tensorflow:Using config: {'_model_dir': 'masha', '_tf_random_seed': 123, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x117d2c978>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [14]:
params = {
    'batch_size': 256,
    'num_epochs': 5,
    'train_size': int(len(xtr) * 0.9)
}

In [15]:
estimator.train(lambda: input_fn(xtr, ytr, params=params, is_training=True))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into masha/model.ckpt.
INFO:tensorflow:loss = 1.6077139, step = 1
INFO:tensorflow:global_step/sec: 10.7509
INFO:tensorflow:loss = 0.65985274, step = 101 (9.303 sec)
INFO:tensorflow:global_step/sec: 10.306
INFO:tensorflow:loss = 0.19193456, step = 201 (9.703 sec)
INFO:tensorflow:global_step/sec: 10.0203
INFO:tensorflow:loss = 0.09169913, step = 301 (9.980 sec)
INFO:tensorflow:Saving checkpoints for 328 into masha/model.ckpt.
INFO:tensorflow:Loss for final step: 0.06080568.


<tensorflow.python.estimator.estimator.Estimator at 0x117d2c898>

In [16]:
eval_results = estimator.evaluate(lambda: input_fn(xev, yev, params=params, is_training=False))

for key, value in eval_results.items():
    print(f'{key}: {value}')

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-10-21-20:33:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from masha/model.ckpt-328
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-10-21-20:33:16
INFO:tensorflow:Saving dict for global step 328: accuracy = 0.7436997, global_step = 328, loss = 0.8626567, mse = 1.8772118
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 328: masha/model.ckpt-328
accuracy: 0.7436997294425964
loss: 0.8626567125320435
mse: 1.8772118091583252
global_step: 328


In [17]:
preds = estimator.predict(lambda: input_fn(xev, yev, params=params, is_training=False))

In [18]:
logits = []

for el in preds:
    logits.append(el['logits'])
    
logits = np.array(logits)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from masha/model.ckpt-328
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
