# Doc2Vec for Kaggle Quora competition

Here we're tackling Kaggle's Quora question pairs competition mostly using the Paragraph Vector idea by [Le and Mikolov](https://arxiv.org/pdf/1405.4053v2.pdf). 

As training data we get ~400K question pairs labeled 1 if the questions in the pair are duplicates/have the same intent, and labeled 0 if they are considered to be different.

We then have to classify a test set of ~2.3M question pairs. The classification should be as correct as possible since the score is log loss, which punishes how wrong guesses are.

TL;DR: we get __best results with tuned XGBoost__ (surprise!) against a very basic, untuned deep net. Also, Doc2Vec vectorization sets a performance ceiling.

![Ask More Questions poster](../notebooks-img/ask_more_questions_part.jpeg?raw=True "Ask More Questions poster")

Unlike many approaches currently discussed in the competition [Kernels](https://www.kaggle.com/c/quora-question-pairs/kernels), which take into account the [distribution of labels](https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb) [in the test data](https://www.kaggle.com/davidthaler/how-many-1-s-are-in-the-public-lb), for the code below I don't take into account any information other than the training data.

See [the Kaggle Quora competition webiste](https://www.kaggle.com/c/quora-question-pairs) for full details and for the data!

---

We start by importing libraries

In [2]:
from __future__ import print_function
import sys, os
import pandas as pd
import numpy as np
from numpy.linalg import norm
from six.moves import cPickle as pickle
from collections import Counter
from gensim.models.doc2vec import Doc2Vec
from gensim import matutils
import multiprocessing
from scipy.stats import expon, bernoulli, geom
from itertools import izip
from scipy.spatial.distance import cdist
import time
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from skopt import gbrt_minimize
from skopt.space import Categorical
from sklearn.cross_validation import cross_val_score
import tensorflow as tf

# note that I'm using a special gensim that allows use of Doc2Vec with pretrained word embeddings
# you can get it at https://github.com/jhlau/gensim
# most of the code is the same if using regular gensim, except for small syntactical fixes
# also, if using regular gensim, don't use the pretrained_emb parameter in Doc2Vec. the difference isn't huge.
import gensim

## Data management



We then define the function to read in the csv file and process the question pairs. We also have another function that takes a dataframe with question pairs with tags and reduces them to a list of unique questions for further processing.

Note the pickling of variables along the way. I favor this approach in case I need to step away and resume processing later.

In [3]:
def processRows(tokens_only=False, train_size=None):

    if tokens_only: # test data
        pickle_filename = "pickles/test_processed_df.pickle"
    else: # train data
        # can extend this if-statement in case we want to 
        # work with fractions of the data
        if train_size is None: # full data, ~404K
            pickle_filename = "pickles/train_processed_df.pickle"
        elif train_size == "1000": # 0.25%, ~1000 rows
            pickle_filename = "pickles/train_processed_df_1000.pickle"
    
    # load data from pickle if we've previously processed it
    if os.path.exists(pickle_filename):
        
        print('%s already present - skipping pickling.' % pickle_filename)
        with open(pickle_filename, 'rb') as f:
            data_df = pickle.load(f)

    # otherwise, process and pickle
    else:

        if tokens_only:
            data_df = pd.read_csv("test.csv")
        else:
            data_df = pd.read_csv("train.csv")

        print("Processing training rows" if not tokens_only else "Processing test rows")
        for index, row in data_df.iterrows():
            
            # lowercases, tokenizes
            q1 = gensim.utils.simple_preprocess(str(row.question1))
            q2 = gensim.utils.simple_preprocess(str(row.question2))

            if tokens_only:
                data_df.set_value(index, "question1", q1)
                data_df.set_value(index, "question2", q2)
            else:
                # for training data, add tags
                # using doc2vec.TaggedDocument for Doc2Vec later on
                data_df.set_value(index, "question1", gensim.models.doc2vec.TaggedDocument(q1, [row.qid1]))
                data_df.set_value(index, "question2", gensim.models.doc2vec.TaggedDocument(q2, [row.qid2]))


        try:
            with open(pickle_filename, 'wb') as f:
                pickle.dump(data_df, f, pickle.HIGHEST_PROTOCOL)
        except Exception as e:
            print('Unable to save data to', pickle_filename, ':', e)

    return data_df

In [4]:
def getUniqueTrainQuestions(train_processed_df=None):
    # since there's ~808K questions but only ~537K are unique, 
    # consider only unique ones to make training faster

    print("getting unique questions for data")

    if train_processed_df is None:
        train_processed_df = processRows()

    qid_set = set()
    q_list = []

    for q1, q2 in izip(train_processed_df.question1, train_processed_df.question2):
        if q1.tags[0] not in qid_set:
            q_list.append(q1)
            qid_set.add(q1.tags[0])
        if q2.tags[0] not in qid_set:
            q_list.append(q2)
            qid_set.add(q2.tags[0])

    return q_list

## Doc2Vec (+ random search for near-optimal parameters)

With very lightly processed text, we can start vectorizing sentences. 

There's many ways of doing this. Using TF-IDF, for instance, is an obvious choice. There's a few other newer vectorizing methods that I've been meaning to try, like Doc2Vec and its Paragraph Vector idea by [Le and Mikolov](https://arxiv.org/pdf/1405.4053v2.pdf). One of its main insights is that "The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph."

Doc2Vec is a Paragraph Vector implementation in gensim. It has many parameters that I used random search to comb through, according to this paper from [Bergstra and Bengio](http://www.jmlr.org/papers/v13/bergstra12a.html) on how, compared to grid search, "random search over the same domain is able to find models that are as good or better within a small fraction of the computation time." (Admittedly, after having spent some time doing grid search.)

Here's a rough, brief and very illuminating image (from the paper itself) explaining why. 

![Random Search vs. Grid Search](../notebooks-img/bergstra_bengio_grid_vs_random.png?raw=True "Random Search vs. Grid Search")

Below is a somewhat inelegant function to perform random search for the best parameters for Doc2Vec, as well as a sanity check to assess the quality of the models, taken from [gensim's own code](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb).

In [10]:
def sanity(data, model, iterations):

    print("generating sanity check on", len(data), "docs")
    ranks = []
    second_ranks = []

    # check the same 1000 random documents to have comparable model assessments
    np.random.seed(0)
    doc_id_sample = np.random.choice(xrange(len(data)), 1000)

    # iterate over documents
    for i, doc_id in enumerate(doc_id_sample):
        
        sys.stdout.write("\rProcessing doc #{:d} (doc_id = {:d}) of 1000 sampled docs".format(i, doc_id))
        sys.stdout.flush()
        
        # infer vector from document's words
        inferred_vector = model.infer_vector(data[doc_id].words, steps=iterations)
        
        # take most similar documents to inferred vector
        sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
        
        # index of current doc on sorted list of documents most similar to inferred vector
        # should be 0 or close to 0
        rank = [docid for docid, sim in sims].index(doc_id)
        
        # collect index
        ranks.append(rank)

    print()
    print(Counter(ranks).most_common(5))

Notice that in ``randomSearch()`` we do 100 iterations, which may be too many. According to [Alice Zheng from Dato](https://stats.stackexchange.com/a/209409) (then Turi, which was acquired by Apple), "if the close-to-optimal region of hyperparameters occupies at least 5% of the grid surface, then random search with 60 trials will find that region with high probability."

In [19]:
def randomSearch(data, train_size):

    n_jobs = multiprocessing.cpu_count()

    # only train in unique questions
    print("pre-unique questions:", data.shape[0]*2)
    data = getUniqueTrainQuestions(data)
    print("post-unique questions:", len(data))

    n_iter = 100

    # sampling at different orders of magnitude. manually tuned.
    dimensions = np.ceil(expon.rvs(scale=1.5, size=n_iter)).astype(int)*100
    min_count = np.ceil(expon.rvs(scale=1.5, size=n_iter)).astype(int)
    iterations = np.ceil(expon.rvs(scale=23, size=n_iter)).astype(int)*10
    window_size = geom.rvs(0.085, loc=2, size=100)
    sampling_threshold = expon.rvs(scale=2e-4, loc=1e-8, size=n_iter)
    dm_concat = bernoulli.rvs(p=0.5, size=n_iter)
    dm = bernoulli.rvs(p=0.5, size=n_iter)
    dbow_words = bernoulli.rvs(p=0.5, size=n_iter)
    negative = np.ceil(expon.rvs(scale=5, loc=-1, size=n_iter)).astype(int)

    # visually ensure we have enough range
    # I was personally looking for sampling_threshold the 
    # min in the e-07 range, and max in the e-03 range.
    print("min(sampling_threshold)", min(sampling_threshold))
    print("max(sampling_threshold)", max(sampling_threshold))

    # create search space list
    random_search_list = [(dimensions[i], min_count[i], iterations[i], window_size[i], 
                           sampling_threshold[i], dm_concat[i], dm[i], dbow_words[i], 
                           negative[i]) for i in xrange(n_iter)]

    # iterate over search space list
    for dimensions, min_count, iterations, window_size, sampling_threshold, \
        dm_concat, dm, dbow_words, negative in random_search_list:

        print("generating model")
        print("train_size", train_size, end=" | ")
        print("dimensions", dimensions, end=" | ")
        print("min_count", min_count, end=" | ")
        print("iterations", iterations, end=" | ")
        print("window_size", window_size, end=" | ")
        print("sampling_threshold", sampling_threshold, end=" | ")
        print("dm_concat", dm_concat, end=" | ")
        print("dm", dm, end=" | ")
        print("dbow_words", dbow_words, end=" | ")
        print("negative", negative, end="\n")


        model = Doc2Vec(size=dimensions, min_count=min_count, iter=iterations, 
                        window=window_size, workers=n_jobs, sample=sampling_threshold, 
                        dm_concat=dm_concat, dm=dm, dbow_words=dbow_words, negative=negative)

        
        model.build_vocab(data)
        print("training started on model")
        # for regular gensim:
        #    model.train(data, total_examples=model.corpus_count, epochs=model.iter)
        model.train(data)
        
        sanity(data, model, iterations)

As a demonstration, let's run the code so far on only 1K documents (as opposed to ~400K).

In [20]:
train_size = "1000"
data = processRows(train_size = train_size)
randomSearch(data, train_size)

pickles/train_processed_df_1000.pickle already present - skipping pickling.
pre-unique questions: 2020
getting unique questions for data
post-unique questions: 2014
min(sampling_threshold) 4.39123478884e-07
max(sampling_threshold) 0.000847762810627
generating model
train_size 1000 | dimensions 100 | min_count 2 | iterations 280 | window_size 7 | sampling_threshold 1.56535675299e-05 | dm_concat 0 | dm 0 | dbow_words 1 | negative 9
training started on model
generating sanity check on 2014 docs
Processing doc #999 (doc_id = 707) of 1000 sampled docss
[(0, 11), (1, 5), (3, 4), (10, 4), (275, 4)]
generating model
train_size 1000 | dimensions 400 | min_count 1 | iterations 180 | window_size 13 | sampling_threshold 0.000131573208883 | dm_concat 0 | dm 0 | dbow_words 1 | negative 2
training started on model
generating sanity check on 2014 docs
Processing doc #999 (doc_id = 707) of 1000 sampled docss
[(1, 225), (2, 50), (0, 33), (3, 21), (5, 15)]
generating model
train_size 1000 | dimensions 10

training started on model
generating sanity check on 2014 docs
Processing doc #999 (doc_id = 707) of 1000 sampled docss
[(1, 36), (2, 20), (0, 18), (3, 16), (4, 9)]
generating model
train_size 1000 | dimensions 100 | min_count 1 | iterations 850 | window_size 21 | sampling_threshold 0.000211952796599 | dm_concat 0 | dm 0 | dbow_words 0 | negative 32
training started on model
generating sanity check on 2014 docs
Processing doc #999 (doc_id = 707) of 1000 sampled docss
[(4, 4), (155, 4), (357, 4), (1125, 4), (1148, 4)]
generating model
train_size 1000 | dimensions 300 | min_count 1 | iterations 100 | window_size 28 | sampling_threshold 0.000370616869288 | dm_concat 1 | dm 1 | dbow_words 1 | negative 10
training started on model
generating sanity check on 2014 docs
Processing doc #999 (doc_id = 707) of 1000 sampled docss
[(1, 58), (0, 39), (2, 22), (3, 14), (4, 7)]
generating model
train_size 1000 | dimensions 200 | min_count 1 | iterations 280 | window_size 38 | sampling_threshold 0.0002

train_size 1000 | dimensions 300 | min_count 1 | iterations 10 | window_size 53 | sampling_threshold 0.000274604633654 | dm_concat 1 | dm 1 | dbow_words 1 | negative 2
training started on model
generating sanity check on 2014 docs
Processing doc #999 (doc_id = 707) of 1000 sampled docss
[(1022, 4), (1235, 4), (0, 3), (1, 3), (32, 3)]
generating model
train_size 1000 | dimensions 200 | min_count 1 | iterations 260 | window_size 3 | sampling_threshold 0.000208025491248 | dm_concat 1 | dm 0 | dbow_words 1 | negative 1
training started on model
generating sanity check on 2014 docs
Processing doc #999 (doc_id = 707) of 1000 sampled docss
[(1, 197), (0, 45), (2, 38), (3, 25), (4, 18)]
generating model
train_size 1000 | dimensions 200 | min_count 1 | iterations 280 | window_size 5 | sampling_threshold 0.000191358243958 | dm_concat 1 | dm 1 | dbow_words 1 | negative 9
training started on model
generating sanity check on 2014 docs
Processing doc #999 (doc_id = 707) of 1000 sampled docss
[(1, 28

When I ran this on the training set, I got the following best parameters:

In [7]:
dimensions = 300 
min_count = 1 
iterations = 300 
window_size = 17 # average of top performers
sampling_threshold = 0.000232331506591 # median of top performers
dm_concat = 1 
dm = 0 
dbow_words = 1
negative = 4 # median of top performers

## Doc2Vec and word embeddings

Then I wanted to see the role of pretrained word embeddings on training. I used the following function:

In [None]:
def explorePretrained(data):

    print("pre-unique questions:", data.shape[0]*2)
    data = getUniqueTrainQuestions(data)
    print("post-unique questions:", len(data))

    n_jobs = multiprocessing.cpu_count()

    # fetched mostly through https://github.com/3Top/word2vec-api
    for pretrained_emb in [ "word_emb/toy_pretrained_word_embeddings.txt",
                            "word_emb/glove.840B.300d.modified_header.txt", 
                            "word_emb/glove.twitter.27B/glove.twitter.27B.100d.modified_header.txt", 
                            "word_emb/glove.twitter.27B/glove.twitter.27B.200d.modified_header.txt", 
                            "word_emb/glove.twitter.27B/glove.twitter.27B.25d.modified_header.txt", 
                            "word_emb/glove.twitter.27B/glove.twitter.27B.50d.modified_header.txt", 
                            "word_emb/glove.wikiplusgigaword5.6B/glove.6B.100d.modified_header.txt", 
                            "word_emb/glove.wikiplusgigaword5.6B/glove.6B.200d.modified_header.txt", 
                            "word_emb/glove.wikiplusgigaword5.6B/glove.6B.300d.modified_header.txt", 
                            "word_emb/glove.wikiplusgigaword5.6B/glove.6B.50d.modified_header.txt",
                            "word_emb/freebase-vectors-skipgram1000-en-without-en.txt",
                            "word_emb/GoogleNews-vectors-negative300.txt"]:

        try:

            print("pretrained_emb", pretrained_emb)

            # note that the values of the parameters are assigned above e.g. dimensions = 300
            model = Doc2Vec(size=dimensions, min_count=min_count, iter=iterations, 
                    window=window_size, workers=n_jobs, sample=sampling_threshold, 
                    dm_concat=dm_concat, dm=dm, dbow_words=dbow_words, negative=negative,
                    pretrained_emb=pretrained_emb)

            sanity(data, model, iterations)

        except Exception as e:
            print("failed with", pretrained_emb, "with exception", e)
            
    return

In [8]:
explorePretrained(data)

My results on the whole data indicated that ``glove.twitter.27B/glove.twitter.27B.50d.modified_header.txt`` yielded the best results.

![Word Embedding experiments results](../notebooks-img/word_embed_results.png?raw=True "Word Embedding experiments results")

Note that the results aren't spectacular, however. The ``sanity()`` function picks 1000 random documents and tries to make sure that each document's already-known vector is close to the vector inferred by the model purely from the document's words. 

Here, as with the ``sanity()`` assessment for the random search, we see about 20-30% of the 1000 docs taking up one of the top 5 most similar spots to their own inferred vector. Ideally, we'd like close to 100% of the documents taking the single top similar spot. This indicated that perhaps Doc2Vec vectorization, as powerful as it is, has its limits as currently implemented.

## Training Doc2Vec

Now that we have the right hyper-parameters to train our model, we can use ``trainModel()`` below:

In [None]:
def trainModel():

    data = processRows() # get all training data
    data = getUniqueTrainQuestions(data) # unique training data

    n_jobs = multiprocessing.cpu_count()

    dimensions = 300 
    min_count = 1 
    iterations = 300 
    window_size = 17 
    sampling_threshold = 0.000232331506591
    dm_concat = 1 
    dm = 0 
    dbow_words = 1
    negative = 4
    pretrained_emb = "word_emb/glove.twitter.27B/glove.twitter.27B.50d.modified_header.txt"

    try:

        model = Doc2Vec(size=dimensions, min_count=min_count, iter=iterations, 
                window=window_size, workers=n_jobs, sample=sampling_threshold, 
                dm_concat=dm_concat, dm=dm, dbow_words=dbow_words, negative=negative,
                pretrained_emb=pretrained_emb)


        start_time = time.time()    
        model.build_vocab(data)
        print(time.time()-start_time, "secs to build vocab")

        start_time = time.time()    
        print("training started on model")
        model.train(data)
        print(time.time()-start_time, "secs to train")

        pickle_filename = "pickles/best_model.pickle"

        try:
            with open(pickle_filename, 'wb') as f:
                pickle.dump(model, f, pickle.HIGHEST_PROTOCOL)
        except Exception as e:
            print('Unable to save data to', pickle_filename, ':', e)
            
        return model

    except Exception as e:
        print("failed with exception", e)

In [None]:
model = trainModel()

pickles/train_processed_df.pickle already present - skipping pickling.
getting unique questions for data
23.2628879547 secs to build vocab
training started on model
28809.9361 secs to train

## (Efficiently) vectorizing millions of sentences

Now that we've trained our Doc2Vec model to (likely) close to its best possible performance in vectorizing documents, we want to vectorize all test questions.

This process can be _extremely slow_. The process starts really fast but then seems to be hindered by memory. In order to avoid that, I split the data in chunks of 10K question pairs at a time and processed them separately. This brought down an expected processing time of several days to about ~7h on my laptop!

In [None]:
def partitionData(data):

    partitions = np.split(data, xrange(10000,data.shape[0],10000), axis=0)

    for i, p in enumerate(partitions):

        pickle_filename = "pickles/test_processed/test_processed_df_"+str(i+1)+"-"+str(len(partitions))+".pickle"

        try:
            with open(pickle_filename, 'wb') as f:
                pickle.dump(p, f, pickle.HIGHEST_PROTOCOL)
        except Exception as e:
            print('Unable to save data to', pickle_filename, ':', e)

    return len(partitions)

In [10]:
num_partitions = partitionData(data)

I processed these text vectors in parallel manually. Each line defining a for-loop in ``inferTestVectorsDivideAndConquer()`` below used its own processor.

Making this code nicer as a proper python-multiprocessing function is a project for a future afternoon.

In [None]:
def inferTestVectorsDivideAndConquer(num_partitions, model):

    dimensions = 300 

    for number in [1, 8, 15, 22, 29, 36, 43, 50, 57, 64, 71, 78, 85, 92, 99, 106, 113, 120, 127, 134, 141, 148, 155, 162, 169, 176, 183, 190, 197, 204, 211, 218, 225, 232, 239, 246]:
    # for number in [2, 9, 16, 23, 30, 37, 44, 51, 58, 65, 72, 79, 86, 93, 100, 107, 114, 121, 128, 135, 142, 149, 156, 163, 170, 177, 184, 191, 198, 205, 212, 219, 226, 233]:
    # for number in [3, 10, 17, 24, 31, 38, 45, 52, 59, 66, 73, 80, 87, 94, 101, 108, 115, 122, 129, 136, 143, 150, 157, 164, 171, 178, 185, 192, 199, 206, 213, 220, 227, 234]:
    # for number in [4, 11, 18, 25, 32, 39, 46, 53, 60, 67, 74, 81, 88, 95, 102, 109, 116, 123, 130, 137, 144, 151, 158, 165, 172, 179, 186, 193, 200, 207, 214, 221, 228, 235]:
    # for number in [5, 12, 19, 26, 33, 40, 47, 54, 61, 68, 75, 82, 89, 96, 103, 110, 117, 124, 131, 138, 145, 152, 159, 166, 173, 180, 187, 194, 201, 208, 215, 222, 229]:
    # for number in [6, 13, 20, 27, 34, 41, 48, 55, 62, 69, 76, 83, 90, 97, 104, 111, 118, 125, 132, 139, 146, 153, 160, 167, 174, 181, 188, 195, 202, 209, 216, 223, 230]:
    # for number in [7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98, 105, 112, 119, 126, 133, 140, 147, 154, 161, 168, 175, 182, 189, 196, 203, 210, 217, 224, 231]:


        pickle_filename = "pickles/test_processed/test_processed_df_" + str(number) +"-" + num_partitions + ".pickle"
        output_pickle = "pickles/inferred/inferred_" + str(number) +"-" + num_partitions + ".pickle"

        with open(pickle_filename, 'rb') as f:
            data_df = pickle.load(f)

        print("inferring test vectors for " + output_pickle)

        start_time = time.time()

        q1_vec = pd.DataFrame(columns=range(0,dimensions))
        q2_vec = pd.DataFrame(columns=range(0,dimensions))

        for index, row in data_df.iterrows():

            sys.stdout.write("\rProcessing doc #{:d} of {:d} docs".format(index, data_df.shape[0]))
            sys.stdout.flush()
                
            q1_vec.loc[index] = model.infer_vector(row.question1, steps=300)
            q2_vec.loc[index] = model.infer_vector(row.question2, steps=300)

        print()
        print("file, q1_vec.shape, q2_vec.shape")
        print(output_pickle, q1_vec.shape, q2_vec.shape)

        print(time.time()-start_time, "sec to infer test vectors for", output_pickle)

        q_vec = {"q1_vec": q1_vec, "q2_vec": q2_vec}

        start_time = time.time()
        try:
            with open(output_pickle, 'wb') as f:
                pickle.dump(q_vec, f, pickle.HIGHEST_PROTOCOL)
        except Exception as e:
            print('Unable to save data to', output_pickle, ':', e)

        print(time.time()-start_time, "sec to write inferred test vectors for", output_pickle)

    return q1_vec, q2_vec

In [11]:
inferTestVectorsDivideAndConquer(num_partitions, model)

Given that the test data is now partitioned, we have to put it back together:

In [None]:
def uniteTestData():
    
    output_pickle = "pickles/inferred_test_vectors.pickle"
    
    if os.path.exists(output_pickle):
        
        print('%s already present - skipping pickling.' % pickle_filename)
        with open(output_pickle, 'rb') as f:
            data_df = pickle.load(f)
            
            q1_vec = data["q1_vec"]
            q2_vec = data["q2_vec"]
    
    else:
        
        start_time = time.time()
    
        for number in xrange(1,236):

            print(number, end="/235...")

            input_pickle = "pickles/inferred/inferred_" + str(number) +"-235.pickle"

            with open(input_pickle, 'rb') as f:
                data_df = pickle.load(f)

                if number == 1:
                    q1_vec = data_df["q1_vec"].copy()
                    q2_vec = data_df["q2_vec"].copy()
                else:
                    q1_vec = pd.concat([q1_vec, data_df["q1_vec"]])
                    q2_vec = pd.concat([q2_vec, data_df["q2_vec"]])

        
        print(time.time()-start_time, "sec to gather inferred test vectors into", output_pickle)
                    
        print()

        q_vec = {"q1_vec": q1_vec, "q2_vec": q2_vec}

        start_time = time.time()

        try:
            with open(output_pickle, 'wb') as f:
                pickle.dump(q_vec, f, pickle.HIGHEST_PROTOCOL)
        except Exception as e:
            print('Unable to save data to', output_pickle, ':', e)

        print(time.time()-start_time, "sec to write inferred test vectors for", output_pickle)
                
    return q1_vec, q2_vec

In [12]:
q1_vec, q2_vec = uniteTestData()

1/235...2/235...3/235...4/235...5/235...6/235...7/235...8/235...9/235...10/235...11/235...12/235...13/235...14/235...15/235...16/235...17/235...18/235...19/235...20/235...21/235...22/235...23/235...24/235...25/235...26/235...27/235...28/235...29/235...30/235...31/235...32/235...33/235...34/235...35/235...36/235...37/235...38/235...39/235...40/235...41/235...42/235...43/235...44/235...45/235...46/235...47/235...48/235...49/235...50/235...51/235...52/235...53/235...54/235...55/235...56/235...57/235...58/235...59/235...60/235...61/235...62/235...63/235...64/235...65/235...66/235...67/235...68/235...69/235...70/235...71/235...72/235...73/235...74/235...75/235...76/235...77/235...78/235...79/235...80/235...81/235...82/235...83/235...84/235...85/235...86/235...87/235...88/235...89/235...90/235...91/235...92/235...93/235...94/235...95/235...96/235...97/235...98/235...99/235...100/235...101/235...102/235...103/235...104/235...105/235...106/235...107/235...108/235...109/235...110/235...111/235.

As expressed [here](http://stackoverflow.com/a/38246020/583834), ``cPickle`` has its limits when it comes to large files. We use instead the pandas builtin HDF5 format output function ``to_hdf()``.

In [32]:
output_pickle = "pickles/inferred_test_vectors.h5"

start_time = time.time()

q_vec = pd.concat([q1_vec, q2_vec], ignore_index=True, axis=1)

try:
    q_vec.to_hdf(output_pickle, 'q_vec', mode='w')
    del q1_vec
    del q2_vec
except Exception as e:
    print('Unable to save data to', output_pickle, ':', e)

print(time.time()-start_time, "sec to write inferred test vectors for", output_pickle)

1658.35788798 sec to write inferred test vectors for pickles/inferred_test_vectors.h5


A quick note that this will be a bit heavy on your memory...

In [36]:
print(q_vec.shape)
print(q_vec.info())

(2345796, 600)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2345796 entries, 0 to 2345795
Columns: 600 entries, 0 to 599
dtypes: float64(600)
memory usage: 10.5 GB
None


## (Unsupervised) cosine similarity predictions

From here we can submit a prediction simply by obtaining a measure of similarity between the first and the second question for each pair. We calculate it [the same way gensim's ``doc2vec`` does](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec.py), through a cosine similarity. 

We implement cosine similarity code because Doc2Vec doesn't yet support calculating similarity with provided vectors. Also, it's surprising that row-wise cosine distance for matrices is not implemented in any python library (that I'm aware of). [``einsum``](https://docs.scipy.org/doc/numpy-1.12.0/reference/generated/numpy.einsum.html) is used, thanks to [this post](http://stackoverflow.com/a/15622926/583834). [This other post](http://stackoverflow.com/a/33641428/583834) is a good primer on ``einsum``.

In [110]:
q1_vec = q_vec.loc[:,:299]
q2_vec = q_vec.loc[:,300:]

start_time = time.time()

# cosine similarity
cosine_predictions = np.einsum('ij,ij->i', q1_vec, q2_vec) / (norm(q1_vec, axis=1) * norm(q2_vec, axis=1))

# replace all values less than 0 for 0
cosine_predictions[cosine_predictions < 0] = 0

# make dataframe for output
cosine_predictions = pd.concat([pd.Series(np.arange(cosine_predictions.shape[0])), pd.Series(cosine_predictions)], axis=1)

# rename dataframe's columns
cosine_predictions.columns = ["test_id","is_duplicate"]

# write csv submission
cosine_predictions.to_csv("similarity_predictions.csv", index=False)

print(time.time()-start_time, "sec to generate", cosine_predictions.shape[0], "predictions")

245.740324974 sec to generate 2345796 predictions


The results are not optimal, since a good score would be closer to 0.4 and below:

![Paragraph Vector unsupervised similarity score: 0.70604](../notebooks-img/doc2vec_unsupervised_score_wide.png?raw=True "Paragraph Vector unsupervised similarity score: 0.70604")

This score would likely be improved by running Doc2Vec over both training and test data, since we only have ~400K question pairs in training and ~2.34M pairs in testing.

## Supervised training methods

## XGBoost (untuned)

We can also improve the score by using XGBoost on the training data and predicting the test data from it. In order to switch to a supervised training solution, we want to treat each question pair as a single vector. For this, we use the 600-dimension concatenation of the 300-dimension vectors of both questions in each pair. Some other operations such as addition or subtraction on the two 300-dimenstional vectors _could_ work, but usually vectors are concatenated to avoid data loss.

Here's a basic XGBoost application. First, we load the data (because I restarted the server since I ran the code above...):

In [5]:
pickle_filename = "pickles/best_model.pickle"

# training data and model
train_data = processRows()
labels = train_data.is_duplicate

with open(pickle_filename, 'rb') as f:
    model = pickle.load(f)

# vectorized questions
q1_vec = model.docvecs[train_data.qid1]
q2_vec = model.docvecs[train_data.qid2]
q_vec = pd.concat([pd.DataFrame(q1_vec), pd.DataFrame(q2_vec)], ignore_index=True, axis=1)

print(q_vec.shape)
print(labels.shape)

pickles/train_processed_df.pickle already present - skipping pickling.
(404290, 600)
(404290,)


Then we perform XGBoost training and extract predictions: 

In [9]:
test_size = 0.25
X_train, X_test, y_train, y_test = train_test_split(q_vec, labels, test_size=test_size)

start_time = time.time()

xgb_model = XGBClassifier(nthread=-1)
xgb_model.fit(X_train, y_train)

print(time.time()-start_time, "sec to train XGBoost model")

# pickle model
output_pickle = "pickles/xgboost_model.pickle"
try:
    with open(output_pickle, 'wb') as f:
        pickle.dump(xgb_model, f, pickle.HIGHEST_PROTOCOL)
except Exception as e:
    print('Unable to save data to', output_pickle, ':', e)

# make and report predictions scores for test data
predictions = xgb_model.predict(X_test)
logloss_score = log_loss(y_test, predictions)
print("Log Loss: {:f}".format(logloss_score))

874.805408001 sec to train XGBoost model
Log Loss: 10.880096


This result is _extremely_ bad, but ameliorated by the fact that ``model.predict`` outputs label values, while ``model.predict_proba`` outputs probabilities. Since the ``log_loss`` metric penalizes wrong results, getting at least close to the right result is important, and probabilities help us with that. We use ``predict_proba()`` this time:

In [21]:
predictions = xgb_model.predict_proba(X_test)
logloss_score = log_loss(y_test, predictions)
print("Log Loss: {:f}".format(logloss_score))

Log Loss: 0.571065


This result is much more reasonable. Now we can generate predictions for the test values and submit them:

In [22]:
qtest_vec = pd.read_hdf("pickles/inferred_test_vectors.h5")

print(qtest_vec.shape)

start_time = time.time()
predictions = xgb_model.predict_proba(qtest_vec)
print(time.time()-start_time, "sec to generate", predictions.shape[0], "predictions")

(2345796, 600)
529.756315231 sec to generate 2345796 predictions


We need to use only the second column of the prediction result because, [as the source reveals](http://xgboost/python-package/xgboost/sklearn.py), a horizontal stack of ``classzero_probs`` and ``classone_probs`` is returned.

In [24]:
# prepare for submission
predictions = pd.concat([pd.Series(np.arange(predictions.shape[0])), pd.Series(predictions[:,1])], axis=1)
predictions.columns = ["test_id","is_duplicate"]
predictions.to_csv("xgboost_predictions.csv", index=False)

The result of the submission is a bit worse than that of the similarity predictions:

![Vanilla (untuned) XGBoost score: 0.82654](../notebooks-img/vanilla_xgboost_score_wide.png?raw=True "Vanilla (untuned) XGBoost score: 0.82654")

## Cross-validated XGBoost

We can improve this by tuning the XGBoost parameters with cross validation. The code below is taken from Bernstein and and Potts' excellent [Optimizing the hyperparameter of which hyperparameter optimizer to use](http://roamanalytics.com/2016/09/15/optimizing-the-hyperparameter-of-which-hyperparameter-optimizer-to-use/) [\[github\]](https://github.com/roaminsight/roamresearch/tree/master/BlogPosts/Hyperparameter_tuning_comparison).

We first define the cross validation functions.

In [None]:
def cross_validated_scorer(X_train, y_train, model_class, params, loss, kfolds=5):
    """
    The scoring function used through this module, by all search
    functions.
    """
    mod = model_class(**params)
    cv_score = -1 * cross_val_score(
        mod,
        X_train,
        y=y_train,
        scoring=loss,
        cv=kfolds,
        n_jobs=multiprocessing.cpu_count()).mean()
    return cv_score

def skopt_search(X_train, y_train, model_class, param_grid, loss, skopt_method, n_calls=100):
    """
    General method for applying `skopt_method` to the data.
    """
    param_keys, param_vecs = zip(*param_grid.items())
    param_keys = list(param_keys)
    param_vecs = list(param_vecs)

    def skopt_scorer(param_vec):
        params = dict(zip(param_keys, param_vec))
        err = cross_validated_scorer(
            X_train, y_train, model_class, params, loss)
        return err
    outcome = skopt_method(skopt_scorer, list(param_vecs), n_calls=n_calls, verbose=True)
    results = []
    for err, param_vec in zip(outcome.func_vals, outcome.x_iters):
        params = dict(zip(param_keys, param_vec))
        results.append({'loss': err, 'params': params})
    return results

def skopt_gbrt_search(
        X_train, y_train, model_class, param_grid, loss, n_calls=100):
    """
    `skopt` according to the gradient-boosting-tree search method.
    """
    return skopt_search(
        X_train, y_train, model_class, param_grid, loss, gbrt_minimize, n_calls=n_calls)

Note that "``scikit-optimize`` asks you to specify just the upper and lower bounds of the space to be searched."

We define these bounds.

In [None]:
skopt_grid = {
    'max_depth': (4, 12),
    'learning_rate': (0.01, 0.5),
    'n_estimators': (20, 200),
    'objective' : Categorical(('binary:logistic',)),
    'gamma': (0, 0.5),
    'min_child_weight': (1, 5),
    'subsample': (0.1, 1),
    'colsample_bytree': (0.1, 1)}

Also note that, at least for my AWS machine, I had an issue with the package multiprocessing v0.70a1 and pandas v0.20.1 similar to the one [here](https://github.com/scikit-learn/scikit-learn/issues/7981). I reverted to pandas v0.19.2, which I knew was working in another machine, and everything ran well. For the output, I had to tweak [``gbrt.py``](https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/optimizer/gbrt.py) due to a [small bug](https://github.com/scikit-optimize/scikit-optimize/issues/326) in passing the ``verbose`` parameter. It should be resolved by scikit-optimize v0.4.

We now run the cross validation (and wait...).

In [7]:
LOG_LOSS = 'neg_log_loss'

start_time = time.time()

n_calls = 50
# temporary redirection in case the kernel tends to interrupt and mess up the output
# stdout = sys.stdout
# with open('output.txt', 'w') as sys.stdout:
#     res = skopt_gbrt_search(q_vec, labels, XGBClassifier, skopt_grid, LOG_LOSS, n_calls=n_calls)
# sys.stdout = stdout

res = skopt_gbrt_search(q_vec, labels, XGBClassifier, skopt_grid, LOG_LOSS, n_calls=n_calls)


print(time.time()-start_time, "sec to optimize hyperparameters for XGBoost with cross-validation")

Iteration No: 1 started. Evaluating function at random point.
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 4465.8883
Function value obtained: 0.5287
Current minimum: 0.5287
Iteration No: 2 started. Evaluating function at random point.
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 544.0995
Function value obtained: 0.5615
Current minimum: 0.5287
Iteration No: 3 started. Evaluating function at random point.
Iteration No: 3 ended. Evaluation done at random point.
Time taken: 1703.3487
Function value obtained: 0.5177
Current minimum: 0.5177
Iteration No: 4 started. Evaluating function at random point.
Iteration No: 4 ended. Evaluation done at random point.
Time taken: 1754.8015
Function value obtained: 0.5727
Current minimum: 0.5177
Iteration No: 5 started. Evaluating function at random point.
Iteration No: 5 ended. Evaluation done at random point.
Time taken: 1233.1887
Function value obtained: 0.5124
Current minimum: 0.5124
Iteration No: 6 start

Iteration No: 41 ended. Search finished for the next optimal point.
Time taken: 1120.7123
Function value obtained: 0.4993
Current minimum: 0.4419
Iteration No: 42 started. Searching for the next optimal point.
Iteration No: 42 ended. Search finished for the next optimal point.
Time taken: 353.5213
Function value obtained: 0.5569
Current minimum: 0.4419
Iteration No: 43 started. Searching for the next optimal point.
Iteration No: 43 ended. Search finished for the next optimal point.
Time taken: 2070.9181
Function value obtained: 2.4857
Current minimum: 0.4419
Iteration No: 44 started. Searching for the next optimal point.
Iteration No: 44 ended. Search finished for the next optimal point.
Time taken: 1378.0682
Function value obtained: 0.5624
Current minimum: 0.4419
Iteration No: 45 started. Searching for the next optimal point.
Iteration No: 45 ended. Search finished for the next optimal point.
Time taken: 811.8483
Function value obtained: 1.1005
Current minimum: 0.4419
Iteration No: 46

50 calls took 73,925.84 seconds $\approx$ 20.5 hours to run.

We _could_ make more calls and potentially get a better set of parameters. On AWS, each extra call increases the running time on by ~25 minutes. I decided against it for now. We can then take a look at the 5 best parameter sets:

In [20]:
res_tuples = [(x['loss'], x['params']) for x in res]
for r in sorted(res_tuples)[:5]:
    print(r)

(0.44190591057460116, {'n_estimators': 177, 'subsample': 0.9547615647987954, 'colsample_bytree': 0.14751315421666406, 'gamma': 0, 'objective': 'binary:logistic', 'learning_rate': 0.20374312390199137, 'max_depth': 12, 'min_child_weight': 1})
(0.450003499736888, {'n_estimators': 173, 'subsample': 0.7724272099539228, 'colsample_bytree': 0.14301412377293404, 'gamma': 0, 'objective': 'binary:logistic', 'learning_rate': 0.20035580997038052, 'max_depth': 12, 'min_child_weight': 1})
(0.4678065646592991, {'n_estimators': 172, 'subsample': 0.8083830575540177, 'colsample_bytree': 0.1466420600454964, 'gamma': 0, 'objective': 'binary:logistic', 'learning_rate': 0.3236444877922251, 'max_depth': 8, 'min_child_weight': 1})
(0.46877660968947515, {'n_estimators': 175, 'subsample': 0.9154769880774636, 'colsample_bytree': 0.14608616503688332, 'gamma': 0, 'objective': 'binary:logistic', 'learning_rate': 0.3480749063995038, 'max_depth': 7, 'min_child_weight': 1})
(0.46903095714082, {'n_estimators': 191, 'su

And now we can train a model based on the best parameters over the whole training data. 

In [None]:
best_params = sorted(res_tuples)[0][1]
for param in best_params.keys():
    print(param, best_params[param])

start_time = time.time()

xgb_model = XGBClassifier(nthread=multiprocessing.cpu_count(), 
                          colsample_bytree = best_params['colsample_bytree'],
                          learning_rate = best_params['learning_rate'],
                          min_child_weight = best_params['min_child_weight'],
                          n_estimators = best_params['n_estimators'],
                          subsample = best_params['subsample'],
                          objective = best_params['objective'],
                          max_depth = best_params['max_depth'],
                          gamma = best_params['gamma'])
xgb_model.fit(q_vec, labels)

print(time.time()-start_time, "sec to train XGBoost model")

n_estimators 177
subsample 0.954761564799
colsample_bytree 0.147513154217
gamma 0
objective binary:logistic
learning_rate 0.203743123902
max_depth 12
min_child_weight 1
238.549133062 sec to train XGBoost model


Finally we generate and export predictions over the training data:

In [3]:
# make and report predictions scores for test data
start_time = time.time()
predictions = xgb_model.predict_proba(qtest_vec)
print(time.time()-start_time, "sec to generate", predictions.shape[0], "predictions")

predictions = pd.concat([pd.Series(np.arange(predictions.shape[0])), pd.Series(predictions[:,1])], axis=1)
predictions.columns = ["test_id","is_duplicate"]
predictions.to_csv("xgboost_cv_predictions.csv", index=False)

864.285174847 sec to generate 2345796 predictions


The results are the following:

![Cross-validation-tuned XGBoost score: 0.53307](../notebooks-img/cv_xgboost_score_wide.png?raw=True "Cross-validation-tuned XGBoost score: 0.53307")

This is a significant improvement over any of the previous classifiers! However, as mentioned before, it seems that even with optimized parameters and a powerful classifier such as XGBoost, Doc2Vec may not be producing the best vectorized sentences given our potentially-too-small training data.

## Deep neural net (untuned)

We can also try a deep neural net. Once again, this is adapted from a somewhat user-friendly ``DeepClassifier`` class using tensorflow by Roam's [Dingwall, Potts and Senaratna](https://roamanalytics.com/2016/09/13/prescription-based-prediction/#Deep-classifiers) [\[github\]](https://github.com/roaminsight/roamresearch/tree/master/BlogPosts/Prescription_based_prediction). You can see more details about the parameters and return values of the functions in ``DeepClassifier`` in [github](https://github.com/roaminsight/roamresearch/tree/master/BlogPosts/Prescription_based_prediction).

For more nuts and bolts of implementing a net like this in tensorflow, check out [this repo](https://github.com/arturomp/udacity-deep-learning/blob/master/3_regularization.ipynb) or [this other one](https://github.com/arturomp/kaggle/blob/master/digit-recognition/digit_recognition.py) where I use deep nets for optical character recognition.

In [None]:
class DeepClassifier:
    """Defines a feed-forward neural network with two hidden layers.
    Roughly,
    h1 = f(xW1 + b1)
    h2 = g(h1W2 + b2)
    y = softmax(h2W3 + b3)
    where drop-out is applied to h1 and h2.    
    """
    def __init__(self,
            hidden_dim1=200,
            hidden_dim2=100,
            activation1=tf.nn.relu,
            activation2=tf.nn.relu,
            keep_prob1=0.7,
            keep_prob2=0.7,
            eta=0.01,
            max_iter=100,
            tol=1e-05,
            verbose=True):
        
        self.hidden_dim1 = hidden_dim1
        self.hidden_dim2 = hidden_dim2
        self.activation1 = activation1
        self.activation2 = activation2
        self.keep_prob1 = keep_prob1
        self.keep_prob2 = keep_prob2
        self.eta = eta
        self.max_iter = max_iter
        self.tol = tol
        self.verbose = verbose
        self.params = ('hidden_dim1', 'hidden_dim2',
                       'activation1', 'activation2',
                       'keep_prob1', 'keep_prob2',                       
                       'eta', 'max_iter', 'tol',
                       'verbose')
        
    def fit(self, X, y):
        """Specifies the model graph and performs training.
        """
        # Set-up the dataset:
        self.input_dim = X.shape[1]        
        self.classes_ = sorted(set(y))
        self.output_dim = len(self.classes_)
        y_ = self.onehot_encode(y)        
        # Begin the tf session:
        tf.reset_default_graph()
        self.sess = tf.InteractiveSession()        
        # Model:
        self._build_graph()        
        # Optimization:
        cost = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(self.model, self.outputs))

        # from previous deep nets
        global_step = tf.Variable(0)  # count the number of steps taken.
        decay_steps = 100
        decay_rate = 0.9
        learning_rate = tf.train.exponential_decay(self.eta, global_step, decay_steps, decay_rate)
        
        optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(cost)
        # Initialization:
        init = tf.initialize_all_variables()
        self.sess.run(init)                        
        for i in range(1, self.max_iter+1):
            # Training step:
            _, loss = self.sess.run([optimizer, cost],
                feed_dict={self.inputs: X,
                           self.outputs: y_,
                           self.keepprob_holder1: self.keep_prob1,
                           self.keepprob_holder2: self.keep_prob2})
            # Progress report:
            self._progressbar(loss, i)
            if loss <= self.tol:
                sys.stderr.write('Stopping criteria reached.')
        if self.verbose:
            sys.stderr.write("\n")

    def _build_graph(self):
        """Builds the core computation graph."""
        # Inputs and outputs:
        self.inputs = tf.placeholder(tf.float32, [None, self.input_dim])
        self.outputs = tf.placeholder(tf.float32, [None, self.output_dim])
        # Layer 1:        
        W1 = self._weight_init(self.input_dim, self.hidden_dim1, name='W1')
        b1 = self._bias_init(self.hidden_dim1, name='b1')
        hidden1 = self.activation1(tf.matmul(self.inputs, W1) + b1)
        # Dropout 1:
        self.keepprob_holder1 = self._dropout_init('keep_prob1')
        dropout_layer1 = tf.nn.dropout(hidden1, self.keepprob_holder1)
        # Layer 2:
        W2 = self._weight_init(self.hidden_dim1, self.hidden_dim2, name='W2')
        b2 = self._bias_init(self.hidden_dim2, name='b2')
        hidden2 = self.activation2(tf.matmul(dropout_layer1, W2) + b2)
        # Dropout 2:
        self.keepprob_holder2 = self._dropout_init('keep_prob2')
        dropout_layer2 = tf.nn.dropout(hidden2, self.keepprob_holder2)
        # Output layer:
        W3 = self._weight_init(self.hidden_dim2, self.output_dim, name='W3')
        b3 = self._bias_init(self.output_dim, name='b3')
        # No softmax here; that's handled by the cost function.
        self.model = tf.matmul(dropout_layer2, W3) + b3

    def predict(self, X, prob=False):
        """Predict method that mimics `sklearn` by accepting
        an np.array and returning a vector of predictions.
        """
        predictions = self.sess.run(self.model,
            feed_dict={self.inputs: X,
                       self.keepprob_holder1: 1.0,
                       self.keepprob_holder2: 1.0})
        # changed to output probabilities
        if prob:
            return tf.nn.softmax(predictions)
        else:
            return self._predictionvecs2class(predictions)                               
    
    def _weight_init(self, m, n, name):
        """Weight initialization according to the heuristic
        that the values should be uniformly distributed around
        """
        x = np.sqrt(6.0/(m+n))
        with tf.name_scope(name) as scope: 
            return tf.Variable(
                tf.random_uniform(
                    [m, n], minval=-x, maxval=x), name=name)

    def _bias_init(self, dim, name, constant=0.0):
        """Bias initialization, by default as all 0s.
        """
        with tf.name_scope(name) as scope:            
            return tf.Variable(
                tf.constant(constant, shape=[dim]), name=name)

    def _dropout_init(self, name):
        """Initialize a placeholder for a dropout value."""        
        with tf.name_scope(name) as scope:
            return tf.placeholder(tf.float32, name=name)
                
    def onehot_encode(self, y, on_value=1.0):
        """Turns the list of class labels `y` into a matrix of
        one-hot encoded vectors. This could be replaced by
        `tf.one_hot`, but this native version does the job.
        """        
        classmap = dict(zip(self.classes_, range(self.output_dim)))        
        y_ = np.zeros((len(y), self.output_dim))
        for i, cls in enumerate(y):
            y_[i][classmap[cls]] = on_value            
        return y_

    def _progressbar(self, loss, index):
        """Overwriting progress bar for feedback on training process.
        Prints to standard error.
        """        
        if self.verbose:        
            sys.stderr.write('\r')
            sys.stderr.write("Iteration {}: loss is {}".format(index, loss))
            sys.stderr.flush()

    def get_params(self, deep=True):
        """Gets the hyperparameters for the model, as given by the
        `self.params` attribute. This is called `get_params` for
        compatibility with sklearn. `deep=True` is ignored, but is
        needed for sklearn.
        """
        return {p: getattr(self, p) for p in self.params}

    def set_params(self, **params):
        """Use the params dict to set attribute values. This
        is needed for sklearn `GridSearchCV` compatibility.
        """        
        for key, val in six.iteritems(params):
            setattr(self, key, val)
        return self

    def _predictionvecs2class(self, predictions):
        """
        Convert the matrix of prediction probabilities into classes.
        In cases of ties, a random choices is made to avoid spurious
        patterns resulting from guessing classes that are earlier
        in ordering in case of ties.              
        """
        maxprobs = predictions.max(axis=1)
        cats = []
        for row, maxprob in zip(predictions, maxprobs):
            i = np.random.choice([i for i, val in enumerate(row)
                                  if val==maxprob])
            cats.append(self.classes_[i])        
        return cats

After defining the class, we train a model based on it.

In [6]:
start_time = time.time()

mod = DeepClassifier(
           hidden_dim1=200,
           hidden_dim2=100,
           keep_prob1=0.7,
           keep_prob2=0.7,
           activation1=tf.nn.relu,
           activation2=tf.nn.relu,
           verbose=True,
           max_iter=3000,
           eta=0.01)

mod.fit(q_vec, labels)

print(time.time()-start_time, "sec to train deep net")

start_time = time.time()
# Test predictions and scoring:
predictions = tf.nn.softmax(mod.predict(qtest_vec, prob=True)).eval()[:,1:]
print(time.time()-start_time, "sec to generate", predictions.shape[0], "predictions")

Iteration 3000: loss is 0.563651680946


38095.938349 sec to train deep net
422.033621073 sec to generate 2345796 predictions


Training this deep net took 38095.94 seconds $\approx$ 11.5 hours to run.

Ideally, we'd like to cross-validate different parameters for our particular architecture and, even further, try out a couple of different architectures with their own cross-validations. However, that would be a bit too resource-intensive for a cursory view at a deep net's performance, as is evident from running only one of them, and as Roam's Dingwall, Potts and Senaratna themselves acknowledge.

Now we export the predictions to csv and submit them:

In [15]:
deep_predictions = pd.concat([pd.Series(np.arange(predictions.shape[0])), pd.Series(predictions.ravel())], axis=1)
deep_predictions.columns = ["test_id","is_duplicate"]
deep_predictions.to_csv("deep_predictions.csv", index=False)

The result is better than untuned XGBoost, but not better than just plain cosine similarity, and considerably worse than XGBoost tuned with scikit-optimize.

![Deep (untuned) neural net score: 0.53307](../notebooks-img/deep_score_wide.png?raw=True "Deep (untuned) neural net score: 0.53307")

## Opportunities and conclusions

Taken together, these results point toward an opportunity to better vectorize our sentences. Further experiments can easily incorporate TFIDF-vectorization. Another possibility is to use the test data, or even non-uniqued training data, to train the Doc2Vec model. At the same time, it may be the case that Doc2Vec is not the best way to vectorize documents. 

Once we have a vector representaion for sentences, we can achieve good results with cosine similarity, tuned XGBoost or optimized deep nets classifiers. Coupled with a better way of representing sentences, these classifiers should be able to improve the current best score on Kaggle data of 0.53307.