In this solution I use cosine similarity in the TF-IDF space to select the best matching dialogue for a phrase removed from this dialogue. The performance is ~6%

Running it: 

- `python train_vectorizer.py`
- `python match_dialogs.py challenge_data/test_dialogs.txt challenge_data/test_missing.txt`

Answers:

- I chose TF-IDF because it's simple to implement and often gives good performance. It often serves as a good baseline afterwards
- For evaluating the performance I split the trianing set such that it resembles the test set. Training set is 2 times larger than test, so I split it into two parts and evaluated the method on each. I chose to use accuracy, but probably it may make more sense to use something like MAP@3 or MAP@5
- Weaknesses: this representation (and also the cleaning pipeline) loses a lot of information which otherwise could be useful.

Other things I tried (see the other `notcleaned` notebook):

- similarity in the LSI space
- for each missing phrase select top20 candidates with TF-IDF and then train a model for selecting the best one from them. It only marginally improved over the TF-IDF baseline

There are a lot of other things one can try using here, we can discuss them during the interview

In [1]:
from collections import defaultdict
from tqdm import tqdm
import pandas as pd
import numpy as np
import re

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

import cPickle

In [2]:
en_stopwords = set(stopwords.words('english')) | {"n't", 's', 're', 've', 'm', 'would', 'could'}

def remove_html(text):
    return re.sub('<.+?>', ' ', text)

def tokenize(phrase):
    phrase = remove_html(phrase)
    tokens = word_tokenize(phrase)
    tokens = [t.strip("'.\"?:;") for t in tokens]
    tokens = [t for t in tokens if t and t[0].isalpha() and t not in en_stopwords]
    return ' '.join(tokens)

In [3]:
dialogs_all = defaultdict(list)
train = set()
test = set()

with open('./challenge_data/train_dialogs.txt', 'r') as f:
    for line in tqdm(f):
        did, phrase = line.lower().strip().split(' +++$+++ ')
        phrase = tokenize(phrase)
        train.add(did)
        dialogs_all[did].append(phrase)

with open('./challenge_data/test_dialogs.txt', 'r') as f:
    for line in tqdm(f):
        did, phrase = line.lower().strip().split(' +++$+++ ')
        phrase = tokenize(phrase)
        test.add(did)
        dialogs_all[did].append(phrase)



In [19]:
missing_phrases = {}
test_ids = []
missing_phrases_test_orig = []

with open('./challenge_data/train_missing.txt', 'r') as f:
    for line in tqdm(f):
        did, phrase = line.lower().strip().split(' +++$+++ ')
        phrase = tokenize(phrase)
        missing_phrases[did] = phrase

with open('./challenge_data/test_missing.txt', 'r') as f:
    for i, line in enumerate(tqdm(f)):
        _, phrase = line.lower().strip().split(' +++$+++ ')
        missing_phrases_test_orig.append(phrase)
        phrase = tokenize(phrase)
        id = 't%05d' % i
        missing_phrases[id] = phrase
        test_ids.append(id)



In [5]:
all_texts = []
for sentences in dialogs_all.values():
    all_texts.append(' '.join(sentences))
all_texts.extend(missing_phrases.values())

In [6]:
vect = TfidfVectorizer(ngram_range=(1, 3), min_df=4)
vect.fit(all_texts)

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=4,
        ngram_range=(1, 3), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [7]:
with open('tfidf.bin', 'wb') as f:
    cPickle.dump(vect, f)

In [8]:
df_train_dialogs = []

for c in train:
    doc = ' '.join(dialogs_all[c])
    mis = missing_phrases[c]
    df_train_dialogs.append((c, doc, mis))

df_train_dialogs = pd.DataFrame(df_train_dialogs)
df_train_dialogs.columns = ["cid", "dialogue", "missing"]
df_train_dialogs.head()

Unnamed: 0,cid,dialogue,missing
0,c10424,come guys life short tell playing chicken mean...,seems fella trying come back d take settlement...
1,c15145,jenny says know cook whatever honey send kentu...,looks like whole roast
2,c15146,guns robbers bank got guns yeah lot guns oh yeah,well stay away get close
3,c10427,galvin look many years ago give shit lawyer ca...,damn right done going ask mistrial going reque...
4,c15140,whatsa deal jet comin let em ground got ta kil...,let em ground


In [9]:
X_mis = vect.transform(df_train_dialogs.missing).astype('float32')
X_doc = vect.transform(df_train_dialogs.dialogue).astype('float32')

dot = (X_mis * X_doc.T).toarray()
most_similar = (-dot).argsort(axis=1)[:, 0]

In [10]:
truth = np.arange(len(df_train_dialogs))
np.mean(most_similar == truth)

0.045298958560920546

In [11]:
len(train), len(test), len(train) / 2

(13731, 6865, 6865)

In [12]:
np.random.seed(1)
df_train_dialogs['fold'] = np.random.choice([0, 1], size=len(train))

In [13]:
for f in [0, 1]:
    fold = df_train_dialogs[df_train_dialogs.fold == f]
    X_mis = vect.transform(fold.missing).astype('float32')
    X_doc = vect.transform(fold.dialogue).astype('float32')

    dot = (X_mis * X_doc.T).toarray()
    most_similar = (-dot).argsort(axis=1)[:, 0]
    truth = np.arange(len(fold))
    print np.mean(most_similar == truth)

0.0629840713138
0.0667828106852


In [14]:
df_test_dialogs = []

for c in test:
    doc = ' '.join(dialogs_all[c])
    df_test_dialogs.append((c, doc))

df_test_dialogs = pd.DataFrame(df_test_dialogs)
df_test_dialogs.columns = ["cid", "dialogue"]
df_test_dialogs.head()

Unnamed: 0,cid,dialogue
0,c15144,sonny gettin real bad vibes jackie talking lat...
1,c18238,unemployed alfie boss dead plan ooh-kay
2,c10421,kathy price yes yes
3,c15142,hey sonny watchin tv kids know sent neighbors ...
4,c18231,quite see work done found real treasure terri ...


In [15]:
test_phrases = [missing_phrases[i] for i in test_ids]

In [16]:
X_mis = vect.transform(test_phrases).astype('float32')
X_doc = vect.transform(df_test_dialogs.dialogue).astype('float32')

dot = (X_mis * X_doc.T).toarray()
most_similar = (-dot).argsort(axis=1)[:, 0]

In [17]:
predictions = list(df_test_dialogs.cid.iloc[most_similar])

In [21]:
zip(predictions, missing_phrases_test_orig)

[('c15144', 'why?'),
 ('c21564', 'you be careful.'),
 ('c04275',
  "there'll be spoils aplenty if you guide us there.  once we breach the walls, help yourself to all you can carry."),
 ('c00468', 'everything is lovely, ted, but much too expensive.'),
 ('c14836',
  'i know, i know! but what more can you expect of me?! i have pared this story down to the marrow to save money but to cut more would be to--!'),
 ('c06603', "i mean, you can play.  you're okay."),
 ('c12273', 'yes.'),
 ('c17029', 'come on, what were you going to say?'),
 ('c05089',
  'well, where i come from it kind of goes with the territory.  texas.'),
 ('c23266',
  'but even here we were supposed to find who knows what... and all we bring back with us is a crate of cigarettes.'),
 ('c23001', "well, there's some flaws in her..."),
 ('c22968', 'neither do i.'),
 ('c15144', 'why?'),
 ('c23983', 'doug, the traffic light...'),
 ('c10332', 'the phone company was broken up.'),
 ('c27433', 'you sound funny. did you do cocaine?'),
