Goal: Analyze duplicate matrix of questions.

So far we've been focusing efforts on just looking at question pairs. But the question IDs also give some useful information, an undirected graph between questions. Let's see if we can extract some information from them.

In [1]:
import pandas as pd
# Make my plots pretty!
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['savefig.dpi'] = 100
mpl.rcParams['figure.dpi'] = 100

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
# Read training data.
train = pd.read_csv('../data/train.csv')

In [3]:
train[:10]

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [4]:
qid1 = train['qid1']
qid2 = train['qid2']
qid1_unique = qid1.unique()
qid2_unique = qid2.unique()
print('Q1 unique: {0}/{2} ({1}%)'.format(len(qid1_unique), len(qid1_unique) * 100.0 / len(qid1), len(qid1)))
print('Q2 unique: {0}/{2} ({1}%)'.format(len(qid1_unique), len(qid1_unique) * 100.0 / len(qid1), len(qid2)))
qid1_qid2 = set(list(qid1_unique) + list(qid2_unique))
print('Q1 U Q2 unique: {0}/{2} ({1}%)'.format(len(qid1_qid2), len(qid1_qid2) * 100.0 / len(qid1), len(qid1)))

Q1 unique: 290654/404290 (71.8924534369%)
Q2 unique: 290654/404290 (71.8924534369%)
Q1 U Q2 unique: 537933/404290 (133.056222019%)


In [5]:
dups = train[train['is_duplicate'] == 1][['qid1', 'qid2']]
len(dups)

149263

In [11]:
q1 = train[['qid1', 'question1']]
q1.columns = ['qid', 'question']
q1[1253:1256]

Unnamed: 0,qid,question
1253,2498,Were India and China part of Turkey at any poi...
1254,2500,What are the biggest regrets of your life?
1255,2502,What are the places to visit for a honeymoon i...


In [12]:
q2 = train[['qid2', 'question2']]
q2.columns = ['qid', 'question']
q2[1253:1256]

Unnamed: 0,qid,question
1253,2499,Who were the most powerful countries in the wo...
1254,2501,What's your biggest regret in life?
1255,2503,What all places can one visit on a two day tri...


In [13]:
qs = pd.concat([q1, q2], axis=0).drop_duplicates('qid')
len(qs)

537933

In [20]:
qs.reset_index(drop=True).to_csv('../data/all_questions.csv', header=False, index=False)
!head -10000 ../data/all_questions.csv | tail

19757,What movies or TV series are about playing mind games?
19759,How long does it take Apple to approve a company's enrollment?
19761,Why the greatest empire in Indian history is centered around Bihar?
19763,How do I download videos in smartphone?
19765,Which is the best college for hotel management worldwide?
19767,Can anyone remember past life?
19769,What never fails to make you smile?
19771,Carolina Panthers Live Streaming | Watch Carolina Panthers Live Stream NFL Games Today Online?
19773,Can I start a BPO without experiance?
19775,What were your experiences when you had roll no. 1?


In [23]:
!wc -l ../data/all_questions.csv # Some rows have newlines on them.

537942 ../data/all_questions.csv


# After Getting Results

Used a mapreduce job https://github.com/jfkelley/hadoop-matrix-mult to get the transitive closure (i.e. link up all the duplicate chains)

From dumbo:

In [24]:
!tail ../data/all_duplicates.tsv

116636	116636	3.0
122039	122039	2.0
275072	275072	1.0
114203	114203	1.0
338360	338360	1.0
477209	477209	1.0
80153	17456	109.0
260825	101894	3.0
290732	290732	1.0
270374	270374	1.0


In [26]:
qs[qs['qid'] == 80153]

Unnamed: 0,qid,question
44674,80153,How should I loose weight?


I don't know what I expected.

In [27]:
alldups = pd.read_csv('../data/all_duplicates.tsv', header=None, index_col=None, sep='\t')

In [31]:
alldups.columns = ['qid1', 'qid2', 'num_dups']
unique_dup = alldups[alldups['qid1'] != alldups['qid2']]

In [125]:
unique_dup[['qid1', 'qid2']].to_csv('../data/unique_duplicates.csv', index=False, header=False)
!tail ../data/unique_duplicates.csv

139301,38558
104567,6740
75410,18554
155918,207206
88700,45413
404612,4904
384662,500480
44294,255809
80153,17456
260825,101894


In [72]:
duplist = list((t.qid1, t.qid2) for t in unique_dup.itertuples())
dupset = set(duplist)

In [33]:
len(duplist)

457096

How to generate interesting batches? Each batch should have a number of sentences, some duplicates

In [52]:
dup_lookup = {qid: [] for qid in list(alldups['qid1'])}
for qid1, qid2 in duplist:
    dup_lookup[qid1].append(qid2)

In [None]:
itertools.

In [121]:
np.random.shuffle(duplist)
dups = duplist[:5]
id_select = [qid1 for qid1, qid2 in dups]
dds = (dup_lookup[qid][:10] for qid in id_select)
dups_for_id = list(x for t in dds for x in t)
np.random.shuffle(dups_for_id)
batch = (id_select + dups_for_id)[:25]

mtx = np.zeros((len(batch), len(batch)), dtype=np.int32)
for i, q1 in enumerate(batch):
    for j, q2 in enumerate(batch):
        mtx[i,j] = (q1, q2) in dupset

print(len(batch))
print(batch)
print(mtx.sum() * 1.0 / len(batch) / len(batch))
mtx

25
[118411, 67680, 74366, 459349, 92946, 83082, 407394, 1096, 160276, 67679, 459348, 145009, 25852, 99695, 156420, 118410, 307268, 83081, 65400, 101153, 49223, 106454, 5043, 23366, 112180]
0.2464


array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
        0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
        1, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
        0, 1, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
        1, 0, 1],
       [0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
        0, 1, 0],
       [0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
        0, 1, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      

In [240]:
# Sentences with the most duplicates
dups = list(qcounts[-500:].index)

In [241]:
def sampler(sz=10):
    np.random.shuffle(dups)
    d = dups[:sz]
    print(d)
    sample = np.zeros((sz, sz))
    for i, x in enumerate(d):
        for j, y in enumerate(d[i+1:]):
            if (x, y) in dupset:
                sample[i,j] = 1
    return sample

In [242]:
(80153, 17456) in dupset

True

In [243]:
sampler(10)

[57359, 19329, 69550, 42615, 80153, 17258, 178157, 13748, 72620, 62742]


array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [237]:
len(qcounts[qcounts < 2])

388283

In [238]:
len(alldups)

995029

In [239]:
len(qcounts[qcounts < 2]) * 1.0 / len(alldups)

0.39022279752650424

Basically, 380k questions (40%) don't have duplicates in the database. The duplicate graphs are also a tiny part of the entire database. Only 1 out of 1 million possible pairs are duplicates.