###                                             Task Background and Data Pre-processing
#### this jupyter notebook is used to pre-processing data. at the end you will have vocabulary, labels, training/validation/test set.

 in this task, you will be asked to predict top 5 topics given a question and its description.  
 
###### source files: 
 
 1. question_train_set.txt.  you can get question id and its string information, you can transform it to train_X.
 
     it has 5 columns, each is split with '\t'. format as below:
     
     question_id ct1,ct2,ct3,...,ctn wt1,wt2,wt3,...,wtn cd1,cd2,cd3,...cdn wd1,wd2,wd3,...,wdn
     
     second column is character token of title, third column is word token of title, forth column is character token of description, fifth column is 
     
     word token of description.
     
 2. question_topic_train_set.txt.  you can get question id and its labels. you can transform it to  train_Y. 
 
     topics associated with a question. it contains with two columns, each column is splitted with '\t'. 
     
 3. question_eval_set.txt.  you can get question id and its string information, this will be valid_X. this is same format as question_train_set.txt
 
 
###### additional stats information:

1. averaged_length:

   {'desc_char': 117.39879138670524, 'title_char': 22.207077611187056, 'desc_word': 58.272774333851004, 'title_word': 12.841507923253822}

2. averaged length of a input. total length of all information(words,character of title+desc): 210.

3. word of title+desc: 71

4. character of title+desc: 139

5. as can see from word embeding files, there are about 11k of charactor tokens, and 410k of word tokens that frequency is more than 5 times

  in the data set.
  
6. total unique labels: 1999

###### basic processes

1. in this notebook, we will use character token of title and description. max sequence length will be set to 200. any sequence exceed of it, will be

   truncated, any sequence short of it, will be padded. 

2. we will generate vocabulary/ labels dict, and training/validation/test data, then save to cache file(as a pickle file), so during training we can

   load it quickly.


In [39]:
# import some packages
import pandas as pd
from collections import Counter
from tflearn.data_utils import pad_sequences
import random
import numpy as np
import h5py
import pickle
print("import package successful...")

import package successful...


In [40]:
# read source file as csv
base_path='data/ieee_zhihu_cup/'
train_data_x=pd.read_csv(base_path+'question_train_set3.txt',sep='\t', encoding="utf-8")
train_data_y=pd.read_csv(base_path+'question_topic_train_set3.txt',sep='\t', encoding="utf-8")
valid_data_x=pd.read_csv(base_path+'question_eval_set3.txt', sep='\t',encoding="utf-8")

train_data_x=train_data_x.fillna('')
train_data_y=train_data_y.fillna('')
valid_data_x=valid_data_x.fillna('')
print("train_data_x:",train_data_x.shape)
print("train_data_y:",train_data_y.shape)
print("valid_data_x:",valid_data_x.shape)

('train_data_x:', (2999967, 5))
('train_data_y:', (2999967, 2))
('valid_data_x:', (217360, 5))


In [41]:
# understand your data: that's take a look of data
train_data_x.head()

Unnamed: 0,question_id,title_char,title_word,desc_char,desc_word
0,6555699376639805223,"c324,c39,c40,c155,c180,c180,c181,c17,c4,c1153,...","w305,w13549,w22752,w11,w7225,w2565,w1106,w16,w...","c335,c101,c611,c189,c97,c144,c147,c101,c15,c76...","w231,w54,w1681,w54,w11506,w5714,w7,w54,w744,w1..."
1,2887834264226772863,"c44,c110,c101,c286,c106,c150,c101,c892,c632,c1...","w377,w54,w285,w57,w349,w54,w108215,w6,w47986,w...","c1265,c518,c74,c131,c274,c57,c768,c769,c368,c3...","w12508,w1380,w72,w27045,w276,w111"
2,-2687466858632038806,"c15,c768,c769,c1363,c650,c1218,c2361,c11,c90,c...","w875,w15450,w42394,w15863,w6,w95421,w25,w803,w...","c693,c100,c279,c99,c189,c532,c101,c189,c145,c1...","w140340,w54,w48398,w54,w140341,w54,w12856,w54,..."
3,-5698296155734268,"c473,c1528,c528,c428,c295,c15,c101,c188,c146,c...","w8646,w2744,w1462,w9,w54,w138,w54,w50,w110,w14...",,
4,-6719100304248915192,"c190,c147,c105,c219,c220,c101,c647,c219,c220,c...","w380,w54,w674,w133,w54,w134,w614,w54,w929,w307...","c644,c1212,c253,c199,c431,c452,c424,c207,c2,c1...","w4821,w1301,w16003,w928,w1961,w2565,w50803,w11..."


In [44]:
# compute average length of title_char, title_word, desc_char, desc_word

dict_length_columns={'title_char':0,'title_word':0,'desc_char':0,'desc_word':0}
num_examples=len(train_data_x)
train_data_x_small=train_data_x.sample(frac=0.01)
for index, row in train_data_x_small.iterrows():
    title_char_length=len(row['title_char'].split(","))
    title_word_length=len(row['title_word'].split(","))
    desc_char_length=len(row['desc_char'].split(","))
    desc_word_length=len(row['desc_word'].split(","))
    dict_length_columns['title_char']=dict_length_columns['title_char']+title_char_length
    dict_length_columns['title_word']=dict_length_columns['title_word']+title_word_length
    dict_length_columns['desc_char']=dict_length_columns['desc_char']+desc_char_length
    dict_length_columns['desc_word']=dict_length_columns['desc_word']+desc_word_length
dict_length_columns={k:float(v)/float(num_examples*0.01) for k,v in dict_length_columns.items()}
print("dict_length_columns:",dict_length_columns)

# averaged length of a input. total length of all information(words,character of title+desc): 210.
# word of title+desc: 71
# character of title+desc: 139

('dict_length_columns:', {'desc_char': 117.39879138670524, 'title_char': 22.207077611187056, 'desc_word': 58.272774333851004, 'title_word': 12.841507923253822})


In [42]:
train_data_y.head()

Unnamed: 0,question_id,topic_ids
0,6555699376639805223,77390041956937749753738968195649774859
1,2887834264226772863,-3149765934180654494
2,-2687466858632038806,-760432988437306018
3,-5698296155734268,-67589421411221139073195914392210930723
4,-6719100304248915192,"3804601920633030746,4797226510592237555,435133..."


In [53]:
# average labels for a input
train_data_y_small=train_data_y.sample(frac=0.01)
num_examples=len(train_data_y_small)
num_labels=0
for index, row in train_data_y_small.iterrows():
    topic_ids=row['topic_ids']
    topic_id_list=topic_ids.split(",")
    num_labels+=len(topic_id_list)
average_num_labels=float(num_labels)/float(num_examples)
print("average_num_labels:",average_num_labels)


('average_num_labels:', 2.3440333333333334)


In [43]:
valid_data_x.head()

Unnamed: 0,question_id,title_char,title_word,desc_char,desc_word
0,6215603645409872328,"c924,c531,c102,c284,c188,c104,c98,c107,c11,c11...","w1340,w1341,w55,w1344,w58,w6,w24178,w26959,w47...","c1128,c529,c636,c572,c1321,c139,c540,c223,c510...","w4094,w1618,w20104,w19234,w1097,w1005,w4228,w2..."
1,6649324930261961840,"c346,c1549,c413,c294,c675,c504,c183,c74,c541,c...","w40132,w1357,w1556,w1380,w2464,w33,w16791,w109...",,
2,-4251899610700378615,"c96,c97,c97,c98,c99,c100,c101,c141,c42,c42,c10...","w53,w54,w1779,w54,w1309,w54,w369,w949,w65587,w...","c149,c148,c148,c42,c185,c95,c95,c186,c186,c186...",
3,6213817087034420233,"c504,c157,c221,c221,c633,c468,c469,c1637,c1072...","w5083,w12537,w10427,w29724,w6,w2566,w11,w18476...","c15,c131,c39,c40,c85,c166,c969,c2456,c17,c636,...","w2550,w24,w239,w98,w19456,w11,w108710,w3483,w2..."
4,-8930652370334418373,"c0,c310,c35,c122,c123,c11,c317,c91,c175,c476,c...","w33792,w21,w83,w6,w21542,w21,w140670,w25,w1110...",,


In [44]:
train_data_y.head()

Unnamed: 0,question_id,topic_ids
0,6555699376639805223,77390041956937749753738968195649774859
1,2887834264226772863,-3149765934180654494
2,-2687466858632038806,-760432988437306018
3,-5698296155734268,-67589421411221139073195914392210930723
4,-6719100304248915192,"3804601920633030746,4797226510592237555,435133..."


In [45]:
valid_data_x.head()

Unnamed: 0,question_id,title_char,title_word,desc_char,desc_word
0,6215603645409872328,"c924,c531,c102,c284,c188,c104,c98,c107,c11,c11...","w1340,w1341,w55,w1344,w58,w6,w24178,w26959,w47...","c1128,c529,c636,c572,c1321,c139,c540,c223,c510...","w4094,w1618,w20104,w19234,w1097,w1005,w4228,w2..."
1,6649324930261961840,"c346,c1549,c413,c294,c675,c504,c183,c74,c541,c...","w40132,w1357,w1556,w1380,w2464,w33,w16791,w109...",,
2,-4251899610700378615,"c96,c97,c97,c98,c99,c100,c101,c141,c42,c42,c10...","w53,w54,w1779,w54,w1309,w54,w369,w949,w65587,w...","c149,c148,c148,c42,c185,c95,c95,c186,c186,c186...",
3,6213817087034420233,"c504,c157,c221,c221,c633,c468,c469,c1637,c1072...","w5083,w12537,w10427,w29724,w6,w2566,w11,w18476...","c15,c131,c39,c40,c85,c166,c969,c2456,c17,c636,...","w2550,w24,w239,w98,w19456,w11,w108710,w3483,w2..."
4,-8930652370334418373,"c0,c310,c35,c122,c123,c11,c317,c91,c175,c476,c...","w33792,w21,w83,w6,w21542,w21,w140670,w25,w1110...",,


In [46]:
 # create vocabulary_dict, label_dict, generate training/validation data, and save to some place 
    
 # create vocabulary of charactor token by read word_embedding.txt 
word_embedding_object=open(base_path+'unused_current/char_embedding.txt')
lines_wv=word_embedding_object.readlines()
word_embedding_object.close()
char_list=[]
char_list.extend(['PAD','UNK','CLS','SEP','unused1','unused2','unused3','unused4','unused5'])
PAD_ID=0
UNK_ID=1
for i, line in enumerate(lines_wv):
    if i==0: continue
    char_embedding_list=line.split(" ")
    char_token=char_embedding_list[0]
    char_list.append(char_token)    
    
# write to vocab.txt under data/ieee_zhihu_cup
vocab_path=base_path+'vocab.txt'
vocab_char_object=open(vocab_path,'w')

word2index={}
for i, char in enumerate(char_list):
    if i<10:print(i,char)
    word2index[char]=i
    vocab_char_object.write(char+"\n")
vocab_char_object.close()
print("vocabulary of char generated....")

(0, 'PAD')
(1, 'UNK')
(2, 'CLS')
(3, 'SEP')
(4, 'unused1')
(5, 'unused2')
(6, 'unused3')
(7, 'unused4')
(8, 'unused5')
(9, '</s>')
vocabulary of char generated....


In [47]:
 # generate labels list, and save to file system 
c_labels=Counter()
train_data_y_small=train_data_y[0:100000]#.sample(frac=0.1)
for index, row in train_data_y_small.iterrows():
    topic_ids=row['topic_ids']
    topic_list=topic_ids.split(',')
    c_labels.update(topic_list)

label_list=c_labels.most_common()
label2index={}
label_target_object=open(base_path+'label_set.txt','w')
for i, label_freq in enumerate(label_list):
    label,freq=label_freq
    label2index[label]=i
    label_target_object.write(label+"\n")
    if i<20: print(label,freq)
label_target_object.close()
print("generate label dict successful...")

(u'7476760589625268543', 2308)
(u'4697014490911193675', 1746)
(u'-4653836020042332281', 1579)
(u'-8175048003539471998', 1475)
(u'-8377411942628634656', 1382)
(u'-7046289575185911002', 1338)
(u'-5932391056759866388', 1283)
(u'2787171473654490487', 1145)
(u'-7129272008741138808', 1085)
(u'2587540952280802350', 1079)
(u'-4931965624608608932', 1079)
(u'-6748914495015758455', 1049)
(u'-5513826101327857645', 993)
(u'2347973810368732059', 970)
(u'9069451131871918127', 958)
(u'-8132909213241034354', 904)
(u'-3517637179126242000', 867)
(u'-5872443091340192918', 834)
(u'-3522198575349379632', 830)
(u'1127459907694805235', 829)
generate label dict successful...


In [48]:
def transform_multilabel_as_multihot(label_list,label_size):
    """
    convert to multi-hot style
    :param label_list: e.g.[0,1,4], here 4 means in the 4th position it is true value(as indicate by'1')
    :param label_size: e.g.199
    :return:e.g.[1,1,0,1,0,0,........]
    """
    result=np.zeros(label_size)
    #set those location as 1, all else place as 0.
    result[label_list] = 1
    return result

label_list=[0,1,2,10]
label_size=20
label_list_sparse=transform_multilabel_as_multihot(label_list,label_size)
print("label_list_sparse:",label_list_sparse)

('label_list_sparse:', array([1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.]))


In [49]:

def get_X_Y(train_data_x,train_data_y,label_size, test_mode=False):
    """
    get X and Y given input and labels
    input:
    train_data_x:
    train_data_y:
    label_size: number of total unique labels(e.g. 1999 in this task)
    output:
    X,Y
    """
    X=[]
    Y=[]
    if test_mode:
        train_data_x_tiny_test=train_data_x[0:1000] # todo todo todo todo todo todo todo todo todo todo todo todo 
        train_data_y_tiny_test=train_data_y[0:1000] # todo todo todo todo todo todo todo todo todo todo todo todo 
    else:
        train_data_x_tiny_test=train_data_x
        train_data_y_tiny_test=train_data_y

    for index, row in train_data_x_tiny_test.iterrows():
        if index==0: continue
        # get character of title and dssc
        title_char=row['title_char']
        desc_char=row['desc_char']
        # split into list
        title_char_list=title_char.split(',')
        desc_char_list=desc_char.split(",")
        # transform to indices
        title_char_id_list=[vocabulary_word2index.get(x,UNK_ID) for x in title_char_list if x.strip()]
        desc_char_id_list=[vocabulary_word2index.get(x,UNK_ID) for x in desc_char_list if x.strip()]
        # merge title and desc: in the middle is special token 'SEP'
        title_char_id_list.append(vocabulary_word2index['SEP'])
        title_char_id_list.extend(desc_char_id_list)
        X.append(title_char_id_list)
        if index<3: print(index,title_char_id_list)
        if index%100000==0: print(index,title_char_id_list)

    for index, row in train_data_y_tiny_test.iterrows():
        if index==0: continue
        topic_ids=row['topic_ids']
        topic_id_list=topic_ids.split(",")
        label_list_dense=[label2index[l] for l in topic_id_list if l.strip()]
        label_list_sparse=transform_multilabel_as_multihot(label_list_dense,label_size)
        Y.append(label_list_sparse)
        if index%100000==0: print(index,";label_list_dense:",label_list_dense)

    return X,Y

In [50]:
vocabulary_word2index['SEP']

3

In [51]:
def save_data(cache_file_h5py,cache_file_pickle,word2index,label2index,train_X,train_Y,vaild_X,valid_Y,test_X,test_Y):
    # train/valid/test data using h5py
    f = h5py.File(cache_file_h5py, 'w')
    f['train_X'] = train_X
    f['train_Y'] = train_Y
    f['vaild_X'] = vaild_X
    f['valid_Y'] = valid_Y
    f['test_X'] = test_X
    f['test_Y'] = test_Y
    f.close()
    # save word2index, label2index
    with open(cache_file_pickle, 'ab') as target_file:
        pickle.dump((word2index,label2index), target_file)

In [52]:
# generate training/validation/test data using source file and vocabulary/label set.
#  get X,Y---> shuffle and split data----> save to file system.
test_mode=False
label_size=len(label2index)
cache_path_h5py=base_path+'data.h5'
cache_path_pickle=base_path+'vocab_label.pik'
max_sentence_length=200

# step 1: get (X,y) 
X,Y=get_X_Y(train_data_x,train_data_y,label_size,test_mode=test_mode)

# pad and truncate to a max_sequence_length
X = pad_sequences(X, maxlen=max_sentence_length, value=0.)  # padding to max length

# step 2. shuffle, split,
xy=list(zip(X,Y))
random.Random(10000).shuffle(xy)
X,Y=zip(*xy)
X=np.array(X); Y=np.array(Y)
num_examples=len(X)
num_valid=20000
num_valid=20000
num_train=num_examples-(num_valid+num_valid)
train_X, train_Y=X[0:num_train], Y[0:num_train]
vaild_X, valid_Y=X[num_train:num_train+num_valid], Y[num_train:num_train+num_valid]
test_X, test_Y=X[num_train+num_valid:], Y[num_train+num_valid:]
print("num_examples:",num_examples,";X.shape:",X.shape,";Y.shape:",Y.shape)
print("train_X:",train_X.shape,";train_Y:",train_Y.shape,";vaild_X.shape:",vaild_X.shape,";valid_Y:",valid_Y.shape,";test_X:",test_X.shape,";test_Y:",test_Y.shape)

# step 3: save to file system
save_data(cache_path_h5py,cache_path_pickle,word2index,label2index,train_X,train_Y,vaild_X,valid_Y,test_X,test_Y)
print("save cache files to file system successfully!")

del X,Y,train_X, train_Y,vaild_X, valid_Y,test_X, test_Y


(1, [110, 143, 11, 31, 35, 28, 11, 522, 1392, 197, 667, 12, 194, 915, 1611, 509, 58, 67, 33, 15, 60, 64, 84, 1417, 648, 268, 66, 143, 109, 16, 3, 543, 96, 64, 26, 73, 19, 67, 33, 363, 601, 16])
(2, [58, 67, 33, 2152, 562, 1354, 822, 12, 137, 1690, 165, 13, 134, 95, 93, 12, 356, 529, 43, 119, 16, 3, 624, 24, 91, 120, 106, 203, 11, 106, 52, 106, 14, 120, 120, 359, 11, 55, 24, 14, 401, 52, 11, 14, 21, 11, 37, 11, 90, 57, 83, 21, 36, 52, 11, 14, 83, 34, 11, 52, 21, 57, 55, 52, 11, 76, 359, 11, 20, 28, 11, 3662, 11, 20, 27, 11, 90, 57, 83, 21, 36, 52, 11, 345, 742, 84, 669, 239, 36, 21, 21, 55, 185, 38, 38, 39, 39, 39, 35, 79, 14, 52, 21, 46, 57, 401, 30, 34, 52, 35, 57, 46, 79, 38, 42, 57, 83, 21, 24, 83, 21, 38, 28, 28, 38, 24, 83, 38, 624, 24, 91, 120, 106, 203, 431, 28, 23, 906, 431, 28, 23, 310, 426, 624, 455, 38, 624, 24, 91, 120, 106, 203, 431, 28, 23, 906, 431, 28, 23, 310, 426, 624, 455, 431, 28, 23, 30, 83, 431, 28, 23, 318, 83, 91, 14, 83, 21, 52, 35, 36, 21, 90, 120])
(3, [260, 

(1600000, [135, 1054, 1612, 53, 12, 325, 339, 10, 430, 109, 15, 114, 737, 10, 924, 527, 524, 81, 61, 168, 60, 892, 536, 10, 54, 337, 15, 337, 1703, 1383, 16, 3, 17, 32, 135, 1054, 47, 1612, 986, 53, 18, 43, 18, 667, 12, 2029, 144, 10, 270, 392, 109, 78, 236, 3782, 1897, 657, 854, 10, 63, 13, 78, 17, 406, 118, 375, 213, 746, 194, 18, 299, 430, 292, 10, 322, 104, 1612, 986, 53, 12, 430, 292, 18, 220, 172, 17, 1691, 115, 84, 22, 54, 13, 408, 277, 195, 10, 17, 192, 96, 138, 937, 698, 16, 868, 612, 61, 18, 118, 10, 924, 527, 12, 241, 116, 204, 421, 164, 12, 15, 189, 10, 1612, 986, 53, 12, 325, 339, 392, 109, 15, 114, 737, 66, 108, 71, 59, 245, 195, 22])
(1700000, [532, 413, 1128, 566, 13, 18, 183, 94, 102, 12, 235, 433, 16, 3, 32, 653, 18, 471, 532, 413, 110, 1128, 566, 10, 32, 532, 413, 268, 123, 130, 268, 93, 74, 515, 231, 1395, 110, 99, 19, 131, 145, 235, 433, 16, 11, 86, 253, 30, 34, 24, 57, 11, 30, 34, 50, 25, 31, 29, 28, 20, 40, 25, 11, 34, 14, 21, 14, 45, 52, 39, 91, 106, 46, 120, 50

save cache files to file system successfully!


### TODO 1: use topic information
below are some of things you can do, to have a better model.

if you want to get better performance, you can use pre-trained word embedding and char embedding. 

addtionally,  if you want to model this task in a better way, you can use topic information. you can find it in topic_info.txt,  

where each topic is assocate:

this its parent topics(zeor,one or more); 

charactor tokens of topic's name; 

word tokens of topic's name;

charactor tokens of topic's description; 

word tokens of topic's description.

In [13]:
topic_info_data=pd.read_csv(base_path+'topic_info.txt', sep='\t',encoding="utf-8")

Unnamed: 0,738845194850773558,-5833678375673307423,"c0,c1",w0,"c0,c1,c2,c3,c4,c5,c6,c7,c0,c1,c8,c9,c10,c11,c12,c13,c14,c15,c16,c11,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c20,c31,c24,c25,c26,c27,c11,c24,c32,c33,c34,c35,c36,c31,c8,c37,c38","w0,w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12,w13,w14,w15,w16,w17,w18,w15,w6,w19,w20,w21,w22,w23"
0,3738968195649774859,2027693463582123305,"c39,c40",w24,"c41,c42,c43,c39,c40,c4,c44,c45,c46,c47,c48,c49...","w24,w25,w26,w27,w28,w6,w29,w30,w11,w31,w32,w33..."
1,4738849194894773882,1127459907694805235,"c172,c31,c0,c1",w102,,
2,7739004195693774975,"2904932941037075699,1160326435131345730,725917...","c39,c40,c5,c173",w103,"c39,c40,c23,c21,c174,c74,c5,c173,c17,c35,c39,c...","w104,w105,w11,w21,w24,w6,w106,w23,w54,w24,w107..."
3,-7261194805221226386,-5833678375673307423,"c36,c31,c45,c237",w148,"c238,c239","w149,w150"
4,-3689337711138901728,"-2689200710357900655,-1689319711084901730","c215,c147,c105,c284,c97,c97,c168,c101,c146,c14...","w205,w54,w206","c196,c197,c0,c1,c313,c314,c315,c316,c317,c200,...","w125,w207,w208,w209,w166,w167,w23"


### TODO 2: use both character and word tokens
in this notebook we just use character tokens. it is fine. however, as many people observed, use word tokens to represent inputs, performance may be 
better. and if you can use both word and character tokens to represent inputs, performance can be much better. 

one of draw back to use word  is there are much more words then character. for example, in this task, total word token that frequency more than 5 is around 410k, while character token with frequency more than 5 is only around 11k. so much more memory is need. 

but you can still have a try if you want. with word token only, sequence length is shorter than character token, only about 50% length is needed.

### TODO 3: use pre-trained character and word embedding
