## Use Theanets to Implement Word/Image Vector fusion
- [source of both idea and code](https://github.com/mganjoo/zslearning)
- [a theano based implementation] - http://nbviewer.ipython.org/github/renruoxu/data-fusion/blob/master/deprecated/mapping%20(1).ipynb
- it is a standard 1-hidden layer MLP with customized cost function
- the data we use here is that: X (image vectors from DeCaff), Y (word vectors from word2vec)

In [71]:
import theanets
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error, pairwise_distances_argmin, confusion_matrix
from sklearn.preprocessing import StandardScaler
from gensim.models import Word2Vec

In [32]:
LABELS = np.array(["airplane", "automobile", "bird","cat",
                        "deer","dog","frog", "horse", "ship","truck"])
word2vec = Word2Vec.load_word2vec_format("../data/word2vec.bin", binary = True)
label_vecs = np.vstack([word2vec[w] for w in LABELS],)

store = pd.HDFStore("../data/cifa_XY.hd5")
X = store["X/"].get_values()
y = store["Y/"].get_values()
labels = pairwise_distances_argmin(y, label_vecs)

In [74]:
## we train a mapping model with data excluding trucks
## and later test whether the truck images are correctly mapped to word truck

truck_index = (labels == 9) ## 5 is dog in LABELS, 9 is truck
X_notruck, y_notruck = X[~truck_index], y[~truck_index]
X_truck, y_truck= X[truck_index], y[truck_index]
label_notruck, label_truck = labels[~truck_index], labels[truck_index]

In [75]:
def train_model(X, y):
    
    train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 10000)
    
    ss_X = StandardScaler().fit(train_X)
    ss_y = StandardScaler().fit(train_y)
    scaled_train_X, scaled_valid_X = ss_X.transform(train_X), ss_X.transform(valid_X)
    scaled_train_y, scaled_valid_y = ss_y.transform(train_y), ss_y.transform(valid_y)
    
    exp = theanets.Experiment(theanets.Regressor, layers = (4096, (200, "tanh"), 100))
    for train, valid in exp.itertrain(train_set = (scaled_train_X, scaled_train_y), 
                                  valid_set = (scaled_valid_X, scaled_valid_y), 
                                  optimize = "sgd", learning_rate = 0.005, validate_every = 5,
                                  hidden_l1 = 0.01, weight_l2 = 1e-4):
        print 'train loss(err)', train['loss'], "(%g)" % train["err"], 'valid loss(err)', valid['loss'], "(%g)" % valid['err']
    return ss_X, ss_y, exp.network

ss_X, ss_y, model = train_model(X_notruck, y_notruck)

train loss(err) 1.63725567084 (0.589381) valid loss(err) 3.59682703396 (2.27917)
train loss(err) 1.29476741721 (0.434238) valid loss(err) 3.59682703396 (2.27917)
train loss(err) 1.14287490996 (0.396441) valid loss(err) 3.59682703396 (2.27917)
train loss(err) 1.03678626229 (0.373359) valid loss(err) 3.59682703396 (2.27917)
train loss(err) 0.958347991959 (0.35711) valid loss(err) 3.59682703396 (2.27917)
train loss(err) 0.896924026019 (0.344361) valid loss(err) 0.9409754833 (0.364293)
train loss(err) 0.847685098418 (0.333805) valid loss(err) 0.9409754833 (0.364293)
train loss(err) 0.808607263705 (0.325642) valid loss(err) 0.9409754833 (0.364293)
train loss(err) 0.775400124158 (0.318583) valid loss(err) 0.9409754833 (0.364293)
train loss(err) 0.746353911948 (0.312587) valid loss(err) 0.9409754833 (0.364293)
train loss(err) 0.719666413906 (0.306712) valid loss(err) 0.755750297344 (0.332407)
train loss(err) 0.696232671611 (0.301982) valid loss(err) 0.755750297344 (0.332407)
train loss(err) 0

In [76]:
def predict_by_model(xscaler, yscaler, model, X):
    yhat = yscaler.inverse_transform(model.predict(xscaler.transform(X)))
    return pairwise_distances_argmin(yhat, label_vecs)

In [82]:
## for images already seen
yhat_notruck = predict_by_model(ss_X, ss_y, model, X_notruck)
NOTRUCK_LABELS = [l for l in LABELS if l != "truck"]
cm = pd.DataFrame(confusion_matrix(label_notruck, yhat_notruck), 
                  index = NOTRUCK_LABELS, columns=NOTRUCK_LABELS)
cm

Unnamed: 0,airplane,automobile,bird,cat,deer,dog,frog,horse,ship
airplane,5431,58,113,31,48,14,24,32,249
automobile,77,5795,8,33,8,7,7,16,49
bird,153,7,4907,336,270,104,150,50,23
cat,41,9,310,4440,219,672,175,104,30
deer,52,2,283,204,5143,96,102,97,21
dog,9,1,187,640,152,4852,50,101,8
frog,25,3,154,233,168,56,5335,22,4
horse,43,9,90,244,253,200,18,5128,15
ship,186,65,23,27,8,10,3,12,5666


In [78]:
## for unseen (truck) images - how are they map to the text
yhat_truck = predict_by_model(ss_X, ss_y, model, X_truck)
Counter(LABELS[yhat_truck])

Counter({'automobile': 3746, 'airplane': 1115, 'ship': 553, 'horse': 221, 'cat': 185, 'bird': 66, 'deer': 53, 'dog': 42, 'frog': 19})

** we see that even we don't see truck images before, most of them will be mapped to the word "automobile". But unfortunatelly none of the new images are really mapped to the word "truck" - obviously the mapping does a better job for interpolation than for extrapolation. For example, the arithmetic relation between the words "truck" and "automobile" are NOT captured by the mapping from images to words - because the image vectors themselves don't have similiar arithmetic relations as in word2vec. This could be caused by the lack of enough words for images**