## Exploration of explicitly encoding Lexical Patterns into Neural Models

In [1]:
%ls

Lexical Exploration.ipynb  composition_learning.py
README.md                  file_util.py
[34m__pycache__[m[m/               repeval_exploration.ipynb


In [2]:
%ls ..

GoogleNews-vectors-negative300.bin  notes.txt
amod.train.filtered                 [34mrepeval2017[m[m/


In [2]:
import numpy as np
import composition_learning
from gensim.models import KeyedVectors as embedding_model

Using Theano backend.


Zunächst erstmal Zielnomen, Nomen, Adjektive aus tsv-File laden

In [6]:
import csv
path_to_adj_noun_file = "../amod.train.filtered"
gnoun_noun_adj_list = [] #a list of lists, each sublist containing a goal-noun, noun, adjective triple
goal_noun_list = [] #only the goal nouns, needed for functions from composition_learning.py.
with open(path_to_adj_noun_file, "r") as tsv_file:
    tsv_reader = csv.reader(tsv_file, delimiter='\t')
    for row in tsv_reader:
        gnoun_noun_adj_list.append(row)
        goal_noun_list.append(row[0].upper())

Danach Word Embeddings aus Modell laden:

In [3]:
path_to_word_embeddings = "../data/GoogleNews-vectors-negative300.bin"
vector_space = embedding_model.load_word2vec_format(path_to_word_embeddings, binary=True)

Daten in das richtige Format für das Keras-Modell bringen:

In [7]:
training_data, training_labels, not_in_list = composition_learning.construct_data_and_labels(gnoun_noun_adj_list,
                                                                                             vector_space,
                                                                                             goal_noun_list,
                                                                                             verbosity = 0
                                                                                            )

Das Modell trainieren. tensor_model bezeichnet das modell, indem adjektiv mit Tensor zusammen zu einer Matrix multipliziert wird, die dann mit Nomen-vektor multipliziert wird um wieder einen Vektor zu erhalten.
weighted_model steht für ein additives Modell mit Gewichtigunsmatrizen für je Adjetiv und Nomen.

In [21]:
tensor_model = composition_learning.train_model(training_data, training_labels, 
                                         composition_mode = 'tensor_mult_identity', verbosity=1)

Trainiere NN mit mode weighted_adj_and_noun_add_identity...
Trainiere NN mit mode tensor_mult_identity...
Finished Training of the Models!


In [15]:
weighted_model = composition_learning.train_model(training_data, training_labels,
                                                composition_mode = 'weighted_adj_and_noun_add_identity', verbosity=1)
print("Finished Training of the Models!")

Trainiere NN mit mode weighted_adj_and_noun_add_identity...
Finished Training of the Models!


Modelle abspeichern für später:

In [25]:
%cd .. 
%mkdir models 
%cd models

/Users/Fabian/Documents/Arbeit/AG_SD/repeval2017
/Users/Fabian/Documents/Arbeit/AG_SD/repeval2017/models


In [34]:
from keras.models import save_model
save_model(tensor_model, "../models/tensor_model")
save_model(weighted_model, "../models/weighted_model")

In [30]:
%cd ../repeval2017/

/Users/Fabian/Documents/Arbeit/AG_SD/repeval2017/repeval2017


Modelle wieder laden:

In [12]:
from keras.models import load_model
from composition_learning import MagicOperation
tensor_model = load_model("../models/tensor_model", custom_objects={'MagicOperation':MagicOperation})
weighted_model = load_model("../models/weighted_model", custom_objects={'MagicOperation':MagicOperation})

TypeError: __init__() missing 1 required positional argument: 'output_dim'

Um mit den Modellen Vektoren zu berechnen, tut man folgendes:

In [None]:
goal_noun, noun, adj = gnoun_noun_adj_list[0]


goal_noun_tensor_vector = tensor_model.predict(np.asarray([[vectorspace[adj], vectorspace[noun]]]))[0, 0]
goal_noun_weighted_add_vector = weighted_model.predict(np.asarray([[vectorspace[adj], vectorspace[noun]]]))[0]

Todo:
+ Matthias' Keras-Fehler nachvollziehen
+ Turney Datensatz laden, in ordentliche Struktur bringen
+ Evaluieren:
    - Modell je den besten kandidaten wählen lassen
         - wie viele sind korrekt gewählt? wie viele falsche sind korrekterweise nicht gewählt? (nachlesen)
         - average rank des korrekten Kandidaten (nachlesen)

### Evaluierung

Lade Datensatz von hier: http://jair.org/media/3640/live-3640-6413-jair.txt

In [3]:
%cd ../data/
!wget "http://jair.org/media/3640/live-3640-6413-jair.txt"
%ls
%cd ../repeval2017/

/Users/Fabian/Documents/Arbeit/AG_SD/repeval2017/data
--2017-04-06 10:53:48--  http://jair.org/media/3640/live-3640-6413-jair.txt
Resolving jair.org... 141.213.4.6
Connecting to jair.org|141.213.4.6|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191220 (187K) [text/plain]
Saving to: 'live-3640-6413-jair.txt'


2017-04-06 10:53:49 (258 KB/s) - 'live-3640-6413-jair.txt' saved [191220/191220]

GoogleNews-vectors-negative300.bin  [34msnli_1.0[m[m/
live-3640-6413-jair.txt             snli_1.0.zip
/Users/Fabian/Documents/Arbeit/AG_SD/repeval2017/repeval2017


Evaluierungsdatensatz einlesen:

In [16]:
import csv
path_to_eval_file = "../data/live-3640-6413-jair.txt"
data = []
with open(path_to_eval_file, "r") as tsv_file:
    tsv_reader = csv.reader(tsv_file, delimiter='|')
    i = 0    
    for row in tsv_reader:
        if row[0][0] != '#':
            data.append([])
            for elem in row:
                data[i].append(elem.strip())
            i+=1

training_data = data[0:680] #Split nach Angabe von Turney
test_data = data[680:]
print(len(training_data))
print(len(test_data))

680
1500


In [21]:
from scipy.spatial.distance import cosine
import sys

def compute_distances_to_candidates(model, vector_space, sample):
    #print("Processing {}".format(sample))
    distances = []
    
    phrase = sample[0]
    adj,noun = phrase.split()
    # berechne phrasenvektor mit modell
    try:
        phrase_vec = model.predict(np.asarray([[vector_space[adj], vector_space[noun]]]))[0] #TODO: format für modell anpassen
    except:
        print("{} or {} not in vectorspace".format(adj, noun))
        return distances
    
    #print(np.shape(phrase_vec))

    for i in range(1,len(sample)):
        # berechne vektoren für alle targets per look up
        try:
            #print(sample[i], type(sample[i]))
            
            target_vec = vector_space[sample[i]]
            distance = cosine(phrase_vec, target_vec)
            distances.append((i, distance))
        except KeyError as e:
            distances.append((i, np.infty))#wörter, die nicht im vektorraum enthalten sind, werden als unendlich weit weg behandelt
            print("{} is not in vectorspace".format(sample[i]))
            
    #sortiere nach distanz
    sorted_distances = sorted(distances, key=lambda x:x[1], reverse=False)
    #print(sorted_distances)
    return sorted_distances
    
def eval_model(model, vector_space, test_data, mode='average_rank'):
    if mode == 'average_rank':
        ranks = []
        for sample in test_data:
            distances = compute_distances_to_candidates(model, vector_space, sample)
            if distances:
                for i in range(0, len(distances)):
                    if distances[i][0] == sample[1]: #find rank of gold label
                        
                        ranks.append(i)
        return np.mean(ranks)

print(eval_model(weighted_model, vector_space, test_data))
        
        # implementiere zweite evaluationsversionen die matthias vorgeschlagen hat


incumbrance is not in vectorspace
syncarp is not in vectorspace
absorptance is not in vectorspace
squirearchy is not in vectorspace
perfective is not in vectorspace
alizarine is not in vectorspace
superorder is not in vectorspace
dalesman is not in vectorspace
copestone is not in vectorspace
scorbutus is not in vectorspace
husbandman is not in vectorspace
gravitative is not in vectorspace
chrysomelid is not in vectorspace
leafage is not in vectorspace
instructress is not in vectorspace
langouste is not in vectorspace
lithops is not in vectorspace
favour is not in vectorspace
masterwort is not in vectorspace
incrustation is not in vectorspace
pycnogonid is not in vectorspace
eardrop is not in vectorspace
tiepin is not in vectorspace
basidium is not in vectorspace
dimity is not in vectorspace
betatron is not in vectorspace
pleasance is not in vectorspace
clepsydra is not in vectorspace
soddy is not in vectorspace
canella is not in vectorspace
lipectomy is not in vectorspace
peristome is 

  out=out, **kwargs)


TODO next:
+ warum sind die ranks leer?