# Challenges in Computational Linguistics
## SemEval 2020: Commonsense Validation and Explanation

We're participating in task 4 of the SemEval 2020 Challenges for our seminar Challenges in Computational Linguistics, University Tübingen.

This notebook is meant as playground and first steps, to get the ball rolling. It can later be used as a template to build the final notebook (or program).

## The Data
### Read-in for task 1

First I use the given data from the tasks github respo for subtask A.

https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/tree/master/Training%20%20Data


In [16]:
"""Imports for data"""
import pandas as pd
import numpy as np
import matplotlib as mpl

In [17]:
"""Read in the data directly from github"""
url_data_task_A = "https://raw.githubusercontent.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/master/Training%20%20Data/subtaskA_data_all.csv"
url_answers_task_A = "https://raw.githubusercontent.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/master/Training%20%20Data/subtaskA_answers_all.csv"

data_task_A = pd.read_csv(url_data_task_A,header=0, index_col=0)
answers_task_A = pd.read_csv(url_answers_task_A, index_col=0, header=None)

print(data_task_A[:3])
answers_task_A[:3]

                                    sent0                          sent1
id                                                                      
0   He poured orange juice on his cereal.  He poured milk on his cereal.
1                        He drinks apple.                He drinks milk.
2                   Jeff ran a mile today   Jeff ran 100,000 miles today


Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
0,0
1,0
2,1


In [7]:
# check data type, shape, etc
print(type(data_task_A), 'shape of data:',data_task_A.shape, 'shape of answers:',
      answers_task_A.shape, 'one line is missing \nbecause no header here\n')

print('To get first column, first row:', data_task_A['sent0'].iloc[0]) # iloc only takes integers
print('\nTo get both colums for given row:',data_task_A.loc[0])

<class 'pandas.core.frame.DataFrame'> shape of data: (10000, 2) shape of answers: (9999, 1) one line is missing 
because no header here

To get first column, first row: He poured orange juice on his cereal.

To get both colums for given row: sent0    He poured orange juice on his cereal.
sent1            He poured milk on his cereal.
Name: 0, dtype: object




## Using spacy for Natural Language Processing
### What about NLTK ?

In [3]:
"""now the fun starts..."""
import nltk

In [19]:
# just checking and loading stuff...
nltk.__version__
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/max/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Creating a tokenized DataFrame

With help of Pythons list comprehension, we're transforming the string sentences into list of tokens. 


In [7]:
tokens_per_sentence=pd.DataFrame([[nltk.word_tokenize(row['sent0']), nltk.word_tokenize(row['sent1'])
                                  ] for i, row in data_task_A.iterrows()], columns=['sent0','sent1'])




In [28]:
tokens_per_sentence[:3]

Unnamed: 0,sent0,sent1
0,"[He, poured, orange, juice, on, his, cereal, .]","[He, poured, milk, on, his, cereal, .]"
1,"[He, drinks, apple, .]","[He, drinks, milk, .]"
2,"[Jeff, ran, a, mile, today]","[Jeff, ran, 100,000, miles, today]"


### Counting number of distinct words

This could be done together with the above tokenization inside one loop, for better readability, I separated it.

In [14]:
number_dist_words = nltk.FreqDist()

# there should be a way to skip the double lookup...
for i, rows in tokens_per_sentence.iterrows():
    for word1, word2 in zip(rows['sent0'], rows['sent1']):
        
        # if it's the same word, we don't want to count it as double
        if word1.lower() == word2.lower():
            number_dist_words[word1.lower()] += 1
        else:
            number_dist_words[word1.lower()] += 1
            number_dist_words[word2.lower()] += 1


In [15]:
print('number of distinct words:', len(number_dist_words))
print('Output first 10 words:\n')
i = 0
for key, val in number_dist_words.items():
    print('"{}"'.format(key), 'occures', val, 'number of times.')
    i += 1
    if i == 10: break

number of distinct words: 8078
Output first 10 words:

"he" occures 1782 number of times.
"poured" occures 20 number of times.
"orange" occures 17 number of times.
"milk" occures 110 number of times.
"juice" occures 26 number of times.
"on" occures 1098 number of times.
"his" occures 859 number of times.
"cereal" occures 11 number of times.
"." occures 3560 number of times.
"drinks" occures 40 number of times.


### POS-tagger

A quick implementation of a part-of-speech-tagger. This again could be done inside the above forloop.

In [16]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/max/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [17]:
pos_per_sentence = pd.DataFrame([[nltk.pos_tag(row['sent0']), nltk.pos_tag(row['sent1'])
                                  ] for i, row in data_task_A.iterrows()], columns=['sent0','sent1'])



In [18]:
pos_per_sentence[:3]

Unnamed: 0,sent0,sent1
0,"[(H, NNP), (e, NN), ( , NNP), (p, NN), (o, NN)...","[(H, NNP), (e, NN), ( , NNP), (p, NN), (o, NN)..."
1,"[(H, NNP), (e, NN), ( , NNP), (d, NN), (r, NN)...","[(H, NNP), (e, NN), ( , NNP), (d, NN), (r, NN)..."
2,"[(J, NNP), (e, NN), (f, NN), (f, NN), ( , NNP)...","[(J, NNP), (e, NN), (f, NN), (f, NN), ( , NNP)..."


## Parser

### BllipParser


Let's use, for simplicity purposes, the bllip parser module that comes with the nltk package.

Sadly, this didn't work in the beginning, so I quickly downloaded the bllipparser directly from their github:
https://github.com/BLLIP/bllip-parser and
https://github.com/BLLIP/bllip-parser/blob/master/README-python.rst

If you want to try the next few lines, make sure to follow the steps and install the package and WSJ+Gigaword-v2 Model.

In [20]:
from nltk.parse.bllip import BllipParser # this doesn't work: ImportError: Couldn't import bllipparser module: No module named 'bllipparser'
from bllipparser import RerankingParser


In [None]:
nltk_parser = BllipParser('/Users/max/.local/share/bllipparser/WSJ+Gigaword-v2')

In [21]:
parser_path = '/Users/max/.local/share/bllipparser/WSJ+Gigaword-v2'

"""This is a time consuming operation (1min)"""
parser = RerankingParser.from_unified_model_dir(parser_path)

In [27]:
"""
parser.parse outputs a list of N most probable parses, default n = 50
parser.set_parser_options(nbest=10) sets new options and outputs a dict with current ones

"""

parser_options = parser.set_parser_options(nbest=2)
print(parser_options)

test_a = 'I put an elephant in the fridge'
test_b = 'I put a turkey in the fridge'

parsed_sen_a = parser.parse(test_a)
parsed_sen_b = parser.parse(test_b)
print(test_a)
print(parsed_sen_b)
print(test_b)
print(parsed_sen_b)

{'language': 'En', 'case_insensitive': False, 'nbest': 2, 'small_corpus': True, 'overparsing': 21, 'debug': 0, 'smooth_pos': 0}
I put an elephant in the fridge
2 x
-15.839081581410 -63.140991134783
(S1 (S (NP (PRP I)) (VP (VBP put) (NP (DT a) (NN turkey)) (PP (IN in) (NP (DT the) (NN fridge))))))
-16.692901096060 -65.869600587365
(S1 (S (NP (PRP I)) (VP (VB put) (NP (DT a) (NN turkey)) (PP (IN in) (NP (DT the) (NN fridge))))))

I put a turkey in the fridge
2 x
-15.839081581410 -63.140991134783
(S1 (S (NP (PRP I)) (VP (VBP put) (NP (DT a) (NN turkey)) (PP (IN in) (NP (DT the) (NN fridge))))))
-16.692901096060 -65.869600587365
(S1 (S (NP (PRP I)) (VP (VB put) (NP (DT a) (NN turkey)) (PP (IN in) (NP (DT the) (NN fridge))))))



### Summary BllipParser
As we can see, we get the same scores for both sentences. This parser is useful for getting the dependencies of the tokens, or rather the grammatical structure.

While this can certainly help us, we additionally want a parser that computs the propability P of a word W2, given a W1: P(W2|W1). Then We can compute P(S) for sentence S = W1 W2 W3 as P(S)=P(W1|start)+P(W2|W1)+P(W3|W2)+P(end|W3)

BTW; I'm not sure on that probability calculation above, i think it differs for depenendent/independent events, so we have to check that again.

We can probably make more use of the sentence structure for task B

### StanfordParser

As suggested by Cagri, we could also make use of the StanfordParser. It usually runs on Java, so we have to make some adjustments to our python script if we want to use it. I found couple good-looking lines from: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

Also on the official site https://nlp.stanford.edu/software/lex-parser.shtml#Download an interface to python is linked:http://projects.csail.mit.edu/spatial/Stanford_Parser which I have yet to try.

In [32]:
import os
from nltk.parse.stanford import StanfordParser

java_path = r'/Library/Java/JavaVirtualMachines/jdk-10.jdk/bin/java.exe'
os.environ['JAVAHOME'] = java_path

stanford_parser = StanfordParser(path_to_jar='/Users/max/Documents/GitHub/Commonsense2020/Parsers/stanford-parser-full-2015-04-20/stanford-parser.jar',
                     path_to_models_jar='/Users/max/Documents/GitHub/Commonsense2020/Parsers/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar')

# we again use test sentences test_a & test_b from above
result = list(stanford_parser.raw_parse(test_a))
print(result[0])

Please use [91mnltk.parse.corenlp.CoreNLPParser[0m instead.
  


(ROOT
  (S
    (NP (PRP I))
    (VP
      (VBD put)
      (NP (DT an) (NN elephant))
      (PP (IN in) (NP (DT the) (NN fridge))))))


We could also try out the CoreNLPParser as suggested in the DeprecationWarning above.

I tried drawing the tree like the tutorial linked above suggsted, but ran into some dependency-problems which I wont delve into. 

In [51]:
"""StanfordDependencyParser"""
from nltk.parse.stanford import StanfordDependencyParser
depend_parser = StanfordParser(path_to_jar='/Users/max/Documents/GitHub/Commonsense2020/Parsers/stanford-parser-full-2015-04-20/stanford-parser.jar',
                     path_to_models_jar='/Users/max/Documents/GitHub/Commonsense2020/Parsers/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar')

result = list(depend_parser.raw_parse(test_a))  
print(result[0])


Please use [91mnltk.parse.corenlp.CoreNLPParser[0m instead.
  after removing the cwd from sys.path.


(ROOT
  (S
    (NP (PRP I))
    (VP
      (VBD put)
      (NP (DT an) (NN elephant))
      (PP (IN in) (NP (DT the) (NN fridge))))))


### Summary StanfordParser

Seems like both run on the CoreNLPParser, so we should look into this one. Also we should research if these Parser can also output the propability we actually want out of our parses.

# BERT

The thing always going through my mind with a propability calculation of the whole sentence is called a Language Model (should've known that) and spacy provides us with a pretrained one...

Alright, following I'm trying to implement the BERT model to compute word embeddings for three words out of our dataset. Pipeline will be as follows:

- go through dataset and see which words differ
- get dependent words through pos tags
- run these words through BERT
- get word embeddings as output
- compute vector distance
- whichever distance is lower is better
- output is a measure of the difference of the distances

All of this (without first, maybe second step) cann happen in a model?

In [11]:
"""https://github.com/huggingface/transformers"""

import tensorflow as tf
# FullTokenizer from file tokenization.py, method copied from bert-github:https://github.com/google-research/bert
# with tensorflow 2, the normal implementation doesnt work anymore, you have to run the upgrade script on the file.
import tokenization_v2 


## Different words
We want some more information about our data like
- longest word sequence
- word pairs that differ between the sentences
- dependents on the different words (looking at POS-tags should we a way to find them.)

In [49]:
l = [[[11],[12]],[[21],[22]],[[31],[32]]]
print(np.asarray(l).shape)
for i in range(len(l)):
    l[i][0] = [i]
    l[i][1] = [i+1]
#l[0][0] = [0]
print(l)
different_words = [[[None],[None]]] * (data_task_A.shape[0])
np.asarray(different_words).shape

(3, 2, 1)
[[[0], [1]], [[1], [2]], [[2], [3]]]


(10000, 2, 1)

In [60]:
test_a = 'I put an elephant in the fridge'
test_b = 'I put a turkey in the fridge'

"""We use the official FullTokenizer just to make sure"""
tokenizer = tokenization_v2.FullTokenizer('BERT/vocab.txt')
sen_tok = tokenizer.tokenize(test_a)

toks_task_A = []
max_seq_len = 0
different_words = [[[None],[None]]] * (data_task_A.shape[0])
# data_task_A, answers_task_A
for index, row in data_task_A.iterrows():
    
    sen_0_tok = tokenizer.tokenize(row['sent0'])
    sen_1_tok = tokenizer.tokenize(row['sent1'])
    
    len_sen_0_tok = len(sen_0_tok)
    len_sen_1_tok = len(sen_1_tok)
    
    """Here we get the longest token-sequence length"""
    if len_sen_0_tok >= len_sen_1_tok:
        if len_sen_0_tok > max_seq_len:
            max_seq_len = len_sen_0_tok
    elif len_sen_1_tok > max_seq_len:
            max_seq_len = len_sen_1_tok
    
    """Here we look for the words that differ"""
    if len_sen_0_tok == len_sen_1_tok:
        for tok_0, tok_1 in zip(sen_0_tok, sen_1_tok):
            if tok_0 != tok_1:
                #print(tok_0, tok_1)
                #different_words.append([tok_0, tok_1])
                different_words[0][0] = [tok_0]
                different_words[0][1] = [tok_1]
    # We should make sure that the different words stay in the spots for sent0 sent1,
    # therefor we probl need a 3dim list, or we work with pandas...
    
    ### 
    #BUG: somehow it overrides all lines every new
    ###
    else:
        if len_sen_0_tok > len_sen_1_tok:
            print(index)
            s = set(sen_1_tok)
            different_words[index][0] = [x for x in sen_0_tok if x not in s]
            different_words[index][1] = [x for x in sen_1_tok if x not in set(sen_0_tok)]
            #word_from_other_sen = [x for x in sen_1_tok if x not in set(sen_0_tok)]
            #for word in word_from_other_sen:
             #   different_words[index].append(word)
        else:
            different_words[index][0] = [x for x in sen_0_tok if x not in set(sen_1_tok)]
            s = set(sen_0_tok)
            different_words[index][1] = [x for x in sen_1_tok if x not in s]
            
            #word_from_other_sen = [x for x in sen_0_tok if x not in set(sen_1_tok)]
            #for word in word_from_other_sen:
             #   different_words[index].append(word)
        
    sen_0_tok.insert(0,'[CLS]')
    sen_0_tok.append('[SEP]')
    sen_1_tok.insert(0, '[CLS]')
    sen_1_tok.append('[SEP]')
    
    toks_task_A.append([sen_0_tok,sen_1_tok])

print(toks_task_A[:3], '\nLongest sentence length:', max_seq_len, 
      '\nFirst three word-pairs that differ:', different_words[:3],
      '\nshape of different_words:', np.asarray(different_words).shape)

0
3
7
10
11
12
14
15
16
19
26
29
33
36
39
49
52
56
65
68
71
75
78
83
90
91
102
107
114
115
117
122
127
131
132
135
137
152
154
160
162
176
177
182
185
190
193
195
203
205
206
212
218
222
223
233
239
240
248
249
250
254
256
258
271
295
302
307
308
311
317
318
330
331
332
345
346
349
355
358
364
372
374
375
376
389
390
392
394
400
406
407
421
426
428
434
439
445
453
461
463
467
471
477
490
497
509
511
525
528
529
537
542
549
550
559
561
566
569
570
571
584
587
591
598
600
604
621
622
623
628
637
644
645
646
652
659
661
662
663
664
669
676
678
681
683
687
690
701
710
711
712
720
727
729
738
740
747
748
751
776
777
789
796
799
806
826
827
842
854
857
866
868
871
874
882
884
889
890
892
894
903
904
909
914
918
923
926
930
934
947
954
960
964
977
981
984
992
994
995
996
1003
1004
1005
1015
1018
1020
1021
1023
1026
1041
1058
1064
1067
1071
1073
1074
1075
1087
1090
1094
1103
1104
1105
1118
1121
1122
1123
1133
1138
1139
1144
1148
1149
1160
1164
1165
1167
1168
1171
1173
1188
1209
1213
1214
1215


9354
9373
9382
9383
9405
9410
9415
9424
9425
9431
9440
9446
9450
9455
9456
9482
9488
9490
9492
9498
9510
9518
9526
9535
9537
9538
9545
9554
9558
9564
9577
9582
9586
9587
9590
9595
9603
9612
9616
9625
9638
9643
9645
9650
9653
9658
9662
9668
9673
9675
9683
9689
9725
9726
9727
9730
9738
9742
9743
9747
9749
9751
9755
9758
9761
9766
9767
9800
9808
9822
9831
9832
9834
9835
9846
9857
9887
9892
9897
9911
9915
9918
9934
9940
9945
9960
9965
9968
9979
9986
[[['[CLS]', 'he', 'poured', 'orange', 'juice', 'on', 'his', 'cereal', '.', '[SEP]'], ['[CLS]', 'he', 'poured', 'milk', 'on', 'his', 'cereal', '.', '[SEP]']], [['[CLS]', 'he', 'drinks', 'apple', '.', '[SEP]'], ['[CLS]', 'he', 'drinks', 'milk', '.', '[SEP]']], [['[CLS]', 'jeff', 'ran', 'a', 'mile', 'today', '[SEP]'], ['[CLS]', 'jeff', 'ran', '100', ',', '000', 'miles', 'today', '[SEP]']]] 
Longest sentence length: 25 
First three word-pairs that differ: [[['lamp'], ['desk']], [['lamp'], ['desk']], [['lamp'], ['desk']]] 
shape of different_words: 

In [57]:
for i in different_words:
    print(i)

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hamburger', '##s']]
[['vegetables'], ['hambur

In [31]:
"""
Saver below loads the graph-operations of pretrained model.
First we need to load the meta graph, then the weights.
"""
sess = tf.Session()

saver = tf.train.import_meta_graph('BERT/bert_model.ckpt.meta')
saver.restore(sess, 'BERT/bert_model.ckpt')

"""
the saver imports the graph of the saved model and with restore, loads the weights into the different operations.
So if we want to load the model from file and do a classification with it, we need to modify the graph.
Precisely, we need to add 3 Input layers (operations) like two cells below from: https://towardsdatascience.com/simple-bert-using-tensorflow-2-0-132cb19e9b22
Also a classification ops like https://towardsdatascience.com/bert-in-keras-with-tensorflow-hub-76bcbc9417b

"""

### Working with tensorflow hub

Because working directly with the graph and operations can be quite tiresome, we'll try tensorflow hub from https://towardsdatascience.com/simple-bert-using-tensorflow-2-0-132cb19e9b22. This promises us quicker results. Ill still want to figure out how to work with the graph later in the game.

Also this tutorial works with tensorflow 2, so make sure you have that installed.
Check with tensorflow.\__version__

In [5]:
import tensorflow_hub as hub
from tensorflow.keras.models import Model


'2.0.0'

In [6]:
max_seq_length = max_seq_len  # We take the max length from above cell "different words"

input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")

bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=True)

pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])

NameError: name 'max_seq_len' is not defined

In [None]:
graph = tf.get_default_graph()


## NExt STEPs:

- go through dataset and see which words differ
- get dependent words through pos tags
- run these words through BERT
- get word embeddings as output
- compute vector distance
- whichever distance is lower is better
- output is a measure of the difference of the distances

All of this (without first, maybe second step) cann happen in a model?