# Challenges in Computational Linguistics
## SemEval 2020: Commonsense Validation and Explanation

We're participating in task 4 of the SemEval 2020 Challenges for our seminar Challenges in Computational Linguistics, University Tübingen.

This notebook is meant as playground and first steps, to get the ball rolling. It can later be used as a template to build the final notebook (or program).

## The Data
### Read-in for task 1

First I use the given data from the tasks github respo for subtask A.

https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/tree/master/Training%20%20Data


In [1]:
"""Imports for data"""
import pandas as pd
import numpy as np
import matplotlib as mpl

In [2]:
"""Read in the data directly from github"""
url_data_task_A = "https://raw.githubusercontent.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/master/Training%20%20Data/subtaskA_data_all.csv"
url_answers_task_A = "https://raw.githubusercontent.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/master/Training%20%20Data/subtaskA_answers_all.csv"

data_task_A = pd.read_csv(url_data_task_A,header=0, index_col=0)
answers_task_A = pd.read_csv(url_answers_task_A, index_col=0)

data_task_A[:3]

Unnamed: 0_level_0,sent0,sent1
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,He poured orange juice on his cereal.,He poured milk on his cereal.
1,He drinks apple.,He drinks milk.
2,Jeff ran a mile today,"Jeff ran 100,000 miles today"


In [7]:
# check data type, shape, etc
print(type(data_task_A), 'shape of data:',data_task_A.shape, 'shape of answers:',
      answers_task_A.shape, 'one line is missing \nbecause no header here\n')

print('To get first column, first row:', data_task_A['sent0'].iloc[0]) # iloc only takes integers
print('\nTo get both colums for given row:',data_task_A.loc[0])

<class 'pandas.core.frame.DataFrame'> shape of data: (10000, 2) shape of answers: (9999, 1) one line is missing 
because no header here

To get first column, first row: He poured orange juice on his cereal.

To get both colums for given row: sent0    He poured orange juice on his cereal.
sent1            He poured milk on his cereal.
Name: 0, dtype: object


This should be enough in respect to the data for now, next step is manipuating the data to our needs.

## Using spacy for Natural Language Processing
### What about NLTK ?

In [3]:
"""now the fun starts..."""
import nltk

In [19]:
# just checking and loading stuff...
nltk.__version__
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/max/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Creating a tokenized DataFrame

With help of Pythons list comprehension, we're transforming the string sentences into list of tokens. 


In [7]:
tokens_per_sentence=pd.DataFrame([[nltk.word_tokenize(row['sent0']), nltk.word_tokenize(row['sent1'])
                                  ] for i, row in data_task_A.iterrows()], columns=['sent0','sent1'])




In [28]:
tokens_per_sentence[:3]

Unnamed: 0,sent0,sent1
0,"[He, poured, orange, juice, on, his, cereal, .]","[He, poured, milk, on, his, cereal, .]"
1,"[He, drinks, apple, .]","[He, drinks, milk, .]"
2,"[Jeff, ran, a, mile, today]","[Jeff, ran, 100,000, miles, today]"


### Counting number of distinct words

This could be done together with the above tokenization inside one loop, for better readability, I separated it.

In [14]:
number_dist_words = nltk.FreqDist()

# there should be a way to skip the double lookup...
for i, rows in tokens_per_sentence.iterrows():
    for word1, word2 in zip(rows['sent0'], rows['sent1']):
        
        # if it's the same word, we don't want to count it as double
        if word1.lower() == word2.lower():
            number_dist_words[word1.lower()] += 1
        else:
            number_dist_words[word1.lower()] += 1
            number_dist_words[word2.lower()] += 1


In [15]:
print('number of distinct words:', len(number_dist_words))
print('Output first 10 words:\n')
i = 0
for key, val in number_dist_words.items():
    print('"{}"'.format(key), 'occures', val, 'number of times.')
    i += 1
    if i == 10: break

number of distinct words: 8078
Output first 10 words:

"he" occures 1782 number of times.
"poured" occures 20 number of times.
"orange" occures 17 number of times.
"milk" occures 110 number of times.
"juice" occures 26 number of times.
"on" occures 1098 number of times.
"his" occures 859 number of times.
"cereal" occures 11 number of times.
"." occures 3560 number of times.
"drinks" occures 40 number of times.


### POS-tagger

A quick implementation of a part-of-speech-tagger. This again could be done inside the above forloop.

In [16]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/max/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [17]:
pos_per_sentence = pd.DataFrame([[nltk.pos_tag(row['sent0']), nltk.pos_tag(row['sent1'])
                                  ] for i, row in data_task_A.iterrows()], columns=['sent0','sent1'])



In [18]:
pos_per_sentence[:3]

Unnamed: 0,sent0,sent1
0,"[(H, NNP), (e, NN), ( , NNP), (p, NN), (o, NN)...","[(H, NNP), (e, NN), ( , NNP), (p, NN), (o, NN)..."
1,"[(H, NNP), (e, NN), ( , NNP), (d, NN), (r, NN)...","[(H, NNP), (e, NN), ( , NNP), (d, NN), (r, NN)..."
2,"[(J, NNP), (e, NN), (f, NN), (f, NN), ( , NNP)...","[(J, NNP), (e, NN), (f, NN), (f, NN), ( , NNP)..."


## Parser

### BllipParser


Let's use, for simplicity purposes, the bllip parser module that comes with the nltk package.

Sadly, this didn't work in the beginning, so I quickly downloaded the bllipparser directly from their github:
https://github.com/BLLIP/bllip-parser and
https://github.com/BLLIP/bllip-parser/blob/master/README-python.rst

If you want to try the next few lines, make sure to follow the steps and install the package and WSJ+Gigaword-v2 Model.

In [20]:
from nltk.parse.bllip import BllipParser # this doesn't work: ImportError: Couldn't import bllipparser module: No module named 'bllipparser'
from bllipparser import RerankingParser


In [28]:
nltk_parser = BllipParser('/Users/max/.local/share/bllipparser/WSJ+Gigaword-v2')

RuntimeError: Parser is already loaded and can only be loaded once.

In [21]:
parser_path = '/Users/max/.local/share/bllipparser/WSJ+Gigaword-v2'

"""This is a time consuming operation (1min)"""
parser = RerankingParser.from_unified_model_dir(parser_path)

In [27]:
"""
parser.parse outputs a list of N most probable parses, default n = 50
parser.set_parser_options(nbest=10) sets new options and outputs a dict with current ones

"""

parser_options = parser.set_parser_options(nbest=2)
print(parser_options)

test_a = 'I put an elephant in the fridge'
test_b = 'I put a turkey in the fridge'

parsed_sen_a = parser.parse(test_a)
parsed_sen_b = parser.parse(test_b)
print(test_a)
print(parsed_sen_b)
print(test_b)
print(parsed_sen_b)

{'language': 'En', 'case_insensitive': False, 'nbest': 2, 'small_corpus': True, 'overparsing': 21, 'debug': 0, 'smooth_pos': 0}
I put an elephant in the fridge
2 x
-15.839081581410 -63.140991134783
(S1 (S (NP (PRP I)) (VP (VBP put) (NP (DT a) (NN turkey)) (PP (IN in) (NP (DT the) (NN fridge))))))
-16.692901096060 -65.869600587365
(S1 (S (NP (PRP I)) (VP (VB put) (NP (DT a) (NN turkey)) (PP (IN in) (NP (DT the) (NN fridge))))))

I put a turkey in the fridge
2 x
-15.839081581410 -63.140991134783
(S1 (S (NP (PRP I)) (VP (VBP put) (NP (DT a) (NN turkey)) (PP (IN in) (NP (DT the) (NN fridge))))))
-16.692901096060 -65.869600587365
(S1 (S (NP (PRP I)) (VP (VB put) (NP (DT a) (NN turkey)) (PP (IN in) (NP (DT the) (NN fridge))))))



### Summary BllipParser
As we can see, we get the same scores for both sentences. This parser is useful for getting the dependencies of the tokens, or rather the grammatical structure.

While this can certainly help us, we additionally want a parser that computs the propability P of a word W2, given a W1: P(W2|W1). Then We can compute P(S) for sentence S = W1 W2 W3 as P(S)=P(W1|start)+P(W2|W1)+P(W3|W2)+P(end|W3)

BTW; I'm not sure on that probability calculation above, i think it differs for depenendent/independent events, so we have to check that again.

We can probably make more use of the sentence structure for task B

### StanfordParser

As suggested by Cagri, we could also make use of the StanfordParser. It usually runs on Java, so we have to make some adjustments to our python script if we want to use it. I found couple good-looking lines from: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

Also on the official site https://nlp.stanford.edu/software/lex-parser.shtml#Download an interface to python is linked:http://projects.csail.mit.edu/spatial/Stanford_Parser which I have yet to try.

In [32]:
import os
from nltk.parse.stanford import StanfordParser

java_path = r'/Library/Java/JavaVirtualMachines/jdk-10.jdk/bin/java.exe'
os.environ['JAVAHOME'] = java_path

stanford_parser = StanfordParser(path_to_jar='/Users/max/Documents/GitHub/Commonsense2020/Parsers/stanford-parser-full-2015-04-20/stanford-parser.jar',
                     path_to_models_jar='/Users/max/Documents/GitHub/Commonsense2020/Parsers/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar')

# we again use test sentences test_a & test_b from above
result = list(stanford_parser.raw_parse(test_a))
print(result[0])

Please use [91mnltk.parse.corenlp.CoreNLPParser[0m instead.
  


(ROOT
  (S
    (NP (PRP I))
    (VP
      (VBD put)
      (NP (DT an) (NN elephant))
      (PP (IN in) (NP (DT the) (NN fridge))))))


We could also try out the CoreNLPParser as suggested in the DeprecationWarning above.

I tried drawing the tree like the tutorial linked above suggsted, but ran into some dependency-problems which I wont delve into. 

In [51]:
"""StanfordDependencyParser"""
from nltk.parse.stanford import StanfordDependencyParser
depend_parser = StanfordParser(path_to_jar='/Users/max/Documents/GitHub/Commonsense2020/Parsers/stanford-parser-full-2015-04-20/stanford-parser.jar',
                     path_to_models_jar='/Users/max/Documents/GitHub/Commonsense2020/Parsers/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar')

result = list(depend_parser.raw_parse(test_a))  
print(result[0])


Please use [91mnltk.parse.corenlp.CoreNLPParser[0m instead.
  after removing the cwd from sys.path.


(ROOT
  (S
    (NP (PRP I))
    (VP
      (VBD put)
      (NP (DT an) (NN elephant))
      (PP (IN in) (NP (DT the) (NN fridge))))))


### Summary StanfordParser

Seems like both run on the CoreNLPParser, so we should look into this one. Also we should research if these Parser can also output the propability we actually want out of our parses.