# Challenges in Computational Linguistics
## SemEval 2020: Commonsense Validation and Explanation

We're participating in task 4 of the SemEval 2020 Challenges for our seminar Challenges in Computational Linguistics, University Tübingen.

This notebook is meant as playground and first steps, to get the ball rolling. It can later be used as a template to build the final notebook (or program).

## The Data
### Read-in for task 1

First I use the given data from the tasks github respo for subtask A.

https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/tree/master/Training%20%20Data


In [1]:
"""Imports for data"""
import pandas as pd
import numpy as np
import matplotlib as mpl

In [2]:
"""Read in the data directly from github"""
url_data_task_A = "https://raw.githubusercontent.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/master/Training%20%20Data/subtaskA_data_all.csv"
url_answers_task_A = "https://raw.githubusercontent.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation/master/Training%20%20Data/subtaskA_answers_all.csv"

data_task_A = pd.read_csv(url_data_task_A,header=0, index_col=0)
answers_task_A = pd.read_csv(url_answers_task_A, index_col=0)

data_task_A[:3]

Unnamed: 0_level_0,sent0,sent1
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,He poured orange juice on his cereal.,He poured milk on his cereal.
1,He drinks apple.,He drinks milk.
2,Jeff ran a mile today,"Jeff ran 100,000 miles today"


In [7]:
# check data type, shape, etc
print(type(data_task_A), 'shape of data:',data_task_A.shape, 'shape of answers:',
      answers_task_A.shape, 'one line is missing \nbecause no header here\n')

print('To get first column, first row:', data_task_A['sent0'].iloc[0]) # iloc only takes integers
print('\nTo get both colums for given row:',data_task_A.loc[0])

<class 'pandas.core.frame.DataFrame'> shape of data: (10000, 2) shape of answers: (9999, 1) one line is missing 
because no header here

To get first column, first row: He poured orange juice on his cereal.

To get both colums for given row: sent0    He poured orange juice on his cereal.
sent1            He poured milk on his cereal.
Name: 0, dtype: object


This should be enough in respect to the data for now, next step is manipuating the data to our needs.

## Using spacy for Natural Language Processing
### What about NLTK ?

In [3]:
"""now the fun starts..."""
import nltk

In [19]:
# just checking and loading stuff...
nltk.__version__
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/max/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Creating a tokenized DataFrame

With help of Pythons list comprehension, we're transforming the string sentences into list of tokens. 


In [27]:
tokens_per_sentence=pd.DataFrame([[nltk.word_tokenize(row['sent0']), nltk.word_tokenize(row['sent1'])
                                  ] for i, row in data_task_A.iterrows()], columns=['sent0','sent1'])




In [28]:
tokens_per_sentence[:3]

Unnamed: 0,sent0,sent1
0,"[He, poured, orange, juice, on, his, cereal, .]","[He, poured, milk, on, his, cereal, .]"
1,"[He, drinks, apple, .]","[He, drinks, milk, .]"
2,"[Jeff, ran, a, mile, today]","[Jeff, ran, 100,000, miles, today]"


### Counting number of distinct words

This could be done together with the above tokenization inside one loop, for better readability, I separated it.

In [41]:
number_dist_words = nltk.FreqDist()

# there should be a way to skip the double lookup...
for i, (row['sent0'], row['sent1']) in tokens_per_sentence.iterrows():
    for word1, word2 in zip(row['sent0'], row['sent1']):
        
        # if it's the same word, we don't want to count it as double
        if word1.lower() == word2.lower():
            number_dist_words[word1.lower()] += 1
        else:
            number_dist_words[word1.lower()] += 1
            number_dist_words[word2.lower()] += 1


In [49]:
print('number of distinct words:', len(number_dist_words))
print('Output first 10 words:\n')
i = 0
for key, val in number_dist_words.items():
    print('"{}"'.format(key), 'occures', val, 'number of times.')
    i += 1
    if i == 10: break

number of distinct words: 8078
Output first 10 words:

"he" occures 1782 number of times.
"poured" occures 20 number of times.
"orange" occures 17 number of times.
"milk" occures 110 number of times.
"juice" occures 26 number of times.
"on" occures 1098 number of times.
"his" occures 859 number of times.
"cereal" occures 11 number of times.
"." occures 3560 number of times.
"drinks" occures 40 number of times.


### POS-tagger

A quick implementation of a part-of-speech-tagger. This again could be done inside the above forloop.

In [52]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/max/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [53]:
pos_per_sentence = pd.DataFrame([[nltk.pos_tag(row['sent0']), nltk.pos_tag(row['sent1'])
                                  ] for i, row in data_task_A.iterrows()], columns=['sent0','sent1'])



In [54]:
pos_per_sentence[:3]

Unnamed: 0,sent0,sent1
0,"[(I, PRP), (have, VBP), (a, DT), (desk, NN), (...","[(I, PRP), (have, VBP), (a, DT), (lamp, NN), (..."
1,"[(H, NNP), (e, NN), ( , NNP), (d, NN), (r, NN)...","[(H, NNP), (e, NN), ( , NNP), (d, NN), (r, NN)..."
2,"[(J, NNP), (e, NN), (f, NN), (f, NN), ( , NNP)...","[(J, NNP), (e, NN), (f, NN), (f, NN), ( , NNP)..."
