# Lab1-Assignment

This notebook describes the assignment for Lab 1. 

**Points**: each exercise is prefixed with the number of points you can obtain for the exercise.

We assume you have worked through the following notebooks:
* **Lab1.1-introduction**
* **Lab1.2-introduction-to-NLTK**
* **Lab1.3-introduction-to-spaCy** 

In this assignment, you will process an English text (**Lab1-apple-samsung-example.txt**) with both NLTK and spaCy and discuss the similarities and differences.

## Who to contact for questions
* Piek Vossen (piek.vossen@vu.nl)

## Tip: how to read a file from disk
Let's open the file **Lab1-apple-samsung-example.txt** from disk. It should be located in the same folder as this notebook. The most simple way is to specify the full path to the file, e.g.:

```
path_to_file='/Users/piek/Desktop/HLT-2019/hlt-ma-labs/lab1.toolkits/Lab1-apple-samsung-example.txt'
```

This will work for me but not for you as it is unlikely that the file has the same path on your machine.

We can use the Path module to find the directory of this notebook. Once we have that, we only need to concatenate the name of the text file to this path. This is how you do this:

In [5]:
from pathlib import Path

In [6]:
cur_dir = Path().resolve() # this should provide you with the folder in which this notebook is placed
print('Current directory of this notebook:', cur_dir)
path_to_file = Path.joinpath(cur_dir, 'Lab1-apple-samsung-example.txt')
print('Path to the text file:', path_to_file)

Current directory of this notebook: /Users/piek/Desktop/HLT-2019/hlt-ma-labs/lab1.toolkits
Path to the text file: /Users/piek/Desktop/HLT-2019/hlt-ma-labs/lab1.toolkits/Lab1-apple-samsung-example.txt


If you are unsure whether the path is correct, you can check if the file exist on that location:

In [7]:
print('does path exist? ->', Path.exists(path_to_file))

does path exist? -> True


If the output from the code cell above states that **does path exist? -> False**, please check that the file **Lab1-apple-samsung-example.txt** is in the same directory as this notebook.

Now we can open the file and access it content. Lets read the complete content and ask for it length using the 'len' function, which will tell us how many characters it has:

In [8]:
with open(path_to_file) as infile:
    text = infile.read()

print('number of characters', len(text))

number of characters 1139


In [52]:
print(text)

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.
The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
Apple stated it had “acted quickly and diligently" in order to "determine that these newly released products do infringe many of the same claims already asserted by Apple."
In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices. Samsung, which is the world's top mobile phone maker, is appealing the ruling.
A similar case in the UK found in Samsung's fav

We now created a string object with the name 'text' that we can use it for the assignment below.

## [total points: 4] Exercise 1: NLTK
In this exercise, we use NLTK to apply **Part-of-speech (POS) tagging**, **Named Entity Recognition (NER)**, and **Constituency parsing**. The following code snippet already performs sentence splitting and tokenization. 

In [13]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize

In [29]:
sentences_nltk = sent_tokenize(text)

In [30]:
tokens_per_sentence = [] # this will become a list of lists!!!

for sentence_nltk in sentences_nltk:
    sent_tokens = word_tokenize(sentence_nltk)
    # We append the tokens of this sentence to the result list
    tokens_per_sentence.append(sent_tokens)

We will use lists to keep track of the output of the NLP tasks. We can hence inspect the output for each task using the index of the sentence. Lets look at the first sentence.

In [16]:
sentence_id = 1
print('SENTENCE', sentences_nltk[sentence_id])
print('TOKENS', tokens_per_sentence[sentence_id])

SENTENCE The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
TOKENS ['The', 'six', 'phones', 'and', 'tablets', 'affected', 'are', 'the', 'Galaxy', 'S', 'III', ',', 'running', 'the', 'new', 'Jelly', 'Bean', 'system', ',', 'the', 'Galaxy', 'Tab', '8.9', 'Wifi', 'tablet', ',', 'the', 'Galaxy', 'Tab', '2', '10.1', ',', 'Galaxy', 'Rugby', 'Pro', 'and', 'Galaxy', 'S', 'III', 'mini', '.']


### [point: 1] Exercise 1a: Part-of-speech (POS) tagging
Use *nltk.pos_tag* to perform part-of-speech tagging on a single sentence.

Use **print** to show the output in the notebook (and hence also in the exported PDF!).

In [31]:
sentence_id=2
sentence_tokens = tokens_per_sentence[sentence_id]
pos_tagged_sentence_tokens= [] #put here the call to nltk pos tagger
print(pos_tagged_sentence_tokens)

[]


# KEY

In [43]:
sentence_id=2
sentence_tokens = tokens_per_sentence[sentence_id]
pos_tagged_sentence_tokens= nltk.pos_tag(sentence_tokens)
print(pos_tagged_sentence_tokens)

[('Apple', 'NNP'), ('stated', 'VBD'), ('it', 'PRP'), ('had', 'VBD'), ('“', 'NNP'), ('acted', 'VBD'), ('quickly', 'RB'), ('and', 'CC'), ('diligently', 'RB'), ("''", "''"), ('in', 'IN'), ('order', 'NN'), ('to', 'TO'), ('``', '``'), ('determine', 'VB'), ('that', 'IN'), ('these', 'DT'), ('newly', 'RB'), ('released', 'VBN'), ('products', 'NNS'), ('do', 'VBP'), ('infringe', 'VB'), ('many', 'JJ'), ('of', 'IN'), ('the', 'DT'), ('same', 'JJ'), ('claims', 'NNS'), ('already', 'RB'), ('asserted', 'VBN'), ('by', 'IN'), ('Apple', 'NNP'), ('.', '.'), ("''", "''")]


### [point: 1] Exercise 1b: Named Entity Recognition (NER)
Use *nltk.chunk.ne_chunk* to perform Named Entity Recognition (NER) on each sentence.

Use **print** to show the output in the notebook (and hence also in the exported PDF!).

In [58]:
tokens_pos_tagged_and_named_entities = []
print(tokens_pos_tagged_and_named_entities)

[]


# KEY

In [44]:
from nltk.chunk import ne_chunk

tokens_pos_tagged_and_named_entities = ne_chunk(pos_tagged_sentence_tokens)
print(tokens_pos_tagged_and_named_entities)


(S
  (PERSON Apple/NNP)
  stated/VBD
  it/PRP
  had/VBD
  “/NNP
  acted/VBD
  quickly/RB
  and/CC
  diligently/RB
  ''/''
  in/IN
  order/NN
  to/TO
  ``/``
  determine/VB
  that/IN
  these/DT
  newly/RB
  released/VBN
  products/NNS
  do/VBP
  infringe/VB
  many/JJ
  of/IN
  the/DT
  same/JJ
  claims/NNS
  already/RB
  asserted/VBN
  by/IN
  (PERSON Apple/NNP)
  ./.
  ''/'')


### [points: 2] Exercise 1c: Constituency parsing
Use the *nltk.RegexpParser* to perform constituency parsing on each sentence.

Use **print** to show the output in the notebook (and hence also in the exported PDF!).

In [60]:
constituent_parser_v1 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [61]:
constituency_output_for_sentence = []
#add here your code to assign the output of the parser 'constituent_parser_v1' to the variable name 'constituent_parser_v1'

In [62]:
print(constituency_output_for_sentence)

[]


Augment the RegexpParser so that it also detects Named Entity Phrases (NEP), e.g., that it detects phrases such as *Galaxy S III* and *Ice Cream Sandwich* as entity phrases, which we give the label 'NEP'.

In [63]:
constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {}             # ???''')

# KEY

In [64]:
constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition 
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {<NNP> *}    # NEP''')

In [65]:
constituency_v2_output_per_sentence = []
constituency_v2_output_per_sentence = constituent_parser_v2.parse(pos_tagged_sentence_tokens)

In [33]:
print(constituency_v2_output_per_sentence)

NameError: name 'constituency_v2_output_per_sentence' is not defined

## [total points: 1] Exercise 2: spaCy
Use Spacy to process the same text as you analyzed with NLTK.

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [9]:
doc = nlp(text) # insert code here

**Tip**: You can use **sents = list(doc.sents)** to be able to use the index to access a sentence like **sents[2]** for the third sentence. Use the previsouly defined "sentence_id" to get the output for the same NLTK sentence. Note that we assume that the sentences are split in the same way.

## [total points: 5] Exercise 3: Comparison NLTK and spaCy
We will now compare the output of NLTK and spaCy. Take the same sentence that you just processed with the NLTK and check the spaCy output for that sentence, i.e. in what do they differ?

### [points: 2] Exercise 3a: Part of speech tagging
Compare the output from NLTK and Spacy regarding part of speech tagging for the same sentence selected before. You already had the PoS Tags from NLTK. Get the tokens and their PoS tags (**token.tag**) from spaCy. Print both and describe any differences. This is not a trick question, it is possible that there are no differences.

In [45]:
sentence=sents[sentence_id]
print(sentence)

Apple stated it had “acted quickly and diligently" in order to "determine that these newly released products do infringe many of the same claims already asserted by Apple.


In [46]:
for ent in sentence.ents:
    print(ent)
print(sentence.ents)

Apple
Apple
[Apple, Apple]


# KEY

In [47]:
sents = list(doc.sents)
for i, sentence_nltk in enumerate(sentences_nltk):
    print('SENTENCE:', i)
    print('NLTK', sentence_nltk)
    print()
    print('SpaCy', sents[i])
    print()
    print()

SENTENCE: 0
NLTK https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.

SpaCy https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.



SENTENCE: 1
NLTK The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.

SpaCy The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9

# KEY

In [49]:
print('NLTK TOKENS:', sentence_tokens)
spacy_tokens=sents[sentence_id]
print('SPACY TOKENS:', spacy_tokens)

NLTK TOKENS: ['Apple', 'stated', 'it', 'had', '“', 'acted', 'quickly', 'and', 'diligently', "''", 'in', 'order', 'to', '``', 'determine', 'that', 'these', 'newly', 'released', 'products', 'do', 'infringe', 'many', 'of', 'the', 'same', 'claims', 'already', 'asserted', 'by', 'Apple', '.', "''"]
SPACY TOKENS: Apple stated it had “acted quickly and diligently" in order to "determine that these newly released products do infringe many of the same claims already asserted by Apple.


In [55]:
spacy_tokens=sents[sentence_id]
for token_id, nltk_token in enumerate(pos_tagged_sentence_tokens):
    spacy_token=spacy_tokens[token_id]
    print('NLTK:', nltk_token, 'SPACY:', spacy_token.text, spacy_token.tag_)

NLTK: ('Apple', 'NNP') SPACY: Apple NNP
NLTK: ('stated', 'VBD') SPACY: stated VBD
NLTK: ('it', 'PRP') SPACY: it PRP
NLTK: ('had', 'VBD') SPACY: had VBD
NLTK: ('“', 'NNP') SPACY: “ ``
NLTK: ('acted', 'VBD') SPACY: acted VBN
NLTK: ('quickly', 'RB') SPACY: quickly RB
NLTK: ('and', 'CC') SPACY: and CC
NLTK: ('diligently', 'RB') SPACY: diligently RB
NLTK: ("''", "''") SPACY: " ``
NLTK: ('in', 'IN') SPACY: in IN
NLTK: ('order', 'NN') SPACY: order NN
NLTK: ('to', 'TO') SPACY: to TO
NLTK: ('``', '``') SPACY: " ``
NLTK: ('determine', 'VB') SPACY: determine VB
NLTK: ('that', 'IN') SPACY: that IN
NLTK: ('these', 'DT') SPACY: these DT
NLTK: ('newly', 'RB') SPACY: newly RB
NLTK: ('released', 'VBN') SPACY: released VBN
NLTK: ('products', 'NNS') SPACY: products NNS
NLTK: ('do', 'VBP') SPACY: do VBP
NLTK: ('infringe', 'VB') SPACY: infringe VB
NLTK: ('many', 'JJ') SPACY: many JJ
NLTK: ('of', 'IN') SPACY: of IN
NLTK: ('the', 'DT') SPACY: the DT
NLTK: ('same', 'JJ') SPACY: same JJ
NLTK: ('claims', 'NNS')

**Observation**: Most of the PoS tags are the same between the two modules. There are some differences, for example, they disagree on the kind of verb for 'filed'. Also, 'to' is consistently marked as 'TO' by NLTK, whereas it is marked as 'IN' by SpaCy. 'operating' is noun according to NLTK whereas it is a verb according to SpaCy.

### [points: 1] Exercise 3b: Named Entity Recognition (NER)
* For the same sentence.describe differences between the output from NLTK and spaCy for Named Entity Recognition. Which one do you think performs better?

In [58]:
print(sentence_id)
print('NLTK named entity & pos tags:', tokens_pos_tagged_and_named_entities)

print('NLTK named entities:')
for chunk in tokens_pos_tagged_and_named_entities:
    if hasattr(chunk, 'label'):
            ent_type=chunk.label()
            ent_mention=' '.join(c[0] for c in chunk)
            print(ent_mention, ent_type)
            
print('SPACY named entities:')
for ent in sentence.ents:
    print(ent.text, ent.label_)

2
NLTK named entity & pos tags: (S
  (PERSON Apple/NNP)
  stated/VBD
  it/PRP
  had/VBD
  “/NNP
  acted/VBD
  quickly/RB
  and/CC
  diligently/RB
  ''/''
  in/IN
  order/NN
  to/TO
  ``/``
  determine/VB
  that/IN
  these/DT
  newly/RB
  released/VBN
  products/NNS
  do/VBP
  infringe/VB
  many/JJ
  of/IN
  the/DT
  same/JJ
  claims/NNS
  already/RB
  asserted/VBN
  by/IN
  (PERSON Apple/NNP)
  ./.
  ''/'')
NLTK named entities:
Apple PERSON
Apple PERSON
SPACY named entities:
Apple ORG
Apple ORG


**Observations:** It seems like SpaCy performs better. NLTK makes some obvious mistakes, such as: 'Apple' is classified as a person, 'UK' as an organization, and 'San Jose' as an organization too. Both mistakenly classify 'iPhone' as an organizationm and 'Galaxy S III' as a person.

### [points: 1] Exercise 3c: Constituency/dependency parsing
For the same sentence from the text, run constituency parsing using NLTK and dependency parsing using spaCy.
* describe briefly the difference between constituency parsing and dependency parsing


In [52]:
constituent_parser = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [53]:
constituent_structure = constituent_parser.parse(pos_tagged_sentence_tokens)
constituent_structure.draw()

In [54]:
from spacy import displacy
displacy.render(list(doc.sents)[sentence_id], jupyter=True, style='dep')

**Answer:** 

A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled.

A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship.

There are quite a few differences, some of which stem from the differences in the task definition (constituency vs dependency), and some are decisions that the tools made within the task. For instance, regarding the latter, in SpaCy: 'that', these', etc. are associated with the verb 'infringe'; in NLTK, 'that ' and 'these' form a phrase with each other and group with other words in the higher layers, only merging with the phrase of 'infringe' at the root node.

# End of the assignment 1