# Lab1-Assignment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the assignment for Lab 1 of the text mining course. 

**Points**: each exercise is prefixed with the number of points you can obtain for the exercise.

We assume you have worked through the following notebooks:
* **Lab1.1-introduction**
* **Lab1.2-introduction-to-NLTK**
* **Lab1.3-introduction-to-spaCy** 

In this assignment, you will process an English text (**Lab1-apple-samsung-example.txt**) with both NLTK and spaCy and discuss the similarities and differences.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Tip: how to read a file from disk
Let's open the file **Lab1-apple-samsung-example.txt** from disk.

In [5]:
from pathlib import Path

In [6]:
cur_dir = Path().resolve() # this should provide you with the folder in which this notebook is placed
path_to_file = Path.joinpath(cur_dir, 'Lab1-apple-samsung-example.txt')
print(path_to_file)
print('does path exist? ->', Path.exists(path_to_file))

/Users/fredrik/Documents/GitHub/ba-text-mining/lab_sessions/lab1/Lab1-apple-samsung-example.txt
does path exist? -> True


If the output from the code cell above states that **does path exist? -> False**, please check that the file **Lab1-apple-samsung-example.txt** is in the same directory as this notebook.

In [7]:
with open(path_to_file) as infile:
    text = infile.read()

print('number of characters', len(text))

number of characters 1139


## [total points: 4] Exercise 1: NLTK
In this exercise, we use NLTK to apply **Part-of-speech (POS) tagging**, **Named Entity Recognition (NER)**, and **Constituency parsing**. The following code snippet already performs sentence splitting and tokenization. 

In [8]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize

In [9]:
sentences_nltk = sent_tokenize(text)

In [10]:
tokens_per_sentence = []
for sentence_nltk in sentences_nltk:
    sent_tokens = word_tokenize(sentence_nltk)
    tokens_per_sentence.append(sent_tokens)

We will use lists to keep track of the output of the NLP tasks. We can hence inspect the output for each task using the index of the sentence.

In [11]:
sent_id = 1
print('SENTENCE', sentences_nltk[sent_id])
print('TOKENS', tokens_per_sentence[sent_id])

SENTENCE The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
TOKENS ['The', 'six', 'phones', 'and', 'tablets', 'affected', 'are', 'the', 'Galaxy', 'S', 'III', ',', 'running', 'the', 'new', 'Jelly', 'Bean', 'system', ',', 'the', 'Galaxy', 'Tab', '8.9', 'Wifi', 'tablet', ',', 'the', 'Galaxy', 'Tab', '2', '10.1', ',', 'Galaxy', 'Rugby', 'Pro', 'and', 'Galaxy', 'S', 'III', 'mini', '.']


### [point: 1] Exercise 1a: Part-of-speech (POS) tagging
Use `nltk.pos_tag` to perform part-of-speech tagging on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [12]:
pos_tags_per_sentence = []
for tokens in tokens_per_sentence:
    pos_tagged = nltk.pos_tag(tokens)
    pos_tags_per_sentence.append(pos_tagged)

    print(pos_tagged)
    print()

[('https', 'NN'), (':', ':'), ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ'), ('Documents', 'NNS'), ('filed', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('San', 'NNP'), ('Jose', 'NNP'), ('federal', 'JJ'), ('court', 'NN'), ('in', 'IN'), ('California', 'NNP'), ('on', 'IN'), ('November', 'NNP'), ('23', 'CD'), ('list', 'NN'), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), ('running', 'VBG'), ('the', 'DT'), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), ('operating', 'VBG'), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), ('Apple', 'NNP'), ('claims', 'VBZ'), ('infringe', 'VB'), ('its', 'PRP$'), ('patents', 'NNS'), ('.', '.')]

[('The', 'DT'), ('six', 'CD'), ('phones', 'NNS'), ('and', 'CC'), ('tablets', 'NNS'), ('affected', 'VBN'), ('are', 'VBP'), ('the', 'DT'), ('Galaxy', 'NNP'), ('S', 'NNP'), ('III', 'NNP

In [13]:
# print(pos_tags_per_sentence)

### [point: 1] Exercise 1b: Named Entity Recognition (NER)
Use `nltk.chunk.ne_chunk` to perform Named Entity Recognition (NER) on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [14]:
from nltk.chunk import ne_chunk

In [15]:
ner_tags_per_sentence = []

for pos_tagged in pos_tags_per_sentence:
    ne_chunked = ne_chunk(pos_tagged)
    ner_tags_per_sentence.append(ne_chunked)

    print(ne_chunked)
    print()

(S
  https/NN
  :/:
  //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ
  Documents/NNS
  filed/VBN
  to/TO
  the/DT
  (ORGANIZATION San/NNP Jose/NNP)
  federal/JJ
  court/NN
  in/IN
  (GPE California/NNP)
  on/IN
  November/NNP
  23/CD
  list/NN
  six/CD
  (ORGANIZATION Samsung/NNP)
  products/NNS
  running/VBG
  the/DT
  ``/``
  Jelly/RB
  (GPE Bean/NNP)
  ''/''
  and/CC
  ``/``
  Ice/NNP
  Cream/NNP
  Sandwich/NNP
  ''/''
  operating/VBG
  systems/NNS
  ,/,
  which/WDT
  (PERSON Apple/NNP)
  claims/VBZ
  infringe/VB
  its/PRP$
  patents/NNS
  ./.)

(S
  The/DT
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  affected/VBN
  are/VBP
  the/DT
  (ORGANIZATION Galaxy/NNP)
  S/NNP
  III/NNP
  ,/,
  running/VBG
  the/DT
  new/JJ
  (PERSON Jelly/NNP Bean/NNP)
  system/NN
  ,/,
  the/DT
  (ORGANIZATION Galaxy/NNP)
  Tab/NNP
  8.9/CD
  Wifi/NNP
  tablet/NN
  ,/,
  the/DT
  (ORGANIZATION Galaxy/NNP)
  Tab/NNP
  2/CD
  10.1/CD
  ,/,
  (P

In [16]:
# print(ner_tags_per_sentence)

### [points: 2] Exercise 1c: Constituency parsing
Use the `nltk.RegexpParser` to perform constituency parsing on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [17]:
constituent_parser = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [18]:
constituency_output_per_sentence = []

for pos_tagged in pos_tags_per_sentence:
    const_parsed = constituent_parser.parse(pos_tagged)
    constituency_output_per_sentence.append(const_parsed)

    print(const_parsed)
    print()

(S
  (NP https/NN)
  :/:
  (NP
    //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ)
  Documents/NNS
  (VP (V filed/VBN))
  to/TO
  (NP the/DT)
  San/NNP
  Jose/NNP
  (NP federal/JJ court/NN)
  (P in/IN)
  California/NNP
  (P on/IN)
  November/NNP
  23/CD
  (NP list/NN)
  six/CD
  Samsung/NNP
  products/NNS
  (VP (V running/VBG) (NP the/DT))
  ``/``
  Jelly/RB
  Bean/NNP
  ''/''
  and/CC
  ``/``
  Ice/NNP
  Cream/NNP
  Sandwich/NNP
  ''/''
  (VP (V operating/VBG))
  systems/NNS
  ,/,
  which/WDT
  Apple/NNP
  (VP (V claims/VBZ))
  (VP (V infringe/VB))
  its/PRP$
  patents/NNS
  ./.)

(S
  (NP The/DT)
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  (VP (V affected/VBN))
  (VP (V are/VBP) (NP the/DT))
  Galaxy/NNP
  S/NNP
  III/NNP
  ,/,
  (VP (V running/VBG) (NP the/DT new/JJ))
  Jelly/NNP
  Bean/NNP
  (NP system/NN)
  ,/,
  (NP the/DT)
  Galaxy/NNP
  Tab/NNP
  8.9/CD
  Wifi/NNP
  (NP tablet/NN)
  ,/,
  (NP the/DT)
  Galaxy/NNP

In [19]:
# print(constituency_output_per_sentence)
# constituency_output_per_sentence[1].draw()

Augment the RegexpParser so that it also detects Named Entity Phrases (NEP), e.g., that it detects *Galaxy S III* and *Ice Cream Sandwich*

In [20]:
constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {<NNP>+}       # Named Entities''')

In [21]:
constituency_v2_output_per_sentence = []

for pos_tagged in pos_tags_per_sentence:
    const_parsed_v2 = constituent_parser_v2.parse(pos_tagged)
    constituency_v2_output_per_sentence.append(const_parsed_v2)

    print(const_parsed_v2)
    print()

(S
  (NP https/NN)
  :/:
  (NP
    //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ)
  Documents/NNS
  (VP (V filed/VBN))
  to/TO
  (NP the/DT)
  (NEP San/NNP Jose/NNP)
  (NP federal/JJ court/NN)
  (P in/IN)
  (NEP California/NNP)
  (P on/IN)
  (NEP November/NNP)
  23/CD
  (NP list/NN)
  six/CD
  (NEP Samsung/NNP)
  products/NNS
  (VP (V running/VBG) (NP the/DT))
  ``/``
  Jelly/RB
  (NEP Bean/NNP)
  ''/''
  and/CC
  ``/``
  (NEP Ice/NNP Cream/NNP Sandwich/NNP)
  ''/''
  (VP (V operating/VBG))
  systems/NNS
  ,/,
  which/WDT
  (NEP Apple/NNP)
  (VP (V claims/VBZ))
  (VP (V infringe/VB))
  its/PRP$
  patents/NNS
  ./.)

(S
  (NP The/DT)
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  (VP (V affected/VBN))
  (VP (V are/VBP) (NP the/DT))
  (NEP Galaxy/NNP S/NNP III/NNP)
  ,/,
  (VP (V running/VBG) (NP the/DT new/JJ))
  (NEP Jelly/NNP Bean/NNP)
  (NP system/NN)
  ,/,
  (NP the/DT)
  (NEP Galaxy/NNP Tab/NNP)
  8.9/CD
  (NEP Wifi/NN

In [22]:
# print(constituency_v2_output_per_sentence)
# constituency_v2_output_per_sentence[0].draw()

## [total points: 1] Exercise 2: spaCy
Use Spacy to process the same text as you analyzed with NLTK.

In [23]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [24]:
doc = nlp(text) # insert code here

small tip: You can use **sents = list(doc.sents)** to be able to use the index to access a sentence like **sents[2]** for the third sentence.


## [total points: 7] Exercise 3: Comparison NLTK and spaCy
We will now compare the output of NLTK and spaCy, i.e., in what do they differ?

### [points: 3] Exercise 3a: Part of speech tagging
Compare the output from NLTK and spaCy regarding part of speech tagging.

* To compare, you probably would like to compare sentence per sentence. Describe if the sentence splitting is different for NLTK than for spaCy. If not, where do they differ?
* After checking the sentence splitting, select a sentence for which you expect interesting results and perhaps differences. Motivate your choice.
* Compare the output in `token.tag` from spaCy to the part of speech tagging from NLTK for each token in your selected sentence. Are there any differences? This is not a trick question; it is possible that there are no differences.

### Sentence Splitting Check

In [25]:
nltk_list, spacy_list = list(sentences_nltk), list(doc.sents)
print(spacy_list[1].text == '\n')
del spacy_list[1]
nltk_iter, spacy_iter = iter(nltk_list), iter(spacy_list)

for _ in range(max(len(nltk_list), len(spacy_list))):
    print('NLTK:')
    print(next(nltk_iter, 'N/A'))
    print()
    print('Spacy:')
    print(next(spacy_iter, 'N/A'))
    print('---------------------------')

True
NLTK:
https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.

Spacy:
https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.
---------------------------
NLTK:
The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.

Spacy:
The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Gala

#### Answer
For some reason SpaCy splits "\n" as a sentence (first index). Aside from that the sentence splitting is identical. I removed that problem to be able to see the comparison easier.

### POS tagging comparison

In [26]:
sentences_spacy = doc.sents
pos_tags_per_sentence_spacy = []

for idx, sentence in enumerate(sentences_spacy):
    pos_tags_per_sentence_spacy.append([])

    for token in sentence:
        pos_tags_per_sentence_spacy[idx].append([token.text, token.tag_])

In [27]:
for pos_tag in pos_tags_per_sentence_spacy[0]:
    print(pos_tag[0], pos_tag[1])

print("--------------------")

for pos_tag in pos_tags_per_sentence[0]:
    print(pos_tag[0], pos_tag[1])

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html NNP


 _SP
Documents NNS
filed VBD
to IN
the DT
San NNP
Jose NNP
federal JJ
court NN
in IN
California NNP
on IN
November NNP
23 CD
list NN
six CD
Samsung NNP
products NNS
running VBG
the DT
" ``
Jelly NNP
Bean NNP
" ''
and CC
" ``
Ice NNP
Cream NNP
Sandwich NNP
" ''
operating NN
systems NNS
, ,
which WDT
Apple NNP
claims NNS
infringe VBP
its PRP$
patents NNS
. .
--------------------
https NN
: :
//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html JJ
Documents NNS
filed VBN
to TO
the DT
San NNP
Jose NNP
federal JJ
court NN
in IN
California NNP
on IN
November NNP
23 CD
list NN
six CD
Samsung NNP
products NNS
running VBG
the DT
`` ``
Jelly RB
Bean NNP
'' ''
and CC
`` ``
Ice NNP
Cream NNP
Sandwich NNP
'' ''
operating VBG
systems NNS
, ,
which WDT
Apple NNP
claims VBZ
infringe VB
its PRP$
patents NNS
. .


#### Answer
I chose the first sentence because it contains a URL and in all other respects the sentences are similar (contain complex sentence structures and named entities)

The differences between Spacy and NLTK pos tagging are the following:

Spacy:
- Tags the whole URL as NNP
- Tags "filed" as past tense
- Correctly tags "Jelly" and "Bean" as NNP (proper noun)
- Correctly (I think) tags "operating" as a type of noun (because the term is "operating systems")
- Incorrectly tags "claims" as a type of noun
- Tags "infringe" as a type of conjugated verb which I think is correct maybe(?)

NLTK:
- Tags the URL in different parts
- Tags "filed" as past participle
- Incorrectly tags only "Bean", and not "Jelly", as NNP (proper noun)
- Incorrectly (I think) tags "operating" as a type of noun
- Correctly tags "claims" as a type of verb
- Tags "infringe" as a base form verb

### [points: 2] Exercise 3b: Named Entity Recognition (NER)
* Describe differences between the output from NLTK and spaCy for Named Entity Recognition. Which one do you think performs better?

In [28]:
for sentence in ner_tags_per_sentence:
    for ner_tag in sentence:
        if type(ner_tag) != tuple:
            print(ner_tag)
    print()

(ORGANIZATION San/NNP Jose/NNP)
(GPE California/NNP)
(ORGANIZATION Samsung/NNP)
(GPE Bean/NNP)
(PERSON Apple/NNP)

(ORGANIZATION Galaxy/NNP)
(PERSON Jelly/NNP Bean/NNP)
(ORGANIZATION Galaxy/NNP)
(ORGANIZATION Galaxy/NNP)
(PERSON Galaxy/NNP Rugby/NNP Pro/NNP)
(PERSON Galaxy/NNP S/NNP)

(PERSON Apple/NNP)
(PERSON Apple/NNP)

(GPE August/NNP)
(PERSON Samsung/NNP)
(GSP US/NNP)
(GPE Apple/NNP)
(ORGANIZATION iPad/NN)
(ORGANIZATION iPhone/NN)
(GPE Galaxy/NNP)

(GPE Samsung/NNP)

(ORGANIZATION UK/NNP)
(GPE Samsung/NNP)
(PERSON Apple/NNP)
(LOCATION South/JJ Korean/JJ)



In [29]:
for sentence in doc.sents:
    for ner_tag in sentence.ents:
        print(ner_tag.label_, ner_tag.text)
    print()

GPE San Jose
GPE California
DATE November 23
CARDINAL six
ORG Samsung
WORK_OF_ART Jelly Bean
ORG Apple


CARDINAL six
GPE the Galaxy S III
ORG Jelly Bean
ORG the Galaxy Tab 2 10.1

ORG Apple

DATE August
ORG Samsung
GPE US
ORG Apple
MONEY 1.05bn
ORG iPad
ORG iPhone

ORG Samsung

GPE UK
ORG Samsung
ORG Apple
NORP South Korean
ORG iPad



1:
Spacy detects November 23, six and "Jelly Bean". But Spacy thinks "Jelly Bean" is a work of art which is wrong.
NLTK fails to detect the first two and only detects "Bean", not the complete "Jelly Bean". Also NLTK thinks Apple is a person.
NLTK never tags dates, times or numbers so I won't mention those in the next sentences.

2:
Spacy detects the whole name of the phone model (Galaxy Tab .... etc), whereas NLTK only detects "Galaxy". Both incorrectly label it in various ways, however (as e.g. organization and person). In this sentence both NLTK and Spacy detect "Jelly Bean", but both incorrectly label it.

3:
NLTK detects both instances of "Apple" but incorrectly labels it as person. Spacy only detects one "Apple" but correctly labels it an organization.

4:
This time NLTK detects a month ("August") but incorrectly labels it (thinks it's a city or something). Spacy correctly labels Samsung as an organization, NLTK labels it incorrectly. Both Spacy and NLTK get iPad and iPhone, but incorrectly label it. Only NLTK gets Galaxy, but fails to label it correctly. Spacy also detects 1.05bn as MONEY.

Overall I think Spacy performs better. 
- It detects time, currency, dates, etc. which can be useful. 
- It also usually has more correct labels.
- It sometimes misses named entities entirely when NLTK detects them (for example the last "Galaxy" in the fourth sentence and the second "Apple" in the third sentence), but it seems to have a higher rate of correct labelling when it does detect something. (So perhaps NLTK can be better depending on your needs).


### [points: 2] Exercise 3c: Constituency/dependency parsing
Choose one sentence from the text and run constituency parsing using NLTK and dependency parsing using spaCy.
* describe briefly the difference between constituency parsing and dependency parsing
* describe differences between the output from NLTK and spaCy.

In [30]:
from spacy import displacy

In [31]:
sentence4_spacy = spacy_list[-1]
sentence4_nltk = pos_tags_per_sentence[-1]

#### Spacy Dependency Parsing

In [32]:
displacy.render(sentence4_spacy, jupyter=True, style='dep')

#### Constituency Parser NLTK

In [33]:
constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {<NNP>+}       # Named Entities''')

In [34]:
const_parsed_for_comparison = constituent_parser_v2.parse(sentence4_nltk)

print(const_parsed_for_comparison)

(S
  (NP A/DT similar/JJ case/NN)
  (PP (P in/IN) (NP the/DT))
  (NEP UK/NNP)
  (VP (V found/VBD))
  (P in/IN)
  (NEP Samsung/NNP)
  's/POS
  (NP favour/NN)
  and/CC
  (VP (V ordered/VBD))
  (NEP Apple/NNP)
  to/TO
  (VP (V publish/VB) (NP an/DT apology/NN))
  (VP
    (V making/VBG)
    (NP clear/JJ)
    (PP (P that/IN) (NP the/DT South/JJ Korean/JJ firm/NN)))
  (VP (V had/VBD))
  not/RB
  (VP (V copied/VBN))
  its/PRP$
  (NP iPad/NN)
  when/WRB
  (VP (V designing/VBG))
  its/PRP$
  (NP own/JJ)
  devices/NNS
  ./.)


### Answer

A constituency parser breaks a sentence into different types sub-phrases (for example noun phrase, preposition phrase, verb phrase, etc.) and shows the part-of-speech of individual words.

A dependency parser shows the part of speech of words and the type of dependency from one word to another. For example the word "case" has a "det" (determiner) relationship to the word "a" and an "amod" (adjevtival modifier) relationship to the word "similar".

# End of this notebook