# Exercises Lecture 7: Preprocessing Text

# NLTK and SpaCy

* Tokenizing, POS tagging, Stemming, Lemmatizing, Parsing, Named Entity Recognition (NER)

To install NLTK, please try to run the following cell. If this does not work, please try and follow the [documentation](http://www.nltk.org/install.html). 

To install SpaCy, look [here](https://spacy.io/usage/)

In [None]:
%%bash
pip install nltk

To install SpaCy, try the commands below. If this does not work, look [here](https://spacy.io/usage/)

In [None]:
#conda install -c conda-forge spacy
#python -m spacy download en
#spacy.load('en')

## 1 Tokenization and Sentence Segmentation

**Exercise 1:** Tokenizing a text file 


* Download the data files used for this exercise sheet [here](https://mastertal.gitlab.io/UE803/l7_data.tgz)
* Open and read the file 'data/hp.txt'. Then, tokenize it using NLTK word_tokenize module.

In [5]:
import nltk
nltk.download('punkt')
with open ("data/hp.txt") as input:
    content = input.read()
    tokens = nltk.word_tokenize(content)
    print(tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


['Harry', 'Potter', 'and', 'the', 'Sorcerer', "'s", 'Stone', 'CHAPTER', 'ONE', 'THE', 'BOY', 'WHO', 'LIVED', 'Mr.', 'and', 'Mrs.', 'Dursley', ',', 'of', 'number', 'four', ',', 'Privet', 'Drive', 'in', 'Stansted', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you', "'d", 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'did', "n't", 'hold', 'with', 'such', 'nonsense', '.', 'Mr.', 'Dursley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'Grunnings', ',', 'which', 'made', 'drills', '.', 'He', 'was', 'a', 'big', ',', 'beefy', 'man', 'with', 'hardly', 'any', 'neck', ',', 'although', 'he', 'did', 'have', 'a', 'very', 'large', 'mustache', '.', 'Mrs.', 'Dursley', 'was', 'thin', 'and', 'blonde', 'and', 'had', 'nearly', 'twice', 'the', 'usual', 'amount', 'of', 'neck', ',', 'which', 'came', 'in', 'very', 

**Exercise 2:** Breaking the text into Sentences

* Now use the sent_tokenize() function, to segment the text into sentences.

In [6]:
with open("data/hp.txt") as input:
    content = input.read()
    sentences = nltk.sent_tokenize(content)
    print(sentences)

Help on function sent_tokenize in module nltk.tokenize:

sent_tokenize(text, language='english')
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).
    
    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus



**Exercise 3:** Get all words which end in "ed"
Use the python function "endswith"

In [7]:
with open ("data/hp.txt") as input:
    content = input.read()
    tokens = nltk.word_tokenize(content)
    ed_words = []
    for token in tokens:
        if token.endswith("ed"):
            ed_words.append(token)
print(ed_words)

['Stansted', 'involved', 'called', 'called', 'wanted', 'pretended', 'shuddered', 'arrived']


## 2. Part-of-speech (POS) tagging

[NLTK Book Chapter](https://www.nltk.org/book/ch05.html)

** Exercise 4: ** Use POS tagging to retrieve only those words which end in "ed" and are verbs (the output list should no longer contain the noun "Stansted"). 

* To see how `pos_tag()` can be used, use the `help()` function.  
* `pos_tag()` takes a tokenized text as input and returns a list of tuples in which the first element corresponds to the token and the second to its pos-tag.
* The POS tag set of the Penn Treebank Project, which can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

In [11]:
# Start by reading the documentation:
help(nltk.pos_tag)

nltk.download('averaged_perceptron_tagger')

with open ("data/hp.txt") as input:
    content = input.read()
    tokens = nltk.word_tokenize(content)
    tagged_tokens = nltk.pos_tag(tokens)
    
    verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]
    
    verbs = []
    
    for token, tag in tagged_tokens:
        
        if tag in verb_tags and token.endswith('ed'):
            verbs.append(token)
            
    print(verbs)
    

Help on function pos_tag in module nltk.tag:

pos_tag(tokens, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.
    
        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
        ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
        [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
        ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
    
    NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.
    
    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :param tagset: the tagset to be u

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [12]:
import spacy
import pandas as pd
nlp = spacy.load('en')
sentence = "Mr. Dursley was the director of a firm called Grunnings, which made drills."
nlp_sentence = nlp(sentence)
spacy_pos_tagged = [(w, w.tag_, w.pos_) for w in nlp_sentence]

pd.DataFrame(spacy_pos_tagged,
             columns=['Word', 'POS tag', 'Tag type'])

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

## 3. Lemmatization

Now we would like to collect all verbs that appear in the text. The same verb occurs in different forms however (e.g., 'is, was, were'). To gather together these forms into a single form ('be'), we can use NLTK lemmatizer (nltk.stem.wordnet.WordNetLemmatizer) which given a token and its POS tag will return its lemma i.e., the word form which is usually found in dictionary entries. 

** Exercise 5: ** Return all verb lemmas contained in the text. 

Hint: use the WordNetLemmatizer for this using the lemmatize() function


In [14]:
nltk.download('wordnet')
with open ("data/hp.txt") as input:
    content = input.read()
sentence = "Mr. Dursley was the director of a firm called Grunnings, which made drills."
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]
verbs = []
for token, tag in tagged_tokens:
    if tag in verb_tags:
        verbs.append(token)
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
verb_lemmas = []
for word_form in verbs:
    lemma = lemmatizer.lemmatize(word_form, "v") 
    verb_lemmas.append((word_form,lemma))
print(verb_lemmas)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


[('was', 'be'), ('called', 'call'), ('made', 'make')]


** Exercise 6: ** The WordNet Lemmatizer needs to know the POS tag of the word form to be lemmatized. 

Write a program that for each word in the input text:
* gets its POS tag
* converts is to the corresponding WordNet POS tag (wn.NOUN, wn.VERB, wn.ADV or wn.ADJ)
* lemmatize each word form 
* output the resulting list of lemmas


In [None]:
from nltk.corpus import wordnet as wn
def penn_to_wn(penn_tag):
    if penn_tag in ["NN", "NNS", "NNP", "NNPS"]:
        wn_tag = wn.NOUN
    elif penn_tag in ['VB', "VBD", "VBG", "VBN", "VBP", "VBZ"]:
        wn_tag = wn.VERB
    elif penn_tag in ["RB", "RBR", "RBS"]:
        wn_tag = wn.ADV
    elif penn_tag in ["JJ", "JJR", "JJS"]:
        wn_tag = wn.ADJ
    else:
        wn_tag = None
    return wn_tag

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

lemmas = []

for token, pos in tagged_tokens:
    wn_tag = penn_to_wn(pos)
    if not wn_tag == None:
        lemma = lemmatizer.lemmatize(token, wn_tag)
    else:
        lemma = lemmatizer.lemmatize(token)
    lemmas.append(lemma)
    print(lemmas)

## 4. Processing Multiple Files

** Exercise 7: ** For each file in "data/", print out the number of sentences and the number of words per sentences.

* Use Python [glob](https://docs.python.org/3/library/glob.html) module to iterate over all files in the directory

In [17]:
import glob
for filename in glob.glob("data/*.txt"): 
    with open(filename, "r", encoding="utf8", errors='ignore') as infile:
        content = infile.read()
    sentences = nltk.sent_tokenize(content)
    print("File %s has %d sentences"%(filename, len(sentences)))
    counter = 0
    
    for sentence in sentences:
        counter += 1
        tokens = nltk.word_tokenize(sentence)
        # print("Sentence %d has %d tokens"%(counter, len(tokens)))
    print()

File data\Harry Potter 1 - Sorcerer's Stone.txt has 6396 sentences

File data\Harry Potter 2 - Chamber of Secrets.txt has 6938 sentences

File data\Harry Potter 3 - The Prisoner Of Azkaban.txt has 8585 sentences

File data\hp.txt has 12 sentences



** Exercise 8: ** Extract all noun and verb lemma from the data/ files

* create a list of all the files we want to process
* open and read the files
* tokenize the texts
* perform pos-tagging
* collect all the tokens analyzed as nouns
* print out the number of distinct nouns found


In [18]:
def tag_tokens_file(filepath):
    with open(filepath, 'r', encoding = 'utf8', errors='ignore') as input:
        content = input.read()
        tokens = nltk.word_tokenize(content)
        tagged_tokens = nltk.pos_tag(tokens)
    return tagged_tokens

nouns = []
for filename in glob.glob("data/*.txt"):
    tagged_tokens = tag_tokens_file(filename)
    
    for token, pos in tagged_tokens:
        if pos in ["NN", "NNP"]:
            nouns.append(token)
nouns = set(nouns)
print("Threre are %d nouns"%(len(nouns)))


Threre are 6995 nouns


## 5. Parsing 

** Exercise 9: ** Extract all Noun Phrases from the first 10 sentences of each txt file in the data/ directory.

* Open each file
* Apply sentence segmentation
* Apply the stanford constituency parser to the first 10 sentences of each file
* Use the NLTK parse tree methods (help(nltk.tree.Tree)) to retrieve all NPs subtrees and extract their leaves
* N.B. The Stanford parser can assign several parses to the same sentence. Only consider the first parse.



In [None]:
import os
java_path = r'/usr/lib/jvm/java-8-oracle/jre/bin/java'
os.environ['JAVAHOME'] = java_path
from nltk.parse.stanford import StanfordParser
scp = StanfordParser(path_to_jar='/home/Lenovo/stanford-parser-full-2018-10-17/stanford-parser.jar',
           path_to_models_jar='/home/Lenovo/stanford-parser-full-2018-10-17/stanford-parser-3.9.2-models.jar')

for filename in glob.glob("data/*.txt"): 
    with open(filename, "r", encoding="utf8", errors='ignore') as infile:
        content = infile.read()
    sentences = nltk.sent_tokenize(content)
    counter = 0
    
    for sentence in sentences[0:10]:
        counter +=1
        parse_trees = list(scp.raw_parse(sentence))
        tree = parse_trees[0]
        
        for s in tree.subtrees(lambda tree: tree.label() == "NP"):
        print(s.leaves())

## 6. NER 

** Exercise 10: ** Extract all Person names from the data/ files

* Use either Spacy or Stanford NER

In [None]:
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
java_path = r'/usr/lib/jvm/java-8-oracle/jre/bin/java'
os.environ['JAVAHOME'] = java_path

sner = StanfordNERTagger('/home/Lenovo/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
           path_to_jar='/home/Lenovo/stanford-ner-2018-10-16/stanford-ner.jar')

ne = []

for filename in glob.glob("data/*.txt"): 
    with open(filename, "r", encoding="utf8", errors='ignore') as infile:
        content = infile.read()
    sentences = nltk.sent_tokenize(content)
    counter = 0
    
    for sentence in sentences:
        counter +=1
        ner_tagged_sent = sner.tag(sentence.split())
        sentence_ne = [ne for ne in ner_tagged_sent if ne[1] != '0']
        ne = ne + sentence_ne
print(set(ne))