## Tasks

1. present a creative solution to the event detection in a few single sentences in 1/3 pippi longstocking chapters by answering the 5W1h questions.
2. count the number of separate events (location &v time &v protagonist &v object changes etc.) in 1/3 chapters.
Work with different libraries (e.g., gimme5w.., allennlp, zalando, spacy etc.)

## Giveme5W1H

In [None]:
import Giveme5W1H

In [None]:
from Giveme5W1H.extractor.preprocessors.preprocessor_core_nlp import Preprocessor
import pandas, os, geopy
from geopy.geocoders import Nominatim
import nltk
from nltk.tokenize import sent_tokenize

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/esra/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
text = """Way out at the end of a tiny little town was an old overgrown garden, and in the garden was an old house, and in the house lived Pippi Longstocking. She was nine years old, and she lived there all alone. She had no mother and no father, and that was of course very nice because there was no one to tell her to go to bed just when she was having the most fun, and no one who could make her take cod liver oil when she much preferred caramel candy.
Once upon a time Pippi had had a father of whom she was extremely fond. Naturally she had had a mother too, but that was so long ago that Pippi didn't remember her at all. Her mother had died when Pippi was just a tiny baby and lay in a cradle and howled so that nobody could go anywhere near her. Pippi was sure that her mother was now up in Heaven, watching her little girl through a peephole in the sky, and Pippi often waved up at her and called, "Don't you worry about me. I'll always come out on top."
Pippi had not forgotten her father. He was a sea captain who sailed on the great ocean, and Pippi had sailed with him in his ship until one day her father was blown overboard in a storm and disappeared. But Pippi was absolutely certain that he would come back. She would never believe that he had drowned; she was sure he had floated until he landed on an island inhabited by cannibals. And she thought he had become the king of all the cannibals and went around with a golden crown on his head all day long.
"My papa is a cannibal king; it certainly isn't every child who has such a stylish papa," Pippi used to say with satisfaction."""

In [None]:
sentences = sent_tokenize(text)

In [None]:
from Giveme5W1H.extractor.document import Document
from Giveme5W1H.extractor.extractor import MasterExtractor

geopy.geocoders.options.default_user_agent = "fu-nlp-sose21"


# NOTE: Even I tried in Windows and Ubuntu, the Stanford Servr couldn't resolve coreference module, 
# however I saw the server as running.
preprocessor = Preprocessor()
extractor = MasterExtractor(preprocessor=preprocessor)
for sent in sentences:
    doc = Document.from_text(sent,"2021-07-06")
    res = extractor.parse(doc)
    top_who_answer = res.get_top_answer('what').get_parts_as_text()
    print(top_who_answer)

KeyboardInterrupt: 

## Spacy - Event Detection

Resource link: https://andrewhalterman.com/post/event-data-in-30-lines-of-python/"

In [None]:
import numpy as np
import spacy
import os
import re
from nltk.stem import WordNetLemmatizer

In [None]:
data_dir = '/home/esra/Desktop/'

In [None]:
with open(os.path.join(data_dir, '01villa.txt')  , "r") as f:
    raw_text = f.read()

In [None]:
sentences = nltk.sent_tokenize(raw_text)

In [None]:
# Home-made normalizer - New version for tokenization
def normalize_corpus(sent_corpus, remove_whitespace = False, remove_punc = False, 
                     remove_stopwords = False, all_lower = False, text_lemmatization = False, tokenize = False ):
    cleaned_array=[]
    lemmatizer = WordNetLemmatizer()
    for sent in sent_corpus:
        cleaned_sent = ""
        for item in nltk.word_tokenize(sent):
            if text_lemmatization == True:
                item = lemmatizer.lemmatize(item)
            if all_lower == True:
                item = item.lower() 
            if remove_punc == True:
                item = re.sub(r'[^A-Za-z0-9.,]','', item)
                item = re.sub('nt', 'not',item)
            if remove_stopwords == True and item.lower() not in stopword_list:
                cleaned_sent += item+' ' 
            elif remove_stopwords == False:
                cleaned_sent += item+' '
        if tokenize == True:   
            #print(cleaned_sent)
            cleaned_array.append(nltk.word_tokenize(cleaned_sent))
        else:
            cleaned_array.append(cleaned_sent)
    return cleaned_array

In [None]:
sent_list =normalize_corpus(sentences, remove_punc = True)

In [None]:
import en_core_web_lg
nlp = en_core_web_lg.load()

In [None]:
dobj_list =[]
for sent in sent_list:
    for token in nlp(sent):
        if(token.dep_ == 'dobj'):
            dobj_list.append(token.text)
            #print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            #token.shape_, token.is_alpha, token.is_stop)

In [None]:
verb_list =[]
for sent in sent_list:
    for token in nlp(sent):
        # we can remove 'have' from list - it is not a verb for event action
        if (token.pos_ == 'VERB' and token.lemma_ not in  ('have')):
            verb_list.append(str(token.lemma_))

In [None]:
processed_docs = list(nlp.pipe(sent_list))

In [None]:
def detect_event(doc, verb_list, dobj_list):
    for word in doc:
        if word.dep_ == "ROOT" and word.lemma_ in verb_list:
            for subword in word.children:
                # dobj = direct object
                if subword.dep_ == "dobj" and subword.lemma_ in dobj_list: 
                    return word, subword
                    
def actor_extractor(root):
    for child in root.children:
        # nsubj = real noun subject
        if child.dep_ == "nsubj":
            nsubj = child.text
            nsubj_subtree = ''.join(w.text_with_ws for w in child.subtree).strip()
            return nsubj_subtree

In [None]:
eventCounter = 0
for doc in processed_docs:
    root_and_obj = detect_event(doc, verb_list, dobj_list)
    if root_and_obj != None:
        actor = actor_extractor(root_and_obj[0])
        if actor:
            eventCounter += 1
            print("actor: ", actor, '|', "action: ", root_and_obj[0] ,'|', "object:", root_and_obj[1] )

print('# of Events: ',eventCounter)

actor:  Pippi | action:  forgotten | object: father
actor:  Her father | action:  bought | object: house
actor:  she | action:  said | object: goodby
actor:  she | action:  lift | object: horse
actor:  she | action:  lifted | object: garden
actor:  she | action:  took | object: care
actor:  Pippi herself | action:  made | object: it
actor:  she | action:  wore | object: pair
actor:  you | action:  know | object:   
actor:  I | action:  forget | object: it
actor:  you | action:  expect | object: child
actor:  she | action:  conotinued | object:  
actor:  first 1 | action:  inotroduce | object: you
actor:  Tommy and Annika | action:  patted | object: horse
actor:  They | action:  seen | object: king
actor:  No , not the least little tiny bit of a one , | action:  said | object: Pippi
actor:  who | action:  tells | object: you
actor:  I | action:  tell | object: myself
actor:  I | action:  tell | object: myself
actor:  they | action:  come | object: inoto
actor:  she | action:  took | obj

## AllenNLP

In [2]:
from allennlp.predictors import Predictor 
predictor = Predictor.from_path("https://allennlp.s3.amazonaws.com/models/ner-model-2018.04.26.tar.gz") 

downloading:   0%|          | 0/711852086 [00:00<?, ?B/s]

KeyboardInterrupt: 

In [None]:
results = predictor.predict(sentence="Did Uriah honestly think he could beat The Legend of Zelda in under three hours?") 
for word, tag in zip(results["words"], results["tags"]):
    print(f"{word}\t{tag}")