# Build the Guidance Signal (train, val and test split)

### Steps for Guidance signal extraction (for all splits)
1) Reformat **NarraSum's data** to allow for Event Detection 
2) Extract potential candidates for event detection (**pos-tagging**)
3) **Infer events** on newly augmented (v1) NarraSum's data
4) Filter None-events
5) Reformat NarraSum's data to allow for Event Relation Extraction (ERE)
6) Add all newly detected events to this dataset
7) **Infer event relations** on secondly augmented (v2) NarraSum's data

(some of the previous steps were carried on DFKI's cluster to allow for distributed processing)

- Result : **Raw guidance signal** for E-BART

### Treatment for Raw guidance (see notebook "Several Guidance Signal")


In [1]:
# Dependecies
import nltk
import pandas as pd   
import json

import uuid
import spacy

nlp = spacy.load("en_core_web_trf")
import en_core_web_trf
nlp = en_core_web_trf.load()

## 1. Reformat NarraSum's data to allow for ED

In [2]:
file_path= "/Users/clementgillet/Desktop/Master_Hub/datasets/NarraSum/train.json"
trainNarraSum = pd.read_json(path_or_buf=file_path, lines=True)
trainNarraSum

Unnamed: 0,id,document,summary
0,6b1ad56f71ab26b0d44c4223be26518d,"""Sound Tracks presents Quick Hits"" is a 20-par...","""Sound Tracks presents Quick Hits"" is a 20-par..."
1,e473f32fc0a5df963c1d60ccc66c8989,Set wholly in a secondary school in a working-...,Teacher Francois Marin and his colleagues are ...
2,b199c5e2616eb4826cbb02fa0e4b58c3,"Lois follows a trial about a possible killer, ...","Eugene Laderman, convicted of murdering Henry ..."
3,3b35856b7511e65319cda8165d9056a9,Cat told Vincent that she wouldn't be second c...,Cat and Vincent are unsure if they can trust A...
4,be382214372e901980d96959180fa0ac,An escaped Norman Osborn is traveling by train...,Norman Osborn's memory of the Green Goblin is ...
...,...,...,...
109762,c0481fde39300813627ef17f2ef25d09,Previously on 'Survivor': Caleb and Hayden tri...,"Upon returning to camp after Tribal Council, H..."
109763,73420b51d3f742c4d6f293deefae7bf0,The entire Simpson family stays at home during...,"One snowy day in Springfield, Lisa informs Bar..."
109764,0e19a51c6e70d1ff163223feb931ea9c,"At the film's opening, Dan and Sara Anderson a...",A married couple (he is an aging NFL quarterba...
109765,c857b08ae31107aa089a7cc020206c2c,A group of journalists are investigating a hig...,In spring 1938 in the mountains in the north o...


In [3]:
for i, doc in enumerate(trainNarraSum.document):
    trainNarraSum.document[i] = nltk.tokenize.sent_tokenize(doc)

In [5]:
tokensList = []
for i, doc in enumerate(trainNarraSum.document):
    sents = []
    for sent in doc:
        sents.append(nltk.word_tokenize(sent))
    tokensList.append(sents)
trainNarraSum['tokens'] = tokensList

In [8]:
content = []
for ind in trainNarraSum.index:
    ls = []
    for sent, tok in zip(trainNarraSum['document'][ind], trainNarraSum['tokens'][ind]):
        dic = {}
        dic['sentence']= sent
        dic['tokens']= tok
        ls.append(dic)
    content.append(ls)
    

    
trainNarraSum['content']= content
trainNarraSum["content"]

0         [{'sentence': '"Sound Tracks presents Quick Hi...
1         [{'sentence': 'Set wholly in a secondary schoo...
2         [{'sentence': 'Lois follows a trial about a po...
3         [{'sentence': 'Cat told Vincent that she would...
4         [{'sentence': 'An escaped Norman Osborn is tra...
                                ...                        
109762    [{'sentence': 'Previously on 'Survivor': Caleb...
109763    [{'sentence': 'The entire Simpson family stays...
109764    [{'sentence': 'At the film's opening, Dan and ...
109765    [{'sentence': 'A group of journalists are inve...
109766    [{'sentence': 'Francis Dolarhyde sits in a caf...
Name: content, Length: 109767, dtype: object

In [11]:
trainNarraSum.content[5]

[{'sentence': 'A working-class schlub, desperate to escape his mundane life, finds guarded hope when he meets a traveling dream salesman.',
  'tokens': ['A',
   'working-class',
   'schlub',
   ',',
   'desperate',
   'to',
   'escape',
   'his',
   'mundane',
   'life',
   ',',
   'finds',
   'guarded',
   'hope',
   'when',
   'he',
   'meets',
   'a',
   'traveling',
   'dream',
   'salesman',
   '.']},
 {'sentence': 'Larry has been down on his luck for so long, he forgets what happiness feels like.',
  'tokens': ['Larry',
   'has',
   'been',
   'down',
   'on',
   'his',
   'luck',
   'for',
   'so',
   'long',
   ',',
   'he',
   'forgets',
   'what',
   'happiness',
   'feels',
   'like',
   '.']},
 {'sentence': 'On the desolate streets of industrial Philadelphia, he peddles satellite TV packages to pay the bills.',
  'tokens': ['On',
   'the',
   'desolate',
   'streets',
   'of',
   'industrial',
   'Philadelphia',
   ',',
   'he',
   'peddles',
   'satellite',
   'TV',
   'pa

In [9]:
trainNarraSum.drop("document", axis=1, inplace=True)
trainNarraSum.drop("tokens", axis=1, inplace=True)
trainNarraSum.drop("summary", axis=1, inplace=True)
trainNarraSum

Unnamed: 0,id,content
0,6b1ad56f71ab26b0d44c4223be26518d,"[{'sentence': '""Sound Tracks presents Quick Hi..."
1,e473f32fc0a5df963c1d60ccc66c8989,[{'sentence': 'Set wholly in a secondary schoo...
2,b199c5e2616eb4826cbb02fa0e4b58c3,[{'sentence': 'Lois follows a trial about a po...
3,3b35856b7511e65319cda8165d9056a9,[{'sentence': 'Cat told Vincent that she would...
4,be382214372e901980d96959180fa0ac,[{'sentence': 'An escaped Norman Osborn is tra...
...,...,...
109762,c0481fde39300813627ef17f2ef25d09,[{'sentence': 'Previously on 'Survivor': Caleb...
109763,73420b51d3f742c4d6f293deefae7bf0,[{'sentence': 'The entire Simpson family stays...
109764,0e19a51c6e70d1ff163223feb931ea9c,"[{'sentence': 'At the film's opening, Dan and ..."
109765,c857b08ae31107aa089a7cc020206c2c,[{'sentence': 'A group of journalists are inve...


In [None]:
with open('~/ready_for_pos.json', 'w') as f:
    f.write(trainNarraSum.to_json(orient='records', lines=True))

## 2. Extract potential candidates for ED

### 2.1 Spacy POS-Tagging 

In [None]:
file_path1 = "~/ready_for_pos.json"
trainNarraSum = pd.read_json(path_or_buf=file_path1, lines=True)

In [None]:
#filter for PROPN, NOUN, VERB
# Add entry as follows : 
# {"trigger_word": "Conquest", "sent_id": 0, "offset": [1, 2], "id": "f3d95fd23f790fb12875f8fe02bf5fb0"}

candidatesList = []
for q, elem in enumerate(trainNarraSum.content):
    candidates = []
    for i, sent in enumerate(elem):
        #print(sent['sentence'])
        doc = nlp(sent['sentence'])
        for j, w in enumerate(doc):
            if w.pos_ in ["PROPN", "NOUN", "VERB"]:
                dic = {}
                dic["trigger-word"] = w.text
                dic["sent_id"] = i
                dic["offset"] = [j, j+1]
                dic["id"] = str(uuid.uuid4()).replace("-","")
                candidates.append(dic)
                #print("(", w.text , ",", w.pos_, ")") 
    print(q/len(test_NarraSum)*100,"%")
    candidatesList.append(candidates)

In [None]:
trainNarraSum["candidates"] = candidatesList
trainNarraSum

In [None]:
with open("~/ready_for_ed.jsonl", 'w') as f:
    f.write(trainNarraSum.to_json(orient='records', lines=True))

## 3. ED-Inference

- Terminal Command for inferring events with **BERT+CRF**

`bash /netscratch/gillet/projects/pegasus-bridle/wrapper.sh 
python run_maven.py 
--data_dir /netscratch/gillet/MAVEN_Event_Detection/NS 
--model_type bertcrf 
--model_name_or_path bert-base-uncased 
--output_dir ./MAVEN 
--max_seq_length 128 
--do_lower_case 
--per_gpu_train_batch_size 16 
--per_gpu_eval_batch_size 16 
--gradient_accumulation_steps 8 
--learning_rate 5e-5 
--num_train_epochs 5 
--save_steps 100 
--logging_steps 100 
--seed 0 
--do_infer`

## 4. Filter None-events

In [None]:
# load main data in 110 chunks

with open('~/ready_for_ed.jsonl', encoding='utf-8') as f:
    df = []
    df_reader = pd.read_json(f, lines=True, chunksize=1000)
    for chunk in df_reader:
        df.append(chunk)
        print(chunk)

# merge chunks
ls = []
for i in range(110):
    ls.append(df[i])
trainNarraSum = pd.concat(ls)

In [None]:
# load inference results

with open('~/ed_results.jsonl', encoding='utf-8') as f:
    df = []
    df_reader = pd.read_json(f, lines=True, chunksize=1000)
    for chunk in df_reader:
        df.append(chunk)
        print(chunk)

ls = []
for i in range(110):
    ls.append(df[i])
results = pd.concat(ls)

In [None]:
# Create skeleton for reformating before ERE
NS_ERE = pd.DataFrame()
NS_ERE["id"]= trainNarraSum.id
NS_ERE["tokens"]= trainNarraSum.content.tokens
NS_ERE["sentences"]=trainNarraSum.content.sentence

In [None]:
# The 168 MAVEN event types

mavenTypes=["None", "Know", "Warning", "Catastrophe", "Placing", "Causation", "Arriving", "Sending", "Protest", 
             "Preventing_or_letting", "Motion", "Damaging", "Destroying", "Death", "Perception_active", "Presence", 
             "Influence", "Receiving", "Check", "Hostile_encounter", "Killing", "Conquering", "Releasing", "Attack", 
             "Earnings_and_losses", "Choosing", "Traveling", "Recovering", "Using", "Coming_to_be", 
             "Cause_to_be_included", "Process_start", "Change_event_time", "Reporting", "Bodily_harm", "Suspicion", 
             "Statement", "Cause_change_of_position_on_a_scale", "Coming_to_believe", "Expressing_publicly", 
             "Request", "Control", "Supporting", "Defending", "Building", "Military_operation", "Self_motion", 
             "GetReady", "Forming_relationships", "Becoming_a_member", "Action", "Removing", "Surrendering", 
             "Agree_or_refuse_to_act", "Participation", "Deciding", "Education_teaching", "Emptying", "Getting", 
             "Besieging", "Creating", "Process_end", "Body_movement", "Expansion", "Telling", "Change", 
             "Legal_rulings", "Bearing_arms", "Giving", "Name_conferral", "Arranging", "Use_firearm", 
             "Committing_crime", "Assistance", "Surrounding", "Quarreling", "Expend_resource", "Motion_directional", 
             "Bringing", "Communication", "Containing", "Manufacturing", "Social_event", "Robbery", "Competition", 
             "Writing", "Rescuing", "Judgment_communication", "Change_tool", "Hold", "Being_in_operation", "Recording", 
             "Carry_goods", "Cost", "Departing", "GiveUp", "Change_of_leadership", "Escaping", "Aiming", "Hindering", 
             "Preserving", "Create_artwork", "Openness", "Connect", "Reveal_secret", "Response", "Scrutiny", "Lighting", 
             "Criminal_investigation", "Hiding_objects", "Confronting_problem", "Renting", "Breathing", "Patrolling", 
             "Arrest", "Convincing", "Commerce_sell", "Cure", "Temporary_stay", "Dispersal", "Collaboration", "Extradition", 
             "Change_sentiment", "Commitment", "Commerce_pay", "Filling", "Becoming", "Achieve", "Practice", 
             "Cause_change_of_strength", "Supply", "Cause_to_amalgamate", "Scouring", "Violence", "Reforming_a_system", 
             "Come_together", "Wearing", "Cause_to_make_progress", "Legality", "Employment", "Rite", "Publishing", 
             "Adducing", "Exchange", "Ratification", "Sign_agreement", "Commerce_buy", "Imposing_obligation", 
             "Rewards_and_punishments", "Institutionalization", "Testing", "Ingestion", "Labeling", "Kidnapping", 
             "Submitting_documents", "Prison", "Justifying", "Emergency", "Terrorism", "Vocalizations", "Risk", 
             "Resolve_problem", "Revenge", "Limiting", "Research", "Having_or_lacking_access", "Theft", "Incident", "Award"]

In [None]:
# Filter out events that were identified as "None"
# Detect type from list of mavenTypes above
# Add it to new

newEvents = []
count = 0
for candidates, preds in zip(trainNarraSum.candidates, results.predictions):
    new = []
    for cand,pred  in zip(candidates,preds) :
        if pred["type_id"] != 0:
            cand["type"] = mavenTypes[pred["type_id"]]
            cand["type_id"] = pred["type_id"]
            new.append(cand)
            count += 1
    newEvents.append(new)

# store all valid events in a file
# 1O millions events were give a type on 30 million candidates (along 110.000 texts)
                             
with open("~/events.jsonl", 'w') as f:
    f.write(newEvents.to_json(orient='records', lines=True))

## 5. Reformat NarraSum's data to allow for ERE

In [None]:
file_path= "~/train.json"
trainNarraSum = pd.read_json(path_or_buf=file_path, lines=True)

In [None]:
for i, doc in enumerate(trainNarraSum.document):
    trainNarraSum.document[i] = nltk.tokenize.sent_tokenize(doc)

In [None]:
testNarraSum = testNarraSum.rename(columns = {'document':'sentences'})

In [None]:
tokensList = []
for i, doc in enumerate(testNarraSum.sentences):
    sents = []
    for sent in doc:
        sents.append(nltk.word_tokenize(sent))
    tokensList.append(sents)

In [None]:
testNarraSum['tokens']= tokensList
testNarraSum = testNarraSum[['id', 'tokens', 'sentences']]

## 6. Add Event-mentions

In [None]:
with open('~/events.json', encoding='utf-8') as f:
    df = []
    df_reader = pd.read_json(f, lines=True, chunksize=1000)
    for chunk in df_reader:
        df.append(chunk)
        print(chunk)

ls = []
for i in range(110):
    ls.append(df[i])
events = pd.concat(ls)
events

In [None]:
# We put all events in a list of lists while filtering out NaN and None results
# We also have to merge all 1215 columns into 1 
# We then assign this list to a column of the v2 of NarraSum called "event_mentions"

lsls = []
for i in range(109767):
    ls = []
    if i%1000==0:
        print(i)
    for j in range(1215):
        if events[j][i] != None:
            ls.append(events[j][i])
    lsls.append(ls)
    
#filtering NaN-entries

for i in range(109767):
    lsls[i] = [item for item in lsls[i] if not(pd.isnull(item)) == True]

# Here an example of the 1st input document's event mentions   

lsls[0]

In [None]:
testNarraSum["event_mentions"]= lsls
testNarraSum

In [None]:
with open('~/ready_for_ere.json', 'w') as f:
    f.write(trainNarraSum.to_json(orient='records', lines=True))

## 7. ERE-Inference

- Terminal Command for inferring event relations **with RoBERTa-large**

`bash /netscratch/gillet/projects/pegasus-bridle/wrapper.sh python main.py --eval_steps 200 --epochs 100 --lr 3e-4 --bert_lr 2e-5 --accumulation_steps 4 --batch_size 8`

__-> Raw Guidance Signal Acquired for the three splits (train, val and test)__

In [None]:
# load main data in 110 chunks

with open('/Users/clementgillet/Desktop/4SUBMISSION/files/final/hallelujah_train.json', encoding='utf-8') as f:
    df = []
    df_reader = pd.read_json(f, lines=True, chunksize=1000)
    for chunk in df_reader:
        df.append(chunk)
        print(chunk)

# merge chunks
ls = []
for i in range(110):
    ls.append(df[i])
trainNarraSum = pd.concat(ls)

In [None]:
# merge chunks
ls = []
for i in range(110):
    ls.append(df[i])
trainNarraSum = pd.concat(ls)

In [None]:
trainNarraSum["TIMEX"]= ""
trainNarraSum

In [None]:
with open('/Users/clementgillet/Desktop/4SUBMISSION/files/final/hallelujah_train.json', 'w') as f:
    f.write(trainNarraSum.to_json(orient='records', lines=True))

In [None]:
# If we desire to do TIMEX annotation later -->

# For every data point
#    For every sentences:
#        if contain (timelist):
#            create entry for TIMEX column

# Allow for fuzzy string matching !!! 

weekList = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
timeList = ["now", "yesterday", "tomorrow", "the day before yesterday", "the day after tomorrow", "5 o'clock", "5am", "5pm", "time", "2 days before", "2 weeks before"]
durationList = ["5 seconds", "5 minutes", "5 hours", "5 months", "5 years", "during the whole week", "all day long", "from january to may", "for 10 years"]

count = 0

for sents in trainNarraSum.sentences:
    if count%100==0:
        print(count)
    count+=1
    for sent in sents:
        
        
# use datefinder library on github
