An event has a set of keywords C_e

For a existing story tree S, there is a set of different C_s that is a union of all the keywords of the events inside

To know if an event is in that Story, we use compatibility between event E and story tree S with Jaccard similarity between C_s and C_e.

If we match to no stories, we create a new one

When we find a story that match, we apply either: Merge, Extend, Insert

Merge: Merge event in the story tree (merge the event with another event) (we will never have this use this)
Extend: append the event as a child node
Insert: appends the event to the root node of story tree

We extend or insert if there is no overlap of event, we find the parent event node, and measure the 'connection strength' between the new event and all the events in the story tree.  'Connection strength' is defined in page page 14 of paper.

In [1]:
!pip install nltk

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


In [7]:
import pandas as pd
import pickle
import numpy as np
import nltk

In [3]:
#news_dataset = pd.read_pickle("/work/IFT6010_Story_Tree/data/short_news_dataset_2_with_extractedkeyword.pickle").drop_duplicates(subset=['TEXT']).drop_duplicates(subset=['TITLE'])
#news_dataset = pd.read_pickle("/work/IFT6010_Story_Tree/data/news_with_extracted_keywords.pkl")

#### Get the communities 

In [4]:
temp_defaultdict_communities = pickle.load(open("../data/extracted_communities_newset.pickle",'rb'))
dict_communities = {}
for i,v in temp_defaultdict_communities.items():
    if len(v)==1:
        dict_communities[i] = v[0]

keywords_clusters = pd.DataFrame(dict_communities.items())
keywords_clusters.columns = ['keyword', 'cluster']

# Get data

In [None]:
import sys
sys.path.insert(1, '/work/IFT6010_Story_Tree/src/features/')

In [None]:
#from utils_cosine_tf_idf import latest_tfidf, preprocessing

#news_dataset['text_and_title'] = news_dataset[['content','title']].apply(lambda x :" ".join(x), axis=1)
#news_dataset['preprocessed_text'] = news_dataset['text_and_title'].apply(preprocessing)


#corpus = news_dataset['preprocessed_text']
#corpus = corpus.to_list()

#news_dataset['vector'] = news_dataset['preprocessed_text'].apply(latest_tfidf, allDocs=corpus)

#news_dataset.to_pickle("news_with_extracted_keywords_and_vectors.pkl")

In [116]:
news_dataset = pd.read_pickle("../data/news_with_extracted_keywords_and_vectors.pkl")
news_dataset_pt2 = pd.read_pickle("../data/25k_27k_news_with_extracted_keywords.pkl")
dsk_dataset = pd.read_pickle("../data/news_dsk_with_extracted_keywords_5000_25april.pkl").drop_duplicates(["text"]).drop_duplicates(["title"]).sample(n=1000)

In [117]:
dsk_dataset = dsk_dataset.rename(columns={'text': 'content'})

del news_dataset['publication']
del news_dataset['author']
del news_dataset['vector']

del dsk_dataset['summary']

del news_dataset_pt2['publication']
del news_dataset_pt2['author']
del news_dataset_pt2['year']
del news_dataset_pt2['month']

frames = [dsk_dataset, news_dataset_pt2, news_dataset]

total_news_dataset = pd.concat(frames)

# Define if event in the story

In [8]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [50]:
def jaccard_similarity(keyword_list_1, keyword_list_2):
    list1_as_set = set(keyword_list_1)
    
    intersection = list1_as_set.intersection(keyword_list_2) 

    set_union = set(keyword_list_1 + keyword_list_2)

    return len(intersection) / len(set_union)

In [60]:
def count_similar_word_in_title(title1, title2):

    title1.replace(' - The New York Times','')
    title2.replace(' - The New York Times','')

    title1_low = title1.lower()
    title2_low = title2.lower()

    title1_list = title1_low.split(" ")
    title2_list = title2_low.split(" ")

    stop_words = stopwords.words('english')

    title1_tokens = [ token for token in title1_list if token not in stop_words]
    title2_tokens = [ token for token in title2_list if token not in stop_words]
    
    return len(list(set(title1_tokens)&set(title2_tokens)))

In [127]:
# Needs to be true with at least 1 event in the story
# event_keyword: keywords of the event
# event_title: title of the event
# story: Story object
def is_event_in_story(event_keyword, event_title, story):
    similarity = jaccard_similarity(event_keyword, story.get_list_of_keywords())

    one_event_common_title = False

    for event_of_story in story.get_list_of_events():
        common_words_title = count_similar_word_in_title(event_title, event_of_story.get_title())
        if common_words_title >= 1 and not common_words_title>4:
            one_event_common_title = True
            break

    if  similarity > 0.24 and one_event_common_title:
        return True
    
    return False

### Extend and insert event in story

Measure connection strength between our event X, and all the events X_s of a story. We look at:

1) The time distance between both events

2) The compatibility of the two events

3) The storyline coherence if we append event X to story tree of X_s

connection_strength(X,X_s) = compatibility()+coherence()+time_penalty

In [121]:
def compatiblity(tf_new_event, tf_event_story):

    #tf_new_event = np.array(tf_new_event)
    #tf_event_story = np.array(tf_event_story)

    if len(tf_new_event) < len(tf_event_story):
        temp = tf_new_event
        vector_a = tf_event_story
        vector_b = temp
    else:
        vector_a = tf_new_event
        vector_b = tf_event_story

    list_1={}
    list_2={}

    for elem in vector_a:
        if elem in vector_b:
            list_1[elem] = vector_a[elem]
            list_2[elem] = vector_b[elem]
        else:
            list_1[elem] = vector_a[elem]
            list_2[elem] = 0

    for elem in vector_b:
        if not elem in list_1:
            list_1[elem] = 0 
            list_2[elem] = vector_b[elem]

    # turn dictionary to numpy array
    list_1_vector = np.fromiter(list_1.values(), dtype=float)
    list_2_vector = np.fromiter(list_2.values(), dtype=float)

    prod = np.dot(list_1_vector, list_2_vector)

    
    return prod / np.linalg.norm(list_1_vector) * np.linalg.norm(list_2_vector)

In [17]:
def conherence():
    event1 = news_dataset['VECTOR'].iloc[2]

    sum_ = 0
    
    for event in stories:
        event2 = news_dataset['VECTOR'].iloc[3]

        sum_ += compatiblity(event1, event2)

    return sum_/len(stories)

In [18]:
def time_penalty(delta, time1, time2):
    if time1 < time2:
        return math.exp(delta)
    
    return 0

# Create stories

### Update the related story tree

We calculate the connection strength between the new event E and each existing event Ej ∈ S based on the following three factors: 

(1) the time distance between E and Ej

(2) the compatibility of the two events

(3) the storyline coherence if E is appended to Ej in the tree

In [122]:
class Event:
    def __init__(self, title, content, keywords, date, vector):
        self.title = title
        self.content = content
        self.keywords = keywords
        self.date = date
        self.vector = vector

    def get_title(self):
        return self.title
    
    def get_content(self):
        return self.content

    def get_keywords(self):
        return self.keywords

    def get_vector(self):
        return self.vector

class Story:
    def __init__(self, event):
        self.list_of_events = [event]
        self.list_keywords = event.get_keywords()

    def add_event(self, new_event):
        self.list_of_events.append(new_event)
        # A story keywords, is the union of all keywords
        self.list_keywords.extend(new_event.get_keywords())

    def get_list_of_keywords(self):
        return self.list_keywords

    def get_list_of_events(self):
        return self.list_of_events

In [124]:
news_dataset = total_news_dataset

list_of_stories = []

# Go through events to add to stories 
for i in range(len(news_dataset)):
    
    if i%100 == 0:
        print(i)

    row = news_dataset.iloc[i]
    title    = row["title"]
    content  = row["content"]
    keywords = row["extracted_keywords"]
    date     = row["date"]
    #vector   = row["vector"]
    vector   = [0,0,0]

    new_event = Event(title, content, keywords, date, vector)

    # We create the first story
    if i==0:
        list_of_stories.append(Story(new_event))

    # Do we add the event to an existing story, or create a new one ?
    else:

        found_a_story_for_event = False

        # Iterate through the story to associate an event to a story
        for story in list_of_stories:
            # if the event is the story, we append it to the story
            if is_event_in_story(keywords, title, story):
                found_a_story_for_event = True
                story.add_event(new_event)

        # If we found no story to associate the event, we create a new story
        if not found_a_story_for_event:
            list_of_stories.append(Story(new_event))

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900


In [125]:
len(list_of_stories)

3782

In [126]:
for story in list_of_stories:
    events = story.get_list_of_events()
    if(len(events) > 2)
        for event in events:
            print(event.get_title())
        print("=======================================")

DSK Granted Bail But Indicted, Subjected To House Arrest
Judge grants Strauss-Kahn bail
Ex-IMF leader to be released on $1m bail
Ex-IMF chief gets NZ$1.26m bail in sex assault case
Ex-IMF chief gets $1m bail, house arrest in sex assault
Former IMF chief gets bail set at $1 million but remains under house arrest
Strauss-Kahn puts up $1M, released on bail
Strauss-Kahn gets bail in sex assault case
Robinson: Powerful with privileges belong in an age of dinosaurs
Eugene Robinson: IMF chief from dinosaur age
Robinson: IMF's Mr. Big takes a N.Y. 'perp walk'
ROBINSON: Perp walk not the way to persuade French voters
There's no coming back from the perp walk
Strauss-Kahn house arrest draws media, tourists Posted: 22 May 2011 0158 hrs
Strauss-Kahn plots defense in house arrest
Strauss-Kahn house arrest draws media, tourists
Strauss-Kahn is released from NYC jail - Forbes.com
Strauss-Kahn's NYC apartment new tourist hot spot - Forbes.com
Strauss-Kahn apartment is new tourist attraction
Ex-IMF chi

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=902703f2-430f-48f3-ba3f-6c2fee66cf11' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>