An event has a set of keywords C_e

For a existing story tree S, there is a set of different C_s that is a union of all the keywords of the events inside

To know if an event is in that Story, we use compatibility between event E and story tree S with Jaccard similarity between C_s and C_e.

If we match to no stories, we create a new one

When we find a story that match, we apply either: Merge, Extend, Insert

Merge: Merge event in the story tree (merge the event with another event) (we will never have this use this)
Extend: append the event as a child node
Insert: appends the event to the root node of story tree

We extend or insert if there is no overlap of event, we find the parent event node, and measure the 'connection strength' between the new event and all the events in the story tree.  'Connection strength' is defined in page page 14 of paper.

In [1]:
import pandas as pd
import pickle
import numpy as np

In [2]:
#news_dataset = pd.read_pickle("/work/IFT6010_Story_Tree/data/short_news_dataset_2_with_extractedkeyword.pickle").drop_duplicates(subset=['TEXT']).drop_duplicates(subset=['TITLE'])
news_dataset = pd.read_pickle("/work/IFT6010_Story_Tree/data/news_with_extracted_keywords.pkl")

In [3]:
temp_defaultdict_communities = pickle.load(open("../data/extracted_communities_newset.pickle",'rb'))
dict_communities = {}
for i,v in temp_defaultdict_communities.items():
    if len(v)==1:
        dict_communities[i] = v[0]

keywords_clusters = pd.DataFrame(dict_communities.items())
keywords_clusters.columns = ['keyword', 'cluster']

In [4]:
keywords_clusters['cluster'].value_counts()

# Define the Jaccard similarity

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

In [5]:
def jaccard_similarity(keyword_list_1, keyword_list_2):
    list1_as_set = set(keyword_list_1)
    intersection = list1_as_set.intersection(keyword_list_2) 

    set_union = set(keyword_list_1 + keyword_list_2)

    return len(intersection) / len(set_union)

In [None]:
def count_similar_word_in_title(title1, title2):
    title1_low = title1.lower()
    title2_low = title2.lower()

    title1_list = title1_low.split(" ")
    title2_list = title2_low.split(" ")

    stop_words = stopwords.words('english')

    title1_tokens = [ token for token in title1_list if token not in stop_words]
    title2_tokens = [ token for token in title2_list if token not in stop_words]
    
   return len(list(set(title1_tokens)&set(title2_tokens)))

In [None]:
for i in news_dataset.iterrows():
    for j in news_dataset.iterrows():
        similarity = jaccard_similarity(i[1]["extracted_keywords"], j[1]["extracted_keywords"])
        if  similarity> 0.3 and similarity != 1.0:
            print(i[1]["title"])
            print(j[1]["title"])
            print("---------------")

The Parliamentary Tactic That Could Obliterate Obamacare - The New York Times
Republicans’ 4-Step Plan to Repeal the Affordable Care Act - The New York Times
---------------
Republicans’ 4-Step Plan to Repeal the Affordable Care Act - The New York Times
The Parliamentary Tactic That Could Obliterate Obamacare - The New York Times
---------------
Republicans’ 4-Step Plan to Repeal the Affordable Care Act - The New York Times
Senators Propose Giving States Option to Keep Affordable Care Act - The New York Times
---------------
Four Movies You Should Know About Before the Golden Globes - The New York Times
Obama’s Last Battle: His Legacy - The New York Times
---------------
Senate Confirmation Hearings to Begin Without All Background Checks - The New York Times
Mike Pompeo Is Confirmed to Lead C.I.A., as Rex Tillerson Advances - The New York Times
---------------
What We Know and Don’t Know About the Trump-Russia Dossier - The New York Times
Trump Received Unsubstantiated Report That Russ

# Preprocessing and creating of vector

In [6]:
import sys
sys.path.insert(1, '/work/IFT6010_Story_Tree/src/features/')

In [7]:
from utils_cosine_tf_idf import latest_tfidf, preprocessing

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [8]:
news_dataset['text_and_title'] = news_dataset[['content','title']].apply(lambda x :" ".join(x), axis=1)
news_dataset['preprocessed_text'] = news_dataset['text_and_title'].apply(preprocessing)


corpus = news_dataset['preprocessed_text']
corpus = corpus.to_list()

news_dataset['vector'] = news_dataset['preprocessed_text'].apply(latest_tfidf, allDocs=corpus)

In [12]:
news_dataset.to_pickle("news_with_extracted_keywords_and_vectors.pkl")

# Extend and insert event in story

Measure connection strength between our event X, and all the events X_s of a story. We look at:

1) The time distance between both events

2) The compatibility of the two events

3) The storyline coherence if we append event X to story tree of X_s

connection_strength(X,X_s) = compatibility()+coherence()+time_penalty

In [13]:
def compatiblity(tf_new_event, tf_event_story):

    #tf_new_event = np.array(tf_new_event)
    #tf_event_story = np.array(tf_event_story)

    if len(tf_new_event) < len(tf_event_story):
        temp = tf_new_event
        vector_a = tf_event_story
        vector_b = temp
    else:
        vector_a = tf_new_event
        vector_b = tf_event_story

    list_1={}
    list_2={}

    for elem in vector_a:
        if elem in vector_b:
            list_1[elem] = vector_a[elem]
            list_2[elem] = vector_b[elem]
        else:
            list_1[elem] = vector_a[elem]
            list_2[elem] = 0

    for elem in vector_b:
        if not elem in list_1:
            list_1[elem] = 0 
            list_2[elem] = vector_b[elem]

    # turn dictionary to numpy array
    list_1_vector = np.fromiter(list_1.values(), dtype=float)
    list_2_vector = np.fromiter(list_2.values(), dtype=float)

    prod = np.dot(list_1_vector, list_2_vector)

    
    return prod / np.linalg.norm(list_1_vector) * np.linalg.norm(list_2_vector)

In [14]:
def conherence():
    event1 = news_dataset['VECTOR'].iloc[2]

    sum_ = 0
    
    for event in stories:
        event2 = news_dataset['VECTOR'].iloc[3]

        sum_ += compatiblity(event1, event2)

    return sum_/len(stories)

In [15]:
def time_penalty(delta, time1, time2):
    if time1 < time2:
        return math.exp(delta)
    
    return 0

# Create stories

### Identifying the related story tree

In [None]:
# Needs to be true with at least 1 event in the story
def is_event_in_story(keywords_of_event, keyword_of_stories):
    similarity = jaccard_similarity(keywords_of_event, keyword_of_stories)
    common_words_title = count_similar_word_in_title(title1, title2)

    if  similarity > 0.3 and common_words_title >= 1:
        return True
    
    return False

### Update the related story tree

We calculate the connection strength between the new event E and each existing event Ej ∈ S based on the following three factors: 

(1) the time distance between E and Ej

(2) the compatibility of the two events

(3) the storyline coherence if E is appended to Ej in the tree

In [18]:
class Event:
    def __init__(self, title, content, keywords, date, vector):
        self.title = title
        self.content = content
        self.keywords = keywords
        self.date = date
        self.vector = vector

    def get_title():
        return self.title
    
    def get_content():
        return self.content

    def get_keywords():
        return self.keywords

    def get_vector():
        return self.vector



class Story:
    def __init__(self, event):
        self.list_of_event = [event]
        self.list_keywords = event.get_keywords()

    def add_event(new_event):
        list_of_event.append(event)

        # A story keywords, is the union of all keywords
        list_of_keywords.extend(new_event.get_keywords())

    def get_list_of_keywords():
        return list_keywords

In [None]:
list_of_stories = []

# Go through events to add to stories 
for i in range(len(news_dataset)):

    row = news_dataset.iloc[i]
    title    = row["title"]
    content  = row["content"]
    keywords = row["extracted_keywords"]
    date     = row["date"]
    vector   = row["vector"]

    # We create the first story
    if i==0:
        list_of_stories.append(Story(Event(title, content, keywords, date, vector)))


    # Do we add the event to an existing story, or create a new one ?
    else:
        for story in list_of_stories:
            is_event_in_story


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=902703f2-430f-48f3-ba3f-6c2fee66cf11' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>