# Personal Information
Name: **Emiel Verhoef**

StudentID: **15617483**

Email: [**emiel.verhoef@student.uva.nl**](emiel.verhoef@student.uva.nl)

Submitted on: **23.03.2025**

Github: https://github.com/Verrem98/structural_narrative_similarity

# Data Context
Current popular techniques for similarity between narratives often use surface level semantics and do not explicitly take any other features of the narrative into account (see BERT, Splade, ColBERT etc.). We will look at narrative structure (relationships between entities and actions, and possibly their ordering) as an indicator for similarity. The accuracy of this structural narrative similarity can be calculated by reframing it as in information retrieval task, fetching the most similar narratives to a query narrative.
In this EDA we will look at two different datasets:

### Story Similarity Data (https://cs.uky.edu/~sgware/projects/storysimilarity/index.php)
This dataset consists of 512 annotations of [reference_story, story A, story B, participant_id, choice] where the choice column indicates which story, out of story A and B, is most similar to the reference story according to the participant. The reference stories have been artificially generated. An example:

"Tom walks to the Crossroads.

Tom walks to the Market.

Tom buys the Merchant's Sword from the Merchant using his coin.

Tom robs The Merchant and takes the Potion.

Tom walks to the Crossroads.

Tom walks to his Cottage."

### Tell Me Again! (https://github.com/uhh-lt/tell-me-again)

This dataset consists of 96831 individial summaries of movie plots across 9505 unique movies. The summaries were taken from Wikipedia. Uniquely, this dataset contains multiple summaries of the same properties but written in different languages and machine translated to English. An example:

"The film is set in post-World War I Czechoslovakia in 1925: there is high unemployment and political extremists recruit followers. Professor Boskovic learns about a supernatural giant rat from an old book. It takes advantage of humans and transforms them into deaf giant monkeys that resemble humans in appearance. So the followers of the giant rat are not immediately recognizable. The giant rat wants to rule the world and infiltrates society with his loyalists. Together with a young writer and his daughter, the professor fights the project. The young writer, who has fallen in love with the professor's daughter, discovers a secret event and escapes the clutches of the giant rat at the last second. When the city government is firmly under the control of the rat people, the three agree: the professor has developed an anti-rat spray and is fighting the rodents. When he is murdered, the writer completes his work and can ultimately win."

In [1]:
# Imports
import os
import numpy as np
import pandas as pd
import json

### Data Loading

### Story Similarity Data

In [2]:
story_similarity_data_annotations = pd.read_csv("story_similarity_data/data.csv")

data = {}
current_section = None
with open("story_similarity_data/english.txt", "r") as file:
    for line in file:
        line = line.strip()
        if line.isdigit(): 
            ID = int(line)
            data[ID] = []  
        elif line:  
            data[ID].append(line)
story_similarity_data_stories = pd.DataFrame(
    {"ID": data.keys(), "story": [" ".join(events) for events in data.values()]}
)


In [3]:
def fetch_story(df, ID):

    return df[df['ID'] == ID].story.tolist()[0]

In [4]:
story_similarity_data = story_similarity_data_annotations.copy()

In [5]:
story_similarity_data["story_reference"] = story_similarity_data["reference"].apply(lambda x: fetch_story(story_similarity_data_stories, x))
story_similarity_data["story_A"] = story_similarity_data["A"].apply(lambda x: fetch_story(story_similarity_data_stories, x))
story_similarity_data["story_B"] = story_similarity_data["B"].apply(lambda x: fetch_story(story_similarity_data_stories, x))

In [6]:
story_similarity_data.iloc[0].choice

'A'

In [7]:
story_similarity_data.iloc[0].story_reference

'The Bandit walks to the Market. Tom walks to the Crossroads. The Bandit walks to the Crossroads. The Guard walks to the Crossroads. The Bandit attacks Tom.'

In [8]:
story_similarity_data.iloc[0].story_A

"Tom walks to the Crossroads. Tom walks to the Merchant's House. The Bandit walks to the Merchant's House. The Bandit attacks Tom."

In [9]:
story_similarity_data.iloc[0].story_B

'Tom walks to the Crossroads. Tom walks to the Market. Tom buys the Potion from the Merchant using his coin. Tom waits for night. Later that night, Tom walks to the Crossroads. Tom walks to his Cottage.'

In [10]:
story_similarity_data

Unnamed: 0,reference,A,B,participant,choice,story_reference,story_A,story_B
0,3,56,4,1,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
1,3,56,4,2,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
2,3,56,4,3,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
3,3,56,4,4,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
4,3,56,4,5,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
...,...,...,...,...,...,...,...,...
507,5,42,41,60,A,"Tom waits for night. Later that night, Tom wal...",The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. The Bandit attack...
508,5,42,41,61,A,"Tom waits for night. Later that night, Tom wal...",The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. The Bandit attack...
509,5,42,41,62,A,"Tom waits for night. Later that night, Tom wal...",The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. The Bandit attack...
510,5,42,41,63,A,"Tom waits for night. Later that night, Tom wal...",The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. The Bandit attack...


### Tell Me Again!

In [11]:
def load_all_json_files(root_folder):
    data_dict = {}
    
    for subdir, _, files in os.walk(root_folder):
        for filename in files:
            if filename.endswith(".json"):
                file_path = os.path.join(subdir, filename)
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        data_dict[file_path] = json.load(f)
                except json.JSONDecodeError as e:
                    print(f"Error loading {file_path}: {e}")
    
    return data_dict

root_folder = "tell_me_again_v1/summaries"
json_data = load_all_json_files(root_folder)

In [12]:
tell_me_again_data = pd.DataFrame(json_data).T

In [13]:
tell_me_again_data = tell_me_again_data[["wikidata_id", "title", "en_translated_summaries"]]
# create rows for every unique language summary
exploded = tell_me_again_data.explode('en_translated_summaries').rename(columns = {'en_translated_summaries':'language'})

In [14]:
tell_me_again_data = exploded.merge(tell_me_again_data[["wikidata_id", "en_translated_summaries"]], on="wikidata_id", how="left").dropna()


In [15]:
# unpack summary dicts
tell_me_again_data["unpacked_summary"] = tell_me_again_data.apply(lambda row: row["en_translated_summaries"].get(row["language"], {}).get("text", ""), axis=1)
tell_me_again_data["unpacked_summary_sents"] = tell_me_again_data.apply(lambda row: row["en_translated_summaries"].get(row["language"], {}).get("sentences", ""), axis=1)

tell_me_again_data = tell_me_again_data.drop(columns = ['en_translated_summaries'])

In [53]:
for lang in list(set(tell_me_again_data[(tell_me_again_data.title == 'Wrestling Ernest Hemingway')].language.tolist())):
    print(f"{lang.upper()} to English summary:\n")
    print(tell_me_again_data[(tell_me_again_data.title == 'Wrestling Ernest Hemingway') & (tell_me_again_data.language == lang)].unpacked_summary.tolist()[0])
    print()

IT to English summary:

In the Florida home owned by widow Helen, the air conditioner is broken. This infuriates the tenant, retired sea captain Frank, whose son sent him a double-sided hat for his seventy-fifth birthday. Frank seeks refreshment in a bookstore, but is banished to the cinema, where he chats with the elegant and well-dressed Georgia. Disappointed by his son's late arrival, Frank meets Walt, another lonely pensioner, at the park, with whom he goes to the snack where the latter invariably orders bacon sandwiches to the pretty Elaine, of whom he is a devoted admirer and, who accompanies him home on the bus after work.
It's the Fourth of July, and to see the fireworks, the two use Frank's tent, disappointed that his son has canceled the planned visit. Meanwhile, they become friends: Walt reveals himself to be a Cuban exile, and Frank lingers to recall his first love encounter and his four marriages. Because on Saturday, Georgia, very elegant, escapes the shabby Frank, his fr

In [67]:
tell_me_again_data

Unnamed: 0,wikidata_id,title,language,unpacked_summary,unpacked_summary_sents
0,Q1000352,Reckless,de,"New York: As a 12-year-old boy, Jacob, longing...","[New York: As a 12-year-old boy, Jacob, longin..."
1,Q1000352,Reckless,it,Jacob Recless lives alone with his mother and ...,[Jacob Recless lives alone with his mother and...
2,Q1000394,This Modern Age,de,"Diane Winters, a middle-aged housewife, is abo...","[Diane Winters, a middle-aged housewife, is ab..."
3,Q1000764,The Buddenbrooks,de,# Part one\nThe Buddenbrooks are a respected a...,"[# Part one, The Buddenbrooks are a respected ..."
4,Q1000764,The Buddenbrooks,fr,Part one\nThe Buddenbrooks were a respected an...,"[Part one, The Buddenbrooks were a respected a..."
...,...,...,...,...,...
83273,Q999587,206 Bones,de,Tempe Brennan travels with her colleague and e...,[Tempe Brennan travels with her colleague and ...
83274,Q999587,206 Bones,it,"Dr. Brennan wakes up, bound hand and foot, in ...","[Dr. Brennan wakes up, bound hand and foot, in..."
83275,Q99982097,BigBug,de,"In the year 2045, artificial intelligence is u...","[In the year 2045, artificial intelligence is ..."
83276,Q99982097,BigBug,fr,"By 2045, every resident of a house will have a...","[By 2045, every resident of a house will have ..."


#### checking the length of the texts after BERT tokenization

In [76]:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_summaries = [tokenizer(x, return_tensors='pt').input_ids.shape[1] for x in tell_me_again_data.unpacked_summary.tolist()]


Token indices sequence length is longer than the specified maximum sequence length for this model (723 > 512). Running this sequence through the model will result in indexing errors


In [81]:
tell_me_again_data['token_count'] = tokenized_summaries
tell_me_again_data.token_count.describe()

count    83277.000000
mean       431.674676
std        496.204027
min          4.000000
25%        114.000000
50%        268.000000
75%        612.000000
max      29047.000000
Name: token_count, dtype: float64

In [86]:
tell_me_again_data.token_count.quantile(0.69)

508.0

## Analysis Story Similarity Data

In [21]:
story_similarity_data.story_reference.apply(lambda x: len(x.split("."))).describe()

count    512.000000
mean       6.125000
std        0.331042
min        6.000000
25%        6.000000
50%        6.000000
75%        6.000000
max        7.000000
Name: story_reference, dtype: float64

The Story Similarity Data is already completely clean. While looking at the sample it is immediately clear that it does not reflect natural language in the same way that the summaries do and looks more like an ontology. This has the advantage of it already being very to the point, almost being pure structure. It does also mean that the generalizability of the dataset is much worse as it is so distinctly different from narratives in the most general sense. The narratives are also much shorter when compared to the movie summaries with the mean length only being 6 sentences long. A big advantage of this is that, when using encoder-transformer models as a baseline, we will not have to worry about exceeding the token limit of 512.

## Analysis Tell Me Again!

In [40]:
tell_me_again_data.unpacked_summary_sents.apply(lambda x: len(x)).describe()

count    83277.00000
mean        16.48245
std         19.03143
min          1.00000
25%          4.00000
50%         10.00000
75%         24.00000
max       1188.00000
Name: unpacked_summary_sents, dtype: float64

The Tell Me Again! dataset is also already practically clean, with only a few stray '\n' tags here and there. The creators of the datset do warn that there are some duplicates in there, but for our purpose this is not much of a problem. You can also see that there are some outliers in sentence length, which can of course be removed quite simply. The texts better reflect natural narratives than in the Story Similarity Dataset. Summaries of the same property do, somewhat surprisingly, differ a lot in content. While you would expect summaries of the same property to be the most similar to each other, these differences might be big enough for this to not be entirely true. This would cause a lower baseline performance, but does not prevent us from using it in our experiments. The higher number of sentences isn't necessarily a problem. Because there are so many summaries (~70% below 512 tokens) we can simply select all of the properties with summaries beneath a specific number of tokens/sentences if necessary.

## Conclusion

Both datasets look perfectly usable for our task. Story Similarity Data represents human similarity, while Tell Me Again! represents at a more objective measure. It is entirely possible that humans do not factor structural similarity into their similarity judgement in a significant way, as some research suggests (humans preferring more directly obvious factors: the subject of the text or the writing style). Still, it will at least be interesting to experiment with.