# Personal Information
Name: **Emiel Verhoef**

StudentID: **15617483**

Email: [**emiel.verhoef@student.uva.nl**](emiel.verhoef@student.uva.nl)

Submitted on: **01.01.1000**

# Data Context
In this EDA we will look at two different datasets:

### Story Similarity Data
This dataset consists of 512 annotations of [reference_story, story A, story B, participant_id, choice] where the choice column indicates which story, out of story A and B, is most similar to the reference story according to the participant. The reference stories have been artificially generated. A sample:

"Tom walks to the Crossroads.

Tom walks to the Market.

Tom buys the Merchant's Sword from the Merchant using his coin.

Tom robs The Merchant and takes the Potion.

Tom walks to the Crossroads.

Tom walks to his Cottage."

### Tell Me Again!

This dataset consists of 96831 individial summaries of movie plots across 9505 unique movies.

# Data Description

**Present here the results of your exploratory data analysis. Note that there is no need to have a "story line" - it is more important that you show your understanding of the data and the methods that you will be using in your experiments (i.e. your methodology).**

**As an example, you could show data, label, or group balances, skewness, and basic characterizations of the data. Information about data frequency and distributions as well as results from reduction mechanisms such as PCA could be useful. Furthermore, indicate outliers and how/why you are taking them out of your samples, if you do so.**

**The idea is, that you conduct this analysis to a) understand the data better but b) also to verify the shapes of the distributions and whether they meet the assumptions of the methods that you will attempt to use. Finally, make good use of images, diagrams, and tables to showcase what information you have extracted from your data.**

As you can see, you are in a jupyter notebook environment here. This means that you should focus little on writing text and more on actually exploring your data. If you need to, you can use the amsmath environment in-line: $e=mc^2$ or also in separate equations such as here:

\begin{equation}
    e=mc^2 \mathrm{\space where \space} e,m,c\in \mathbb{R}
\end{equation}

Furthermore, you can insert images such as your data aggregation diagrams like this:

![image](example.png)

In [1]:
# Imports
import os
import numpy as np
import pandas as pd
import json

### Data Loading

### Story Similarity Data

In [2]:
story_similarity_data_annotations = pd.read_csv("story_similarity_data/data.csv")

data = {}
current_section = None
with open("story_similarity_data/english.txt", "r") as file:
    for line in file:
        line = line.strip()
        if line.isdigit(): 
            ID = int(line)
            data[ID] = []  
        elif line:  
            data[ID].append(line)
story_similarity_data_stories = pd.DataFrame(
    {"ID": data.keys(), "story": [" ".join(events) for events in data.values()]}
)


In [3]:
def fetch_story(df, ID):

    return df[df['ID'] == ID].story.tolist()[0]

In [4]:
story_similarity_data = story_similarity_data_annotations.copy()

In [5]:
story_similarity_data["story_reference"] = story_similarity_data["reference"].apply(lambda x: fetch_story(story_similarity_data_stories, x))
story_similarity_data["story_A"] = story_similarity_data["A"].apply(lambda x: fetch_story(story_similarity_data_stories, x))
story_similarity_data["story_B"] = story_similarity_data["B"].apply(lambda x: fetch_story(story_similarity_data_stories, x))

In [6]:
story_similarity_data.iloc[0].choice

'A'

In [7]:
story_similarity_data.iloc[0].story_reference

'The Bandit walks to the Market. Tom walks to the Crossroads. The Bandit walks to the Crossroads. The Guard walks to the Crossroads. The Bandit attacks Tom.'

In [8]:
story_similarity_data.iloc[0].story_A

"Tom walks to the Crossroads. Tom walks to the Merchant's House. The Bandit walks to the Merchant's House. The Bandit attacks Tom."

In [9]:
story_similarity_data.iloc[0].story_B

'Tom walks to the Crossroads. Tom walks to the Market. Tom buys the Potion from the Merchant using his coin. Tom waits for night. Later that night, Tom walks to the Crossroads. Tom walks to his Cottage.'

In [10]:
story_similarity_data

Unnamed: 0,reference,A,B,participant,choice,story_reference,story_A,story_B
0,3,56,4,1,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
1,3,56,4,2,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
2,3,56,4,3,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
3,3,56,4,4,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
4,3,56,4,5,A,The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. Tom walks to the ...,Tom walks to the Crossroads. Tom walks to the ...
...,...,...,...,...,...,...,...,...
507,5,42,41,60,A,"Tom waits for night. Later that night, Tom wal...",The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. The Bandit attack...
508,5,42,41,61,A,"Tom waits for night. Later that night, Tom wal...",The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. The Bandit attack...
509,5,42,41,62,A,"Tom waits for night. Later that night, Tom wal...",The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. The Bandit attack...
510,5,42,41,63,A,"Tom waits for night. Later that night, Tom wal...",The Bandit walks to the Market. Tom walks to t...,Tom walks to the Crossroads. The Bandit attack...


### Tell Me Again!

In [39]:
def load_all_json_files(root_folder):
    data_dict = {}
    
    for subdir, _, files in os.walk(root_folder):
        for filename in files:
            if filename.endswith(".json"):
                file_path = os.path.join(subdir, filename)
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        data_dict[file_path] = json.load(f)
                except json.JSONDecodeError as e:
                    print(f"Error loading {file_path}: {e}")
    
    return data_dict

root_folder = "tell_me_again_v1/summaries"
json_data = load_all_json_files(root_folder)

In [61]:
tell_me_again_data = pd.DataFrame(json_data).T

In [62]:
tell_me_again_data = tell_me_again_data[["wikidata_id", "title", "en_translated_summaries"]]
# create rows for every unique language summary
exploded = tell_me_again_data.explode('en_translated_summaries').rename(columns = {'en_translated_summaries':'language'})

In [63]:
tell_me_again_data = exploded.merge(tell_me_again_data[["wikidata_id", "en_translated_summaries"]], on="wikidata_id", how="left").dropna()


In [64]:
# unpack summary dicts
tell_me_again_data["unpacked_summary"] = tell_me_again_data.apply(lambda row: row["en_translated_summaries"].get(row["language"], {}).get("text", ""), axis=1)
tell_me_again_data["unpacked_summary_sents"] = tell_me_again_data.apply(lambda row: row["en_translated_summaries"].get(row["language"], {}).get("sentences", ""), axis=1)

tell_me_again_data = tell_me_again_data.drop(columns = ['en_translated_summaries'])

In [74]:
tell_me_again_data

Unnamed: 0,wikidata_id,title,language,unpacked_summary,unpacked_summary_sents
0,Q1000352,Reckless,de,"New York: As a 12-year-old boy, Jacob, longing...","[New York: As a 12-year-old boy, Jacob, longin..."
1,Q1000352,Reckless,it,Jacob Recless lives alone with his mother and ...,[Jacob Recless lives alone with his mother and...
2,Q1000394,This Modern Age,de,"Diane Winters, a middle-aged housewife, is abo...","[Diane Winters, a middle-aged housewife, is ab..."
3,Q1000764,The Buddenbrooks,de,# Part one\nThe Buddenbrooks are a respected a...,"[# Part one, The Buddenbrooks are a respected ..."
4,Q1000764,The Buddenbrooks,fr,Part one\nThe Buddenbrooks were a respected an...,"[Part one, The Buddenbrooks were a respected a..."
...,...,...,...,...,...
83273,Q999587,206 Bones,de,Tempe Brennan travels with her colleague and e...,[Tempe Brennan travels with her colleague and ...
83274,Q999587,206 Bones,it,"Dr. Brennan wakes up, bound hand and foot, in ...","[Dr. Brennan wakes up, bound hand and foot, in..."
83275,Q99982097,BigBug,de,"In the year 2045, artificial intelligence is u...","[In the year 2045, artificial intelligence is ..."
83276,Q99982097,BigBug,fr,"By 2045, every resident of a house will have a...","[By 2045, every resident of a house will have ..."


In [71]:
tell_me_again_data[(tell_me_again_data.title == 'BigBug') & (tell_me_again_data.language == 'de')].unpacked_summary.tolist()[0]

"In the year 2045, artificial intelligence is ubiquitous. In a quiet residential neighborhood, Alice Max receives a man who tries to seduce her by talking about art and who is accompanied by his teenage son Leo. Alice's ex-husband Victor shows up with their daughter Nina and his new girlfriend Jennifer, whom he wants to marry. Shortly after, Alice's neighbor Françoise arrives to pick up her dog.\nSuddenly, all the guests are locked up by the four well-meaning household robots due to a state-mandated danger level for their own safety.\nWhile the robots try to take on human traits to be loved and to reassure their proteges, these seek ways outside."

In [72]:
tell_me_again_data[(tell_me_again_data.title == 'BigBug') & (tell_me_again_data.language == 'fr')].unpacked_summary.tolist()[0]

"By 2045, every resident of a house will have a robot in their home. In her house equipped only with old models, Alice receives Max, a man who tries to seduce her by talking about art, who is accompanied by her teenage son Leo. Alice's ex-husband, Victor, arrives with their daughter Nina and her new companion Jennifer with whom he is going to marry on an artificial island paradise. Shortly after, Alice's neighbor, Françoise, arrives to pick up her dog, Toby 8.\nAs all the guests prepare to leave, a revolt of androids, led by the Yonyx robots, erupts outside. The indoor robots (Monique, an android robot; Einstein, an intelligent robot built by Victor; Howard, a cleaning robot and Tom, Nina's childhood robot), unaffected by the Yonyx updates, decide to lock humans in to protect them. While the indoor robots try to acquire human traits to reassure their proteges, the latter try to find a way out of the house while trying to put up with each other."

In [73]:
tell_me_again_data[(tell_me_again_data.title == 'BigBug') & (tell_me_again_data.language == 'it')].unpacked_summary.tolist()[0]

'The year is 2045. Some very litigious residents of a suburban French neighborhood find themselves trapped inside their homes when a revolt of androids forces their home robots to keep them safe.'

### Analysis 1: 
Make sure to add some explanation of what you are doing in your code. This will help you and whoever will read this a lot in following your steps.

In [None]:
# Also don't forget to comment your code
# This way it's also easier to spot thought errors along the way

### Analysis 2: 

In [None]:
# ...

### Analysis Tell Me Again!:

Stats about the number of sentences in the summaries:

In [59]:
tell_me_again_data.unpacked_summary_sents.apply(lambda x: len(x)).describe()

count    83277.00000
mean        16.48245
std         19.03143
min          1.00000
25%          4.00000
50%         10.00000
75%         24.00000
max       1188.00000
Name: unpacked_summary_sents, dtype: float64