Tasks:
As an observation point, we can look into the types of input and the top predictions that would be output by each layer of the network.
Start by solving these preliminary tasks:
Task A - Find all the occurrences in which the object string is present in the subject, such as:
Apple II (subject) is produced by Apple (object)
Make a list of all the relations for which such instances occur.
Task B - For each of the prompts whose object is 'Paris':
Find at what layer the 'object' token is present for the first time in the top 10 tokens
Find at what layer the 'object' token is top 1 in for the first time

# **Task A**

In [9]:
import urllib.request
import json
from io import StringIO
import requests
import pandas as pd

In [10]:
file_url = 'https://drive.google.com/uc?id=1htUxnapH5oPnMyaxhk-xrIE4NvLS_2MH'
prompts_file = urllib.request.urlopen(file_url)
prompts_file = prompts_file.read()
prompts_file = prompts_file.decode("utf-8")


def download_file(url, destination):
    response = requests.get(url)
    if response.status_code == 200:
        with open(destination, 'wb') as f:
            f.write(response.content)
        print("File downloaded successfully at", destination)
    else:
        print("Failed to download file")

def split_file(input_file, output_file1, output_file2):
    with open(input_file, 'r') as f:
        data = f.readlines()

    # Determine the splitting point
    split_index = len(data) // 2

    # Split the data into two parts
    part1 = data[:split_index]
    part2 = data[split_index:]

    # Write the two parts to separate output files
    with open(output_file1, 'w') as f1:
        f1.writelines(part1)

    with open(output_file2, 'w') as f2:
        f2.writelines(part2)

input_url = 'https://drive.google.com/uc?id=1htUxnapH5oPnMyaxhk-xrIE4NvLS_2MH'
output_file1 = 'output1.txt'
output_file2 = 'output2.txt'

download_file(input_url, 'input.txt')
split_file('input.txt', output_file1, output_file2)


File downloaded successfully at input.txt


In [11]:
# Read JSON data from input.txt into a DataFrame
df = pd.read_json('input.txt', lines=True)

# Display the DataFrame
print(df)



       obj_label             sub_label predicate_id  \
0     Antarctica  Shackleton Ice Shelf          P30   
1        English    Australian English         P279   
2        English         Welsh English         P279   
3        English   New Zealand English         P279   
4        English      American English         P279   
...          ...                   ...          ...   
1349         fat       unsaturated fat         P279   
1350      planet     extrasolar planet         P279   
1351     teacher          head teacher         P279   
1352        beef           ground beef         P279   
1353     printer         laser printer         P279   

                        template  acc_out  \
0        [X] is located in [Y] .        1   
1     [X] is a subclass of [Y] .        1   
2     [X] is a subclass of [Y] .        1   
3     [X] is a subclass of [Y] .        1   
4     [X] is a subclass of [Y] .        1   
...                          ...      ...   
1349  [X] is a subclass 

In [12]:
# Obj and Sub matches
obj_labels = df['obj_label']
sub_labels = df['sub_label']
pred_id = df['predicate_id']
# List to store pairs where object string is present in subject string
matches = []
# Iterate through each row of the DataFrame
for index, row in df.iterrows():
    sub_label = row['sub_label']
    obj_label = row['obj_label']
    predicate_id = row['predicate_id']
    if obj_label in sub_label:
        matches.append((sub_label, obj_label, predicate_id))

# Print the matches
print("Occurrences where object string is present in subject string:")
for match in matches:
    print("Subject:", match[0])
    print("Object:", match[1])
    print("Predicate ID:", match[2])
    print()



Occurrences where object string is present in subject string:
Subject: Australian English
Object: English
Predicate ID: P279

Subject: Welsh English
Object: English
Predicate ID: P279

Subject: New Zealand English
Object: English
Predicate ID: P279

Subject: American English
Object: English
Predicate ID: P279

Subject: Canadian English
Object: English
Predicate ID: P279

Subject: NBC Nightside
Object: NBC
Predicate ID: P449

Subject: Dateline NBC
Object: NBC
Predicate ID: P449

Subject: NFL on NBC
Object: NBC
Predicate ID: P449

Subject: NBA on NBC
Object: NBC
Predicate ID: P449

Subject: NBC Mystery Movie
Object: NBC
Predicate ID: P449

Subject: NBC Sunday Showcase
Object: NBC
Predicate ID: P449

Subject: The NBC Monday Movie
Object: NBC
Predicate ID: P449

Subject: CBS Evening News
Object: CBS
Predicate ID: P449

Subject: CBS Storybreak
Object: CBS
Predicate ID: P449

Subject: CBS Playhouse
Object: CBS
Predicate ID: P449

Subject: SEC on CBS
Object: CBS
Predicate ID: P449

Subject: C

# **Task B**

In [13]:
# Find occurrences of the word "Paris" in 'obj_labels'
matches = df[df['obj_label'].str.contains('Paris')]

# Print the matches
print("Occurrences of the word 'Paris' in 'obj_labels':")
print(matches)

Occurrences of the word 'Paris' in 'obj_labels':
    obj_label                      sub_label predicate_id  \
125     Paris                 Paul Delaroche          P19   
126     Paris          Jean-Marcel Jeanneney          P19   
127     Paris                   Henri Debain          P19   
128     Paris            Alexandre Mercereau          P19   
129     Paris                 Pierre Macquer          P19   
130     Paris            Claude-Thomas Dupuy          P19   
131     Paris              Philippe de Broca          P19   
132     Paris                 Henri Estienne          P19   
133     Paris                Marquis de Sade          P19   
134     Paris                Nicolas Gigault          P19   
135     Paris             Jacques-Jean Barre          P19   
136     Paris                 Jules Michelet          P19   
137     Paris                Georges Duhamel          P19   
138     Paris                Georges Rouault          P19   
139     Paris  Jean-Baptiste Joseph 

In [14]:
#  Find at what layer the 'object' token is present for the first time in the top 10 tokens

# Iterate through each row
for index, row in df.iterrows():
    int_tokens = row['int_tokens']
    # Iterate through each layer
    for layer, tokens in int_tokens.items():
        # Check if 'Paris' is in the tokens of this layer
        if ' Paris' in tokens:
            # Extract the layer number from the key
            first_appearance_layer = int(layer)
            print(f"The object 'Paris' first appears in layer {first_appearance_layer}.")
            break



The object 'Paris' first appears in layer 22.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 21.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 21.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears in layer 20.
The object 'Paris' first appears i

In [18]:
# Find at what layer the 'object' token is top 1 in for the first time
for index, row in df.iterrows():
    int_tokens = row['int_tokens']
    # Iterate through each layer
    for layer, tokens in int_tokens.items():
        # Check if 'Paris' is in the tokens of this layer
        if ' Paris' in tokens[0]:
            # Extract the layer number from the key
            top_token_layer = int(layer)
            print(f"The object 'Paris' is top token in layer {top_token_layer}.")
            break

The object 'Paris' is top token in layer 24.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 24.
The object 'Paris' is top token in layer 24.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 24.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 22.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 22.
The object 'Paris' is top token in layer 23.
The object 'Paris' is top token in layer 22.
The object 'Paris' is top token in layer 22.
The object 'Paris' is top token in layer 21.
The object 'Paris' is top token in layer 22.
The object