<a href="https://colab.research.google.com/github/dualcircle/NOTEBOOKS/blob/main/Mapping_Semantic_Proximity_in_Homeric_Verses_A_Framework_for_Text_List_Generation_from_'%CE%BA%CE%BB%CE%AD%CE%BF%CF%82%C2%A0%E1%BC%84%CF%86%CE%B8%CE%B9%CF%84%CE%BF%CE%BD'.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Mapping Semantic Proximity in Homeric Verses: A Framework for Text-List Generation from 'κλέος ἄφθιτον'**

The Ancient Greek phrase κλέος ἄφθιτον, translated as "imperishable fame," resonates as a cultural artifact of significant weight. Its cognate in the Rig-Vedic tradition, śráva(s) ákṣitam, highlights its origins within the strata of the oldest Indo-European traditions (see Nagy, 1974).

Reflecting its own ideal - a renown that transcends time and geography - the phrase invites closer examination of its semantic and contextual proximities.

Here, computational methods attempt to map the semantic landscape surrounding κλέος ἄφθιτον within the Iliad, where it occurs only once. A Principal Components Analysis (PCA) performed on a Jaro-distance similarity matrix organizes a sample of Iliad-verses (n=5000) into a three-dimensional space, enabling the identification of those verses nearest and farthest from the verse containing κλέος ἄφθιτον.

Noun lists are then extracted from these closest and farthest verses, lemmatized, and entered into two distinct lists: nouns closest in proximity, and those farthest away, from the verse containing κλέος ἄφθιτον.

*  Noun List 1: Nouns from verses
flagged as 'closest' to 'κλέος ἄφθιτον.'
* Noun List 2: Nouns from verses flagged as 'farthest' from 'κλέος ἄφθιτον.'

To meet Colab's free tier constraints, each analysis session is limited to five thousand verses, requiring several iterations.

**Later work will assess whether semantic differences between these noun lists align with their proximity to κλέος ἄφθιτον. If so, it would lend support to the methodological soundness of the processes developed here: we would see what nouns - what things in the culture - are more or less associated with κλέος ἄφθιτον.**




# Environment setup and model loading


In [1]:
%%capture
# Installing specific Python packages
!pip install anyascii
!pip install kneed
!pip install spacy-transformers
!pip install gensim fasttext

# Downloading and installing a specific model wheel from Hugging Face
!wget -O grc_odycy_joint_trf.whl https://huggingface.co/chcaa/grc_odycy_joint_trf/resolve/main/grc_odycy_joint_trf-any-py3-none-any.whl
!mv /content/grc_odycy_joint_trf.whl /content/grc_odycy_joint_trf-0.1-py3-none-any.whl
!pip install /content/grc_odycy_joint_trf-0.1-py3-none-any.whl

In [2]:
# General-purpose libraries
import string
import time
import random
import itertools
import unicodedata
from collections import Counter
from multiprocessing import Pool, cpu_count

# Numerical and data manipulation libraries
import pandas as pd
import numpy as np

# Scikit-learn (Machine Learning)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances, silhouette_score

# Knee point detection
from kneed import KneeLocator

# Similarity and distance metrics
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

# Word Embeddings (Gensim)
from gensim.models import Word2Vec, FastText

# Natural Language Processing (NLP)
import spacy
from spacy_transformers import Transformer

# Data visualization libraries
import plotly.graph_objs as go
import matplotlib.pyplot as plt

# Special-purpose libraries
import jellyfish
from anyascii import anyascii

# Others
from scipy.spatial.distance import cdist
import numpy as np
import plotly.graph_objects as go


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [3]:
# Load the joint model
nlp = spacy.load("grc_odycy_joint_trf")

# Ensure the transformer is part of the pipeline
if "transformer" not in nlp.pipe_names:
    transformer = nlp.add_pipe("transformer")

  _torch_pytree._register_pytree_node(
  self._model.load_state_dict(torch.load(filelike, map_location=device))


# Input parameters

In [4]:
#Colab can't seem to handle more than 5000 at the free tier.
SAMPLE_SIZE = 5000

In [5]:
#Nouns must apppear at least ROWS many times
ROWS=7

#  Data processing, similarity computation, and visualization

This code processes transliterated verses and computes string similarities using parallel processing. It cleans the text by removing punctuation and special characters, then applies the Jaro similarity metric. The `create_similarity_matrix` function leverages multiprocessing to generate a similarity matrix. It also processes a document, filters specific rows, applies transliteration, and combines sampled rows into a final DataFrame (`final_df`). This dataset includes both original and transliterated text, ready for further analysis.



In [6]:
%%capture


# Helper function to compute string similarity
def compute_similarity(pair):
    i, j, str_1, str_2 = pair
    str_1_clean = str_1.replace('|', '').replace('||', '').translate(str.maketrans('', '', string.punctuation)).lower().strip()
    str_2_clean = str_2.replace('|', '').replace('||', '').translate(str.maketrans('', '', string.punctuation)).lower().strip()
    similarity_score = jellyfish.jaro_similarity(str_1_clean, str_2_clean)
    return (i, j, similarity_score)

# Function to create similarity matrix using multiprocessing
def create_similarity_matrix(transliterated_series):
    # Create pairs of indices and strings for comparison
    combinations_list = [(i, j, transliterated_series[i], transliterated_series[j])
                         for i in range(len(transliterated_series))
                         for j in range(len(transliterated_series))]

    # Use multiprocessing to parallelize the computation
    with Pool(cpu_count()) as pool:
        results = pool.map(compute_similarity, combinations_list)

    # Initialize an empty similarity matrix
    similarity_matrix = np.zeros((len(transliterated_series), len(transliterated_series)))

    # Fill the matrix with the results
    for i, j, score in results:
        similarity_matrix[i, j] = score

    return pd.DataFrame(similarity_matrix, index=transliterated_series, columns=transliterated_series)

# Constants

DOCUMENTS = ['14']

# Create a list to store the processed data for each document
processed_documents_list = []

# Define a regex pattern to remove numbers and special characters
pattern = r'[0-9!@#$%^&*()_+{}\[\]:;"\'<>,.?/\|\\]'

# Process each document
for document_name in DOCUMENTS:
    df = pd.read_csv(f'/content/drive/MyDrive/DARBY/{document_name}.csv')
    df['Document_ID'] = document_name
    df['Original_Text'] = df['TRANSLIT']

    # Remove special characters from all string columns using the pattern
    df = df.apply(lambda col: col.str.replace(pattern, '') if col.dtype == 'object' else col)

    # Filter rows where 'KEY' column is 'X'
    key_x_rows = df[df['KEY'] == 'X']
    key_x_rows['Transliterated_Text'] = key_x_rows['TRANSLIT'].apply(
        lambda x: anyascii(x) if isinstance(x, str) else x
    )
    key_x_rows['Original_or_Transliterated_Text'] = key_x_rows.get('OGTXT', key_x_rows['TRANSLIT'])

    df['Transliterated_Text'] = df['TRANSLIT'].apply(lambda x: anyascii(x) if isinstance(x, str) else x)
    sampled_rows = df[:-1].sample(SAMPLE_SIZE, replace=False)
    sampled_rows['Original_or_Transliterated_Text'] = df.get('OGTXT', sampled_rows['TRANSLIT'])




    combined_rows = pd.concat([key_x_rows, sampled_rows]).drop_duplicates(subset=['Transliterated_Text'])

    #combined_rows = pd.concat([key_x_rows, sampled_rows]).drop_duplicates()
    processed_documents_list.append(combined_rows)

final_df = pd.concat(processed_documents_list, ignore_index=True)


In [7]:

# Copy the final DataFrame to add further columns
df = final_df.copy()
df['Document_Name'] = df['Document_ID']
df['Original_Index'] = df.index
df['Label'] = df['Document_Name']
df = df[['INDEX', 'Transliterated_Text', 'Original_Text', 'Document_Name', 'Original_Index', 'Original_or_Transliterated_Text', 'KEY', 'Label']]

# Reset index for the DataFrame and extract 'Transliterated_Text' column for similarity computation
transliterated_series = df['Transliterated_Text'].reset_index(drop=True)

# Create the similarity matrix using multiprocessing
similarity_df = create_similarity_matrix(transliterated_series)

# Normalize the similarity matrix data for PCA and clustering
scaler = StandardScaler()
similarity_matrix_scaled = scaler.fit_transform(similarity_df)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, n_init=10, random_state=42)
kmeans_labels = kmeans.fit_predict(similarity_matrix_scaled)

# **First 3D Plot (Before Optimization)**
pca = PCA(n_components=3)
similarity_matrix_3d = pca.fit_transform(similarity_matrix_scaled)

df_3d = pd.DataFrame(similarity_matrix_3d, columns=['PC1', 'PC2', 'PC3'])
df_3d['Label'] = df['Label']
df_3d['Original_Text'] = df['Original_Text']
df_3d['KEY'] = df['KEY']
df_3d['INDEX'] = df['INDEX']

color_map = {'16': 'blue', '14': 'green'}
df_3d['color'] = df_3d['Label'].map(color_map)

df_X = df_3d[df_3d['KEY'] == 'X']
df_not_X = df_3d[df_3d['KEY'] != 'X']

trace_X = go.Scatter3d(
    x=df_X['PC1'], y=df_X['PC2'], z=df_X['PC3'], mode='markers',
    marker=dict(size=9, symbol='x', color='red'),
    text=df_X['Original_Text'], name='KEY == X'
)

trace_not_X = go.Scatter3d(
    x=df_not_X['PC1'], y=df_not_X['PC2'], z=df_not_X['PC3'], mode='markers',
    marker=dict(size=2, symbol='circle', color=df_not_X['color']),
    text=df_not_X['Original_Text'], name='Other'
)

fig_before = go.Figure(data=[trace_X, trace_not_X])
fig_before.update_layout(
    title='3D Plot of Transliterated Text Data with Red "X" Markers for Iliad Verses Containing "κλέος ἄφθιτον" (PCA with 3 Components)',
    scene=dict(xaxis_title='PC1', yaxis_title='PC2', zaxis_title='PC3'),
    legend_title="Marker Legend"
)


# 3D Plot of Transliterated Text Data with Red "X" Markers for Iliad Verse Containing "κλέος ἄφθιτον" (PCA with 3 Components)

In [8]:
fig_before.show()

# Distance computation and visualization relative to the "κλέος ἄφθιτον" verse

This code calculates the Euclidean distances between red `X` markers and other markers. It selects the closest and farthest markers based on the distances. First, it extracts the 3D coordinates of both sets of markers, then computes distances. The closest and farthest *n* markers are identified and retrieved. Finally, the red `X` markers, closest, and farthest markers are combined into a DataFrame (`df_optimized`) for further analysis or visualization.



In [9]:
TOP=SAMPLE_SIZE*.20
TOP=int(TOP)
TOP

# Step 1: Extract coordinates of red X markers and other markers
df_X_coords = df_X[['PC1', 'PC2', 'PC3']].values  # Coordinates of the red X marker(s)
df_not_X_coords = df_not_X[['PC1', 'PC2', 'PC3']].values  # Coordinates of other markers

# Step 2: Compute distances from red X markers to all other markers
distances = cdist(df_X_coords, df_not_X_coords, metric='euclidean').flatten()

# Step 3: Get the closest 25 and farthest 25 markers based on the distances
closest_25_indices = np.argsort(distances)[:TOP]  # Indices of the closest 25 markers
farthest_25_indices = np.argsort(distances)[-TOP:]  # Indices of the farthest 25 markers

# Select the closest and farthest rows from df_not_X
closest_25_markers = df_not_X.iloc[closest_25_indices]
farthest_25_markers = df_not_X.iloc[farthest_25_indices]

# Step 4: Combine the closest, farthest, and red X markers for visualization
df_optimized = pd.concat([df_X, closest_25_markers, farthest_25_markers])

# Step 5: Create the new 3D scatter plot with closest, farthest, and red X markers
trace_X_new = go.Scatter3d(
    x=df_X['PC1'], y=df_X['PC2'], z=df_X['PC3'], mode='markers',
    marker=dict(size=9, symbol='x', color='red'),
    text=df_X['Original_Text'], name='KEY == X'
)

trace_closest = go.Scatter3d(
    x=closest_25_markers['PC1'], y=closest_25_markers['PC2'], z=closest_25_markers['PC3'],
    mode='markers', marker=dict(size=2, symbol='circle', color='blue'),
    text=closest_25_markers['Original_Text'], name='Closest 25'
)

trace_farthest = go.Scatter3d(
    x=farthest_25_markers['PC1'], y=farthest_25_markers['PC2'], z=farthest_25_markers['PC3'],
    mode='markers', marker=dict(size=2, symbol='circle', color='green'),
    text=farthest_25_markers['Original_Text'], name='Farthest 25'
)

# Combine all traces into a single plot
fig_optimized = go.Figure(data=[trace_X_new, trace_closest, trace_farthest])
fig_optimized.update_layout(
    title='3D Visualization of Verse Similarities Relative to κλέος ἄφθιτον (Closest and Farthest Verses)',
    scene=dict(xaxis_title='PC1', yaxis_title='PC2', zaxis_title='PC3'),
    legend_title="Marker Legend"
)

# Show the optimized plot


# 3D Visualization of Verse Similarities Relative to 'κλέος ἄφθιτον' (Closest and Farthest Verses)

In [10]:
fig_optimized.show()

# Creation of a closest and farthest verses table and data normalization

This code builds a table linking each red `X` marker with its 25 closest and 25 farthest verses. For each red `X` marker, the closest and farthest markers are appended to a list. The data is then consolidated into a DataFrame (`final_table_df`), where strings are normalized to handle variations in Greek characters. Additional columns are mapped to match indexes and transliterated text from the original data. Duplicates are removed, and the DataFrame is indexed and sorted by the combined index of red markers and relation (closest/farthest). Finally, the unique strings are grouped and combined into a new DataFrame (`combined_df`).



In [11]:
df_optimized
farthest_25_indices

import pandas as pd

# Ensure TOP is defined
#TOP = 25  # Assuming you want 25 closest and 25 farthest points

# Create a list to store the rows for the final table
table_data = []

# Ensure we loop for each red X marker and append its 25 closest and 25 farthest points
for red_x_marker in df_optimized['Original_Text']:

    # Get the closest points for this red X marker
    closest_points = df_not_X.iloc[closest_25_indices]  # Use the indices to get the actual rows

    # Add each closest point as a separate row (there should be up to 25 rows)
    for j in range(len(closest_points)):
        row_data = {
            'Red X Marker': red_x_marker,
            'Relation': 'Closest',
            'Associated String': closest_points.iloc[j]['Original_Text']  # Adjust based on the column needed
        }
        table_data.append(row_data)

    # Get the farthest points for this red X marker
    farthest_points = df_not_X.iloc[farthest_25_indices]  # Use the indices to get the actual rows

    # Add each farthest point as a separate row (there should be up to 25 rows)
    for k in range(len(farthest_points)):
        row_data = {
            'Red X Marker': red_x_marker,
            'Relation': 'Farthest',
            'Associated String': farthest_points.iloc[k]['Original_Text']  # Adjust based on the column needed
        }
        table_data.append(row_data)

# Convert the list of rows into a DataFrame
final_table_df = pd.DataFrame(table_data)

# Save the final table to a CSV file
final_table_df.to_csv('closest_farthest_table.csv', index=False)

# Display the final table
final_table_df

import unicodedata

# Normalize both the string in the dataframe and the target string
normalized_target = unicodedata.normalize('NFC', 'ὤλετο μέν μοι νόστος, ἀτὰρ κλέος ἄφθιτον ἔσται:')

final_table_df['Red X Marker'] = final_table_df['Red X Marker'].apply(lambda x: unicodedata.normalize('NFC', x))

# Filter after normalization
final_table_df = final_table_df[final_table_df['Red X Marker'].str.contains(normalized_target, na=False)]

final_table_df
import pandas as pd

# Map 'Matched_INDEX' and 'Transliterated_Text' as before
final_table_df['Matched_INDEX'] = final_table_df['Associated String'].map(
    lambda x: df[df['Original_Text'] == x]['INDEX'].values[0] if any(df['Original_Text'] == x) else None
)

final_table_df['Transliterated_Text'] = final_table_df['Associated String'].map(
    lambda x: df[df['Original_Text'] == x]['Transliterated_Text'].values[0] if any(df['Original_Text'] == x) else None
)

# Ensure that 'Matched_INDEX' is converted to strings for the .str accessor
final_table_df['Matched_INDEX'] = final_table_df['Matched_INDEX'].astype(str)

# Now remove rows where 'Matched_INDEX' contains any letter character
final_table_df = final_table_df[~final_table_df['Matched_INDEX'].str.contains(r'[a-zA-Z]', na=False)]

# Display the final table
pd.set_option('display.max_rows', None)  # Display all rows
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.width', None)  # Adjust the width to fit all columns
pd.set_option('display.max_colwidth', None)  # Show full content of each cell

# Show the final table
final_table_df



import pandas as pd

# Step 1: Create 'Combined_Index'
final_table_df['Combined_Index'] = final_table_df['Red X Marker'].astype(str) + '_' + final_table_df['Relation'].astype(str)

# Step 2: Set the new index
final_table_df.set_index('Combined_Index', inplace=True)

# Step 3: Sort the dataframe by the new index
final_table_df.sort_index(inplace=True)

# Step 4: Remove duplicates within each group and combine the strings in 'Associated String'
# Group by the index, remove duplicates in 'Associated String', and then combine the unique strings
combined_df = final_table_df.groupby(final_table_df.index)['Associated String'] \
    .apply(lambda x: ' '.join(x.drop_duplicates())) \
    .reset_index()

# Rename the column to indicate it's a combined string
combined_df.rename(columns={'Associated String': 'Combined_String'}, inplace=True)

# Now, combined_df contains 'Combined_Index' and the combined unique strings in 'Combined_String'

# Add quotes around each combined string in the 'Combined_String' column
#combined_df['Combined_String'] = combined_df['Combined_String'].apply(lambda x: f'"{x}"')

# Noun Lemma Frequency Analysis and Filtering

This code processes textual data to extract and analyze noun lemmas from the `Combined_String` column of a DataFrame while applying a unified exclusion list to filter out specific terms and named entities. Using `spaCy`, the text is tokenized, lemmatized, and filtered to retain nouns, excluding those that match normalized terms from the exclusion list or belong to identified entities like persons or locations. Lemmas are then counted, and only those exceeding a predefined frequency threshold are retained. The results are aggregated into a combined DataFrame, refined by removing irrelevant rows, transposed for further analysis, and flagged as "closest," "farthest," or "equal" based on frequency comparisons. As the final stage of the Colab notebook, this code generates cleaned lists of the closest and farthest noun lemmas, completing the pipeline and preparing the data for downstream analysis.

In [12]:
# List of lemmas to exclude (see list generation at https://medium.com/@PRIVATE.LIBRARY/core-vocabulary-in-homers-iliad-a-quantitative-analysis-of-noun-lemma-frequencies-2024-f071658aae0a)
exclusion_list = [
'Ἀνήρ',
'ναῦς',
'ζεύς',
'ἵππος',
'θυμός',
'θεός',
'ἕκτωρ',
'υἱός',
'χείρ',
'Τρώς',
'ἀχιλλεύς',
'τρώς',
'λαός',
'ἔγχος',
'Ἀγαμέμνων',
'ἄναξ'
]

exclusion_list = [unicodedata.normalize('NFC', word.lower()) for word in exclusion_list]

# Step 1: Initialize an empty list to hold each row's result as a DataFrame
result_dfs = []

# Assuming `combined_df` is already defined and has a column 'Combined_String'
# Replace ROWS with the desired threshold


# Step 2: Loop through each row of the 'Combined_String' column
for i, text in enumerate(combined_df['Combined_String']):
    print(f"Processing row {i+1}...")

    # Process the document using spaCy
    doc = nlp(text)

    # Step 2a: Identify named entities to exclude persons (PERSON) and places (GPE, LOC, ORG)
    names_and_places = {unicodedata.normalize('NFC', ent.text.lower()) for ent in doc.ents if ent.label_ in ['PERSON', 'GPE', 'LOC', 'ORG']}

    # Lemmatize, get POS tags, filter unwanted characters, and keep only nouns
    noun_lemmas = [
        f"{token.lemma_.strip()}_{token.pos_}" for token in doc
        if token.pos_ == 'NOUN' and
        unicodedata.normalize('NFC', token.lemma_.lower()) not in exclusion_list and
        unicodedata.normalize('NFC', token.text.lower()) not in names_and_places and
        not token.is_punct and
        not token.is_stop and
        not token.lemma_ == '\xa0'
    ]

    # Remove empty strings and filter invalid lemmas
    noun_lemmas = [lemma for lemma in noun_lemmas if lemma]

    # Count the frequency of each lemma_POS pair (which are nouns)
    noun_lemma_freq = Counter(noun_lemmas)

    # Filter frequencies where count is greater than ROWS
    filtered_freq = {word: count for word, count in noun_lemma_freq.items() if count > ROWS}

    # Convert the frequencies into a DataFrame where lemma_POS are columns
    if filtered_freq:
        row_df = pd.DataFrame([filtered_freq])
        row_df['Row'] = i + 1  # Add the row index for identification
        result_dfs.append(row_df)
    else:
        print(f"Row {i+1} has no nouns with a frequency greater than 2.")

    print("\n" + "="*50 + "\n")

# Step 3: Combine all the individual row DataFrames into one
if result_dfs:
    combined_result_df = pd.concat(result_dfs, axis=0).reset_index(drop=True)
    combined_result_df = combined_result_df.fillna(0)  # Fill NaNs with 0

    # Remove rows where ALL non-"Row" columns have values greater than 0
    combined_result_df = combined_result_df[
        ~((combined_result_df.drop(columns=['Row']) > 0).all(axis=1))
    ]

    # Step 4: Transpose the DataFrame (including the Flag column)
    transposed_df = combined_result_df.T

    # Remove any rows that do not have relevant data (e.g., "Row")
    transposed_df = transposed_df.drop(index='Row', errors='ignore')  # Remove "Row" if present

    # Step 5: Drop rows where BOTH column 0 and column 1 have values greater than 0
    transposed_df = transposed_df[~((transposed_df[0] > 0) & (transposed_df[1] > 0))]

    # Step 6: Add "Flag" column based on which column (0 or 1) has the highest value
    def assign_flag(column):
        if column[0] > column[1]:
            return 'closest'
        elif column[1] > column[0]:
            return 'farthest'
        else:
            return 'equal'

    # Apply the function to each column to create the Flag column
    transposed_df['Flag'] = transposed_df.apply(assign_flag, axis=1)

    # Normalize the index of the transposed DataFrame to ensure proper comparison
    transposed_df.index = [unicodedata.normalize('NFC', idx.lower()) for idx in transposed_df.index]

    # Step 7: Define a function that checks if any exclusion string is a partial match in the index
    def should_drop(index_value):
        for exclusion in exclusion_list:
            if exclusion in index_value:
                return True
        return False

    # Drop rows where the index contains a partial match to any string in the exclusion list
    filtered_transposed_df = transposed_df[~transposed_df.index.to_series().apply(should_drop)]

    # Split the index values into two lists based on the 'Flag' column
    closest_list = filtered_transposed_df[filtered_transposed_df['Flag'] == 'closest'].index.tolist()
    farthest_list = filtered_transposed_df[filtered_transposed_df['Flag'] == 'farthest'].index.tolist()

    # Remove '_noun' from the index values in both lists
    closest_list_clean = [item.replace('_noun', '') for item in closest_list]
    farthest_list_clean = [item.replace('_noun', '') for item in farthest_list]

    # Output the two lists
    print("Closest List:", closest_list_clean)
    print("Farthest List:", farthest_list_clean)


Processing row 1...



`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.





Processing row 2...



`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.





Closest List: ['δόρυ', 'ποταμός', 'ἄρης', 'μένος', 'ἄστυ', 'ἔπος', 'πάτροκλος', 'ἀπόλλων', 'μῦθος', 'χαλκός', 'ἥρα', 'ὦμος', 'ἴλιος', 'δαναοί']
Farthest List: ['ἀθήνη', 'αἴας', 'κλισία', 'βία', 'αἰνείας', 'μάχη', 'φρήν', 'κεφαλή']


In [13]:
closest_list_clean

['δόρυ',
 'ποταμός',
 'ἄρης',
 'μένος',
 'ἄστυ',
 'ἔπος',
 'πάτροκλος',
 'ἀπόλλων',
 'μῦθος',
 'χαλκός',
 'ἥρα',
 'ὦμος',
 'ἴλιος',
 'δαναοί']

In [15]:
farthest_list_clean

['ἀθήνη', 'αἴας', 'κλισία', 'βία', 'αἰνείας', 'μάχη', 'φρήν', 'κεφαλή']

# Conclusion
This initial analysis is based on a single random sample of verses. Future work will integrate data from numerous iterations across multiple random samples of the Iliad, enhancing robustness, reducing sampling bias, and providing a stronger basis for investigating lexical patterns within Homeric poetry.

This study employed Principal Components Analysis and the Jaro distance metric to map Iliad verses into a three-dimensional space, identifying those nearest and farthest from the verse containing 'κλέος ἄφθιτον.' By extracting and lemmatizing nouns from these verses, we created a methodological framework for generating lexical data to explore semantic structures.