# Investigating Roget's Thesaurus Classification with Machine Learning


In this Jupyter notebook, we embark on a journey to explore the timeless structure of Roget's Thesaurus using modern Machine Learning techniques. Dating back to the 19th century, Roget's Thesaurus has provided a hierarchical organization of words, aiding writers and scholars in finding synonyms and understanding semantic relationships. Our assignment is to leverage Machine Learning to analyze and compare Roget's classification with contemporary word embeddings, seeking insights into how well they align.

## Malandrakis Georgios
## Assignment 3

We start by reading a text file, containing Roget's Thesaurus. After removing text within square brackets, we define regular expressions to identify classes, sections, divisions, and word patterns. We proceed to find classes and their respective sections or divisions, extracting start numbers for each section. Finally, it prints the identified classes, sections, and their start numbers.




In [2]:
import re

# Open and read the content of the text file
with open('roget.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Remove text enclosed within square brackets and the brackets themselves
cleaned_text = re.sub(r'\[.*?\]', '', text)
text = cleaned_text

# Define regular expressions to identify different components of the text
class_pattern = re.compile(r'^CLASS\s+(\w+)', re.MULTILINE)
section_pattern = re.compile(r'^SECTION\s+(\w+)\.', re.MULTILINE)
division_pattern = re.compile(r'^DIVISION\s+(\w+)', re.MULTILINE)
word_pattern = re.compile(r'\d+\.\s+([\w\s]+)-')
number_pattern = re.compile(r'\n.*?(\d+)\..*?--')

# Find all classes in the text
classes = class_pattern.findall(text)

# Function to extract sections or divisions from text based on provided pattern
def extract_sections(text, pattern):
    return pattern.findall(text)

# Find all sections or divisions in each class
class_sections = {}
for class_name in classes:
    class_text = text.split(f"CLASS {class_name}", 1)[-1]
    if class_name != classes[-1]:
        next_class = classes[classes.index(class_name) + 1]
        class_text = class_text.split(f"CLASS {next_class}", 1)[0]
    sections = extract_sections(class_text, section_pattern)
    divisions = extract_sections(class_text, division_pattern)
    if not divisions:
        class_sections[class_name] = sections
    else:
        class_sections[class_name] = divisions

# Find the start number for each section in each class
class_start_numbers = {}
for class_name, sections in class_sections.items():
    class_start_numbers[class_name] = {}
    class_text = text.split(f"CLASS {class_name}", 1)[-1]
    if class_name != classes[-1]:
        next_class = classes[classes.index(class_name) + 1]
        class_text = class_text.split(f"CLASS {next_class}", 1)[0]
    for section in sections:
        section_text = class_text.split(f"SECTION {section}.", 1)[-1]
        if len(sections) > 1:
            if section != sections[-1]:
                section_text = section_text.split(f"SECTION {sections[sections.index(section)+1]}.", 1)[0]
            else:
                section_text = section_text.split(f"CLASS")[0]
        # Find and store the start number using pattern
        match = number_pattern.search(section_text)
        start_number = match.group(1) if match else None
        # Store the start number for the section
        class_start_numbers[class_name][section] = start_number

# Print the results
print("Classes found:")
print(classes)
print("\nSections found in each class:")
for class_name, sections in class_sections.items():
    print(f"{class_name}: {sections}")
print("\nStart numbers found in each section of each class:")
for class_name, sections in class_start_numbers.items():
    print(f"{class_name}:")
    for section, start_number in sections.items():
        print(f"  Section {section}: Start Number: {start_number}")


Classes found:
['I', 'II', 'III', 'IV', 'V', 'VI']

Sections found in each class:
I: ['I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII']
II: ['I', 'II', 'III', 'IV']
III: ['I', 'II', 'III']
IV: ['I', 'II']
V: ['I', 'II']
VI: ['I', 'II', 'III', 'IV', 'V']

Start numbers found in each section of each class:
I:
  Section I: Start Number: 1
  Section II: Start Number: 9
  Section III: Start Number: 25
  Section IV: Start Number: 58
  Section V: Start Number: 84
  Section VI: Start Number: 106
  Section VII: Start Number: 140
  Section VIII: Start Number: 153
II:
  Section I: Start Number: 180
  Section II: Start Number: 192
  Section III: Start Number: 240
  Section IV: Start Number: 264
III:
  Section I: Start Number: 316
  Section II: Start Number: 321
  Section III: Start Number: 357
IV:
  Section I: Start Number: 450
  Section II: Start Number: 455
V:
  Section I: Start Number: 600
  Section II: Start Number: 620
VI:
  Section I: Start Number: 820
  Section II: Start Number: 827
  Secti

We construct a DataFrame df19 from the class_start_numbers dictionary, which contains information about classes, sections, and their respective start numbers.

In [3]:
import pandas as pd

# Create a list to store the data
data = []

# Iterate over the class_start_numbers dictionary and append the data
for class_name, sections in class_start_numbers.items():
    for section, start_number in sections.items():
        data.append([class_name, section, start_number])

# Create a DataFrame
df19 = pd.DataFrame(data, columns=['Class', 'Section', 'Start Number'])

# Set display options to show all rows
pd.set_option('display.max_rows', None)

# Display the DataFrame
print(df19)

pd.reset_option('display.max_rows')


   Class Section Start Number
0      I       I            1
1      I      II            9
2      I     III           25
3      I      IV           58
4      I       V           84
5      I      VI          106
6      I     VII          140
7      I    VIII          153
8     II       I          180
9     II      II          192
10    II     III          240
11    II      IV          264
12   III       I          316
13   III      II          321
14   III     III          357
15    IV       I          450
16    IV      II          455
17     V       I          600
18     V      II          620
19    VI       I          820
20    VI      II          827
21    VI     III          888
22    VI      IV          922
23    VI       V          976


Then, we read the text file named 'allwords.txt', which contains a list of words with associated numbers. Using a regular expression pattern, it matches and extracts words along with their associated numbers from the text. After filtering out empty strings and stripping whitespace from the words, the data is stored in a pandas DataFrame named df20 for further analysis.




In [4]:
# Read the text file with 'latin-1' encoding
with open('allwords.txt', 'r', encoding='latin-1') as file:
    text = file.read()

# Find the starting index of "A 1" and "zymotic"
start_index = text.find('abode 186')
end_index = text.find("Roget's Thesaurus of English Words and Phrases")

# Extract the text between "A 1" and "zymotic"
extracted_text = text[start_index:end_index]

# Define the pattern to extract words and their associated numbers
pattern = re.compile(r'\n([^\t\n]+)\n\t+([^\n]+)\s+(\d+)')

# Find all matches in the extracted text
matches = pattern.findall(extracted_text)

# Filter out empty strings and strip whitespace from words
matches = [(word.strip(), number) for word, _, number in matches if word.strip()]

# Create a DataFrame to store the data
df20 = pd.DataFrame(matches, columns=['Word', 'Number'])

# Display the DataFrame
df20

Unnamed: 0,Word,Number
0,A 1,648
1,a being,3
2,a blue moon,107
3,a bright thought,498
4,a can of worms,248
...,...,...
55536,zouave,726
55537,zounds!,900
55538,Zulu,876
55539,zygote,357


Indeed, upon closer examination, it becomes evident that the current approach only captures one association per word and number pair, which is insufficient. Let's refine the logic to ensure that each word can have multiple associations with different numbers. We'll adjust the code to handle this scenario by revising how we extract and structure the data.




In [5]:
# Read the text file with 'latin-1' encoding
with open('allwords.txt', 'r', encoding='latin-1') as file:
    text = file.read()

# Define the pattern to extract words
word_pattern = re.compile(r'\n([^\t\n]+)\n')

# Find all matches for words in the text
words = word_pattern.findall(text)

# Create a dictionary to store word positions for faster access
word_positions = {word: match.start() for word, match in zip(words, word_pattern.finditer(text))}

# Process matches
data = []
for i in range(len(words)):
    word = words[i].strip()
    # Get the start index after the word
    start_index = word_positions.get(word, len(text)) + len(word) + 1
    # Get the end index before the next word
    end_index = word_positions.get(words[i+1], len(text)) if i < len(words) - 1 else len(text)
    text_between_words = text[start_index:end_index]

    # Find all numbers in the text between words
    numbers = re.findall(r'\b\d+\b', text_between_words)

    # Add data for each number found
    for number in numbers:
        data.append((word, int(number)))

# Create a DataFrame to store the data
df = pd.DataFrame(data, columns=['Word', 'Number'])

# Display the DataFrame
df = df.iloc[1:]
display(df)

Unnamed: 0,Word,Number
1,A 1,648
2,a being,3
3,a blue moon,107
4,a bright thought,498
5,a can of worms,248
...,...,...
91086,zounds!,870
91087,Zulu,876
91088,zygote,357
91089,zymotic,657


The code below was executed once to retrieve word embeddings using Mistral AI, and the resulting embeddings were stored in a CSV file. Due to the considerable time and cost involved, it's impractical to repeatedly execute this code. Therefore, we must utilize the pre-existing embeddings for further analysis and avoid re-running the code unnecessarily.




In [6]:
#from mistralai.client import MistralClient


# Assuming you already have your dataframe named 'df'

# Initialize MistralClient
#client = MistralClient(api_key="xZmf4Hk6JokQnkaqXq27ki8IWMPHMKl0")

# Initialize an empty list to store embeddings
#embeddings = []

# Batch size for batching requests
#batch_size = 1000  # You can adjust this value based on your needs

# Split words into batches
#word_batches = [df20['Word'][i:i+batch_size] for i in range(0, len(df20['Word']), batch_size)]

# Loop through each word batch
#for batch in word_batches:
    # Call embeddings API for the batch of words
#    embeddings_response = client.embeddings(
#        model="mistral-embed",
#        input=batch.tolist(),
#    )
    # Extract the embedding vectors
#    embedding_vectors = [item.embedding for item in embeddings_response.data]
    # Extend the embeddings list with the embedding vectors
#    embeddings.extend(embedding_vectors)
    
# Add the embeddings to the dataframe
#df20['Embeddings'] = embeddings

# Display the dataframe with embeddings
#display(df20)

#df20.to_csv('df20.csv', index=False)



Now, let's import the CSV file that was generated from the previous code snippet and store its contents into a new DataFrame, newdf.

In [7]:
import numpy as np
from sklearn.cluster import KMeans

file_path = "df20.csv"

# Create a DataFrame from the CSV data
newdf = pd.read_csv(file_path)

newdf

Unnamed: 0,Word,Number,Embeddings
0,A 1,648,"[-0.0179901123046875, 0.0180511474609375, 0.05..."
1,a being,3,"[-0.0487060546875, 0.018829345703125, 0.048492..."
2,a blue moon,107,"[-0.030792236328125, 0.01812744140625, 0.03109..."
3,a bright thought,498,"[-0.0408935546875, 0.0291900634765625, 0.03305..."
4,a can of worms,248,"[-0.029022216796875, -0.012054443359375, 0.062..."
...,...,...,...
55537,zounds!,900,"[-0.013427734375, 0.035125732421875, 0.0615844..."
55538,Zulu,876,"[-0.044830322265625, 0.0084381103515625, 0.036..."
55539,zygote,357,"[-0.03826904296875, -0.0031299591064453125, 0...."
55540,zymotic,657,"[-0.0560302734375, 0.0162200927734375, 0.06240..."


In [8]:
print(newdf.dtypes)

Word          object
Number         int64
Embeddings    object
dtype: object


In [9]:
# Apply lambda function to convert string representations of arrays into NumPy arrays
newdf['Embeddings'] = newdf['Embeddings'].apply(lambda x: np.fromstring(x[1:-1], sep=', '))

# Convert the lists to numpy arrays
embeddings_array = np.stack(newdf['Embeddings'])


In [14]:
# Now embeddings_array contains the embeddings as a float array
print(embeddings_array)

[[-0.01799011  0.01805115  0.05004883 ...  0.00063086  0.00768661
   0.00561905]
 [-0.04870605  0.01882935  0.04849243 ...  0.00041652 -0.02864075
  -0.01551056]
 [-0.03079224  0.01812744  0.03109741 ...  0.00817871 -0.01091766
  -0.00113487]
 ...
 [-0.03826904 -0.00312996  0.0368042  ... -0.00620651  0.03479004
  -0.01180267]
 [-0.05603027  0.01622009  0.06240845 ... -0.04476929  0.01213837
  -0.02575684]
 [-0.05603027  0.01622009  0.06240845 ... -0.04476929  0.01213837
  -0.02575684]]


## Clustering

Now that our data is formatted appropriately, we can proceed with clustering using the K-means algorithm.

In [15]:
from sklearn.cluster import KMeans

# Apply k-means clustering
k = 5  # number of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings_array)

# Add cluster labels to the DataFrame
newdf['Cluster'] = cluster_labels

# Now df contains the original words, their embeddings, and assigned cluster labels
display(newdf)


  super()._check_params_vs_input(X, default_n_init=10)


Unnamed: 0,Word,Number,Embeddings,Cluster
0,A 1,648,"[-0.0179901123046875, 0.0180511474609375, 0.05...",2
1,a being,3,"[-0.0487060546875, 0.018829345703125, 0.048492...",4
2,a blue moon,107,"[-0.030792236328125, 0.01812744140625, 0.03109...",4
3,a bright thought,498,"[-0.0408935546875, 0.0291900634765625, 0.03305...",4
4,a can of worms,248,"[-0.029022216796875, -0.012054443359375, 0.062...",4
...,...,...,...,...
55537,zounds!,900,"[-0.013427734375, 0.035125732421875, 0.0615844...",4
55538,Zulu,876,"[-0.044830322265625, 0.0084381103515625, 0.036...",4
55539,zygote,357,"[-0.03826904296875, -0.0031299591064453125, 0....",1
55540,zymotic,657,"[-0.0560302734375, 0.0162200927734375, 0.06240...",1


In [16]:
# Calculate the number of data points in each cluster
cluster_counts = newdf['Cluster'].value_counts()

# Display the counts of data points in each cluster
display(cluster_counts)


Cluster
2    13846
1    12640
4    11299
0     9488
3     8269
Name: count, dtype: int64

In [17]:
df19

Unnamed: 0,Class,Section,Start Number
0,I,I,1
1,I,II,9
2,I,III,25
3,I,IV,58
4,I,V,84
5,I,VI,106
6,I,VII,140
7,I,VIII,153
8,II,I,180
9,II,II,192


In [18]:
df

Unnamed: 0,Word,Number
1,A 1,648
2,a being,3
3,a blue moon,107
4,a bright thought,498
5,a can of worms,248
...,...,...
91086,zounds!,870
91087,Zulu,876
91088,zygote,357
91089,zymotic,657


Now, we have to associate each word with its corresponding class and section by merging two distinct datasets, as demonstrated in the following code snippet.




In [19]:
import pandas as pd

# Convert 'Number' column of df to integer type
df['Number'] = df['Number'].astype(int)
# Convert 'Start Number' column of df19 to integer type
df19['Start Number'] = df19['Start Number'].astype(int)

# Iterate over rows of df
for index, row in df.iterrows():
    number = row['Number']
    # Find the corresponding row in df19
    corresponding_row = df19[df19['Start Number'] <= number].iloc[-1]
    # Extract Class and Section
    class_val = corresponding_row['Class']
    section_val = corresponding_row['Section']
    # Assign values to new columns
    df.at[index, 'Class'] = class_val
    df.at[index, 'Section'] = section_val

# Display the modified DataFrame
display(df)


  df.at[index, 'Class'] = class_val
  df.at[index, 'Section'] = section_val


Unnamed: 0,Word,Number,Class,Section
1,A 1,648,V,II
2,a being,3,I,I
3,a blue moon,107,I,VI
4,a bright thought,498,IV,II
5,a can of worms,248,II,III
...,...,...,...,...
91086,zounds!,870,VI,II
91087,Zulu,876,VI,II
91088,zygote,357,III,III
91089,zymotic,657,V,II


In [20]:
# Create a dictionary mapping words to clusters
word_to_cluster = dict(zip(newdf['Word'], newdf['Cluster']))

# Add a new column 'Cluster' to df based on the 'Word' column
df['Cluster'] = df['Word'].map(word_to_cluster)

display(df)

Unnamed: 0,Word,Number,Class,Section,Cluster
1,A 1,648,V,II,2.0
2,a being,3,I,I,4.0
3,a blue moon,107,I,VI,4.0
4,a bright thought,498,IV,II,4.0
5,a can of worms,248,II,III,4.0
...,...,...,...,...,...
91086,zounds!,870,VI,II,4.0
91087,Zulu,876,VI,II,4.0
91088,zygote,357,III,III,1.0
91089,zymotic,657,V,II,1.0


The Rand Index score, calculated as 0.674, indicates a moderate level of similarity between the true class labels and the predicted cluster assignments. This score suggests that the clustering algorithm performs reasonably well in capturing the similarities between the data points, though it may not perfectly match the ground truth labels.








In [21]:
from sklearn import metrics
df.dropna(inplace=True)

labels_true = df['Class']
labels_pred = df['Cluster']
metrics.rand_score(labels_true, labels_pred)

0.6740497284807659

In [22]:
metrics.adjusted_rand_score(labels_true, labels_pred)

0.0056421158207219215

In [23]:
metrics.homogeneity_score(labels_true, labels_pred)

0.014902502286694907

In [24]:
metrics.adjusted_mutual_info_score(labels_true, labels_pred)  

0.015803953463643806

In the observed clustering results, words with similar semantic meanings such as "airplane," "pilot," "turbine," "passenger," "air," "baggage," "travel," "attendant," and "luggage" consistently group together in the same cluster. However, an inconsistency arises with the Roget class assignment, notably for "baggage," which is classified as Class V while the rest are under Class II.




In [26]:
df[df['Word'] == 'airplane']

Unnamed: 0,Word,Number,Class,Section,Cluster
2087,airplane,266,II,IV,2.0


In [27]:
df[df['Word'] == 'pilot']

Unnamed: 0,Word,Number,Class,Section,Cluster
59719,pilot,694,V,II,2.0
59720,pilot,527,IV,II,2.0
59721,pilot,693,V,II,2.0
59722,pilot,269,II,IV,2.0


In [28]:
df[df['Word'] == 'turbine']

Unnamed: 0,Word,Number,Class,Section,Cluster
83817,turbine,284,II,IV,2.0


In [29]:
df[df['Word'] == 'passenger']

Unnamed: 0,Word,Number,Class,Section,Cluster
58061,passenger,268,II,IV,2.0


In [30]:
df[df['Word'] == 'air']

Unnamed: 0,Word,Number,Class,Section,Cluster
2045,air,689,V,II,2.0
2046,air,4,I,I,2.0
2047,air,415,III,III,2.0
2048,air,359,III,III,2.0
2049,air,320,III,I,2.0
2050,air,852,VI,II,2.0
2051,air,66,I,IV,2.0
2052,air,448,III,III,2.0
2053,air,338,III,II,2.0
2054,air,349,III,II,2.0


In [31]:
df[df['Word'] == 'baggage']

Unnamed: 0,Word,Number,Class,Section,Cluster
5914,baggage,962,VI,IV,2.0
5915,baggage,635,V,II,2.0
5916,baggage,780,V,II,2.0


In [32]:
df[df['Word'] == 'travel']

Unnamed: 0,Word,Number,Class,Section,Cluster
83012,travel,266,II,IV,2.0


In [33]:
df[df['Word'] == 'attendant']

Unnamed: 0,Word,Number,Class,Section,Cluster
5208,attendant,746,V,II,2.0
5209,attendant,88,I,V,2.0
5210,attendant,281,II,IV,2.0


In [34]:
df[df['Word'] == 'luggage']

Unnamed: 0,Word,Number,Class,Section,Cluster
48443,luggage,780,V,II,2.0


In [35]:
# Concatenate 'Class' and 'Section' to create new section names
df['Section'] = df['Class'] + df['Section']

display(df)

Unnamed: 0,Word,Number,Class,Section,Cluster
1,A 1,648,V,VII,2.0
2,a being,3,I,II,4.0
3,a blue moon,107,I,IVI,4.0
4,a bright thought,498,IV,IVII,4.0
5,a can of worms,248,II,IIIII,4.0
...,...,...,...,...,...
91086,zounds!,870,VI,VIII,4.0
91087,Zulu,876,VI,VIII,4.0
91088,zygote,357,III,IIIIII,1.0
91089,zymotic,657,V,VII,1.0


In [36]:
from sklearn.cluster import KMeans

# Apply k-means clustering
k = 23  # number of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings_array)

# Add cluster labels to the DataFrame
newdf['ClusterSection'] = cluster_labels

# Now df contains the original words, their embeddings, and assigned cluster labels
display(newdf)


  super()._check_params_vs_input(X, default_n_init=10)


Unnamed: 0,Word,Number,Embeddings,Cluster,ClusterSection
0,A 1,648,"[-0.0179901123046875, 0.0180511474609375, 0.05...",2,2
1,a being,3,"[-0.0487060546875, 0.018829345703125, 0.048492...",4,22
2,a blue moon,107,"[-0.030792236328125, 0.01812744140625, 0.03109...",4,0
3,a bright thought,498,"[-0.0408935546875, 0.0291900634765625, 0.03305...",4,17
4,a can of worms,248,"[-0.029022216796875, -0.012054443359375, 0.062...",4,3
...,...,...,...,...,...
55537,zounds!,900,"[-0.013427734375, 0.035125732421875, 0.0615844...",4,3
55538,Zulu,876,"[-0.044830322265625, 0.0084381103515625, 0.036...",4,2
55539,zygote,357,"[-0.03826904296875, -0.0031299591064453125, 0....",1,4
55540,zymotic,657,"[-0.0560302734375, 0.0162200927734375, 0.06240...",1,20


In [37]:
# Create a dictionary mapping words to clusters
word_to_cluster = dict(zip(newdf['Word'], newdf['ClusterSection']))

# Add a new column 'Cluster' to df based on the 'Word' column
df['ClusterSection'] = df['Word'].map(word_to_cluster)

display(df)

Unnamed: 0,Word,Number,Class,Section,Cluster,ClusterSection
1,A 1,648,V,VII,2.0,2
2,a being,3,I,II,4.0,22
3,a blue moon,107,I,IVI,4.0,0
4,a bright thought,498,IV,IVII,4.0,17
5,a can of worms,248,II,IIIII,4.0,3
...,...,...,...,...,...,...
91086,zounds!,870,VI,VIII,4.0,3
91087,Zulu,876,VI,VIII,4.0,2
91088,zygote,357,III,IIIIII,1.0,4
91089,zymotic,657,V,VII,1.0,20


We can observe that words with similar meanings are grouped together within the same section from Roget and cluster that we did.

In [43]:
df[df['Word'] == 'airplane']

Unnamed: 0,Word,Number,Class,Section,Cluster,ClusterSection
2087,airplane,266,II,IIIV,2.0,5


In [44]:
df[df['Word'] == 'passenger']

Unnamed: 0,Word,Number,Class,Section,Cluster,ClusterSection
58061,passenger,268,II,IIIV,2.0,5


In [45]:
df[df['Word'] == 'travel']

Unnamed: 0,Word,Number,Class,Section,Cluster,ClusterSection
83012,travel,266,II,IIIV,2.0,5


In [46]:
df[df['Word'] == 'turbine']

Unnamed: 0,Word,Number,Class,Section,Cluster,ClusterSection
83817,turbine,284,II,IIIV,2.0,5


The Rand Index score, calculated as 0.8352, indicates a moderate to high level of similarity between the true section labels and the predicted cluster  section assignments.


In [47]:
from sklearn import metrics
df.dropna(inplace=True)

labels_true = df['Section']
labels_pred = df['ClusterSection']
metrics.rand_score(labels_true, labels_pred)

0.8352034219524224

In [48]:
metrics.adjusted_rand_score(labels_true, labels_pred)

0.005900696941439428

In [49]:
metrics.homogeneity_score(labels_true, labels_pred)

0.04053914280030881

## Classification

In this phase, our focus shifts to classification. We'll employ two distinct models: one targeting class prediction and the other section prediction. 

In [50]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [51]:
df

Unnamed: 0,Word,Number,Class,Section,Cluster,ClusterSection
1,A 1,648,V,VII,2.0,2
2,a being,3,I,II,4.0,22
3,a blue moon,107,I,IVI,4.0,0
4,a bright thought,498,IV,IVII,4.0,17
5,a can of worms,248,II,IIIII,4.0,3
...,...,...,...,...,...,...
91086,zounds!,870,VI,VIII,4.0,3
91087,Zulu,876,VI,VIII,4.0,2
91088,zygote,357,III,IIIIII,1.0,4
91089,zymotic,657,V,VII,1.0,20


In [52]:
newdf

Unnamed: 0,Word,Number,Embeddings,Cluster,ClusterSection
0,A 1,648,"[-0.0179901123046875, 0.0180511474609375, 0.05...",2,2
1,a being,3,"[-0.0487060546875, 0.018829345703125, 0.048492...",4,22
2,a blue moon,107,"[-0.030792236328125, 0.01812744140625, 0.03109...",4,0
3,a bright thought,498,"[-0.0408935546875, 0.0291900634765625, 0.03305...",4,17
4,a can of worms,248,"[-0.029022216796875, -0.012054443359375, 0.062...",4,3
...,...,...,...,...,...
55537,zounds!,900,"[-0.013427734375, 0.035125732421875, 0.0615844...",4,3
55538,Zulu,876,"[-0.044830322265625, 0.0084381103515625, 0.036...",4,2
55539,zygote,357,"[-0.03826904296875, -0.0031299591064453125, 0....",1,4
55540,zymotic,657,"[-0.0560302734375, 0.0162200927734375, 0.06240...",1,20


In [53]:
import pandas as pd

# Assuming 'df' and 'newdf' are your DataFrames

# Merge the two DataFrames based on the 'Word' column
mergeddf = df.merge(newdf[['Word', 'Embeddings']], on='Word', how='left')

# Print the updated DataFrame
display(mergeddf)


Unnamed: 0,Word,Number,Class,Section,Cluster,ClusterSection,Embeddings
0,A 1,648,V,VII,2.0,2,"[-0.0179901123046875, 0.0180511474609375, 0.05..."
1,a being,3,I,II,4.0,22,"[-0.0487060546875, 0.018829345703125, 0.048492..."
2,a blue moon,107,I,IVI,4.0,0,"[-0.030792236328125, 0.01812744140625, 0.03109..."
3,a bright thought,498,IV,IVII,4.0,17,"[-0.0408935546875, 0.0291900634765625, 0.03305..."
4,a can of worms,248,II,IIIII,4.0,3,"[-0.029022216796875, -0.012054443359375, 0.062..."
...,...,...,...,...,...,...,...
91088,zygote,357,III,IIIIII,1.0,4,"[-0.03826904296875, -0.0031299591064453125, 0...."
91089,zymotic,657,V,VII,1.0,20,"[-0.0560302734375, 0.0162200927734375, 0.06240..."
91090,zymotic,657,V,VII,1.0,20,"[-0.0560302734375, 0.0162200927734375, 0.06240..."
91091,zymotic,655,V,VII,1.0,20,"[-0.0560302734375, 0.0162200927734375, 0.06240..."


The logistic regression model achieved an accuracy of approximately 55.13% for class prediction. While this performance indicates some level of predictive capability, it also suggests that the model may not be capturing the complexities of the data sufficiently

In [54]:
# Split the data into features (X) and target labels (y)
X = mergeddf['Embeddings'].values.tolist()

y_class = mergeddf['Class'].values

X_train, X_test, y_class_train, y_class_test = train_test_split(X, y_class, test_size=0.2, random_state=42)

logistic_class_model = LogisticRegression(max_iter=1000, random_state=42)

logistic_class_model.fit(X_train, y_class_train)

logistic_class_pred = logistic_class_model.predict(X_test)

logistic_class_accuracy = accuracy_score(y_class_test, logistic_class_pred)

print("Logistic Regression Accuracy (Class):", logistic_class_accuracy)





Logistic Regression Accuracy (Class): 0.5513474943740052



Considering that there are 23 sections in the dataset, achieving an accuracy of approximately 47.10% with logistic regression for section prediction can be seen as a modest performance. However, it's important to evaluate this accuracy in the context of the dataset's complexity and the difficulty of predicting among 23 distinct categories.

In [55]:
# Split the data into features (X) and target labels (y)
X = mergeddf['Embeddings'].values.tolist()

y_class = mergeddf['Section'].values

X_train, X_test, y_class_train, y_class_test = train_test_split(X, y_class, test_size=0.2, random_state=42)

logistic_class_model = LogisticRegression(max_iter=1000, random_state=42)

logistic_class_model.fit(X_train, y_class_train)

logistic_class_pred = logistic_class_model.predict(X_test)

logistic_class_accuracy = accuracy_score(y_class_test, logistic_class_pred)

print("Logistic Regression Accuracy (Section):", logistic_class_accuracy)

Logistic Regression Accuracy (Section): 0.47099182172457327
