I have two datasets: [English morphemes](https://colingoldberg.github.io/morphemes/morpheme/dataset/2019/06/03/morpheme-dataset.html) and their characteristics derived from WordNet and [frequency counts](https://www.kaggle.com/datasets/rtatman/english-word-frequency) of English words from the Google Web Trillion Word Corpus. Here I aim to create a morpheme dataset that incorporates the cumulative frequencies of words in which each morpheme occurs. I got far enough to accumulate the frequencies of most morphemes, but I fail to account for allomorphy or affixes that do not appear at the immediate starts and ends of words. I ended up finding a dataset with exactly what I was trying to create, but I would still like to figure out how to fully implement this.

In [1]:
import pandas as pd
import json

In [2]:
# Transforming JSON of morphemes to Pandas df
with open("data/morphemes.json") as f:
    json_data = json.load(f)

df_list = []

'''
A different method I considered: the JSON data contains examples of words the morphemes can belong to, and I could look up the
frequencies of those words in the other dataset, but that list is not exhaustive.
'''

# Removed meaning, etymology, examples, categories
for key, value in json_data.items():
    forms_data = value.get('forms', [{}])[0]
    entry = {
        'Affix': forms_data.get('form', ''),
        'Type': forms_data.get('loc', ''),
        #'Form_Type': forms_data.get('type', ''), # Too may NaN values
        'Origin': value.get('origin', '')
    }
    df_list.append(entry)

df = pd.DataFrame(df_list)

# Removed non-affixes
df = df[df['Type'] != 'embedded']

In [3]:
print(df.describe(include='all'))

       Affix    Type Origin
count   2374    2374   2374
unique  2325       2     24
top       en  prefix  Latin
freq       4    1998    857


In [4]:
df['Origin'].unique()

array(['', 'Greek', 'Latin, Greek', 'Latin', 'French', 'throw',
       'Latin from Greek', 'Latin and Greek', 'Irish', 'German',
       'G. entoma', 'L. flagrum', 'L.gene', 'Greek, Latin', 'Latin Greek',
       'Italian', 'G. merus', 'G. meion', 'nectere, nexus',
       'Middle English', 'English', 'greek', 'Greek from Hebrew',
       'Greek '], dtype=object)

In [5]:
# Standardizing the origin column
origin_mapping = {
    '': 'Unspecified',
    'Latin from Greek': 'Latin, Greek',
    'Greek, Latin': 'Latin, Greek',
    'Latin Greek': 'Latin, Greek',
    '"Latin, Greek"': 'Latin, Greek',
    'Latin and Greek': 'Latin, Greek',
    'Greek ': 'Greek',
    'greek': 'Greek',
    'G. entoma': 'Greek',
    'G. merus': 'Greek',
    'G. meion': 'Greek',
    'nectere, nexus': 'Latin',
    'L. flagrum': 'Latin',
    'L.gene': 'Latin',
    'Greek from Hebrew': 'Greek, Hebrew'
}

# Applying the new mapping
df['Origin'] = df['Origin'].replace(origin_mapping)

# Remove throw
df = df[df['Origin'] != 'throw']

# Also make all affixes lowercase
df['Affix'] = df['Affix'].str.lower()

In [6]:
# The new dataset
df.head(15)

Unnamed: 0,Affix,Type,Origin
0,afro,prefix,Unspecified
1,anglo,prefix,Unspecified
2,euro,prefix,Unspecified
3,franco,prefix,Unspecified
4,indo,prefix,Unspecified
5,hell,prefix,Greek
6,abac,prefix,Greek
7,abil,prefix,Unspecified
8,able,suffix,Unspecified
9,ably,suffix,Unspecified


In [7]:
# Import the frequency dataset
freq = pd.read_csv('data/unigram_freq.csv')
freq.head()

Unnamed: 0,word,count
0,the,23135851162
1,of,13151942776
2,and,12997637966
3,to,12136980858
4,a,9081174698


In [8]:
# Check for duplicated values in 'word' column
duplicated_words = freq[freq.duplicated('word')]['word']
print(f'Duplicates:\n {duplicated_words}')

# Check for missing values in 'word' column
missing_words = freq[freq['word'].isnull()]['word']
print(f'Missing Values:\n {missing_words}')

# Drop those instances
freq = freq.drop_duplicates('word')
freq = freq.dropna(subset=['word'])

#Make all words lowercase
freq['word'] = freq['word'].str.lower()

#freq.info()

Duplicates:
 12819    NaN
Name: word, dtype: object
Missing Values:
 2577     NaN
12819    NaN
Name: word, dtype: object


This is my first attempt:

I take each affix and see if it appears at the starts or ends (according to whether it is a prefix or suffix) of words. I then accumulate the frequencies of those words and append this number to the morpheme dataset. This is problematic, because affixes can appear in the middle of words, too.

In [9]:
# Get the unique affixes
unique_affixes_in_df = df[['Affix', 'Type']].drop_duplicates()

results = []

for _, row in unique_affixes_in_df.iterrows():
    affix = row['Affix']
    affix_type = row['Type']

    # Regex: Prefixes at the start of words, suffixes at the end
    pattern = f'^{affix}' if affix_type == 'prefix' else f'{affix}$'

    # Get words with this pattern
    contains_affix = freq['word'].str.contains(pattern, regex=True)
    words_with_affix_df = freq[contains_affix]

    # Accumulate frequencies
    total_frequency = words_with_affix_df['count'].sum()
    result_entry = {'Affix': affix, 'Type': affix_type, 'TotalFrequency': total_frequency}

    results.append(result_entry)

result_df = pd.DataFrame(results)

# Appending the value and cleaning the new dataframe
merged_df = pd.merge(df, result_df, on='Affix', how='left')
merged_df['TotalFrequency'].fillna(0, inplace=True)
merged_df['TotalFrequency'] = merged_df['TotalFrequency'].astype(int)

print(merged_df.shape)
merged_df.head()


(2420, 5)


Unnamed: 0,Affix,Type_x,Origin,Type_y,TotalFrequency
0,afro,prefix,Unspecified,prefix,2647152
1,anglo,prefix,Unspecified,prefix,3591523
2,euro,prefix,Unspecified,prefix,225990569
3,franco,prefix,Unspecified,prefix,6772175
4,indo,prefix,Unspecified,prefix,44708427


This is an alternative to regex, and it might be faster?

In [10]:

'''
if affix_type == 'prefix':
        contains_affix = freq['word'].str.startswith(affix)
    elif affix_type == 'suffix':
        contains_affix = freq['word'].str.endswith(affix)
    else:
        print(f"Unsupported affix type: {affix_type}")
        continue
'''

'\nif affix_type == \'prefix\':\n        contains_affix = freq[\'word\'].str.startswith(affix)\n    elif affix_type == \'suffix\':\n        contains_affix = freq[\'word\'].str.endswith(affix)\n    else:\n        print(f"Unsupported affix type: {affix_type}")\n        continue\n'

This is my second attempt, where my first method becomes iterative; after finding an affix, I remove it and keep checking to see if it contains more affixes. For example, "incarcerated" -> "incarcerate" -> "carcerate" -> "ate". The problem with this method is that it is computationally expensive and also does not account for allomorphy, or when morphemes might change slightly depending on its context.

In [11]:
# This is the same as before
unique_affixes_in_df = df[['Affix', 'Type']].drop_duplicates()

results = []

for _, row in unique_affixes_in_df.iterrows():
    affix = row['Affix']
    affix_type = row['Type']

    pattern = f'.*?{affix}' if affix_type == 'prefix' else f'{affix}.*?'
    contains_affix = freq['word'].str.contains(pattern, regex=True)
    words_with_affix_df = freq[contains_affix]

    # Here is the change:
    # While the word has an affix, count its frequency, remove the affix, see if it still has an affix
    while contains_affix.any():
        # Remove the first instance of the affix from each word
        freq['word'] = freq['word'].str.replace(pattern, '', n=1)
        contains_affix = freq['word'].str.contains(pattern, regex=True)
        # Accumulate freq of next found affix?
        # ...

    # the rest is also the same
    total_frequency = words_with_affix_df['count'].sum()
    result_entry = {'Affix': affix, 'Type': affix_type, 'TotalFrequency': total_frequency}

    results.append(result_entry)

result_df = pd.DataFrame(results)

merged_df = pd.merge(df, result_df, on='Affix', how='left')
merged_df


KeyboardInterrupt: 

In [12]:
# Exporting to csv
merged_df.to_csv("data/affixes.csv", index=False)

In [13]:
# How many affixes were not counted:
merged_df[merged_df['TotalFrequency'] == 0]

Unnamed: 0,Affix,Type_x,Origin,Type_y,TotalFrequency
50,ailur,prefix,Greek,prefix,0
65,alphit,prefix,Greek,prefix,0
282,camisi,prefix,Latin,prefix,0
297,carcer,prefix,Latin,prefix,0
334,centesim,prefix,Latin,prefix,0
...,...,...,...,...,...
2233,trit-,prefix,Greek,prefix,0
2253,tymb,prefix,Greek,prefix,0
2265,uligin,prefix,Latin,prefix,0
2401,xera,prefix,Greek,prefix,0
