Reads in the queried bases from `bases_manual_query_done.csv`, adds new rows with the new base and its frequency into the datasets for the respective suffixes and annotates that row as `true_base == 1`.

After adding this information to each dataset, this script also merges the frequences for the lemmas and bases indicated in the `merge` column.

It creates a dataset with all pairs removed for which the frequency of the base is 0.
This data is saved in `backform_base_cutoff/`.

In [14]:
import pandas as pd
import os

ANNOT_FILES = [fn for fn in os.listdir('2_backform_samples_nano_annot') if fn[-14:] == 'base_annot.csv']
SFXS = [fn.split('_')[0] for fn in ANNOT_FILES]
MANUAL_FREQS = pd.read_csv('bases_manual_query_done.csv')

for idx in range(len(SFXS)):
    
    curr_file = ANNOT_FILES[idx]
    curr_sfx = SFXS[idx]
    curr_df = pd.read_csv('backform_samples_nano_annot/' + curr_file)
    curr_df = curr_df.drop(columns=['query_by_hand', 'query_pos'])
    
    # First, we add in the frequency from the manually queried bases.
    # Subset the MANUAL_FREQS dataset for only the current suffix. If there is any content in that subset,
    # concatenate with curr_df.
    curr_manfreq = MANUAL_FREQS[MANUAL_FREQS.sfx == curr_sfx]
    
    if len(curr_manfreq) > 0:
        
        # Drop cols we no longer need (related to original data or manual querying)
        # and rename columns with the new values to the names in the original data.
        curr_manfreq = curr_manfreq.drop(columns=['unique_candidates', 'pos', 'sfx', 'cql']).rename(columns={'query_by_hand':'unique_candidates', 'query_pos':'pos'})
        curr_manfreq['true_base'] = 1
        
        # Reorder cols according to their order in curr_df so that pd.concat() doesn't complain, and concatenate.
        curr_manfreq = curr_manfreq.reindex(columns=curr_df.columns)
        curr_df = pd.concat([curr_manfreq, curr_df])
    
    # Next, we want to sum up the lemma_freq and base_freq values for the rows in which the merge value
    # is equal to the lemma value, and add those values to the values beside lemma.
    mg_subdf = curr_df[~curr_df['merge'].isna()]
    
    if len(mg_subdf) > 0:
        for idx, row in mg_subdf.iterrows():
            
            # Get the important values from this row.
            target_lem = row['merge']
            lem_freq = row.lemma_freq
            bas_freq = row.base_freq

            # We should be able to ID the correct row to add these values to by looking at whether both true_lemma and true_base
            # have values of 1 (since now all the manually-queried bases should also have true_base values of 1)

            # So, find the index of the one row that meets this description, use .at to augment the values in curr_df
            # in that row and the correct columns.
            target_row = curr_df[(curr_df.lemma == target_lem) & (curr_df.true_lemma == 1) & (curr_df.true_base == 1)].index
            if len(target_row) == 0:
                target_row = curr_df[(curr_df.manual_lemma == target_lem) & (curr_df.true_lemma == 1) & (curr_df.true_base == 1)].index
            assert len(target_row) == 1, print(curr_sfx, '\t', target_lem, '\t', target_row)

            curr_df.at[target_row[0], 'lemma_freq'] += lem_freq
            curr_df.at[target_row[0], 'base_freq'] += bas_freq

        # Remove the row indices in mg_subdif from curr_df and reset_index.
        curr_df = curr_df[curr_df['merge'].isna()].reset_index(drop=True)
    
    # Drop the merge column; no longer needed.
    curr_df = curr_df.drop(columns=['merge'])
    
    # Finally, wherever base_freq == 0, set true_lemma == 0 and true_base == np.nan.
    # We will not consider pairs in which the base never appears.
    # Save in backform_base_cutoff/.
    curr_df.loc[curr_df.base_freq == 0, 'true_lemma'] = 0
    curr_df.loc[curr_df.base_freq == 0, 'true_base'] = pd.np.nan
    curr_df.to_csv('6_backform_base_cutoff/' + curr_sfx + '_bases.csv', index=False)
    
    print('Done', curr_sfx)

Done -age
Done -ament
Done -and
Done -ant
Done -anz
Done -ateur
Done -ation
Done -ator
Done -atur
Done -eA
Done -el
Done -ement
Done -end
Done -ent
Done -enz
Done -er
Done -eur
Done -eV
Done -heit
Done -ie
Done -iker
Done -ikum
Done -ik
Done -iment
Done -ismus
Done -ist
Done -itaet
Done -iteur
Done -ition
Done -itur
Done -ium
Done -ling
Done -nis
Done -schaft
Done -ung
