Read in the SMOR results and use them to remove compounds from `simplex_filtered1.csv`, yielding `simplex_filtered2.csv`.

In [1]:
import pandas as pd
import re
import string

def smored_to_dict(smor_out):
    """
    Reads SMOR output into dictionary format.
    
    Arg:
        smor_out: A list of lines output from SMOR.
    Returns:
        A dictionary in which the unique words analysed by SMOR are the keys, and the lists of analyses are the values.
    """
    
    # Create dictionary that will be iteratively expanded in the for loop below, going through each line
    # of the SMOR output.
    analyses = dict()
    for line in smor_out:
        
        # Get the current word. (This works because the word, formatted '> Word', always precedes its analyses)
        if line[0] == ">":
            curr_wd = line[2:]
            
            # Flag whether this word is already in the dictionary. If it is, then we don't need to re-add it.
            new_token = False if curr_wd in analyses.keys() else True
            
        # If the SMOR output isn't already in analyses, then we'll add it here.
        # Get the analyses of that word. Add to value if that key has already been made (i.e., if we're looking
        # at line 2+ of the SMOR output); else, create key.
        else:
            if new_token:
                if curr_wd in analyses.keys():
                    analyses[curr_wd].append(line)
                else:
                    analyses[curr_wd] = [line]
            
    return analyses

**Part 1:** Convert SMOR output to a dictionary and trim it to remove unparsable words, NN compounds, and a few other things.

In [2]:
# Read in the SMOR output.
with open('infiles/simplex_filtered1.smored', encoding='utf-8') as file:
    smored = [line.strip() for line in file]

smored_dict = smored_to_dict(smored)
len(smored_dict)

13864

The length of `smored_dict` is less than the length of the original file because of words that have the same form but different POSs. 
Since, in the end, we only care about string matching, it's OK to lose these duplicates at this point.

We can safely ignore anything that can't be analysed by SMOR, indicated by a value containing `no result for`.

In [3]:
# Remove no-result-for items.
smored_dict = {key:val for key, val in smored_dict.items() if not "no result for" in val[0]}
len(smored_dict)

12162

We particularly want to identify NN compounds, and we can find them by matching a sequence of `<NN>` followed by `<+NN>` (indicating a noun followed by a head noun).

In [5]:
smored_dict = {key:val for key, val in smored_dict.items() if re.search('<NN>.*<\+NN>', "".join(val)) == None}
# (The join() call combines all analyses in the val list into one string for easier searching)
len(smored_dict)

8503

Also remove any words that contain numerals or punctuation and any that are only one character long.

In [6]:
smored_dict = {key:val for key, val in smored_dict.items() if re.search('\d', key) == None}
len(smored_dict)

8462

In [7]:
smored_dict = {key:val for key, val in smored_dict.items() if not any(punct in key for punct in string.punctuation)}
len(smored_dict)

8292

In [8]:
smored_dict = {key:val for key, val in smored_dict.items() if len(key) > 1}
len(smored_dict)

8268

^ This is how many unique lemmata there are to annotate.

**Part 2:** Read in `outfiles/simplex_filtered1.csv`, only keep the rows in which the value of the `lemma` column is in `smored_dict.keys()`, and save as `outfiles/simplex_filtered2.csv`.

In [9]:
filt1 = pd.read_csv('outfiles/simplex_filtered1.csv')
filt2 = filt1[ filt1['lemma'].isin(smored_dict.keys()) ]
filt2.to_csv('outfiles/simplex_filtered2.csv', index=False)

Also dedup the lemmata for slightly more efficient annotation; this will be merged with `outfiles/simplex_filtered2.csv` in the next step to produce `outfiles/simplex_filtered3.csv`.
The annotation takes place in `infiles/simplex_filtered2_annotated.csv`.

In [10]:
pd.Series(filt2.lemma.unique(), name='lemma').to_csv('outfiles/simplex_filtered2_toannot.csv', index=False, header=True)