This script reads in the list of most frequent lemmas from DECOW and filters it for those that are most likely to be monomorphemic.
The tabular output is saved in `simplex_filtered1.csv`, and the output to be fed into SMOR is saved in `simplex_filtered2.tosmor`.

(The reason why SMOR can't be used exclusively is because it fails to recognise certain affixes.
But it's pretty good at compounds.)

In [1]:
import pandas as pd

In [2]:
freqlist = pd.read_csv('infiles/decow16bx.lp', sep='\t', names=['lemma', 'POS', 'freq'], keep_default_na=False)

# Only keep lemmata with frequency > 10000. (Impressionistically, there aren't many simplexes below that point at all.)
freqlist = freqlist[freqlist['freq'] > 10000]

In [3]:
# Filter for POSs that can possibly be monomorphemic (i.e., excluding stuff like finite verbs and various cliticised pronouns)
pos_to_keep = set(['NN', 'ADJA', 'ADJD', 'VVINF', 'ADV', 'PTKVZ', 'APPR', 
               'KOUS' , 'KON', 'APPO', 'PTKNEG', 'PWAV', 'PWS', 'PWAT', 
               'PRF', 'VMINF', 'PPOSS', 'APZR', 'PRELS', 'PRELAT', 'ART',
               'PTKANT', 'KOUI', 'VAINF', 'KOKOM', 'PTKZU'])

freqlist = freqlist[ freqlist['POS'].isin(pos_to_keep) ]

In [4]:
print(len(freqlist))

31577


Remove words from the sample that contain particular affixes. 
The question of how strict these criteria should be is definitely very subjective, but I'm going for a high-precision, low-recall approach: I really want the things in my sample to be simplexes, and it's OK if there are some true simplexes that don't make it into my final sample.
Some of these suffixes are definitely not productive suffixes of German, but they appear a lot in loan words, so this is also a not-perfect-but-okay way to hone in on core German phonology.

In [5]:
verb_adj_adv_startswith = ['un', 'ge', 'ab', 'an', 'auf', 'aus', 'ein', 'mit', 'nach', 'weg', 'zu', 'be', 'ent', 'er', 'ver', 'zer',
                      'durch', 'um', 'über', 'unter', 'vor', 'hin', 'her', 'ur', 'ik', 'iv']
noun_startswith = [pfx.capitalize() for pfx in verb_adj_adv_startswith]
noun_endswith = ['ung', 'heit', 'keit', 'tum', 'nis', 'er', 'ion', 'in', 'schaft', 'ende', 'chen', 'tät', 'ist', 'ment',
                'ant', 'ar', 'är', 'äß', 'ik', 'um', 'ur', 'age',  'ei', 'erie', 'al']
adj_adv_endswith = ['isch', 'end', 'elnd', 'ernd', 'bar', 'lich', 'lichst', 'los', 'ional', 'voll', 'ell', 'iert', 'är', 'iv', 
                    'ig', 'os', 'ös', 'weise', 'er', 'el', 'frei', 'haft', 'al', 'wert', 'stens']

In [6]:
for pfx in verb_adj_adv_startswith:
    freqlist = freqlist[ ~((freqlist['POS'] == 'ADJA') & (freqlist['lemma'].str.startswith(pfx))) ]
    freqlist = freqlist[ ~((freqlist['POS'] == 'ADJD') & (freqlist['lemma'].str.startswith(pfx))) ]
    freqlist = freqlist[ ~((freqlist['POS'] == 'ADV') & (freqlist['lemma'].str.startswith(pfx))) ]
    freqlist = freqlist[ ~((freqlist['POS'] == 'VVINF') & (freqlist['lemma'].str.startswith(pfx))) ]
    
for pfx in noun_startswith:
    freqlist = freqlist[ ~((freqlist['POS'] == 'NN') & (freqlist['lemma'].str.startswith(pfx))) ]

for sfx in noun_endswith:
    freqlist = freqlist[ ~((freqlist['POS'] == 'NN') & (freqlist['lemma'].str.endswith(sfx))) ]

for sfx in adj_adv_endswith:
    freqlist = freqlist[ ~((freqlist['POS'] == 'ADJA') & (freqlist['lemma'].str.endswith(sfx))) ]
    freqlist = freqlist[ ~((freqlist['POS'] == 'ADJD') & (freqlist['lemma'].str.endswith(sfx))) ]
    freqlist = freqlist[ ~((freqlist['POS'] == 'ADV') & (freqlist['lemma'].str.endswith(sfx))) ]

In [7]:
print(len(freqlist))

14498


In [8]:
freqlist.to_csv('outfiles/simplex_filtered1.csv', index=False)
freqlist.lemma.to_csv('outfiles/simplex_filtered1.tosmor', index=False, header=False, sep="\t", line_terminator='\n')

The next step happens off-stage.
I apply SMOR to `outfiles/simplex_filtered1.tosmor` to yield `infiles/simplex_filtered1.smored`.