### Amod preprocessing

This file does preprocessing for amod collocations. These collocations consist of pairs where two words are connected by the 'amod' relation. For example, the pair ("aasta", "eelmine") appeared 52675 times in the Estonian Koondkorpus, from which the collocations were extracted. An example sentence for was "Ugalasse läks Ott Aardam **eelmisel aastal** koos viie kursusekaaslasega pärast EMA Kõrgema Lavakunstikooli lõpetamist.", where the words forming the collocation are marked in bold.

### Imports

In [1]:
from collections import defaultdict
from data_preprocessing import fetch_entries, connected_entries, matrix_creation

### Analysing the data

In [2]:
entries = fetch_entries(db_name='amod_collocations_20211206.db', table_name='amod_koondkorpus')

#### Removing pairs that are not connected to others

In [3]:
first = list(dict.fromkeys([entry[0] for entry in entries]))
second = list(dict.fromkeys([entry[2] for entry in entries]))

In [4]:
connected = connected_entries(entries, first, second)

#### Reducing the number of subjects to a reasonable amount

In [5]:
first_counts = defaultdict(int)

for entry in connected:
    first_counts[entry[0]] += entry[4]

In [6]:
first_to_keep = sorted(first_counts.items(), key=lambda kv: kv[1], reverse=True)[:15000]

In [7]:
final_words = [f for f, count in first_to_keep]
final_entries = [entry for entry in connected if entry[0] in final_words]

In [8]:
final_second = []

for entry in final_entries:
    s = entry[2]
    if s not in final_second:
        final_second.append(s)

### Creating the matrix used for LDA

In [9]:
df = matrix_creation(final_entries, final_words, final_second, save_to_csv=True, filename="results\\amod.csv")

In [10]:
df.head()

Unnamed: 0,eelmine,järgmine,eesti,esimene,viimane,kohalik,suur,1999.,1998.,1997.,...,süvim,energiakullane,voimalustene,asustuslik,aeronavigatsiooniline,sihukne,paarisajamegane,sajamegane,päritav,endis-eesti
aasta,52675,37528,26,3933,24376,7,31,18611,17056,16967,...,0,0,0,0,0,0,0,0,0,0
aeg,102,68,228,126,27393,1359,149,0,0,0,...,0,0,0,0,0,0,0,0,0,0
keel,9,11,29119,87,24,448,89,0,0,0,...,0,0,0,0,0,0,0,0,0,0
osa,195,429,14,3125,920,14,21798,0,0,0,...,0,0,0,0,0,0,0,0,0,0
inimene,9,66,726,800,336,1149,532,0,0,0,...,0,0,0,0,0,0,0,0,0,0
