### Subject-verb preprocessing

This file does preprocessing for subject-verb collocations. These collocations consist of subject-verb pairs that go together in sentences. For example, the pair ("see", "olema") appeared 498627 times in the Estonian Koondkorpus, from which the collocations were extracted. An example sentence for was "**See** **on** ju telemäng, ega me mõisa peale mänginud.", where the words forming the collocation are marked in bold.

### Imports

In [10]:
from collections import defaultdict
from data_preprocessing import fetch_entries, connected_entries, matrix_creation

### Analysing the data

In [2]:
entries = fetch_entries(db_name='subj_verb_collocations_20211110.db', table_name='subj_verb_koondkorpus')

#### Removing words that don't fit the criteria

In [3]:
non_noun = []

for entry in entries:
    if entry[1] not in ['P', 'S']:
        non_noun.append(entry)

In [4]:
set([entry[1] for entry in non_noun])

{'A', 'D', 'G', 'J', 'N', 'V', 'X', 'Y', 'Z'}

In [5]:
for entry in entries:
    if entry[3] != 'V':
        print(entry)

In [6]:
entries_to_keep = []

for entry in entries:
    if entry[1] in ['S', 'P']:
        entries_to_keep.append(entry)

In [7]:
subjects_non_dup = list(dict.fromkeys([entry[0] for entry in entries_to_keep]))
verbs_non_dup = list(dict.fromkeys([entry[2] for entry in entries_to_keep]))

#### Removing pairs that are not connected to others

In [8]:
connected = connected_entries(entries_to_keep, subjects_non_dup, verbs_non_dup)

#### Reducing the number of subjects to a reasonable amount

In [11]:
subject_counts = defaultdict(int)

for entry in connected:
    subject_counts[entry[0]] += entry[4]

In [12]:
subjects_to_keep = sorted(subject_counts.items(), key=lambda kv: kv[1], reverse=True)[:15000]

In [14]:
final_subjects = [subj for subj, count in subjects_to_keep]
final_entries = [entry for entry in connected if entry[0] in final_subjects]

In [15]:
final_verbs = []

for entry in final_entries:
    verb = entry[2]
    if verb not in final_verbs:
        final_verbs.append(verb)

### Creating the matrix used for LDA

In [16]:
df = matrix_creation(final_entries, final_subjects, final_verbs)

In [17]:
df.head()

Unnamed: 0,olema,teadma,ütlema,tahtma,saama,tähendama,tegema,lisama,arvama,nägema,...,kiduma,klõbisema,runnima,viidsima,pritsuma,ketaalima,seiduma,ücima,müübima,juksima
tema,210031,15303,43006,22397,37392,829,33115,31796,7440,11059,...,0,0,0,0,0,0,0,0,0,0
mina,122824,48357,22122,41302,39695,131,29568,659,30362,21060,...,0,0,3,3,0,0,0,3,0,0
see,498627,2305,1064,868,7488,38195,10684,820,201,1968,...,0,0,0,0,3,0,0,0,0,0
mis,159528,607,2033,1970,17968,12900,11499,564,739,3730,...,0,0,0,0,0,0,0,0,0,0
kes,81930,8050,4846,15278,16658,34,13727,303,2673,3735,...,0,0,0,0,0,0,0,0,0,0
