## Creating rulesets

This is the step that filters the lists of entities to pick out those that we want to use for creating the rulesets. Not all found entities are good quality and sometimes we are interested in just some specific types. Here are some examples of how to work with the long entity lists.

We can work with the full data by reading in the csv.

In [2]:
import pandas as pd

data = pd.read_csv('ner_counts.csv')

In [3]:
data

Unnamed: 0.1,Unnamed: 0,entity,label,count
0,0,"('Ühendriigid',)",LOC,1
1,1,"('Leedusse',)",LOC,1
2,2,"('Leedu', 'välisministeeriumi')",ORG,1
3,3,"('Interfaxile',)",ORG,1
4,4,"('Bill', 'Clinton')",PER,2
...,...,...,...,...
4126,4126,"('Külli', 'Hansen', 'Sebra')",PER,1
4127,4127,"('Fancy',)",LOC,1
4128,4128,"('Fun', '&', 'Fancy')",ORG,4
4129,4129,"('Hinnad', 'Sebra', 'galeriist')",ORG,1


However, it might be more practical to use the lists of first words, last words and singletons created in the previous step.

In [5]:
first_words = pd.read_csv('data/first_counts.csv')

In [7]:
last_words = pd.read_csv('data/last_counts.csv')

In [8]:
single_words = pd.read_csv('data/single_counts.csv')

As an example, let's create a ruleset based on 50 most popular LOC last words

In [14]:
last_words = last_words.sort_values('LOC',ascending=False)
last_words

Unnamed: 0.1,Unnamed: 0,ORG,PER,LOC
3,Liidu,15.0,0.0,18.0
1776,linna,0.0,0.0,7.0
1786,riik,0.0,0.0,6.0
1773,riikides,0.0,0.0,4.0
70,Liidus,2.0,0.0,4.0
...,...,...,...,...
1668,Annabelile,0.0,1.0,0.0
1719,Veski,0.0,1.0,0.0
1721,Aarma,0.0,1.0,0.0
1749,Kõrtsini,0.0,1.0,0.0


In [17]:
last_words.rename( columns={'Unnamed: 0':'Entity'}, inplace=True )

In [45]:
word_list = last_words['Entity'][:50]
word_list

3                Liidu
1776             linna
1786              riik
1773          riikides
70              Liidus
1835              meri
1833         osariigis
1817              mere
69                Liit
1834              jõgi
1771           Araabia
1810               jõe
1782         Kesklinna
1790      Ühendriikide
1781              lahe
1783              linn
1808             riigi
91              Yorgis
1804             külas
1809            vallas
1793            Jersey
1774             Leedu
1814           Liiduga
1797             külla
1799             Citys
324            Erakond
1801         maakonnas
1824             Kenya
1820           Aafrika
1837         Tammsaare
1822           riikide
1823             Diego
1826     Beltsville'is
1825         Bethesdas
1819        Vabariigis
1827          osariigi
1828           riigile
1829         Tõnismäel
1805              NSVs
1798           16 0 16
1802            riigis
1803           Kaunase
1807    Lelle-Viljandi
1811     Fö

Rulesets can be easily created from CSV files with the load() function. Note that the first row of the CSV file is for column names and the second row is for column types. The following rows contain the values which will be turned into rules. It does not work with an index column so that must be set to false.

In [46]:
word_list = pd.concat([pd.Series(['string']), word_list])


In [47]:
pd.DataFrame(word_list).to_csv('top_50.csv',index=False)

In [48]:
from estnltk.taggers.system.rule_taggers import Ruleset
ruleset = Ruleset()
ruleset.load('top_50.csv')

In [49]:
ruleset.static_rules

[StaticExtractionRule(pattern='Liidu', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='linna', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='riik', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='riikides', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Liidus', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='meri', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='osariigis', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='mere', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Liit', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='jõgi', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Araabia', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='jõe', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Kesklinna', attributes={}, group=0, priority=0),
 Stat

-------------------

Here are a few more examples based on the files created in the previous examples.

This is a list of last words that are LOC-s, manually filtered to have geographical entities only.

In [9]:
import pandas as pd
df = pd.read_csv('outputs/geo_loc_100.csv')
word_list = pd.concat([pd.Series(['string']), df.Entity])

In [10]:
pd.DataFrame(word_list).to_csv('geo_rules.csv',index=False)

In [13]:
from estnltk.taggers.system.rule_taggers import Ruleset
ruleset = Ruleset()
ruleset.load('geo_rules.csv')

In [14]:
ruleset.static_rules

[StaticExtractionRule(pattern='jõe', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='mere', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='lahe', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='järve', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='ookeani', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='saarel', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='saare', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='lahel', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='lahes', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='poolsaare', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='poolsaarel', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='saarte', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='saarele', attributes={}, group=0, priority=0),
 

In [15]:
import pickle

with open('geo_ruleset.pkl','wb') as f:
    pickle.dump(ruleset,f)

----------

This example uses the 'entities/' folder created earlier and puts together a LOC tagger from all the LOC-s that have been manually curated.

In [11]:
import pandas as pd
import os
df = pd.concat((pd.read_csv('entities/loc/'+f) for f in os.listdir('entities/loc')))


Here is the last chance to filter the data before making rules out of them

In [14]:
df = df[df['LOC']>100]
df

Unnamed: 0.1,Unnamed: 0,Entity,LOC
0,0,Eesti,6355.0
1,1,Venemaa,1110.0
2,2,Soome,730.0
3,3,Saksamaa,463.0
4,4,Läti,453.0
5,5,Rootsi,448.0
6,6,Leedu,329.0
7,7,Prantsusmaa,329.0
8,8,Itaalia,289.0
9,9,Poola,274.0


In [15]:
word_list = pd.concat([pd.Series(['string']), df.Entity])

In [16]:
word_list

0            string
0             Eesti
1           Venemaa
2             Soome
3          Saksamaa
4              Läti
5            Rootsi
6             Leedu
7       Prantsusmaa
8           Itaalia
9             Poola
10            Iraak
11            Hiina
12        Hispaania
13            Tonga
14          Iisrael
15            Taani
16          Ukraina
17           Jaapan
18    Suurbritannia
19       Austraalia
0             Norra
1            Belgia
dtype: object

In [17]:
pd.DataFrame(word_list).to_csv('all_loc_rules.csv',index=False)

In [18]:
from estnltk.taggers.system.rule_taggers import Ruleset
ruleset = Ruleset()
ruleset.load('all_loc_rules.csv')

In [19]:
ruleset.static_rules

[StaticExtractionRule(pattern='Eesti', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Venemaa', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Soome', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Saksamaa', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Läti', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Rootsi', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Leedu', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Prantsusmaa', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Itaalia', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Poola', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Iraak', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Hiina', attributes={}, group=0, priority=0),
 StaticExtractionRule(pattern='Hispaania', attributes={}, group=0, priority