### Manual entity curation

This notebook filters the data to provide the user with the files that can be used for manually curating the rules.

File names for use in the following notebook

In [1]:
entities = 'ner_counts_5000_lemmas.csv'
country_list = 'riigid.txt'
first_list = 'data/first_counts_lemma_5000.csv'
last_list = 'data/last_counts_lemma_5000.csv'
single_list = 'data/single_counts_lemma_5000.csv'

This example shows how curating country names could be done. In the previous step, all entities were divided into lists with single word, last word of multiword entities and the first words of multiword entities. For simplicity, since most country names contain one word, this notebook only looks into the single word names.

In [2]:
import pandas as pd

single_table = pd.read_csv(single_list)

In [3]:
single_table

Unnamed: 0.1,Unnamed: 0,PER,ORG,LOC
0,Eminem,19.0,0.0,0.0
1,People,2.0,1.0,0.0
2,Hailie,2.0,0.0,0.0
3,Jimmy,7.0,0.0,0.0
4,Jimmyl,1.0,0.0,0.0
...,...,...,...,...
12652,RSbeta@xxxxx.ee,0.0,0.0,6.0
12653,MinNike-vs-mimmi-vs-Liisk,0.0,0.0,2.0
12654,LoLL@xxxxx.ee,0.0,0.0,1.0
12655,Syru,0.0,0.0,1.0


In [4]:
single_table.rename( columns={'Unnamed: 0':'Entity'}, inplace=True )

Reading in the country names from a list of countries

In [5]:
countries = []
with open(country_list,encoding='UTF-8') as f:
    for line in f:
        countries.append(line.rstrip())

In [6]:
countries

['Ameerika Ühendriigid',
 'Argentina',
 'Austraalia',
 'Belgia',
 'Boliivia',
 'Brasiilia',
 'Colombia',
 'Costa Rica',
 'Dominikaani Vabariik',
 'Egiptus',
 'El Salvador',
 'Ecuador',
 'Etioopia',
 'Filipiinid',
 'Guatemala',
 'Haiti',
 'Hiina',
 'Holland',
 'Honduras',
 'India',
 'Iraak',
 'Iraan',
 'Jugoslaavia',
 'Kanada',
 'Kreeka',
 'Kuuba',
 'Libeeria',
 'Liibanon',
 'Luksemburg',
 'Lõuna-Aafrika Vabariik',
 'Mehhiko',
 'Nicaragua',
 'Norra',
 'Panama',
 'Paraguay',
 'Peruu',
 'Poola',
 'Prantsusmaa',
 'Saudi Araabia',
 'Suurbritannia',
 'Süüria',
 'Taani',
 'Tšehhoslovakkia',
 'Tšiili',
 'Türgi',
 'Ukraina',
 'Uruguay',
 'Uus-Meremaa',
 'Valgevene',
 'Venemaa',
 'Venezuela',
 'Afganistan',
 'Island',
 'Rootsi',
 'Tai',
 'Jeemen',
 'Pakistan',
 'Myanmar',
 'Iisrael',
 'Indoneesia',
 'Albaania',
 'Austria',
 'Bulgaaria',
 'Hispaania',
 'Iirimaa',
 'Itaalia',
 'Jordaania',
 'Kambodža',
 'Laos',
 'Liibüa',
 'Nepal',
 'Portugal',
 'Rumeenia',
 'Sri Lanka',
 'Soome',
 'Ungari',
 'Jaa

Each entity is looked for in the country table to create a new filtered table of countries and the frequencies with which they appeared. It is also sorted by priority for the curation, here based on the frequency of it having a LOC tag.

In [7]:
country_rows = []
for row in single_table.itertuples():
    if row.Entity in countries:
        country_rows.append(row)

In [8]:
country_rows = pd.DataFrame(country_rows).sort_values('LOC',ascending=False)

This subset of the data that we are interested in is saved to a file in the data folder

In [9]:
country_rows.to_csv('data/countries.csv')

Now the files for curation are created. The function takes 3 arguments: the first is the input file with the counts created in the previous cell, the second is the name of the file where the filtered rows are written to for curation and the third argument is the count of rows to be written to the output file.

The function compares the input list to the entities saved in the 'entities/' folder. If the input does not appear in at least one of these folders, it will go to the curation file. If it is already curated for all the categories, it will be filtered out and not shown to the user. Note that the 'entities/' folder must have a folder for each category pre-made.

In [10]:
import os

def create_curation_files(counts_file,output_file,n_next):
    curr_list = pd.read_csv(counts_file)
    curr_list.drop('Unnamed: 0',axis=1,inplace=True)
    existing = []
    isdircount = 0
    for column in curr_list.columns:
        if os.path.isdir('entities/'+column):
            isdircount += 1
            for file in os.listdir('entities/'+column):
                df = pd.read_csv('entities/'+column+'/'+file)
                existing.extend(df.Entity)
    to_ignore = set()
    for el in existing:
        if existing.count(el) == isdircount:
            to_ignore.add(el)
    to_curate = curr_list.drop(curr_list[curr_list.Entity.isin(to_ignore)].index)[:n_next]
    counter = 2
    while os.path.exists(output_file):
        parts = output_file.split('.')
        name = parts[0].split('_')
        output_file = name[0] + '_' + str(counter) + '.' + parts[1]
        counter +=1
    to_curate.to_csv(output_file)
    print('Created file ' + output_file)

In [11]:
create_curation_files('data/countries.csv','inputs/countries.csv',n_next=20)

Created file inputs/countries_18.csv


In this step, the user should curate the file created. After that, the following function saves the results to the 'entities/' folder and creates a copy of the curated file to the 'outputs/' folder.

This function takes two arguments, the first is the file that you just curated and the second is the file name where to save these in the entities folder. Note that results in the entities folder will be overwritten if that file already exists.

In [97]:
def save_manual_curation(curated_file,target_file):
    df = pd.read_csv(curated_file)
    parts = curated_file.split('/')
    df.to_csv('outputs/'+parts[1])
    for column in df.columns:
        if column not in ('Unnamed: 0','Index','Entity'):
            series = df[['Entity',column]]
            series.to_csv('entities/'+column.lower()+'/'+target_file)

In [98]:
save_manual_curation('inputs/countries_17.csv','countries_2.csv')