## Cleaning factRuEval-2016

This dataset is somewhat challenging. The tokens, spans, objects (labels), and text are all being kept in separate files...for every single training/test observation. 

To summarize the work that is being done below:

The goal is to get the data into a format where we have tokens on the left and their labels on the right in an easily readable text file. To do this, we need to join the token file to the span file to the object file (because there is insufficient information in the token file to join it to the object file directly).

Steps:

1. Read in the token data.
2. Read in and process the span data.
3. Process the label data, removing duplicates / "recursive" labels.
4. Combine the tokens, spans, and labels.
5. Filter out unwanted label types ("fact" labels vs. NER labels, because fact-finding was a separate task for this dataset)
6. **(To do next)** Use the B-PER and I-PER convention to identify which tokens belong to the same label (for example, "National Bank of Ukraine").

The way a couple of these steps are implemented is likely not ideal for large datasets, but will probably work for now.

In [53]:
import pandas as pd
token_col_names = ['id', 'begin_index', 'length', 'text']
test_obj = pd.read_csv('../data/ru/factRuEval-2016/devset/book_194.tokens', 
                       sep=' ', skip_blank_lines=False, header=None)
test_obj.columns = token_col_names

test_obj[:15]

Unnamed: 0,id,begin_index,length,text
0,210354.0,0.0,6.0,Россия
1,210355.0,7.0,4.0,дала
2,210356.0,12.0,7.0,Украине
3,210357.0,20.0,6.0,кредит
4,210358.0,27.0,2.0,на
5,210359.0,30.0,5.0,сумму
6,210360.0,36.0,1.0,2
7,210361.0,38.0,9.0,миллиарда
8,210362.0,48.0,8.0,долларов
9,210363.0,56.0,1.0,","


In [201]:
test_file = '../data/ru/factRuEval-2016/devset/book_194.spans'

def extract_span_info(filename):
    
    with open(filename) as s:
        info = s.readlines()
    
    intermediate = [l.split() for l in info]
    
    # Extract only the necessary fields for each intermediate representation
    # Grab token ids + phrase text, span id, mention type, position, and length of span/phrase
    fields = [(i_rep[7:], i_rep[0], i_rep[1], i_rep[2], i_rep[5]) 
              for i_rep in intermediate]
    
    return fields

extracted = extract_span_info(test_file)

### Let's try something else. We're going to treat this as a pure, span-only task where we do NOT introduce any token information from the span data, and we rely only on beginning index for joining with tokens. 

#### Sadly, beginning index is NOT available on the token level inside of the span data. Instead, it's only available on the span level, so I'll need to manually calculate somehow.

In [324]:
# later I can just rewrite the fn to exclude the token info
span_df = pd.DataFrame(extracted).iloc[:, 1:]
span_df.columns = ['position_id', 'mention_type', 'begin_index', 'n_tokens']

# Convert dtypes for joining with token df
for col in ['position_id', 'n_tokens', 'begin_index']:
    span_df[col] = span_df[col].astype('int')
    
# Reuse previous work on labels    
get_label = lambda x: label_dict.get(str(x), np.nan)
span_df['label'] = span_df['position_id'].apply(get_label)

span_df

Unnamed: 0,position_id,mention_type,begin_index,n_tokens,label
0,26213,loc_name,0,1,LocOrg
1,26214,loc_name,12,1,LocOrg
2,26215,org_name,67,1,Org
3,26216,org_descr,124,2,Org
4,26217,loc_name,142,1,Org
5,26218,org_descr,137,1,Org
6,38269,org_descr,104,1,Org
7,38270,org_descr,93,2,Org
8,77695,geo_adj,93,1,
9,26219,org_name,226,3,Org


In [367]:
labeled_tokens_df_2 = test_obj.merge(span_df, 
                                    left_on='begin_index',
                                    right_on='begin_index',
                                    how='left')

# Position IDs being associated with multiple mention types created 
# duplicate tokens in the join, so we drop those now
labeled_tokens_df_2.drop_duplicates(subset=['begin_index', 'text'], 
                                    keep='first', inplace=True)

labeled_tokens_df_2['lag(n_tokens)'] = labeled_tokens_df_2['n_tokens'].shift(1)

cond1 = (labeled_tokens_df_2['n_tokens'] == 1) & (labeled_tokens_df_2['lag(n_tokens)'].isna())
labeled_tokens_df_2['label1'] = np.where(cond1, 'B-' + labeled_tokens_df_2['label'], np.nan)

cond2 = (labeled_tokens_df_2['n_tokens'] >= 1) & (labeled_tokens_df_2['lag(n_tokens)'].isna())
labeled_tokens_df_2['label2'] = np.where(cond2, 'B-' + labeled_tokens_df_2['label'], np.nan)

# Label and lag(n_tokens) are both non-null (indicates being inside of a phrase)
cond3 = (labeled_tokens_df_2['label'].notna()) & (labeled_tokens_df_2['lag(n_tokens)'].notna())
labeled_tokens_df_2['label3'] = np.where(cond3, 'I-' + labeled_tokens_df_2['label'], np.nan)

# If label1 is not null, use label 1, etc., if label3 is null, use label 
# (label will be NaN for non-labeled words, which is what we want)
labeled_tokens_df_2['final_label'] = \
    np.where(labeled_tokens_df_2['label1'].notna(), 
             labeled_tokens_df_2['label1'],
             np.where(labeled_tokens_df_2['label2'].notna(), 
                      labeled_tokens_df_2['label2'],
                     np.where(labeled_tokens_df_2['label3'].notna(), 
                              labeled_tokens_df_2['label3'],
                             labeled_tokens_df_2['label'])))

labeled_tokens_df_2.loc[:, 'text':]

# The four possible labels are Org, Location, Person, and LocOrg
# Now: How do I go from this to a series of labels based on whether a token is
# the beginning or inside of the label?

# Outline the possible situations:
# 1. (label1) A single token has a label and the word before it is not labeled
# 2. (label2) A token is labeled and is the first word of a multi-word phrase
# 3. (label3) A token is labeled and is a non-first word of a multi-word phrase
# 4. Others? (I think that due to sufficient separation of tokens--no two NE words 
#    or phrases sitting directly next to each other--this should be correct, but need
#    to verify)

Unnamed: 0,text,position_id,mention_type,n_tokens,label,lag(n_tokens),label1,label2,label3,final_label
0,Россия,26213.0,loc_name,1.0,LocOrg,,B-LocOrg,B-LocOrg,,B-LocOrg
1,дала,,,,,1.0,,,,
2,Украине,26214.0,loc_name,1.0,LocOrg,,B-LocOrg,B-LocOrg,,B-LocOrg
3,кредит,,,,,1.0,,,,
4,на,,,,,,,,,
5,сумму,,,,,,,,,
6,2,,,,,,,,,
7,миллиарда,,,,,,,,,
8,долларов,,,,,,,,,
9,",",,,,,,,,,


In [202]:
def separate_ids_and_text(id_text_list):
    
    if len(id_text_list) == 2:
        return [{id_text_list[0]: id_text_list[1]}]
    
    else:
        list_len = len(id_text_list)
        text_idx = list_len // 2
        return [{item[0]: item[1]} for item 
                in zip(id_text_list[:text_idx], id_text_list[text_idx:])
               ]
    
df = pd.DataFrame(extracted)
df[0] = df[0].apply(separate_ids_and_text)
df = df.explode(0)
df['token_id'] = df[0].apply(lambda x: list(x.keys())[0])
df['text'] = df[0].apply(lambda x: list(x.values())[0])
df.columns = ['0', 'span_id', 'mention_type', 'begin_index', 'span_length', 'token_id', 'text']
df = df[['token_id', 'span_id', 'text', 'span_length', 'begin_index', 'mention_type']]

# Convert dtypes for joining with token df
for col in ['token_id', 'span_length', 'begin_index']:
    df[col] = df[col].astype('int')

In [300]:
df[:10]

Unnamed: 0,token_id,span_id,text,span_length,begin_index,mention_type,label
0,210354,26213,Россия,1,0,loc_name,LocOrg
1,210356,26214,Украине,1,12,loc_name,LocOrg
2,210365,26215,Лента.ру,1,67,org_name,Org
3,210372,26216,Национальный,2,124,org_descr,Org
3,210373,26216,банк,2,124,org_descr,Org
4,210374,26217,Украины,1,142,loc_name,Org
5,210373,26218,банк,1,137,org_descr,Org
6,210370,38269,правительство,1,104,org_descr,Org
7,210369,38270,украинское,2,93,org_descr,Org
7,210370,38270,правительство,2,93,org_descr,Org


In [283]:
# Read in and process the labels
test_labels = '../data/ru/factRuEval-2016/devset/book_194.objects'

# This is NER, so each label can have one or more token IDs associated with it.
# Each line/list `l` holds the token IDs from index 2 up to the hash symbol.
with open(test_labels) as t:
    labs = [l.split() for l in t.readlines()]
    ids_labs_only = [(l[2: l.index('#')], l[1]) for l in labs]
 
# `seen` keeps track of span_ids we have already added as key-label pairs.
# Some label files have redundant tags (spans have multiple labels),
# so we need to only include the first instance of each span.
# Open book_194.objects for an example of this; span 26217, 'Украины',
# is tagged both as part of an Org AND on its own as LocOrg, even though
# it only occurs once in the sentence.
ind_dicts = []
seen = [] 
for item in ids_labs_only:
    for span_id in item[0]:
        if span_id not in seen:
            ind_dicts.append({span_id: item[1]})
            seen.append(span_id)
        
# Update the first dict with all the other k-v pairs 
for d in ind_dicts:
    ind_dicts[0].update(d)
    
# `label_dict` is now a dict with span_ids as keys and their labels as values.
label_dict = ind_dicts[0]

get_label = lambda x: label_dict.get(x, np.nan)
df['label'] = df['span_id'].apply(get_label)

In [298]:
labeled_tokens_df = test_obj.merge(df, 
                                    left_on=['id', 'begin_index', 'text'], 
                                    right_on=['token_id', 'begin_index', 'text'], 
                                    how='left')

labeled_tokens_df[:20]

# Note to self: span_id may more accurately be called position_id. Rename later

Unnamed: 0,id,begin_index,length,text,token_id,span_id,span_length,mention_type,label
0,210354.0,0.0,6.0,Россия,210354.0,26213.0,1.0,loc_name,LocOrg
1,210355.0,7.0,4.0,дала,,,,,
2,210356.0,12.0,7.0,Украине,210356.0,26214.0,1.0,loc_name,LocOrg
3,210357.0,20.0,6.0,кредит,,,,,
4,210358.0,27.0,2.0,на,,,,,
5,210359.0,30.0,5.0,сумму,,,,,
6,210360.0,36.0,1.0,2,,,,,
7,210361.0,38.0,9.0,миллиарда,,,,,
8,210362.0,48.0,8.0,долларов,,,,,
9,210363.0,56.0,1.0,",",,,,,


In [301]:
# `cond` checks if the join was successful, but the text is unlabeled.
# (that is, "token_id NOT NULL, label IS NULL")
# If this is the case, we know it's a duplicate of a previous token,
# and was duplicated in the join with the span + label data (due to the 
# "redundant tags" issue mentioned above), so we can safely drop these.
cond = ~labeled_tokens_df['token_id'].isna() & labeled_tokens_df['label'].isna() 

# Invert the condition to get only the non-duplicate tokens
labeled_tokens_df = labeled_tokens_df[~cond]

# Now we're quite close to having a df of text and labels!
# Next step is to detect if the span length is > 1 and appropriately
# indicate in the label which tokens belong to the same "label"--
# for example "National Bank of Ukraine" should be labeled as a single
# entity. We can use I-PER + B-PER / I-ORG / B-ORG for this, as is 
# usual for NER datasets.
labeled_tokens_df[:13]

Unnamed: 0,id,begin_index,length,text,token_id,span_id,span_length,mention_type,label
0,210354.0,0.0,6.0,Россия,210354.0,26213.0,1.0,loc_name,LocOrg
1,210355.0,7.0,4.0,дала,,,,,
2,210356.0,12.0,7.0,Украине,210356.0,26214.0,1.0,loc_name,LocOrg
3,210357.0,20.0,6.0,кредит,,,,,
4,210358.0,27.0,2.0,на,,,,,
5,210359.0,30.0,5.0,сумму,,,,,
6,210360.0,36.0,1.0,2,,,,,
7,210361.0,38.0,9.0,миллиарда,,,,,
8,210362.0,48.0,8.0,долларов,,,,,
9,210363.0,56.0,1.0,",",,,,,


In [303]:
# The four possible labels are Org, Location, Person, and LocOrg
labeled_tokens_df[:50]

# Oh no...my filter trick didn't remove all the duplicates. 
# (See indices 39-40.)

Unnamed: 0,id,begin_index,length,text,token_id,span_id,span_length,mention_type,label
0,210354.0,0.0,6.0,Россия,210354.0,26213.0,1.0,loc_name,LocOrg
1,210355.0,7.0,4.0,дала,,,,,
2,210356.0,12.0,7.0,Украине,210356.0,26214.0,1.0,loc_name,LocOrg
3,210357.0,20.0,6.0,кредит,,,,,
4,210358.0,27.0,2.0,на,,,,,
5,210359.0,30.0,5.0,сумму,,,,,
6,210360.0,36.0,1.0,2,,,,,
7,210361.0,38.0,9.0,миллиарда,,,,,
8,210362.0,48.0,8.0,долларов,,,,,
9,210363.0,56.0,1.0,",",,,,,
