In [1]:
import pandas as pd
from google.colab import drive
import os

Let's read in the UN data downloaded from: https://github.com/leslie-huang/UN-named-entity-recognition

Each document is in its own .txt, where each row is a token alongside its named entity label.

Mount your Google Drive and you'll be able to access this data by running the code below.

In [3]:
DATA_DIR = '/content/drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/content-metadata-2021/Data'
UN_DATA_DIR = DATA_DIR + '/un_ner_data'

Get the paths of all the UN data stored in G Drive.

In [4]:
un_test_paths = [UN_DATA_DIR + '/leslie-huang UN-named-entity-recognition master tagged-test/' + f for f in os.listdir(UN_DATA_DIR+'/leslie-huang UN-named-entity-recognition master tagged-test')]
un_train_paths = [UN_DATA_DIR + '/leslie-huang UN-named-entity-recognition master tagged-training/' + f for f in os.listdir(UN_DATA_DIR+'/leslie-huang UN-named-entity-recognition master tagged-training')]
un_paths = un_test_paths + un_train_paths

In [5]:
un_paths[:3]

['/content/drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/content-metadata-2021/Data/un_ner_data/leslie-huang UN-named-entity-recognition master tagged-test/ID07653_2015_Myanmar.txt',
 '/content/drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/content-metadata-2021/Data/un_ner_data/leslie-huang UN-named-entity-recognition master tagged-test/ID11153_2013_Turkey.txt',
 '/content/drive/Shareddrives/GOV.UK teams/2020-2021/Data labs/content-metadata-2021/Data/un_ner_data/leslie-huang UN-named-entity-recognition master tagged-test/ID08716_2013_Peru.txt']

Read each of the UN files into a records format.

In [6]:
un_data = [list(zip(*pd.read_csv(path, sep='\t', header=None).to_records(index=False))) for path in un_paths]

Convert the records format into a data frame.

In [7]:
# get the data frame
un_df = pd.DataFrame.from_records(un_data, columns=['text_token', 'label_list'])

# get rid of some null values
un_df = un_df.loc[~un_df.label_list.isnull()]

# the zip function above creates tuples, so let's cast them to lists to make it consistent
# with our existing data from gov.uk
un_df = un_df.applymap(list)

# create a column with a single string of text
un_df['text'] = un_df.text_token.apply(lambda x: ' '.join(x))

un_df.head()

Unnamed: 0,text_token,label_list,text
0,"[First, of, all, ,, I, should, like, to, join,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","First of all , I should like to join previous ..."
1,"[I, wish, to, start, by, extending, our, since...","[O, O, O, O, O, O, O, O, O, O, O, I-PER, I-PER...",I wish to start by extending our sincere congr...
2,"[I, am, pleased, to, congratulate, you, ,, Sir...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","I am pleased to congratulate you , Sir , on yo..."
3,"[I, am, honoured, to, address, the, General, A...","[O, O, O, O, O, O, I-ORG, I-ORG, O, O, O, O, O...",I am honoured to address the General Assembly ...
4,"[On, behalf, of, the, Malian, delegation, ,, I...","[O, O, O, O, I-MISC, O, O, O, O, O, O, O, O, O...","On behalf of the Malian delegation , I would f..."


Check if B- prefix occurs in the labels.

In [8]:
for ll in un_df.label_list:
  if 'B-' in ' '.join(ll):
    print(ll)