# Named Entity Recognition Euro Parliamentary Proceedings

This project involves using Named Entity Recognition to identify prominent entities (dates, persons, organisation, etc.) mentioned within European Parliamentary proceedings corpus. Identifying entities would allow institutions working with European data to access correctly tagged information more quickly than if they were required to browse the full archives for relevant information. The raw dataset can be found on Kaggle at: https://www.kaggle.com/datasets/nltkdata/europarl

This project only features the English documents listed under 'en' in this dataset. However, it may be possible to use other language models to parse information from other translations in the main corpus. The goal is to produce a dataframe that lists individual documents alongside their respective entities.

The project has the following main sections:
- Importing and Preprocessing text files
- Rendering HTML documents labelling entities
- Creating a dataframe listing the entities within each document

## Importing and Preprocessing text files. ##

First, let's import the library using glob.

In [None]:
# Import library
import glob

# The document files are contained in this folder
folder = "C:/Users/dalin/Dropbox/MachineLearning/Entity Recognition/Euro_Parliament/txt/en/"

# List all the .txt files and sort them alphabetically
files = glob.glob(folder + "*.txt")
files.sort()

NER using Spacy:

Spacy is an open-source Natural Language Processing library that can be used for various tasks. It has built-in methods for Named Entity Recognition. Spacy has a fast statistical entity recognition system.

 We can use spacy very easily for NER tasks. Though often we need to train our own data for business-specific needs, the spacy model general performs well for all types of text data.  

Let us get started with the code, first we import spacy and proceed.

In [None]:
import spacy
from spacy import displacy

# Load the spacy model
NER = spacy.load('en_core_web_sm')

Now, we'll preprocess the text to remove unnecessary characters for a cleaner text dataset.

In [None]:
# Import libraries
import re, os
from tqdm import tqdm

# Initialize the lists that will contain the texts and titles of each document
txts = []
titles = []

for n in tqdm(files):
    # Open each file
    f = open(n, encoding='utf-8-sig')
    # Remove all non-alpha-numeric characters except periods, question marks, exclamation marks. f.read() reads the text and ' ' replaces non-alphanumeric characters with a space.
    data = re.sub('[^a-zA-Z0-9_.?!]+', ' ', f.read())
    # Store the texts and titles of the books in two separate lists
    txts.append(data)
    titles.append(os.path.basename(n).replace(".txt", ""))

# Print the length, in characters, of each text
[len(t) for t in txts]

The titles of each document will be converted to a dataframe column. This column can be used as an index to identify each document and its respective entities in the final dataframe.

In [None]:
import pandas as pd

In [None]:
title_dataframe = pd.DataFrame(titles)
title_dataframe.rename(columns = {0 : 'File_Name'}, inplace = True)

This dataframe will be used to indicate the filename in the final table.

In [None]:
title_dataframe.head(10)

## Rendering HTML documents labelling entities ##

In the next section, we'll render html documents clearly highlighting entities within the documents themselves. This should allow a reader to quickly scan a given document in the corpus and see what entities have been tagged. (It should be noted that some documents may be short enough that no entities are discovered.)

In [None]:
from pathlib import Path

for i, t in tqdm(zip(titles, txts)): 
    text= NER(t)
    # Creates an html render for each document highlighting the location of known entities.
    html = displacy.render(text, style="ent", jupyter=False)
    # The filename for each document will be created using the titles list created earlier.
    file_name = i + ".html"
    with open("C:/Users/dalin/Dropbox/MachineLearning/Entity Recognition/Euro_Parliament/txt/en/Renders/" + file_name, 'w+', encoding="utf-8") as fp:
        fp.write(html)
        fp.close()
#    output_path = Path("C:/Users/dalin/Dropbox/MachineLearning/Entity Recognition/Euro_Parliament/txt/en/Preprocessing_Sample/Renders/" + file_name)
#    output_path.open("w", encoding="utf-8").write(svg)


## Creating a dataframe listing the entities within each document ##

Now we will generate a dataframe that adds each document's entities as a new row. The final dataframe can be used as a database to identify the location of prominent entites within the Euro Parl corpus.

In [None]:
df = pd.DataFrame()
for t in tqdm(txts):
    text= NER(t)
    entity_list = []
    for word in text.ents:
        word_label = (word.text, word.label_)
        entity_list.append(word_label)
    df = df.append(pd.DataFrame([entity_list]))

This process will generate an extra index column that we do not need, so this can be dropped.

In [None]:
df.reset_index(inplace = True)
df.head()

In [None]:
# Drop the extra index column
del df[df.columns[0]]

Join title dataframe and entity dataframe together so that the title of each document fills the first column under "File_Name".

In [None]:
main_data = title_dataframe.join(df)
main_data.head()

This dataframe can be further summarized info a final dataframe that lists how many times each entity appears in each document. This method will generate a sparse matrix that can be used to locate the importance of an entity within each document.

In [None]:
summary_data = df.apply(pd.value_counts, axis=1)
summary_data.head()

Add the document titles back to this dataframe as the first column.

In [None]:
final_data = title_dataframe.join(summary_data)

Save Table as a .csv file.

In [None]:
final_data.to_csv("C:/Users/dalin/Dropbox/MachineLearning/Entity Recognition/Euro_Parliament/txt/en/Entity_list.csv")