### Import external dependencies

In [1]:
import numpy as np
import pandas as pd
import textwrap
from pprint import pprint
import re
import sys
import os
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt', quiet=True)

True

### Import internal dependencies

In [2]:
import data_cleaning.data_cleaning as dc

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arthu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Open Unannotated Documents
We get the unnanotatted texts as key, value pairs. This will make it easy to convert them into a dataframe.

In [3]:
path_to_raw = "../data/raw_documents/"

In [4]:
documents = {}

# Get documents that require manual labeling
for filename in os.listdir(path_to_raw):
   if filename.endswith(".txt"):
      with open(os.path.join(path_to_raw, filename), 'r', encoding="utf8") as f:
         key = int(filename.replace('.txt', ''))
         value = f.read()
         documents[key] = value

# Transform documents dictionary into a pandas dataframe
df_docs = pd.DataFrame.from_dict(documents, orient='index')
df_docs.columns = ['text']

print(df_docs.shape)

(568, 1)


#### Data Cleaning
We will use the **data_processing.clear_text** function which was built to clean the specific document we are using here. Since we already know they will be used for NER, the objective of the cleaning is getting the text into a format which will be easy to split into sentences. Each sentence will later be used as an input to our neural network.
There are three aspects we are cleaning:
1. Removing line breaks as they make it very hard for the sentence tokenizer to correctly recognize the begining and ending of sentences in the text
2. Removing repeated whitespaces which are used for the visual formatting of the documents but will generate unnecessary tokens for our neural network.
3. There are a lot of law and document numbers in this documents. There isn't a consistent writing of these numbers which can start as "nº.123", "nº. 123" and "nº 123". We will padronize this occurences to appear as "nº 123" since the punctuation after "nº" makes it harder for the sentence tokenizer to correctly separate the sentences.

In [5]:
df_docs['text'] = df_docs['text'].apply(dc.clear_text)

#### Sentence tokenization
Having cleaned our documents we will split them into sentences. The **data_processing.split_text_sentences** function will return a DataFrame with: The sentences, a unique ID for each sentence and the index of the document each sentence is a part of.

In [None]:
df_sentences = dc.split_text_sentences(df_docs['text'])
pd.set_option("display.max_colwidth", 0)
print(df_sentences.shape)

(2648, 3)


In [8]:
# We create a new column doc_sentence_id that serves as a unique identifier for each sentence in our dataset.add
df_sentences['doc_sentence_id'] = df_sentences['document'].astype(str) + '-' + df_sentences['sentence_id'].astype(str)
df_sentences.head()

Unnamed: 0,document,sentence,sentence_id,doc_sentence_id
0,105798,"Documento gerado sob autenticação Nº LKB.506.405.IRF, disponível no endereço http://www.ufrgs.br/autenticacao Documento certificado eletronicamente, conforme Portaria nº 3362/2016, que institui o Sistema de Documentos Eletrônicos da UFRGS.",0,105798-0
1,105798,"1/1 PORTARIA Nº 1955 de 05/03/2020 O PRÓ-REITOR DE GESTÃO DE PESSOAS DA UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL, no uso de suas atribuições que lhe foram conferidas pela Portaria nº 7684, de 03 de outubro de 2016, do Magnífico Reitor, e conforme o Laudo Médico n°60131, RESOLVE.",1,105798-1
2,105798,"Designar, temporariamente, nos termos da Lei nº 8.112, de 11 de dezembro de 1990, com redação dada pela Lei nº 9.527, de 10 de dezembro de 1997, a ocupante do cargo de ASSISTENTE EM ADMINISTRAÇÃO, do Quadro de Pessoal desta Universidade, FRANCIELE MARQUES ZIQUINATTI (Siape: 1091092 ), para substituir TURENE ANDRADE E SILVA NETO (Siape: 0356721 ), Diretor da Divisão de Preparo da Licitação do Departamento de Aquisição de Bens e Serviços, Código FG-4, em seu afastamento por motivo de Laudo Médico do titular da Função, no período de 30/01/2020 a 15/03/2020, com o decorrente pagamento das vantagens por 46 dias.",2,105798-2
3,105798,MAURÍCIO VIÉGAS DA SILVA Pró-Reitor de Gestão de Pessoas,3,105798-3
4,105799,"Documento gerado sob autenticação Nº BOA.507.615.IRF, disponível no endereço http://www.ufrgs.br/autenticacao Documento certificado eletronicamente, conforme Portaria nº 3362/2016, que institui o Sistema de Documentos Eletrônicos da UFRGS.",0,105799-0


#### Dealing with duplicate sentences
Now that we have our dataset of sentences we'll go to the last step which is identifying and categorizing duplicate sentences. Since we need to manually annotate the dataset for training we can group duplicate sentences to facilitate the anottation process. The document we have are very standardized so there are multiple duplicate sentences in them.

Let's group sentences by their text. Each group will be composed of identical sentences.

In [10]:
duplicate_group = df_sentences.groupby('sentence')

We loop each group of duplicated sentences and create a dictionary where the key will be the index of the first sentence of the group and the value is the doc_sentence_id of all sentences equal to the first.

In [11]:
duplicates = {}
for i in duplicate_group:
    dup_list = i[1]['doc_sentence_id'].tolist()
    duplicates[dup_list[0]] = dup_list

In [18]:
duplicates

{'69967-3': ['69967-3'],
 '69967-4': ['69967-4'],
 '55632-37': ['55632-37'],
 '72999-3': ['72999-3'],
 '55631-32': ['55631-32'],
 '55631-6': ['55631-6'],
 '55631-7': ['55631-7'],
 '55631-26': ['55631-26'],
 '55631-34': ['55631-34'],
 '55631-28': ['55631-28'],
 '55631-35': ['55631-35'],
 '55632-11': ['55632-11'],
 '55632-20': ['55632-20'],
 '55632-21': ['55632-21'],
 '55632-19': ['55632-19'],
 '55631-36': ['55631-36'],
 '55632-12': ['55632-12'],
 '55632-31': ['55632-31'],
 '55631-31': ['55631-31'],
 '55631-29': ['55631-29'],
 '55631-30': ['55631-30'],
 '55632-32': ['55632-32'],
 '55631-17': ['55631-17'],
 '55632-24': ['55632-24'],
 '55632-28': ['55632-28'],
 '55632-25': ['55632-25'],
 '55632-22': ['55632-22'],
 '55631-9': ['55631-9'],
 '55631-33': ['55631-33'],
 '55631-37': ['55631-37'],
 '55631-2': ['55631-2'],
 '55631-3': ['55631-3'],
 '55631-5': ['55631-5'],
 '55631-4': ['55631-4'],
 '32132-1': ['32132-1'],
 '32186-1': ['32186-1'],
 '32077-1': ['32077-1'],
 '32080-1': ['32080-1'],
 '

We create a dataframe which will contain only unique sentences. The sentences that have identic pairs will have all of their indexes in a list in the 'duplicates' column. This will allow us to replicate the labels after the anotattion process.

In [19]:
df_unique_sentences = df_sentences.copy()
df_unique_sentences['duplicates'] = pd.Series(duplicates)#, index=df_sentences.index)
#df_unique_sentences = df_unique_sentences.dropna(subset=['duplicates'])
df_unique_sentences['label'] = ""

In [26]:
pd.Series(duplicates)

69967-3     [69967-3] 
69967-4     [69967-4] 
55632-37    [55632-37]
72999-3     [72999-3] 
55631-32    [55631-32]
               ...    
32135-3     [32135-3] 
37697-5     [37697-5] 
37697-9     [37697-9] 
55715-3     [55715-3] 
55521-3     [55521-3] 
Length: 2070, dtype: object

We can see that the new dataframe has less sentences since each entry is a unique sentence.

In [25]:
df_unique_sentences['duplicates'].unique()

array([nan], dtype=object)

In [21]:
pprint(df_sentences.shape)
pprint(df_unique_sentences.shape)

(2648, 4)
(2648, 6)


#### Bad Sentence Filter
Some sentences are not going to be helpful on our final train and test data. They either have just one or no entities we are interested in. Because of that we'll remove them from our dataset.

In [26]:
pprint(df_unique_sentences.shape)
df_unique_sentences = df_unique_sentences[df_unique_sentences['sentence'].str.len() > 20]
df_unique_sentences = df_unique_sentences[~df_unique_sentences['sentence'].str.startswith("Documento gerado sob")]
df_unique_sentences = df_unique_sentences[~df_unique_sentences['sentence'].str.startswith("Solicitação nº")]
df_unique_sentences = df_unique_sentences[~df_unique_sentences['sentence'].str.startswith("Processo")]
df_unique_sentences = df_unique_sentences[~df_unique_sentences['sentence'].str.contains("RESOLVE")]
pprint(df_unique_sentences.shape)

(2070, 5)
(768, 5)


In [27]:
#pprint(df_unique_sentences.to_json(lines=True, orient = 'records'))
save_path = "../data/unannotated/"
df_docs.to_json(os.path.join(save_path, 'clean_documents_05-12-24.jsonl'),lines=True, orient = 'records')
df_unique_sentences.to_json(os.path.join(save_path, 'unique_sentences-05-12-24.jsonl'),lines=True, orient = 'records')

In [None]:
### 