# Data Augmentation

In this notebooks we'll develop the functions required to change tagged and untagged strings from a text in order to recreate a bigger dataset. It is imperative to retain the information of the tags, that may change due to differences in lenght.

## Imports
### Libraries

In [1]:
import re, os, json, pickle
import spacy
from spacy import displacy
import pandas as pd
import numpy as np

### Datasets

In [2]:
data_dir = '../../data/estatutos/tagged/spacy_dataset_2020-5-6.pkl'

In [None]:
with open(data_dir, 'rb') as file:
    dataset = pickle.load(file)

## Dataset structure and objective function
As we can see below, we have a dictionary with the name of the document as keys and more dictionaries as values.
The structure is as follows:
```python
{
    "company_name" :{
        "doc_id" : "original_name",
        "pages" : {
            'page_number': {
                'text_body_coords': [
                    [int, int],
                    [int, int],
                    [int, int],
                    [int, int]
                ],
                'lectura': 'text_containing in that page'
            }
        },
        'text' : 'the concatenated text string of every page',
        'entities': {
            'search': ['list', 'of', 'entities', 'available', 'to', 'tag'],
            'tags': [
                {
                    'tag': 'name_of_tag',
                    'start': int, # where the entity starts on the string
                    'finish': int, # same logic as above
                    'text': 'string', # with the slice of the text marked before
                    'value': 'string', # what it would look like after parsing
                    'extra1': 'string', # not used
                    'extra2': 'string', #not used
                    'index': int, # for multi entity tags
                }
            ]
        }
         
    }

}
```

#### We're interested in modifying this part
```python
{
    'doc_id' : 'original_name',
    'pages' : {
        'page_number': {
            'text_body_coords': [
                [int, int],
                [int, int],
                [int, int],
                [int, int]
            ],
            'lectura': 'text_containing in that page'
        }
    },
    'text' : 'the concatenated text string of every page',
    'entities': {
        'search': ['list', 'of', 'entities', 'available', 'to', 'tag'],
        'tags': [
            {
                'tag': 'name_of_tag',
                'start': int, # where the entity starts on the string
                'finish': int, # same logic as above
                'text': 'string', # with the slice of the text marked before
                'value': 'string', # what it would look like after parsing
                'extra1': 'string', # not used
                'extra2': 'string', #not used
                'index': int, # for multi entity tags
            }
        ]
    }
}
```


#### In order to achieve this, we'll create a class that takes as input that dictionary and modifies accordingly.

In [4]:
d1 = dataset['confecom_srl.']

##### We delete the key that will not be currently used

In [5]:
d1.pop('pages')

{'0': {'text_body_coords': [[237, 348],
   [1178, 332],
   [260, 1686],
   [1201, 1671]],
  'lectura': 'O DE SOCIEDAD DE RESPONSABILIDAD LIMITADA. “CONFECOM S.\n\n12)\n\n \n\nROA, argentina, casada en primeras nupcias con Raul Alberto LESCA-\n\nNO, nacida el 10 de marzo de 1966, comerciante, titular del docu-\nmento nacional de identidad número 17.636.488, C.U.I.T.: 27-\n17636488-8, y Yamila Belén LESCANO, argentina, soltera, nacida el\n6 de marzo de 1998, comerciante, titular del documento nácional de\nidentidad número 41.048.026, C.U.I.L.: 27-41048026-9, ambos domici-\nliados en Pico 4785, Ciudad Autónoma de Buenos Aires; resuelven de-\njar constituida una sociedad de responsabilidad limitada que se re-\n\ngirá por la Ley número 19.550/22.903 y el siguiente estatuto:\n\nPRIMERA: La sociedad se denomina: "CONFECOM S.R.L."” y tiene su do-\nmicilio legal en jurisdicción de la Ciudad Autónoma de Buenos Ai-\nres, pudiendo establecer agencias o sucursales en cualquier lugar\n\ndel país o d

### Visualization
We'll create a simple function to transform data to be displayed in notebooks or html using displacy. To achieve this, the final shape of the dictionary should be:

```python
{
    "text": "But Google is starting from behind.",
    "ents": [{"start": 4, "end": 10, "label": "ORG"}],
    "title": None
}
```

In [6]:
d1['entities'].pop('search')

['firmantes_nombre',
 'firmantes_dni',
 'socios_nombre',
 'razon_social',
 'tipicidad',
 'vigencia',
 'capital',
 'num_escritura',
 'fecha_contrato',
 'mandato_dir',
 'tipo_rep_legal',
 'facultades_firm',
 'aporte_socios',
 'fecha_cierre_ejercicio',
 'objeto_social',
 'cargo_firmantes',
 'legal_domicilio',
 'sede_domicilio',
 'firma_domicilio',
 'poderes_especiales',
 'reservas_legales',
 'garantias_directorio',
 'aumentos_capital',
 'firmante_domicilio',
 'firmante_familiar',
 'fecha_nacimiento',
 'fecha_final_doc']

### We check our output

In [14]:
class TaggedDoc:
    def __init__(self, document):
        self.document = document
        self.spacy_ents = self._displacy_transform().get('ents')
        self.title = document.get('doc_id')
        self.ents = document.get('entities').get('tags')
        self.text = document.get('text')
        self.text_len = len(self.text)
        self.ents_df = pd.DataFrame(self.ents)
        
    def _displacy_transform(self) -> dict:
        ent_list = []
        for tag in self.document['entities']['tags']:
            ent_list.append(
                {
                    'start' : tag.get('start'),
                    'end' : tag.get('end'),
                    'label': tag.get('tag')
                }
            )

        return {
            'text' : self.document.get('text'),
            'ents' : ent_list,
            'title' : self.document.get('doc_id') 
        }
    
    def render(self, style='ent', jupyter=True, manual=True, page=False):
        displacy.render(self._displacy_transform(), style=style, jupyter=jupyter, manual=manual, page=page)

In [15]:
doc = TaggedDoc(d1)

In [9]:
doc.render()

##### With the following list comprehension we get the same list strucure as the one found in TaggedDoc.ents

--------------------------------------------------------------------
## Augmentation functions
There are 2 main functions to be developed:
1) One or more that create random entities.

2) Another that plugs the new entities into the original text and keeps track of the positions.


We need to create one or more functions that generate new entities.
- Name generator: from AFIP
- Date generator: 3 different formats, 366 days = 1098 possibilities.
- Vigencia generator: both spelled and numerical
- Tipicidad generator: 3 types, 6 categories, rather easy.
- Fecha cierre ejercicio generator: spelled, numerical and mixed.
- Mandato dir generator: written years, fiscal years, numeric, mixed and different types of undefined.
- Razon Social generator: from AFIP
- Capital generator: might be tricky
- DNI generator: from AFIP

Let's begin with the easier ones. We acquired a dataset from AFIP that contains Names and CUITs for each registered individual in the country. We need to parse and randomize the generation of names, for each labeled name category.

The rather complicated part is to match strings with different lenghts into the text without loosing track of the position of the others.

#### Index augmentation

This function allows us to mantain the relationship between the slices of each entity. For every extra character replaced, the following entities change their start and finish values. The same happens if the new string is shorter.

In [13]:
def index_augmentation(df_ents):
    new_ents = df_ents.copy()
    new_ents = new_ents.sort_values('start')
    new_ents.index = range(new_ents.shape[0])
    new_ents['len'] = new_ents.text.apply(len)
    new_ents['new_len'] = new_ents.new_text.apply(len)
    new_ents['diff'] = new_ents['new_len'] - new_ents['len']
    indecis = new_ents.loc[new_ents.diff != 0].index.to_list()
    
    for index in indecis:
        diff = new_ents.loc[index, 'diff']
        new_ents.loc[index, 'end'] = new_ents.loc[index, 'end'] + diff
        new_ents.loc[index+1:,'start'] = new_ents.loc[index+1:,'start'] + diff
        new_ents.loc[index+1:,'end'] = new_ents.loc[index+1:,'end'] + diff
    
    return [tagged_entity for tagged_entity in new_ents.T.to_dict().values()]