# Data Augmentation Module

Here you can find a suite of tools to work with tagged documents. You can render them or 
create synthetic documents to supersample your dataset. <br>


## Classes

### TaggedDoc

Takes a tagged document in dictionary format and creates an object. <br>

Attributes: <br>

- __TaggedDoc.document__ <br>
Returns the same dictionary that was provided.

- __TaggedDoc.displacy_ents__ <br>
Returns a list of dictionaries with every entity tagged. Each dictionary contains keys: 
`start`, `end` and `label`. <br>

- __TaggedDoc.title__ <br>
Returns the `doc_id` of the document provided. <br>

- __TaggedDoc.text__ <br> 
 Returns the entire text of the document provided. <br>
 
- __TaggedDoc.text_len__ <br>
Returns the lenght of the document measured in characters. <br>

- __TaggedDoc.ents_df__ <br>
Returns a pandas.DataFrame object with the entities tagged. <br>

Methods: <br>

- __TaggedDoc.render__
- __TaggedDoc.index_augmentation__
- __TaggedDoc.save_render__

#### Examples

In [1]:
from nlptools.data_augmentation import TaggedDoc
from nlptools.example import example_data

In [2]:
tagged = TaggedDoc(example_data)

In [3]:
tagged.document

{'doc_id': 'confecom_srl',
 'pages': {'0': {'text_body_coords': [[237, 348],
    [1178, 332],
    [260, 1686],
    [1201, 1671]],
   'lectura': 'O DE SOCIEDAD DE RESPONSABILIDAD LIMITADA. “CONFECOM S.\n\n12)\n\n \n\nROA, argentina, casada en primeras nupcias con Raul Alberto LESCA-\n\nNO, nacida el 10 de marzo de 1966, comerciante, titular del docu-\nmento nacional de identidad número 17.636.488, C.U.I.T.: 27-\n17636488-8, y Yamila Belén LESCANO, argentina, soltera, nacida el\n6 de marzo de 1998, comerciante, titular del documento nácional de\nidentidad número 41.048.026, C.U.I.L.: 27-41048026-9, ambos domici-\nliados en Pico 4785, Ciudad Autónoma de Buenos Aires; resuelven de-\njar constituida una sociedad de responsabilidad limitada que se re-\n\ngirá por la Ley número 19.550/22.903 y el siguiente estatuto:\n\nPRIMERA: La sociedad se denomina: "CONFECOM S.R.L."” y tiene su do-\nmicilio legal en jurisdicción de la Ciudad Autónoma de Buenos Ai-\nres, pudiendo establecer agencias o sucu

In [4]:
tagged.displacy_ents

[{'start': 106, 'end': 126, 'label': 'firmante_familiar'},
 {'start': 138, 'end': 157, 'label': 'fecha_nacimiento'},
 {'start': 223, 'end': 233, 'label': 'firmantes_dni'},
 {'start': 261, 'end': 281, 'label': 'firmantes_nombre'},
 {'start': 312, 'end': 330, 'label': 'fecha_nacimiento'},
 {'start': 396, 'end': 406, 'label': 'firmantes_dni'},
 {'start': 438, 'end': 496, 'label': 'firmante_domicilio'},
 {'start': 529, 'end': 565, 'label': 'tipicidad'},
 {'start': 669, 'end': 685, 'label': 'razon_social'},
 {'start': 736, 'end': 767, 'label': 'legal_domicilio'},
 {'start': 884, 'end': 886, 'label': 'vigencia'},
 {'start': 973, 'end': 1766, 'label': 'objeto_social'},
 {'start': 1822, 'end': 1845, 'label': 'capital'},
 {'start': 2081, 'end': 2104, 'label': 'tipo_rep_legal'},
 {'start': 2113, 'end': 2145, 'label': 'mandato_dir'},
 {'start': 2146, 'end': 2382, 'label': 'facultades_firm'},
 {'start': 2384, 'end': 2642, 'label': 'garantias_directorio'},
 {'start': 3367, 'end': 3378, 'label': 'fe

In [5]:
tagged.title

'confecom_srl'

## Raw Text
With this attribute you can access the raw text.

In [6]:
tagged.text

'O DE sociedad DE RESPONSABILIDAD LIMITADA. “CONFECOM S. 12) ROA, argentina casada en primeras nupcias con Raul Alberto LESCANO, nacida el 10 de marzo de 1966, comerciante, titular del documento nacional de identidad numero 17.636.488, C.U.I.T. 27-17636488-8, y Yamila Belen LESCANO, argentina soltera, nacida el 6 de marzo de 1998, comerciante, titular del documento nacional de identidad numero 41.048.026, C.U.I.L. 27-41048026-9, ambos domiciliado en Pico 4785, ciudad autonoma de buenos Aires; resuelven dejar constituida una sociedad de responsabilidad limitada que se regira por la Ley numero 19.550/22.903 y el siguiente estatuto primera La sociedad se denomina "CONFECOM S.R.L."” y tiene su domicilio legal en jurisdiccion de la ciudad autonoma de buenos Aires, pudiendo establecer agencias o sucursales en cualquier lugar del pais o del extranjero.- segunda Su duracion es de 99 años contados a partir de la inscripcion en el registro Publico de Comercio. - TERCERA La sociedad tiene por obj

## Render
Will output the text of the document with all the entities highlighted.

In [7]:
tagged.render()

## Save to disk
This method will save an html file with the entities highlighted in the text.

In [8]:
tagged.save_render('path/to/file')

## Functions

### `random_date_generator`
```python
Creates a random date in the time span provided. Returns a datetieme.date object.

Parameters
----------
start_year : int, optional
    The initial year of the target time span, by default 1900.
start_month : int, optional
    The initial month of the target time span, by default 1.
start_day : int, optional
    The initial day of the target time span, by default 1.
end_year : int, optional
    The final year of the target time span, by default 2050.
end_month : int, optional
    The final month of the target time span, by default 12.
end_day : int, optional
    The final day of the target time span, by default 31.
mapper : dict, optional
    A dictionary-like object with same previous arguments as keys and integer as values, by default None.
seed : int, optional
    The random seed if you're interested in replicating the results, by default None.

Returns
-------
datetime.date
    A random datetime.date object contained in the time span provided.
```

In [9]:
from nlptools.data_augmentation import random_date_generator

In [10]:
random_date_generator()

datetime.date(1942, 3, 27)

In [11]:
example = {'start_year': 2005,
            'start_month': 5,
          'start_day': 10,
          'end_year': 2023,
          'end_month': 10,
          'end_day': 23}

In [12]:
random_date_generator(mapper=example)

datetime.date(2008, 11, 26)

In [13]:
date = random_date_generator(seed=5)

In [14]:
from nlptools.data_augmentation import date_formatter

-------
### `date_formatter`
```python
Converts a datetime.date object into a formatted string.

Parameters
----------
date : datetime.date
    The date object intended to convert to string.
formality : str, optional
    Different types of formatting, by default 'random'.
include_year : bool, optional
    Option to keep or leave the year, by default True.

Returns
-------
str
    A string containing the desired format

Raises
------
KeyError
    If the formality provided is not among the accepted, will raise KeyError.
```

In [15]:
date_formatter(date)

'08 de Octubre de 2011'

In [16]:
date_formatter(date, include_year=True)

'ocho del mes de Octubre de dos mil once'

In [17]:
date_formatter(date, include_year=False)

'08 de Octubre'

-------
### `random_name_generator`
```python
random_name_generator(n: int, name_type: str = 'any', seed: int = None) -> list

Creates a list of len(n) names for persons, companies or both.

Parameters
----------
n : int
    The number of names intended to get. Maximum supported 20.000.
name_type : str, optional
    The type of name wanted, can be 'company', 'person' or 'any', by default 'any'.

Returns
-------
list
    A list of len(n) with the names.

Raises
------
KeyError
    If type of name provided does not match any of the one supported.
```

In [18]:
from nlptools.data_augmentation import random_name_generator

In [19]:
random_name_generator(n=3)

['VACA CUA S A', 'SUPER UP S.A.S.', 'Kraupl Walter Denis']

------
### `mandato_generator`
```python
mandato_generator(seed: int = None) -> str

Creates a string typically used for mandato lenght of companies.

Parameters
----------
seed : int, optional
    If specified, will return the same value always, by default None.

Returns
-------
str
    A string with a phrase typically used for mandato length.
```

In [20]:
from nlptools.data_augmentation import mandato_generator

In [21]:
mandato_generator()

'9 (nueve) años'

------
### `vigencia_generator`
```python
vigencia_generator(seed: int = None) -> str

"Creates a string typically used for the duration of the company.

Parameters
----------
seed : int, optional
    If specified, will return the same value always, by default None.

Returns
-------
str
    A string with a phrase or number used to determine the duration of the company.
```

In [22]:
from nlptools.data_augmentation import vigencia_generator

In [23]:
vigencia_generator()

'cien'

------
### `tipicidad_generator`
```python
tipicidad_generator(seed: int = None) -> str

Creates a that determines the type of company.

Parameters
----------
seed : int, optional
    If specified, will return the same value always, by default None.

Returns
-------
str
    A string with a phrase containing the type of company.
```

In [24]:
from nlptools.data_augmentation import tipicidad_generator

In [25]:
tipicidad_generator()

'sociedad de responsabilidad limitada'

-------

### `id_generator`
```python
id_generator(cuit=False, seed: int = None) -> str

Generates an argentine DNI or CUIT in string format.

Parameters
----------
cuit : bool, optional
    If True, will return a CUIT, otherwise a DNI, by default False.
seed : int, optional
    If specified, will return the same value always, by default None.

Returns
-------
str
    A string containing the typical format of a DNI or a CUIT.
```

In [26]:
from nlptools.data_augmentation import id_generator

In [27]:
id_generator()

'14.380.480'

------
### `capital_generator`
```python
Signature: capital_generator(style: str = 'any', seed: int = None) -> str
Docstring:
Creates a random string to define the funding of a company.

Parameters
----------
style : str, optional
    If set to 'written' will return the words of the amount, 
    if set to 'number', will return just the amount,
    if set to 'mixed', will return a mixture of the previous, by default 'any'.
seed : int, optional
    If specified, will return the same value always, by default None.

Returns
-------
str
    A string containing a phrase, number ot both containing a money amount.

Raises
------
KeyError
    If the style specified doesn't exist.
```

In [28]:
from nlptools.data_augmentation import capital_generator

In [29]:
capital_generator()

'cuatrocientos ochenta y cinco mil ($ 485.000)'

----------
### `aporte_generator`
```python
aporte_generator(share_type: str = 'any', seed: int = None) -> str

Creates a random string containing the amount of shares
    that a shareholder is giving to the company.

Parameters
----------
share_type : str, optional [currently not in use]
    If you want cuotas or acciones in the output, by default 'any'.
seed : int, optional
    If specified, will return the same value always, by default None.

Returns
-------
str
    A string containing the number that a shareholder is giving
    to be part of the company.
```

In [30]:
from nlptools.data_augmentation import aporte_generator

In [31]:
aporte_generator()

'15.000'

-------
### `address_generator`
```python
address_generator(n: int, legal: bool = False, seed: int = None) -> List[str]

    Creates a list of strings with fictionary addresses, 
    using real streets, districts and provinces.

Parameters
----------
n : int
    The lenght of the output list.
legal : bool, optional
    If set to True, will not return a street, number
    or apartment number, by default False.
seed : int, optional
    If specified, will return the same value always, by default None.

Returns
-------
List[str]
    A list containing n amounts of strings that simulate real addresses.
```

In [32]:
from nlptools.data_augmentation import address_generator

In [33]:
aporte_generator()

'355.000'

In [34]:
address_generator(n=1)

3


['Del Pacifico 1136, de esta ciudad']