# Data Augmentation

In this notebooks we'll develop the functions required to change tagged and untagged strings from a text in order to recreate a bigger dataset. It is imperative to retain the information of the tags, that may change due to differences in lenght.

## Imports
### Libraries

In [7]:
import re, os, json, pickle
import spacy
from spacy import displacy
import pandas as pd
import numpy as np

### Datasets

In [8]:
data_dir = '../../data/estatutos/tagged/spacy_dataset_2020-5-6.pkl'

In [9]:
with open(data_dir, 'rb') as file:
    dataset = pickle.load(file)

## Dataset structure and objective function
As we can see below, we have a dictionary with the name of the document as keys and more dictionaries as values.
The structure is as follows:
```python
{
    "company_name" :{
        "doc_id" : "original_name",
        "pages" : {
            'page_number': {
                'text_body_coords': [
                    [int, int],
                    [int, int],
                    [int, int],
                    [int, int]
                ],
                'lectura': 'text_containing in that page'
            }
        },
        'text' : 'the concatenated text string of every page',
        'entities': {
            'search': ['list', 'of', 'entities', 'available', 'to', 'tag'],
            'tags': [
                {
                    'tag': 'name_of_tag',
                    'start': int, # where the entity starts on the string
                    'finish': int, # same logic as above
                    'text': 'string', # with the slice of the text marked before
                    'value': 'string', # what it would look like after parsing
                    'extra1': 'string', # not used
                    'extra2': 'string', #not used
                    'index': int, # for multi entity tags
                }
            ]
        }
         
    }

}
```

#### We're interested in modifying this part
```python
{
    'doc_id' : 'original_name',
    'pages' : {
        'page_number': {
            'text_body_coords': [
                [int, int],
                [int, int],
                [int, int],
                [int, int]
            ],
            'lectura': 'text_containing in that page'
        }
    },
    'text' : 'the concatenated text string of every page',
    'entities': {
        'search': ['list', 'of', 'entities', 'available', 'to', 'tag'],
        'tags': [
            {
                'tag': 'name_of_tag',
                'start': int, # where the entity starts on the string
                'finish': int, # same logic as above
                'text': 'string', # with the slice of the text marked before
                'value': 'string', # what it would look like after parsing
                'extra1': 'string', # not used
                'extra2': 'string', #not used
                'index': int, # for multi entity tags
            }
        ]
    }
}
```


#### In order to achieve this, we'll create a class that takes as input that dictionary and modifies accordingly.

In [10]:
d1 = dataset['confecom_srl.']

### Visualization
We'll create a simple function to transform data to be displayed in notebooks or html using displacy. To achieve this, the final shape of the dictionary should be:

```python
{
    "text": "But Google is starting from behind.",
    "ents": [{"start": 4, "end": 10, "label": "ORG"}],
    "title": None
}
```

### We check our output

In [11]:
""" Here you can find a suite of tools to work with tagged documents. You can render them or 
    create synthetic documents to supersample your dataset.

Classes
-------
- TaggedDoc
    Intended to open tagged documents using the Text Tag Tool.
"""

import pandas as pd
from spacy import displacy


class TaggedDoc:
    """
        Takes a tagged document in dictionary format and creates an object.
    
    Attributes
    --------
    - TaggedDoc.document
        Returns the same dictionary that was provided.
    - TaggedDoc.displacy_ents
        Returns a list of dictionaries with every entity tagged. Each dictionary contains keys: 
        `start`, `end` and `label`.
    - TaggedDoc.title
        Returns the `doc_id` of the document provided.
    - TaggedDoc.text
        Returns the entire text of the document provided.
    - TaggedDoc.text_len
        Returns the lenght of the document measured in characters.
    - TaggedDoc.ents_df
        Returns a pandas.DataFrame object with the entities tagged.

    Methods
    -------
    - TaggedDoc.render
    - TaggedDoc.index_augmentation


    
    """
    def __init__(self, document):
        if not isinstance(document, dict):
            raise TypeError(f'This class only takes a dictionary as input. You provided a {type(document)}.')
        elif [key for key in ['doc_id', 'text', 'entities'] if key not in document.keys()]:
            raise KeyError('The dictionary must contain the keys `doc_id`, `text` and `entities`.')
        else:
            self.document = document
            self.title = document.get('doc_id')
            self.ents_df = pd.DataFrame(document['entities']['tags'], index=[self.title for x in range(len(document['entities']['tags']))]).sort_values('start')
            self.ents = [values.to_dict() for index, values in self.ents_df.iterrows()]
            self.document['entities']['tags'] = self.ents
            self.displacy_format = self._displacy_transform()
            self.displacy_ents = self.displacy_format.get('ents')
            self.text = document.get('text')
            


    def _displacy_transform(self) -> dict:
        ent_list = []
        
        for tag in self.ents:
            ent_list.append(
                    {
                        'start' : tag.get('start'),
                        'end' : tag.get('end'),
                        'label': tag.get('tag')
                    }
                )

        return {
            'text' : self.document.get('text'),
            'ents' : ent_list,
            'title' : self.document.get('doc_id') 
        }
    
    
    def render(self, style='ent', jupyter=True, manual=True, page=False):
        for _ in self.displacy_format.get('ents'):
            if not _.get('start') or not _.get('end') or not _.get('label'):
                to_render = {'text': 'Esto es un texto completo con entidades, pero vos no sabes de donde sacarlo.',
                             'ents': [{'start': 46, 'end': 49, 'label': 'GIL'}],
                             'title': 'Por favor leé la documentación.'}
                break
            else:
                to_render = self.displacy_format
                break
        displacy.render(to_render, style=style, jupyter=jupyter, manual=manual, page=page)        


    def index_augmentation(self):
        new_ents = self.ents_df.copy() # careful with this
        new_ents = new_ents.sort_values('start')
        new_ents.index = range(new_ents.shape[0])
        new_ents['len'] = new_ents.text.apply(len)
        new_ents['new_len'] = new_ents.new_text.apply(len)
        new_ents['diff'] = new_ents['new_len'] - new_ents['len']
        indecis = new_ents.loc[new_ents.diff != 0].index.to_list()

        for index in indecis:
            diff = new_ents.loc[index, 'diff']
            new_ents.loc[index, 'end'] = new_ents.loc[index, 'end'] + diff
            new_ents.loc[index+1:,'start'] = new_ents.loc[index+1:,'start'] + diff
            new_ents.loc[index+1:,'end'] = new_ents.loc[index+1:,'end'] + diff

        return [tagged_entity for tagged_entity in new_ents.T.to_dict().values()]

In [12]:
doc = TaggedDoc(dataset['krbg.'])

--------------------------------------------------------------------
## Augmentation functions
There are 2 main functions to be developed:
1) One or more that create random entities.

2) Another that plugs the new entities into the original text and keeps track of the positions.


We need to create one or more functions that generate new entities.
- ~Name generator: from AFIP~
- ~Date generator: 3 different formats, 366 days = 1098 possibilities.~
- ~Vigencia generator: both spelled and numerical~ 
- ~Tipicidad generator: 3 types, 6 categories, rather easy.~
- ~Fecha cierre ejercicio generator: spelled, numerical and mixed.~
- ~Mandato dir generator: written years, fiscal years, numeric, mixed and different types of undefined.~
- ~Razon Social generator: from AFIP~
- ~Capital generator: might be tricky~
- ~DNI generator: from AFIP~ Random
- ~Aportes generator~

The rather complicated part is to match strings with different lenghts into the text without loosing track of the position of the others.

### 1) Synthetic Dates
We are going to generate 2 functions:
- One to create dates within a certain time interval.
- The other to change the level of formality

In [13]:
import datetime, random
from tqdm.autonotebook import tqdm, trange

  


In [14]:
def random_date_generator(start_year:int=1900, end_year:int=2050, start_month:int=1, end_month:int=12, start_day:int=1, end_day:int=31, mapper:dict=None, seed:int=None) -> datetime.date:
    if 0 in [start_day, start_month, start_year, end_month, end_year, end_day]:
        warnings.warn('An argument was specified with value 0, returning to default values.')
        mapper = {'start_year':1900, 
                  'end_year':2050, 
                  'start_month':1, 
                  'end_month':12, 
                  'start_day':1, 
                  'end_day':31}
    
    if mapper:
        start_year = mapper['start_year']
        end_year = mapper['end_year']
        start_month = mapper['start_month']
        end_month = mapper['end_month']
        start_day = mapper['start_day']
        end_day = mapper['end_day']
    
    if seed:
        random.seed(seed)
        
    start_date = datetime.date(start_year, start_month, start_day)
    end_date = datetime.date(end_year, end_month, end_day)
    time_between_dates = end_date - start_date
    days_between_dates = time_between_dates.days
    random_number_of_days = random.randrange(days_between_dates)
    random_date = start_date + datetime.timedelta(days=random_number_of_days)
    
    return random_date

In [15]:
random_date_generator(start_year=2000, end_year=2110)

datetime.date(2096, 11, 17)

In [16]:
from num2words import num2words
def date_formatter(date: datetime.date, formality: str = 'random', include_year: bool = True) -> str:
    formality_list = ['basic', 'basic2', 'mixed', 'mixed2', 'regular', 'formal', 'veryformal', 'random']
    
    if formality not in formality_list:
        raise KeyError(f'Keyword `{formality}` not found. Argument `formality` must be one of {formality_list}.')
    elif formality == 'random':
        formality = formality_list[random.randint(0,5)]
    
    month_mapper = {
        '01':'Enero',
        '02':'Febrero',
        '03':'Marzo',
        '04':'Abril',
        '05':'Mayo',
        '06':'Junio',
        '07':'Julio',
        '08':'Agosto',
        '09':'Septiembre',
        '10':'Octubre',
        '11':'Noviembre',
        '12':'Diciembre'}

    day = f'{date.day}' if len(str(date.day)) > 1 else f'0{date.day}'
    month = f'{date.month}' if len(str(date.month)) > 1 else f'0{date.month}'
    
    if formality not in ['basic', 'basic2']:
        day_words = num2words(day, lang='es', to='cardinal') if formality != 'veryformal' else num2words(day, lang='es', to='ordinal')
        month_words = month_mapper[month]
    
    if include_year:
        year = f'{date.year}'    
        year_words = num2words(year, lang='es')
    
    format_mapper = {
        'basic': f'{day}-{month}-{year}' if include_year else f'{day}-{month}',
        'basic2': f'{day}/{month}/{year}' if include_year else f'{day}/{month}',
        'mixed': f'{day} de {month_words} de {year}' if include_year else \
                 f'{day} de {month_words}',
        'mixed2':f'{day} de {month_words} de {year_words}' if include_year else \
                 f'{day} de {month_words}',
        'regular': f'{day_words} de {month_words} de {year_words}' if include_year else \
                   f'{day_words} de {month_words}',
        'formal': f'{day_words} del mes de {month_words} de {year_words}' if include_year\
                  else f'{day_words} del mes de {month_words}',
        'veryformal': f'{day_words} día del mes de {month_words} del año {year_words}' \
                    if include_year else f'{day_words} días del mes de {month_words}'
    }
    result = format_mapper[formality]
    
    return result

In [17]:
date_formatter(random_date_generator(start_year=1450, end_year=2110))

'05 de Enero de 1844'

### 2) Index augmentation

This function allows us to mantain the relationship between the slices of each entity. For every extra character replaced, the following entities change their start and finish values. The same happens if the new string is shorter.

In [18]:
def index_augmentation(self):
    new_ents = self.ents_df.copy() # careful with this
    new_ents = new_ents.sort_values('start')
    new_ents.index = range(new_ents.shape[0])
    new_ents['len'] = new_ents.text.apply(len)
    new_ents['new_len'] = new_ents.new_text.apply(len)
    new_ents['diff'] = new_ents['new_len'] - new_ents['len']
    indecis = new_ents.loc[new_ents.diff != 0].index.to_list()

    for index in indecis:
        diff = new_ents.loc[index, 'diff']
        new_ents.loc[index, 'end'] = new_ents.loc[index, 'end'] + diff
        new_ents.loc[index+1:,'start'] = new_ents.loc[index+1:,'start'] + diff
        new_ents.loc[index+1:,'end'] = new_ents.loc[index+1:,'end'] + diff

    return [tagged_entity for tagged_entity in new_ents.T.to_dict().values()]

### 3) Synthetic Names, DNI's and CUIT's
We're going to create a function that allows us to take a random name from a database taken from AFIP.

In [19]:
import pandas as pd
names = pd.read_csv('../../data/estatutos/external_sources/afip_names_cuits.csv', dtype='str')

In [20]:
import numpy as np
names['name'] = names.concatenado.apply(lambda x: x[:30] if isinstance(x, str) else np.nan)

In [21]:
names['name'] = names['name'].str.replace('\s{2,}', '', regex=True)

In [22]:
names = names.dropna()

#### Companies and people
Since we have companies in the list, we can take them to create synthetic "Razon Social". Keep in mind that we'll include them as people's name since they can be founders of a new company as any regular person can. And we're also going to throw away goverments, municipalities, companies in formation and those who filed bankrupcy.

As we can see, there are 68352 different companies. We'll use that as a mask to create a new dataframe

In [23]:
company_mask = names['name'].str.contains(r'\bSA$|\bSRL$|\bSAS$|\bSAU$|\bSASU$|\bSAIC$|\bSACIF$')

In [24]:
names.loc[company_mask]

Unnamed: 0,concatenado,cuit,dni,name
68968,DE MIRANDA EURICO SA,20056190415,05619041,DE MIRANDA EURICO SA
138919,DE LAVALLAZ ALBERTO ANTONIO SAACEXA,20078365863,07836586,DE LAVALLAZ ALBERTO ANTONIO SA
269324,GARCIA CASTELLANOS FERNANDO SANINIF,20111927082,11192708,GARCIA CASTELLANOS FERNANDO SA
334800,GRECO FRANCISCO ANTONIO SA NINIA,20120626575,12062657,GRECO FRANCISCO ANTONIO SA
402647,SOLDANO ALVARO GERMAN DEL SA NINIA,20128898558,12889855,SOLDANO ALVARO GERMAN DEL SA
...,...,...,...,...
4955893,BE3 SRL,34686133305,68613330,BE3 SRL
4955894,DIFEMA SRL,34686233318,68623331,DIFEMA SRL
4955895,RUBRO PUBLICIDAD SA,34686233482,68623348,RUBRO PUBLICIDAD SA
4955896,EAGLEMP TURISMO SRL,34686333320,68633332,EAGLEMP TURISMO SRL


We found a problem. The dataset wasn't properly cleaned. As we can see, some people ho have "SA" at the end, might have been part of their names. The good part is that it seems that the data frame is sorted. First people, then companies, then goverment bodies. Let's begin with goverment offices, since we really don't need them.

In [25]:
guv_name = names['concatenado'].str.contains('GOBIER|MUNICIP|UNIVERSIDA|BIBLIO|ASOCIAC|ESCUELA|COLEGIO|HOSPITAL|SANATORIO|CLINICA|INSTITUT|SECRETA|MINISTER|OFICINA')
guv_cuit = names.cuit.str.contains(r'^30|^33|^34')

In [26]:
names.loc[guv_name & guv_cuit]

Unnamed: 0,concatenado,cuit,dni,name
4482664,ASOCIACION MUTUAL DAN,30500045198,50004519,ASOCIACION MUTUAL DAN
4482667,ASOCIACION ARGENTINA DE COMPA#,30500048782,50004878,ASOCIACION ARGENTINA DE COMPA#
4482685,INSTITUTO AUTARQUICO,30500055509,50005550,INSTITUTO AUTARQUICO
4482696,INSTITUTO COOPERATIVO,30500060588,50006058,INSTITUTO COOPERATIVO
4482700,INSTITUTO ASEGURADOR,30500063242,50006324,INSTITUTO ASEGURADOR
...,...,...,...,...
4955887,BIBLIOTECA POPULAR DE LOS,34684033427,68403342,BIBLIOTECA POPULAR DE LOS
4955891,ASOCIACION CIVIL COMITE DE,34685233411,68523341,ASOCIACION CIVIL COMITE DE
4955899,GOBIERNO DE LA CIUDAD DE,34999032089,99903208,GOBIERNO DE LA CIUDAD DE
4955901,MUNICIPALIDAD DE LA MATANZA,34999257560,99925756,MUNICIPALIDAD DE LA MATANZA


So, those 30653 entities go.

In [27]:
names = names.loc[~(guv_name & guv_cuit)]

We go back to extract **Companies** with our acquired knowledge. Plus, we found that some companies have SAS and others S.A.S., so we'll include those as well

In [28]:
company_mask = names['name'].str.contains(r'\bSA$|\bSRL$|\bSAS$|\bSAU$|\bSASU$|\bSAIC$|\bSACIF$')
company_dots = names['name'].str.contains(r'\bS\.A\.$|\bS\.R\.L\.$|\bS\.A\.S\.$|\bS\.A\.U\.$|\bS\.A\.S\.U\.$|\bS\.A\.I\.C\.$|\bS\.A\.C\.I\.F\.$')
company_spaces = names['name'].str.contains(r'\bS\sA$|\bS\sR\sL$|\bS\sA\sS$|\bS\sA\sU$|\bS\sA\sS\sU$|\bS\sA\sI\sC$|\bS\sA\sC\sI\sF$')
company_cuit = names.cuit.str.contains(r'^30|^33|^34')

In [29]:
companies = names.loc[(company_mask | company_dots | company_spaces) & (company_cuit)]

In [30]:
companies

Unnamed: 0,concatenado,cuit,dni,name
4482617,BANCO BBVA ARGENTINA S.A.,30500003193,50000319,BANCO BBVA ARGENTINA S.A.
4482625,BANCO DE SAN JUAN S A,30500009442,50000944,BANCO DE SAN JUAN S A
4482627,BANCO DE CORRIENTES SA,30500010602,50001060,BANCO DE CORRIENTES SA
4482634,BANCO DEL CHUBUT S.A.,30500012990,50001299,BANCO DEL CHUBUT S.A.
4482636,CHUBB SEGUROS ARGENTINA S A,30500016260,50001626,CHUBB SEGUROS ARGENTINA S A
...,...,...,...,...
4955893,BE3 SRL,34686133305,68613330,BE3 SRL
4955894,DIFEMA SRL,34686233318,68623331,DIFEMA SRL
4955895,RUBRO PUBLICIDAD SA,34686233482,68623348,RUBRO PUBLICIDAD SA
4955896,EAGLEMP TURISMO SRL,34686333320,68633332,EAGLEMP TURISMO SRL


Now LOOK AT THAT.. We went from 68000 to 205500 companies!

In [31]:
companies = companies.drop('concatenado', axis=1)

In [32]:
companies.to_csv('../../data/estatutos/external_sources/companies.csv', index=False)

In [33]:
persons = names.loc[~((company_mask | company_dots | company_spaces) & (company_cuit)), ['cuit', 'dni', 'name']]

In [34]:
persons = persons.loc[~(persons.name=='') | (persons.name=='  ') | (persons.name=='   ')]

In [35]:
persons.to_csv('../../data/estatutos/external_sources/persons.csv', index=False)

In [36]:
def random_name_generator(name_type:str, n:int):
    if n > 20000:
        warnings.warn('Number of samples too big, could cause the application to break. Returning 20.000 samples.')
        n = 20000
    possible_types = ['company', 'person', 'any']
    if not name_type in possible_types:
        raise KeyError(f'{name_type} is not a valid option. Please choose one of the following {possible_types}')
    
    if name_type == 'company':
        names = pd.read_csv('../../data/estatutos/external_sources/companies.csv', dtype=str)['name']
    elif name_type == 'person':
        names = pd.read_csv('../../data/estatutos/external_sources/persons.csv', dtype=str)['name']
    elif name_type == 'any':
        persons = pd.read_csv('../../data/estatutos/external_sources/persons.csv', dtype=str)['name'].sample(10000)
        companies = pd.read_csv('../../data/estatutos/external_sources/companies.csv', dtype=str)['name'].sample(10000)
        names = pd.concat([persons, companies])
        del persons, companies
    
    
    result = names.sample(n).to_numpy()
    
    return result

### 4) Random Mandato
We created a function that returns all the values expected for mandato.
We use `np.random.randint(0,3)` at the begining in order to give 2/3 chances of returning a finite number, and 1/3 chances of returning a permanent goverment of the company.

In [37]:
def mandato_generator():
    if np.random.randint(0,3):
        years = np.random.randint(1,11)
        keywords = ['años', 'ejercicios'][np.random.randint(0,2)]
        years_words = num2words(years, lang='es')
        random_year = [years, years_words, f'{years_words} ({years})', f'{years} ({years_words})'][np.random.randint(0,4)]
        result = f'{random_year} {keywords}'
    else:
        result = ['término de duración de la sociedad', 'plazo de duración de la sociedad', 'vencimiento de la sociedad'][np.random.randint(0,3)]
    
    return result

In [38]:
mandato_generator()

'dos (2) años'

### 5) Random Vigencia

In [39]:
def vigencia_generator():
    years = np.random.randint(1,101)
    years_words = num2words(years, lang='es')
    salad = [years, years_words, f'{years_words} ({years})', f'{years} ({years_words})'][np.random.randint(0,4)]
    
    return salad

In [40]:
vigencia_generator()

'trece'

### 6) Random tipicidad

In [41]:
def tipicidad_generator():
    company_type = ['sociedad de responsabilidad limitada', 
                    'sociedad anónima',
                    'sociedad por acciones simplificada',
                    'sociedad anónima unipersonal',
                    'sociedad por acciones simplificada unipersonal'][np.random.randint(0, 5)]
    style = ['lower', 'upper', 'title'][np.random.randint(0,3)]
    
    if style == 'lower':
        result = company_type
    elif style == 'upper':
        result = company_type.upper()
    elif style == 'title':
        result = company_type.title()
    
    return result

In [42]:
tipicidad_generator()

'sociedad de responsabilidad limitada'

### 7) Random DNI and Random CUIT

In [43]:
def id_generator(cuit=False):
    millions = np.random.randint(0,100)
    thousands = np.random.randint(0,1000)
    hundreds = np.random.randint(0,1000)
    
    
    if thousands < 10:
        thousands = f'00{thousands}'
    elif thousands < 100:
        thousands = f'0{thousands}'
    else: 
        thousands = f'{thousands}'

    if hundreds < 10:
        hundreds = f'00{hundreds}'
    elif hundreds < 100:
        hundreds = f'0{hundreds}'
    else: 
        hundreds = f'{hundreds}'
        
    result = f'{millions}.{thousands}.{hundreds}'

    if cuit:
        millions = f'0{millions}' if millions < 10 else f'{millions}'
        beginning = np.random.randint(20,36)
        end = np.random.randint(1,10)
        result = f'{beginning}-{millions}{thousands}{hundreds}-{end}'
    
    return result

In [44]:
id_generator(True)

'21-13349191-9'

### 8) Random Capital

In [86]:
def capital_generator(style:str='any'):
    millions = np.random.randint(0,2)
    thousands = [x for x in range(0,1000, 5)][np.random.randint(0,200)]
    hundreds = ['000', '500'][np.random.randint(0,2)]
    
    if millions:
        if len(str(thousands)) == 1:
            thousands = f'00{thousands}'
        elif len(str(thousands)) == 2:
            thousands = f'0{thousands}'
        else:
            thousands = str(thousands)
        
        millions = np.random.randint(1,11)
        result = f'{millions}.{thousands}.{hundreds}'
    else:
        result = f'{thousands}.{hundreds}'
    
    if style == 'any':
        style = ['written', 'number', 'mixed'][np.random.randint(0,3)]
    
    if style == 'written':
        result = ''.join(result.split('.'))
        result = num2words(result, lang='es')
    elif style == 'mixed':
        number = ''.join(result.split('.'))
        words = num2words(number, lang='es')
        result = [f'{words} ($ {result})', f'${result} ({words})'][np.random.randint(0,2)]
    elif style == 'number':
        pass
    else:
        raise KeyError('Wrong specification of style. Must be `written`, `number`, `mixed` or `any`')
    return result

### 9) Aportes generator

In [269]:
def aporte_generator(share_type:str = 'any'):
    thousands = [x for x in range(0,1000, 5)][np.random.randint(0,200)]
    hundreds = ['000', '500'][np.random.randint(0,2)]
    
    if np.random.randint(0,2):
        result = f'{thousands}.{hundreds}'
    else:
        result = thousands
    return result

In [292]:
aporte_generator()

'5.000'

### 10) Domicilio generator

In [3]:
import pandas as pd
import json
with open('../../data/estatutos/external_sources/calles.json', 'r') as f:
    direcciones = json.load(f)

In [4]:
calles = [pd.DataFrame(calle) for calle in direcciones['calles']]
df_calles = pd.concat(calles)

In [5]:
del calles, direcciones

In [6]:
df_calles = df_calles.dropna(subset=['provincia'])

In [7]:
mask = df_calles.provincia.apply(lambda x: x.isdigit())

In [8]:
clean_calles = df_calles.loc[~mask]

In [9]:
df_calles = clean_calles.sort_values('id').reset_index()

In [10]:
df_calles = df_calles.drop(columns=['index', 'fuente', 'altura'])

In [11]:
df_calles['full'] = df_calles['categoria'].str.lower() + ' ' + df_calles['nombre'].str.title()

In [11]:
df_calles.loc[~df_calles.provincia.str.contains('Autónoma'), 'provincia'] = 'Provincia de ' + df_calles.loc[~df_calles.provincia.str.contains('Autónoma'), 'provincia']

In [31]:
df_calles.to_csv('../../data/estatutos/external_sources/calles.csv', index=False)

In [93]:
df_calles = pd.read_csv('../../data/estatutos/external_sources/calles.csv', dtype=str)

In [94]:
df_calles

Unnamed: 0,categoria,departamento,nombre,id,provincia,localidad_censal,full
0,CALLE,Comuna 1,15 DE NOVIEMBRE DE 1889,0200701000025,Ciudad Autónoma de Buenos Aires,Ciudad Autónoma de Buenos Aires,calle 15 De Noviembre De 1889
1,CALLE,Comuna 1,25 DE MAYO,0200701000055,Ciudad Autónoma de Buenos Aires,Ciudad Autónoma de Buenos Aires,calle 25 De Mayo
2,CALLE,Comuna 1,5 DE JULIO,0200701000065,Ciudad Autónoma de Buenos Aires,Ciudad Autónoma de Buenos Aires,calle 5 De Julio
3,CALLE,Comuna 1,ACCESO,0200701000100,Ciudad Autónoma de Buenos Aires,Ciudad Autónoma de Buenos Aires,calle Acceso
4,CALLE,Comuna 1,ACCESO A ALVEAR,0200701000120,Ciudad Autónoma de Buenos Aires,Ciudad Autónoma de Buenos Aires,calle Acceso A Alvear
...,...,...,...,...,...,...,...
151367,CALLE,Ushuaia,WULAIA,9401502002795,"Provincia de Tierra del Fuego, Antártida e Isl...",Ushuaia,calle Wulaia
151368,CALLE,Ushuaia,YAGANES,9401502002800,"Provincia de Tierra del Fuego, Antártida e Isl...",Ushuaia,calle Yaganes
151369,CALLE,Ushuaia,YARKEN,9401502002805,"Provincia de Tierra del Fuego, Antártida e Isl...",Ushuaia,calle Yarken
151370,CALLE,Ushuaia,YOWEN,9401502002810,"Provincia de Tierra del Fuego, Antártida e Isl...",Ushuaia,calle Yowen


In [372]:
def address_generator(n:int, legal:bool=False):
    
    df_address = pd.read_csv('../../data/estatutos/external_sources/calles.csv', dtype=str)
    
    if legal:
        result = df_address['departamento'] + ', ' + df_address['provincia']
        result = result.sample(n)
    else:
        altura = pd.Series([np.random.randint(0,5000) for i in range(n*5)])
        
        numeros = [str(np.random.randint(0,10)) for i in range(10)]
        letras = 'A B C D E F G H I J'.split(' ')
        letras_numeros = letras + numeros
        piso = pd.Series([np.random.randint(0,20) for i in range(n*3)])
        tipo = pd.Series([['departamento', 'oficina'][np.random.randint(0,2)] for i in range(n*3)])
        enumeracion = pd.Series([letras_numeros[np.random.randint(0,20)] for i in range(n*3)])
        full = 'piso ' + piso.astype(str) + ', ' + tipo + ' ' + enumeracion.astype(str)
        casa = 'casa ' + piso.astype(str) + ', ' + 'manzana' + ' ' + enumeracion.astype(str)
        full = pd.concat([full, full, casa]).sample(n*3)
        print(len(full))
        connector = [[', de la localidad de ', ', partido de ', ', ', ', departamento de ']\
                     [np.random.randint(0,4)] for i in range(n*3)]
        
        completo =  pd.Series(df_address['nombre'].sample(n*3).str.title().values) + \
                ' ' +\
                pd.Series(altura.sample(n*3).astype(str).values) + \
                ', ' + \
                pd.Series(full.values) + \
                pd.Series(connector) + \
                pd.Series(df_address.sample(n*3)['departamento'].values) + \
                ', ' + \
                pd.Series(df_address.sample(n*3)['provincia'].values)
            
        solo_altura = pd.Series(df_address['nombre'].sample(n*3).str.title().values) + \
                      ' ' + \
                      pd.Series(altura.sample(n*3).astype(str).values) + \
                      ', ' +\
                      pd.Series(df_address['departamento'].sample(n*3).values) + \
                      ', ' +\
                      pd.Series(df_address['provincia'].sample(n*3).values)
        
        esta_ciudad = pd.Series(df_address['nombre'].sample(n).str.title().values) + \
                      ' ' + \
                      pd.Series(altura.sample(n).astype(str).values) + \
                      ', ' +\
                      pd.Series(['de esta ciudad' for i in range(n)])
        
        esta_ciudad_2 = pd.Series(df_address['nombre'].sample(n).str.title().values) + \
                        ' ' +\
                        pd.Series(altura.sample(n).astype(str).values) + \
                        ', ' + \
                        pd.Series(full.sample(n).values) + \
                        pd.Series([', de esta ciudad' for i in range(n)])
        
        
        result = pd.concat([completo, solo_altura, esta_ciudad, esta_ciudad_2]).sample(n).to_list()
    
    return result