# Normalizing transparencia.gob.sv public procurement dataset

The purpose of this notebook is to make normalization in public procurement data obtained from `transparencia.gob.sv`.

Data was obtained using scripts:

- `build_list` makes a list of public procurement orders recorded in that website
- `getcontracts` retrives data for each contract and creates the dataset

In this notebook, following normalizations are done:

- Column names are standarized
- Order amounts are converted from string to numeric
- Supplier name are standarized
- Order dates are converted to YYYY-MM-DD
- Order modes are associated to standard categories
- Order links are converted to a direct usable form

Output is saved to `normalized_contracts.csv`.

TODO:

- Improve scripts to retrieve data from `transparencia.gob.sv`
- Build a probabilistic model to associated supplier names

## Data load

In [1]:
import pandas as pd
import unidecode

In [2]:
data = pd.read_csv('contracts.csv')
data.describe()

Unnamed: 0,Código de adquisición o contratación,Área institucional,Objeto,Monto,Nombre de la contraparte,Plazos de cumplimiento,Tipo de contratación,Fecha de contrato / Órden de compra,Código de contrato / Órden de compra,Características de la contraparte,Archivo adjunto,Fecha de creación,Fecha de última actualización,office
count,53350,39060,75212,75214,70720,74730,75107,74897,59639,74976,75214,75214,75214,75214
unique,26327,1622,48083,38378,21854,22478,3978,2493,43836,7022,74056,1199,699,69
top,OC,SALUD,SERVICIO DE ALIMENTACIÓN,$600.00,50001612 NIPRO MEDICAL CORPORATION SUC. EL S,31/12/2018,LIBRE GESTION,05 de noviembre de 2015,4618000780,PERSONA JURIDICA,/system/procurements/attachments/000/108/591/o...,10/04/2019,18/07/2017,isss
freq,3712,2192,230,447,627,3610,16092,476,291,13432,5,1887,7610,16512


## Column names standarization

In [3]:
data.rename(columns={
    'Código de adquisición o contratación': 'code',
    'Área institucional': 'request_office',
    'Objeto': 'description',
    'Monto': 'amount',
    'Nombre de la contraparte': 'supplier_name',
    'Tipo de contratación': 'order_mode',
    'Fecha de contrato / Órden de compra': 'order_date',
    'Código de contrato / Órden de compra': 'order_code',
    'Características de la contraparte': 'supplier_kind',
    'Archivo adjunto': 'order_link',
}, inplace=True)

In [4]:
data.columns

Index(['code', 'request_office', 'description', 'amount', 'supplier_name',
       'Plazos de cumplimiento', 'order_mode', 'order_date', 'order_code',
       'supplier_kind', 'order_link', 'Fecha de creación',
       'Fecha de última actualización', 'office'],
      dtype='object')

## Order amounts to numeric

In [5]:
data['amount'] = pd.to_numeric(
    data.amount\
        .str.replace('$', '')\
        .str.replace(',', '')
)
data['amount'].dtype

dtype('float64')

## Supplier names normalization

In [6]:
data['supplier_name_norm'] = data['supplier_name']\
    .str.upper()\
    .str.replace('[^\w\s]', '', regex=True)\
    .str.replace('\s+', ' ', regex=True)\
    .str.replace('SA DE CV', '')\
    .str.strip()\
    .apply(lambda s: unidecode.unidecode(str(s)))\
    .str.replace('nan', '')
data['supplier_name_norm']

0                           DATA GRAPHICS
1        CARLOS REMBERTO VILLEGAS MORALES
2                                JUGUESAL
3                    NATANAEL LOPEZ GOMEZ
4                     INVERSIONES MONTOYA
                       ...               
75209                                    
75210                                    
75211                                    
75212                                    
75213                                    
Name: supplier_name_norm, Length: 75214, dtype: object

## Order dates normalization

In [7]:
def date_convert(s):
    months = [
        'enero', 'febrero', 'marzo', 
        'abril', 'mayo', 'junio', 
        'julio', 'agosto', 'septiembre', 
        'octubre', 'noviembre', 'diciembre'
    ]
    try:
        l = str(s).split()
        return "{}-{}-{}".format(
            l[4].zfill(4),
            str(months.index(l[2]) + 1).zfill(2),
            l[0].zfill(2)
        )
    except:
        return ''

In [8]:
data['order_date_norm'] = data['order_date'].apply(date_convert)
data['year'] = data['order_date_norm'].apply(lambda s: s[:4])

In [9]:
data['year'].unique()

array(['2019', '2018', '2017', '2016', '2015', '2014', '2020', '2010', '',
       '2000', '2009', '2012', '2013', '1995', '2002', '2005', '2004',
       '2003', '2011', '2007', '2063', '2918', '2006', '1979'],
      dtype=object)

## Order modes categorization

In [10]:
order_modes = ['LIBRE GESTION', 'LICITACION', 'CONTRATACION DIRECTA', 'PRORROGA']

data['order_mode_norm'] = data['order_mode'] \
    .str.upper() \
    .apply(lambda s: unidecode.unidecode(str(s))) \
    .str.replace('.*LIBRE GESTION.*', order_modes[0], regex=True) \
    .str.replace('.*LICI.*', order_modes[1], regex=True) \
    .str.replace('.*DIRECTA.*', order_modes[2], regex=True) \
    .str.replace('.*PRORROGA.*', order_modes[3], regex=True) \
    .apply(lambda s: s if s in order_modes else 'OTRA')

## Order links to a usable form

In [11]:
data['order_link'] = data['order_link'].apply(lambda s: 'https://transparencia.gob.sv' + s)

## Saving output

In [12]:
data.to_csv('normalized_contracts.csv')

## History

2020-08-26: This notebook was created.