In [1]:
import pandas as pd
from config.schemas import dtypes

In [2]:
df = pd.read_csv("data/tce_licitations.csv", dtype=dtypes)

# 2. **Data Cleaning**

### 2.3. **Selecting necessary data**

In [3]:
df.columns

Index(['CD_ORGAO', 'NM_ORGAO', 'NR_LICITACAO', 'ANO_LICITACAO',
       'CD_TIPO_MODALIDADE', 'NR_COMISSAO', 'TP_OBJETO', 'CD_TIPO_FASE_ATUAL',
       'TP_LICITACAO', 'TP_NIVEL_JULGAMENTO', 'TP_CARACTERISTICA_OBJETO',
       'TP_NATUREZA', 'DS_OBJETO', 'VL_LICITACAO', 'BL_PERMITE_CONSORCIO',
       'DT_ABERTURA', 'DT_HOMOLOGACAO', 'DT_ADJUDICACAO',
       'BL_LICIT_PROPRIA_ORGAO', 'VL_HOMOLOGADO', 'DS_OBSERVACAO'],
      dtype='object')

First, we will select only those bids where `CD_TIPO_FASE_ATUAL=ADH`, which are the ones that have been approved, according to the TCE [documentation](https://tcers.tc.br/repo/cex/licitacon/eValidador_LicitaCon_Manual_Leiaute_1.4.pdf) in page 27.

In [4]:
df = df[(df["CD_TIPO_FASE_ATUAL"]=="ADH")]

In [5]:
df.sample(5)

Unnamed: 0,CD_ORGAO,NM_ORGAO,NR_LICITACAO,ANO_LICITACAO,CD_TIPO_MODALIDADE,NR_COMISSAO,TP_OBJETO,CD_TIPO_FASE_ATUAL,TP_LICITACAO,TP_NIVEL_JULGAMENTO,...,TP_NATUREZA,DS_OBJETO,VL_LICITACAO,BL_PERMITE_CONSORCIO,DT_ABERTURA,DT_HOMOLOGACAO,DT_ADJUDICACAO,BL_LICIT_PROPRIA_ORGAO,VL_HOMOLOGADO,DS_OBSERVACAO
151925,52300,PM DE MONTENEGRO,84.0,2016.0,PRE,4009.0,COM,ADH,MPR,L,...,N,Aquisição de peças para veículo (C.170) da SMVSU,2701.0,N,2016-07-07,2016-07-15,2016-07-15,S,2275.0,
796392,84300,PM DE FAZENDA VILANOVA,4.0,2021.0,PRP,48.0,COM,ADH,MPR,I,...,R,"REFERENTE AQUISIÇÃO DE GALERIAS, MEIO FIO E TU...",1169964.0,N,2021-03-25,2021-03-25,2021-03-25,S,,
713661,52300,PM DE MONTENEGRO,20.0,2018.0,PRE,4009.0,COM,ADH,MPR,L,...,N,Aquisição de material de construção para recon...,2237.16,N,2018-04-19,2018-04-25,2018-04-25,S,1604.0,
193872,47800,PM DE FARROUPILHA,10.0,2022.0,PDE,729.0,COM,ADH,MPR,I,...,N,Aquisição de materiais de manutenção de bens i...,19491.66,N,2022-03-04,2022-03-07,2022-03-07,S,,
495266,55100,PM DE PORTO XAVIER,22.0,2023.0,PRP,1.0,COM,ADH,MPR,I,...,N,Aquisicao de Materiais para Extensao de Redes ...,162350.0,S,2023-05-18,2023-05-24,2023-05-24,S,116410.9,


In [6]:
df.shape

(283721, 21)

Let's drop some columns based on the descriptions we have for them in `colums_definitions/licitations.md`. It's worth noting that several columns have already been dropped, as we are extracting the data with the script `get_data_tce.py`

In [7]:
df.columns

Index(['CD_ORGAO', 'NM_ORGAO', 'NR_LICITACAO', 'ANO_LICITACAO',
       'CD_TIPO_MODALIDADE', 'NR_COMISSAO', 'TP_OBJETO', 'CD_TIPO_FASE_ATUAL',
       'TP_LICITACAO', 'TP_NIVEL_JULGAMENTO', 'TP_CARACTERISTICA_OBJETO',
       'TP_NATUREZA', 'DS_OBJETO', 'VL_LICITACAO', 'BL_PERMITE_CONSORCIO',
       'DT_ABERTURA', 'DT_HOMOLOGACAO', 'DT_ADJUDICACAO',
       'BL_LICIT_PROPRIA_ORGAO', 'VL_HOMOLOGADO', 'DS_OBSERVACAO'],
      dtype='object')

In [8]:
cols_to_keep = [
    "CD_ORGAO",
    "NM_ORGAO",
    "ANO_LICITACAO",
    "DS_OBJETO",
    "VL_LICITACAO",
    "DT_HOMOLOGACAO",
    "VL_HOMOLOGADO",
]


In [9]:
df = df[cols_to_keep]

In [10]:
df.isna().sum()

CD_ORGAO               0
NM_ORGAO               0
ANO_LICITACAO          0
DS_OBJETO              0
VL_LICITACAO       13533
DT_HOMOLOGACAO      6304
VL_HOMOLOGADO     138643
dtype: int64

### 2.2. **Dealing with null values**

For the `VL_HOMOLOGADO` the best approximation is the `VL_LICITACAO`, hence I'll replace those values:

In [11]:
df["VL_HOMOLOGADO"] = df["VL_HOMOLOGADO"].fillna(df["VL_LICITACAO"])

In [12]:
df.isna().sum()

CD_ORGAO              0
NM_ORGAO              0
ANO_LICITACAO         0
DS_OBJETO             0
VL_LICITACAO      13533
DT_HOMOLOGACAO     6304
VL_HOMOLOGADO      8954
dtype: int64

We still have some null values, but we'll only remove them latter on, when we're creating the EDA (Exploratory Data Analysis).

### 2.3. **Asserting the correct data types**

In [13]:
df.dtypes

CD_ORGAO          object
NM_ORGAO          object
ANO_LICITACAO     object
DS_OBJETO         object
VL_LICITACAO      object
DT_HOMOLOGACAO    object
VL_HOMOLOGADO     object
dtype: object

**CD_ORGAO**

In [14]:
df["CD_ORGAO"] = df["CD_ORGAO"].astype(int)

**ANO_LICITACAO**

In [15]:
df["ANO_LICITACAO"].value_counts()

ANO_LICITACAO
2022      45179
2023      38535
2021      38369
2019      37210
2018      36773
2017      32976
2020      31037
2016      17839
2023.0     3212
2024       2304
2024.0      287
Name: count, dtype: int64

In [16]:
df = df[~df["ANO_LICITACAO"].isin(["PRD", "PDE"])]
df['ANO_LICITACAO'] = df['ANO_LICITACAO'].replace({"2023.0": "2023", "2024.0": "2024"})

In [17]:
df["ANO_LICITACAO"] = df["ANO_LICITACAO"].astype(int)

**VL_LICITACAO**

In [18]:
df["VL_LICITACAO"] = df["VL_LICITACAO"].astype(float)

**DT_HOMOLOGACAO**

In [19]:
df['DT_HOMOLOGACAO'] = pd.to_datetime(df['DT_HOMOLOGACAO'])

**VL_HOMOLOGADO**

In [20]:
df = df[~df['VL_HOMOLOGADO'].isin(['###############', '#################'])] 

In [21]:
df["VL_HOMOLOGADO"] = df["VL_HOMOLOGADO"].astype(float)

Checking the new data types:

In [22]:
df.dtypes

CD_ORGAO                   int64
NM_ORGAO                  object
ANO_LICITACAO              int64
DS_OBJETO                 object
VL_LICITACAO             float64
DT_HOMOLOGACAO    datetime64[ns]
VL_HOMOLOGADO            float64
dtype: object