In [1]:
import pandas as pd
from config.schemas import dtypes

In [2]:
df = pd.read_csv("data/tce_licitations.csv", dtype=dtypes)

In [3]:
df["ANO_LICITACAO"].value_counts()

ANO_LICITACAO
2022      118084
2023      115113
2021       98703
2019       81358
2020       78722
2018       74622
2017       64108
2016       30709
2024       19644
2023.0      9505
2024.0      3605
PRD           15
PDE            1
Name: count, dtype: int64

# 2. **Data Cleaning**

### 2.3. **Selecting necessary data**

In [4]:
df.columns

Index(['CD_ORGAO', 'NM_ORGAO', 'NR_LICITACAO', 'ANO_LICITACAO',
       'CD_TIPO_MODALIDADE', 'NR_COMISSAO', 'TP_OBJETO', 'CD_TIPO_FASE_ATUAL',
       'TP_LICITACAO', 'TP_NIVEL_JULGAMENTO', 'TP_CARACTERISTICA_OBJETO',
       'TP_NATUREZA', 'DS_OBJETO', 'VL_LICITACAO', 'BL_PERMITE_CONSORCIO',
       'DT_ABERTURA', 'DT_HOMOLOGACAO', 'DT_ADJUDICACAO',
       'BL_LICIT_PROPRIA_ORGAO', 'VL_HOMOLOGADO', 'DS_OBSERVACAO'],
      dtype='object')

First, we will select only those bids where `CD_TIPO_FASE_ATUAL=ADH`, which are the ones that have been approved, according to the TCE [documentation](https://tcers.tc.br/repo/cex/licitacon/eValidador_LicitaCon_Manual_Leiaute_1.4.pdf) in page 27.

In [5]:
df = df[(df["CD_TIPO_FASE_ATUAL"]=="ADH")]

In [6]:
df.sample(5)

Unnamed: 0,CD_ORGAO,NM_ORGAO,NR_LICITACAO,ANO_LICITACAO,CD_TIPO_MODALIDADE,NR_COMISSAO,TP_OBJETO,CD_TIPO_FASE_ATUAL,TP_LICITACAO,TP_NIVEL_JULGAMENTO,...,TP_NATUREZA,DS_OBJETO,VL_LICITACAO,BL_PERMITE_CONSORCIO,DT_ABERTURA,DT_HOMOLOGACAO,DT_ADJUDICACAO,BL_LICIT_PROPRIA_ORGAO,VL_HOMOLOGADO,DS_OBSERVACAO
154399,88048,PM DE WESTFÁLIA,11.0,2016,PRP,1003.0,COM,ADH,MPR,I,...,N,"Aquisição de veículo novo, tipo Sedan médio.",103000.0,N,2016-09-02,2016-09-08,2016-09-08,S,,
114650,45004,CODECA - CIA. DE DESENV. DE CAXIAS DO SUL,28.0,2019,PRP,1.0,COM,ADH,MPR,I,...,R,Contratação de empresa para fornecimento de ca...,81600.0,N,2019-05-08,2019-05-10,2019-05-08,S,,
173740,43000,PM DE CACHOEIRA DO SUL,14527.0,2016,PRE,577.0,COM,ADH,MPR,I,...,N,PARA UNIDADE DVA,17163.0,N,2016-11-28,2016-12-19,2016-12-19,S,5334.5,Sim
146559,57200,PM DE SANTA ROSA,13.0,2016,PRP,,CSE,ADH,MPR,I,...,N,AQUISIÇÃO,578875.0,N,2016-04-15,2016-04-18,2016-04-18,S,,
275520,58900,PM DE SÃO LUIZ GONZAGA,25.0,2022,PRE,558.0,CSE,ADH,MPR,I,...,N,Aquisicao de moveis para diversas Secretarias ...,279863.74,S,2022-09-29,2022-10-07,2022-10-07,S,191460.01,


In [7]:
df.shape

(283721, 21)

Let's drop some columns based on the descriptions we have for them in `colums_definitions/licitations.md`. It's worth noting that several columns have already been dropped, as we are extracting the data with the script `get_data_tce.py`

In [8]:
df.columns

Index(['CD_ORGAO', 'NM_ORGAO', 'NR_LICITACAO', 'ANO_LICITACAO',
       'CD_TIPO_MODALIDADE', 'NR_COMISSAO', 'TP_OBJETO', 'CD_TIPO_FASE_ATUAL',
       'TP_LICITACAO', 'TP_NIVEL_JULGAMENTO', 'TP_CARACTERISTICA_OBJETO',
       'TP_NATUREZA', 'DS_OBJETO', 'VL_LICITACAO', 'BL_PERMITE_CONSORCIO',
       'DT_ABERTURA', 'DT_HOMOLOGACAO', 'DT_ADJUDICACAO',
       'BL_LICIT_PROPRIA_ORGAO', 'VL_HOMOLOGADO', 'DS_OBSERVACAO'],
      dtype='object')

In [9]:
cols_to_keep = [
    "CD_ORGAO",
    "NM_ORGAO",
    "ANO_LICITACAO",
    "DS_OBJETO",
    "VL_LICITACAO",
    "DT_HOMOLOGACAO",
    "VL_HOMOLOGADO",
]


In [10]:
df = df[cols_to_keep]

In [11]:
df.isna().sum()

CD_ORGAO               0
NM_ORGAO               0
ANO_LICITACAO          0
DS_OBJETO              0
VL_LICITACAO       13533
DT_HOMOLOGACAO      6304
VL_HOMOLOGADO     138643
dtype: int64

### 2.2. **Dealing with null values**

For the `VL_HOMOLOGADO` the best approximation is the `VL_LICITACAO`, hence I'll replace those values:

In [12]:
df["VL_HOMOLOGADO"] = df["VL_HOMOLOGADO"].fillna(df["VL_LICITACAO"])

In [13]:
df.isna().sum()

CD_ORGAO              0
NM_ORGAO              0
ANO_LICITACAO         0
DS_OBJETO             0
VL_LICITACAO      13533
DT_HOMOLOGACAO     6304
VL_HOMOLOGADO      8954
dtype: int64

We still have some null values, but we'll only remove them latter on, when we're creating the EDA (Exploratory Data Analysis).

### 2.3. **Asserting the correct data types**

In [14]:
df.dtypes

CD_ORGAO          object
NM_ORGAO          object
ANO_LICITACAO     object
DS_OBJETO         object
VL_LICITACAO      object
DT_HOMOLOGACAO    object
VL_HOMOLOGADO     object
dtype: object

**CD_ORGAO**

In [15]:
df["CD_ORGAO"] = df["CD_ORGAO"].astype(int)

**ANO_LICITACAO**

In [16]:
df["ANO_LICITACAO"].value_counts()

ANO_LICITACAO
2022      45179
2023      38535
2021      38369
2019      37210
2018      36773
2017      32976
2020      31037
2016      17839
2023.0     3212
2024       2304
2024.0      287
Name: count, dtype: int64

In [17]:
df = df[~df["ANO_LICITACAO"].isin(["PRD", "PDE"])]
df['ANO_LICITACAO'] = df['ANO_LICITACAO'].replace({"2023.0": "2023", "2024.0": "2024"})

In [18]:
df["ANO_LICITACAO"] = df["ANO_LICITACAO"].astype(int)

**VL_LICITACAO**

In [19]:
df["VL_LICITACAO"] = df["VL_LICITACAO"].astype(float)

**DT_HOMOLOGACAO**

In [20]:
df['DT_HOMOLOGACAO'] = pd.to_datetime(df['DT_HOMOLOGACAO'])

**VL_HOMOLOGADO**

In [21]:
df = df[~df['VL_HOMOLOGADO'].isin(['###############', '#################'])] 

In [22]:
df["VL_HOMOLOGADO"] = df["VL_HOMOLOGADO"].astype(float)

Checking the new data types:

In [23]:
df.dtypes

CD_ORGAO                   int64
NM_ORGAO                  object
ANO_LICITACAO              int64
DS_OBJETO                 object
VL_LICITACAO             float64
DT_HOMOLOGACAO    datetime64[ns]
VL_HOMOLOGADO            float64
dtype: object