# Arboviral disease record data - Dengue and Chikungunya, Brazil, 2013–2020

This work presents a unified data set with clinical, sociodemographic, and laboratorial data on confirmed patients of Dengue and Chikungunya, as well as patients ruled out of infection from these diseases. The data is based on case notification data submitted to the Brazilian Information System for Notifiable Diseases, *Sistema de Informação de Agravo de Notificação* (SINAN), from 2013 to 2020. The original data set comprised 13,421,230 records and 118 attributes. Following a pre-processing process, a final data set of 7,632,542 records and 56 attributes was generated.

The data set has a dictionary that can be seen in the links below, in Portuguese:
- [common and sociodemographic data](http://portalsinan.saude.gov.br/images/documentos/Agravos/Notificacao_Individual/DIC_DADOS_NET---Notificao-Individual_rev.pdf)
- [clinical and laboratory data](http://portalsinan.saude.gov.br/images/documentos/Agravos/Dengue/DIC_DADOS_ONLINE.pdf)

The data set resulting from this project can be found [at this link](https://data.mendeley.com/datasets/2d3kr8zynf/2).

## Imports and data uploads

Libraries needed for code execution.

In [None]:
# Imports
import pandas as pd
from collections import Counter

# Path where the original data set is located
path_data = "path_to_data_in_your_computer"

path_save = "path_to_save_the_data_in_your_computer"

df = pd.read_csv(path_data)
df.shape

## Pre processing

### Correction of data

Data that was confirmed as from Chikungunya, but not correctly marked in the CLASSI_FIN columnattribute.

In [None]:
df.loc[(df.RES_CHIKS1 == 1) & (df.CLASSI_FIN != 13), 'CLASSI_FIN'] = 13

### Null Data Removal

Attributes that were over 60% of notifications null were removed.

In [None]:
df = df.loc[:, df.isnull().mean() < .60]
df.shape

### Removing unnecessary attributes

Removed attributes that would not be useful for the final result. Removed attributes are attributes related to system configuration or data that had the same value.

In [None]:
df.drop(columns=["CS_FLXRET", "TP_SISTEMA", "CRITERIO", "TP_NOT"], inplace=True)
df.shape

### Standardization of column values

Standardize the unique values that have been entered in different ways.

In [None]:
# The attribute ID_AGRAVO was inserted slightly differently in some fields.
df.loc[df['ID_AGRAVO']=='A92.', 'ID_AGRAVO'] = 'A92'
df.loc[df['ID_AGRAVO']=='A92.0', 'ID_AGRAVO'] = 'A92'
df.loc[df['ID_AGRAVO']=='A920', 'ID_AGRAVO'] = 'A92'

# DENGUE
df.loc[df['CLASSI_FIN']==1, 'CLASSI_FIN'] = 'Dengue'
df.loc[df['CLASSI_FIN']==2, 'CLASSI_FIN'] = 'Dengue'
df.loc[df['CLASSI_FIN']==10, 'CLASSI_FIN'] = 'Dengue'
df.loc[df['CLASSI_FIN']==11, 'CLASSI_FIN'] = 'Dengue'
df.loc[df['CLASSI_FIN']==12, 'CLASSI_FIN'] = 'Dengue'

# CHIKUNGUNYA
df.loc[df['CLASSI_FIN']==13, 'CLASSI_FIN'] = 'Chikungunya'

# Discarded/Inconclusive
df.loc[df['CLASSI_FIN']==5, 'CLASSI_FIN'] = 'Discarded/Inconclusive'
df.loc[df['CLASSI_FIN']==8, 'CLASSI_FIN'] = 'Discarded/Inconclusive'
df.loc[df['CLASSI_FIN']==6, 'CLASSI_FIN'] = 'Discarded/Inconclusive'

### Null data padding with default values.

The resulting attributes that still had null data were entered with the default values referring to the data dictionary.

In [None]:
# In these attributes, the value for "not informed" is 4.
colunas_exames = [
    "RESUL_SORO",
    "RESUL_NS1",
    "RESUL_VI_N",
    "RESUL_PCR_",
    "HISTOPA_N",
    "IMUNOH_N"
]
for coluna in colunas_exames:
    df.loc[df[coluna].isnull(), coluna] = 4

# In the other attributes, the value of "not informed" is 9.
for nome in df.columns:
  df.loc[df[nome].isnull(), nome] = 9


df.shape

### Transformation of values

CS_SEXO attribute values have been transformed to numeric.

In [None]:
df.loc[df['CS_SEXO'] == "F", 'CS_SEXO'] = 0
df.loc[df['CS_SEXO']=="M", 'CS_SEXO'] = 1
df.loc[df['CS_SEXO']=="I", 'CS_SEXO'] = 2
df.shape

## Data set saving

In [None]:
df.to_csv(path_save, sep=",", index = False)