# Clinical cases of Dengue and Chikungunya - 2015-2020

This data set presents clinical and sociodemographic information of confirmed patients of Dengue and Chikungunya, as well as patients cases discarded from these same diseases. The data were accounted for two databases, the first is from the Health Problem and Notification Information System, from Portuguese Sistema de Informação de Agravo de Notificação (SINAN), that occurred in the state of Amazonas, from 2015 to 2020; The second if from Dados Recife, an open data portal of the city Recife, in the state of Pernambuco, also from 2015 to 2020. The data set has 17,172 records and 27 attributes.

The data set has a dictionary that can be seen in the links below, in Portuguese:
- [common and sociodemographic data](http://portalsinan.saude.gov.br/images/documentos/Agravos/Notificacao_Individual/DIC_DADOS_NET---Notificao-Individual_rev.pdf)
- [clinical and laboratory data](http://portalsinan.saude.gov.br/images/documentos/Agravos/Dengue/DIC_DADOS_ONLINE.pdf)

The data set resulting from this project can be found [at this link](https://data.mendeley.com/datasets/2d3kr8zynf/2).

## Imports and data uploads

Libraries needed for code execution.

In [None]:
# Imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
from collections import Counter  #Counter of classes
from imblearn.under_sampling import RandomUnderSampler  #UnderSampler

# Path where the original data set is located
path_data_sinan = "path_to_sinan_data_in_your_computer"
path_data_recife = "path_to_recife_data_in_your_computer"

path_save = "path_to_save_the_data_in_your_computer"

sinan_df = pd.read_csv(path_data_sinan)
recife_df = pd.read_csv(path_data_recife)

## Reading Data sets

### Recife-db
recife-db originally has a different nomenclature than the database of sinan-db, however, in the data dictionary it is possible to see both nomenclatures, so first the columns of recife-db were renamed to their corresponding counterpart in sinan-db

In [None]:
columns = {
    "nu_notificacao": "NU_NOTIFIC",
    "tp_notificacao": "TP_NOT",
    "dt_notificacao": "DT_NOTIFIC",
    "ds_semana_notificacao": "SEM_NOT",
    "notificacao_ano": "NU_ANO",
    "co_uf_notificacao": "SG_UF_NOT",
    "co_municipio_notificacao": "ID_MUNICIP",
    "id_regional": "ID_REGIONA",
    "co_unidade_notificacao": "ID_UNIDADE",
    "dt_diagnostico_sintoma": "DT_SIN_PRI",
    "ds_semana_sintoma": "SEM_PRI",
    "dt_nascimento": "DT_NASC",
    "nu_idade": "NU_IDADE_N",
    "tp_sexo": "CS_SEXO",
    "tp_gestante": "CS_GESTANT",
    "tp_raca_cor": "CS_RACA",
    "tp_escolaridade": "CS_ESCOL_N",
    "co_uf_residencia": "SG_UF",
    "co_municipio_residencia": "ID_MN_RESI",
    "co_regional_residencia": "ID_RG_RESI",
    "co_distrito_residencia": "ID_DISTRIT",
    "co_bairro_residencia": "ID_BAIRRO",
    "no_bairro_residencia": "NM_BAIRRO",
    "tp_zona_residencia": "CS_ZONA",
    "co_pais_residencia": "ID_PAIS",
    "dt_investigacao": "DT_INVEST",
    "co_cbo_ocupacao": "ID_OCUPA_N",
    "febre": "FEBRE",
    "mialgia": "MIALGIA",
    "cefaleia": "CEFALEIA",
    "exantema": "EXANTEMA",
    "vomito": "VOMITO",
    "nausea": "NAUSEA",
    "dor_costas": "DOR_COSTAS",
    "conjutivite": "CONJUNTVIT",
    "artrite": "ARTRITE",
    "artralgia": "ARTRALGIA",
    "petequia_n": "PETEQUIA_N",
    "leucopenia": "LEUCOPENIA",
    "laco": "LACO",
    "dor_retro": "DOR_RETRO",
    "diabetes": "DIABETES",
    "hematolog": "HEMATOLOG",
    "hepatopat": "HEPATOPAT",
    "renal": "RENAL",
    "hipertensao": "HIPERTENSA",
    "acido_pept": "ACIDO_PEPT",
    "auto_imune": "AUTO_IMUNE",
    "dt_chil_s1": "DT_CHIK_S1",
    "dt_chil_s2": "DT_CHIK_S2",
    "dt_prnt": "DT_PRNT",
    "res_chiks1": "RES_CHIKS1",
    "res_chiks2": "RES_CHIKS2",
    "resul_prnt": "RESUL_PRNT",
    "dt_coleta_exame": "DT_SORO",
    "tp_result_exame": "RESUL_SORO",
    "dt_coleta_NS1": "DT_NS1",
    "Tp_result_NS1": "RESUL_NS1",
    "dt_coleta_isolamento": "DT_VIRAL",
    "tp_result_isolamento": "RESUL_VI_N",
    "dt_coleta_rtpcr": "DT_PCR",
    "tp_result_rtpcr": "RESUL_PCR_",
    "tp_sorotipo": "SOROTIPO",
    "tp_result_histopatologia": "HISTOPA_N",
    "tp_result_imunohistoquimica": "IMUNOH_N",
    "st_ocorreu_hospitalizacao": "HOSPITALIZ",
    "dt_internacao": "DT_INTERNA",
    "co_uf_hospital": "UF",
    "co_municipio_hospital": "MUNICIPIO",
    "tp_autoctone_residencia": "TPAUTOCTO",
    "co_uf_infeccao": "COUFINF",
    "co_pais_infeccao": "COPAISINF",
    "co_municipio_infeccao": "COMUNINF",
    "co_distrito_infeccao": "CODISINF",
    "co_bairro_infeccao": "CO_BAINF",
    "no_bairro_infeccao": "NOBAIINF",
    "tp_classificacao_final": "CLASSI_FIN",
    "tp_criterio_confirmacao": "CRITERIO",
    "st_doenca_trabalho": "DOENCA_TRA",
    "clinc_chik": "CLINC_CHIK",
    "tp_evolucao_caso": "EVOLUCAO",
    "dt_obito": "DT_OBITO",
    "dt_encerramento": "DT_ENCERRA",
    "alrm_hipot": "ALRM_HIPOT",
    "alrm_plaq": "ALRM_PLAQ",
    "alrm_vom": "ALRM_VOM",
    "alrm_sang": "ALRM_SANG",
    "alrm_hemat": "ALRM_HEMAT",
    "alrm_abdom": "ALRM_ABDOM",
    "alrm_letar": "ALRM_LETAR",
    "alrm_hepat": "ALRM_HEPAT",
    "alrm_liq": "ALRM_LIQ",
    "dt_alrm": "DT_ALRM",
    "grav_pulso": "GRAV_PULSO",
    "grav_conv": "GRAV_CONV",
    "grav_ench": "GRAV_ENCH",
    "grav_insuf": "GRAV_INSUF",
    "grav_taqui": "GRAV_TAQUI",
    "grav_extre": "GRAV_EXTRE",
    "grav_hipot": "GRAV_HIPOT",
    "grav_hemat": "GRAV_HEMAT",
    "grav_melen": "GRAV_MELEN",
    "grav_metro": "GRAV_METRO",
    "grav_sang": "GRAV_SANG",
    "grav_ast": "GRAV_AST",
    "grav_mioc": "GRAV_MIOC",
    "grav_consc": "GRAV_CONSC",
    "grav_orgao": "GRAV_ORGAO",
    "dt_grav": "DT_GRAV",
    "mami_hemor": "MANI_HEMOR",
    "epistaxe": "EPISTAXE",
    "gengivo": "GENGIVO",
    "metro": "METRO",
    "petequias": "PETEQUIAS",
    "hematura": "HEMATURA",
    "sangram": "SANGRAM",
    "laco_n": "LACO_N",
    "plasmatico": "PLASMATICO",
    "evidencia": "EVIDENCIA",
    "plaq_menor": "PLAQ_MENOR",
    "complica": "COMPLICA"
}

recife_df.rename(columns=columns, inplace=True)
recife_df.head()

## Unification of the two data sets

In [None]:
df = pd.concat([sinan_df, recife_df], join='inner')
df.head()

## Removal of empty lines
Some lines do not have any information about the patients' symptoms, so we decided to remove them, as the symptoms are the most important information of the work.

In [None]:
# We used only the FEVER column as a basis, as we realized that when one symptom is missing, all are missing.
df.dropna(subset=['FEBRE'], how='any', inplace=True)

## Removal of empty columns
Columns that have more than 50% of the data null have been removed.

In [None]:
empty_columns = [
    'ID_DISTRIT',
    'ID_OCUPA_N',
    'DT_CHIK_S1',
    'DT_CHIK_S2',
    'DT_PRNT',
    'RESUL_PRNT',
    'DT_SORO',
    'DT_NS1',
    'DT_VIRAL',
    'DT_PCR',
    'RESUL_PCR_',
    'SOROTIPO',
    'HISTOPA_N',
    'IMUNOH_N',
    'DT_INTERNA',
    'UF',
    'MUNICIPIO',
    'TPAUTOCTO',
    'COUFINF',
    'COPAISINF',
    'COMUNINF',
    'CODISINF',
    'CO_BAINF',
    'NOBAIINF',
    'DOENCA_TRA',
    'DT_OBITO',
    'ALRM_HIPOT',
    'ALRM_PLAQ',
    'ALRM_VOM',
    'ALRM_SANG',
    'ALRM_HEMAT',
    'ALRM_ABDOM',
    'ALRM_LETAR',
    'ALRM_HEPAT',
    'ALRM_LIQ',
    'DT_ALRM',
    'DT_GRAV',
    'EPISTAXE',
    'GENGIVO',
    'METRO',
    'PETEQUIAS',
    'HEMATURA',
    'SANGRAM',
    'LACO_N',
    'PLASMATICO',
    'EVIDENCIA',
    'PLAQ_MENOR',
    'COMPLICA',
    'DT_CHIK_S1',
]

df.drop(columns=empty_columns, inplace=True)
df.head()

## Removing unimportant columns
Columns that have information that will not be important:

* IDs;
* Constant Columns (Location, Grievance, etc.)
* Admission and departure date
* etc

In [None]:
unimportant_columns = [
    'NU_NOTIFIC', 'TP_NOT', 'SG_UF_NOT', 'ID_MUNICIP', 'ID_REGIONA', 'SEM_NOT',
    'NU_ANO', 'ID_UNIDADE', 'SEM_PRI', 'DT_NASC', 'CS_ESCOL_N', 'SG_UF',
    'ID_MN_RESI', 'ID_RG_RESI', 'ID_BAIRRO', 'NM_BAIRRO', 'ID_PAIS',
    'DT_INVEST', 'CRITERIO', 'EVOLUCAO', 'DT_ENCERRA', 'GRAV_PULSO',
    'GRAV_CONV', 'GRAV_ENCH', 'GRAV_INSUF', 'GRAV_TAQUI', 'GRAV_EXTRE',
    'GRAV_HIPOT', 'GRAV_HEMAT', 'GRAV_MELEN', 'GRAV_METRO', 'GRAV_SANG',
    'GRAV_AST', 'GRAV_MIOC', 'GRAV_CONSC', 'GRAV_ORGAO', 'HOSPITALIZ',
    'RESUL_NS1', 'RESUL_VI_N', 'RES_CHIKS1', 'RES_CHIKS2', 'RESUL_SORO',
    'CLINC_CHIK'
]

df.drop(columns=unimportant_columns, inplace=True)
df.head()

## Data transformation

In [None]:
# Categorization of the collumn
df.loc[df['CS_SEXO'] == "F", 'CS_SEXO'] = 0
df.loc[df['CS_SEXO'] == "M", 'CS_SEXO'] = 1
df.loc[df['CS_SEXO'] == "I", 'CS_SEXO'] = 2

# Creating and filling in the Days column
# Columns filled in by the difference between symptom onset date and treatment date
# After this step, theese columns were also removed
df['DIAS'] = 0
df['DT_NOTIFIC'] = pd.to_datetime(df['DT_NOTIFIC'], dayfirst=True)
df['DT_SIN_PRI'] = pd.to_datetime(df['DT_SIN_PRI'], dayfirst=True)
df['DIAS'] = df['DT_NOTIFIC'] - df['DT_SIN_PRI']
df.drop(columns=['DT_NOTIFIC', 'DT_SIN_PRI'], inplace=True)

## Filling in null values
Columns that still had null values were filled with the value corresponding to "unfilled" entered in the dictionary.

In [None]:
for name in df.columns:
    df.loc[df[name].isnull(), name] = 9

## Standardization of results in the output target
The CLASSI_FIN column has different values for the same disease, the values were grouped for the same type according to the disease.

In [None]:
# DENGUE
# 1  - Classic Dengue
# 2  - Dengue with complications
# 10 - Dengue
# 11 - Dengue with warning signs
# 12 - Severe Dengue
df.loc[df['CLASSI_FIN'] == 1, 'CLASSI_FIN'] = 'DENGUE'
df.loc[df['CLASSI_FIN'] == 2, 'CLASSI_FIN'] = 'DENGUE'
df.loc[df['CLASSI_FIN'] == 10, 'CLASSI_FIN'] = 'DENGUE'
df.loc[df['CLASSI_FIN'] == 11, 'CLASSI_FIN'] = 'DENGUE'
df.loc[df['CLASSI_FIN'] == 12, 'CLASSI_FIN'] = 'DENGUE'

# CHIKUNGUNYA
# 13 - Chikungunya
df.loc[df['CLASSI_FIN'] == 13, 'CLASSI_FIN'] = 'CHIKUNGUNYA'

# OTHER
# 5 - Discarded
# 8 - Inconclusive
df.loc[df['CLASSI_FIN'] == 5, 'CLASSI_FIN'] = 'OTHER'
df.loc[df['CLASSI_FIN'] == 8, 'CLASSI_FIN'] = 'OTHER'

## Applying the undersampling
Applying the undersampling technique to balance the data.

In [None]:
# Shows that the data set is unbalanced
print(Counter(df.CLASSI_FIN))

X = df.drop('CLASSI_FIN', axis=1)
Y = df.CLASSI_FIN
X = np.array(X)
Y = np.array(Y)

# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='not minority')
# fit and apply the transform
X_under, y_under = undersample.fit_resample(X, Y)
# summarize class distribution
print(Counter(y_under))

# Transforming the undersampling base into a dataframe to save the csv

myData = np.c_[X_under, y_under]
columns = df.drop('CLASSI_FIN', axis=1).columns.tolist()
columns.append('CLASSI_FIN')

df = pd.DataFrame(data=myData, columns=columns)
df

## Saving data set

In [None]:
df.to_csv(path_save + 'data_set.csv', sep=';', index=False)