# Malaria Notifications Dataset in the Legal Amazon, Brazil (2003-2022): A Comprehensive Resource for Research and Surveillance

This dataset provides information on notifications of confirmed cases of Malaria. The data used in this study is derived from compulsory notifications stored in the Malaria Epidemiological Surveillance Information System (SIVEP-Malaria), covering the period from 2003 to 2022. By making this dataset available, it is expected that researchers will be better equipped to develop effective strategies to mitigate and eradicate malaria in the region.

The data set resulting from this project can be found [at this link](https://data.mendeley.com/datasets/9n6b97fsbd/1).

## Imports and data uploads

Libraries needed for code execution.

In [None]:
# Imports
from collections import Counter
import glob
import dask.dataframe as dd
import pandas as pd

files = glob.glob('path_to_data_in_your_computer/dataset/NOTI*.csv')

# Path where the original data set is located
path_data = "path_to_data_in_your_computer"

path_save = "path_to_save_the_data_in_your_computer"



## Pre processing

### Dataset Integration

The first step involved consolidating the datasets from each year into a single datase.

In [None]:
df = pd.concat((pd.read_csv(f, on_bad_lines='skip', sep=';') for f in files), ignore_index=True)
df.shape



### Discarding negative malaria cases

Attributes that were over 60% of notifications null were removed.

In [None]:
df_1 = df.loc[df['RES_EXAM']>1]


### Feature Selection

Removed attributes that would not be useful for the final result. 

In [None]:
dataset = df_1[['COD_NOTI', 'DT_NOTIF', 'MUN_NOTI', 'RES_EXAM']]

### Data Transformation

After this filtering process, the NOTIFICATION DATE variable, initially in numeric format, was coded into a DateTime variable. Another important step in transforming the data was renaming the attributes.

In [None]:
# Rename Columns.
dataset.rename(columns={'DT_NOTIF': 'Date'}, inplace=True)
dataset.rename(columns={'MUN_NOTI': 'Municipality'}, inplace=True)
dataset.rename(columns={'RES_EXAM': 'Types of Malaria'}, inplace=True)
dataset.rename(columns={'RES_EXAM': 'Test results'}, inplace=True)


# Transformation in DateTime
dataset['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
dataset.set_index('Date', inplace=True)


## Data set saving

In [None]:
dataset.to_csv(path_save, sep=",", index = False)