<a href="https://colab.research.google.com/github/acoiman/pdt/blob/main/asthma_mortality/notebooks/Python/01_Asthma_Mortality_PP_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Asthma Mortality Data Preprocessing (Part 1)

In Part 1 of the data preprocessing, we will clean the asthma mortality data. The processing will consist of: i) Missing data handling; ii) Duplicate removal; and iii) Data validation.

## Load libraries

Libraries required for the analysis will be loaded

In [None]:
# DataFrame libraries
import pandas as pd
from pandas.api.types import CategoricalDtype

# other libraries
import csv

## Load data

The data on asthma mortality will be loaded and displayed for preprocessing and analysis.

In [None]:
# change directory to work folder (at the begining docker container enter into /home/jovyan/)
%cd work

/home/jovyan/work


In [None]:
#  Read an Excel file into a pandas DataFrame
df = pd.read_excel("pdt/asthma_mortality/data/xls/def_asma_2001_2023.xlsx")

In [None]:
# Display the first few rows of the DataFrame
df.head()

In [None]:
# Return the number of rows in the DataFrame.
len(df)

In [None]:
# Filter the DataFrame `df` to exclude rows where the value in the 'Año' column is equal to 2023.
df = df[df['Año'] != 2023]

In [None]:
# Return unique values in the "Año" column of a DataFrame
df["Año"].unique()

In [None]:
# Return the number of rows in the DataFrame.
len(df)

In [None]:
# difference in length of dataframe
8404-8178

226 (8404-8178) samples corresponding to the year 2023 were excluded

## Misssing values

Missing values ​​will be identified from each column


In [None]:
# Display a concise summary of a DataFrame including the index dtype and column dtypes, non-null values, and memory usage.
df.info()

## Duplicate values
For this dataset it is not necessary to remove duplicate values

## Data validation
Data validation involves verifying and modifying data so that its values ​​are consistent with its corresponding classes. The following operations will be performed:

* Generate a column for department ID (IDDPTO) to match Argentina's official administrative divisions.

* Check column sex values ​​(e.g., "Male," "Female," "Other/Unknown").

* Verify that age group categories are consistent (e.g., "0-4", "5-14", "15-24", etc.)

* Verify that the death toll is not negative.





### Creating a column for department ID(IDDPTO)



Samples with undefined provinces and departments will be eliminated

In [None]:
# Filter a DataFrame to exclude rows where the 'Prov_resid' column is equal to 98 or 99.
df = df[(df['Prov_resid'] != 98) & (df['Prov_resid'] != 99)]

In [None]:
# Return the number of rows in the DataFrame.
len(df)

In [None]:
# difference in length of dataframe
8178-8155

23 (8178-8155) records belonging to other countries and unspecified provinces were eliminated

To ensure that department IDs matches the official administrative divisions in Argentina, two functions will be created: 1) adding 0 to digits less than 10 in the "Prov_resid" column; 2) adding 0 to digits less than 100 in the "Dep_Resid" column.

In [None]:
# for df['Prov_resid'] append 0 to digts  less than 10

# Convert 'Prov_resid' to string type to allow string operations
df['Prov_resid'] = df['Prov_resid'].astype(str)

# Function to append '0' if the number is less than 10
def append_zero(prov):
    if len(prov) < 2:  # Check if the string length is less than 2.
      return "0" + prov
    return prov

In [None]:
# Apply the function to the 'PROVRE' column
df['Prov_resid'] = df['Prov_resid'].apply(append_zero)

In [None]:
# Return unique values of the 'Prov_resid' column in the DataFrame 'df'
df['Prov_resid'].unique()

DEIS (Dirección de Estadísticas e Información de la Salud) considered the City of Buenos Aires (CABA) as a single geographic. The subdivision of the City of Buenos Aires (CABA) was not homogeneous throughout the requested period (2001–2022). Therefore, the City of Buenos Aires (CABA) will be treated as a single department, assigning '000' value to all its polygons in the 'Dep_Resid' column.

In [None]:
# for Dep_Resid == 02 assign  000 for IDDPTO

# Convert 'Dep_Resid' to string type if it's not already
df['Prov_resid'] = df['Prov_resid'].astype(str)

# Create a boolean mask where Dep_Resid is '02'
mask = df['Prov_resid'] == '02'

# Assign '02000' to Dep_Resid where the mask is True
df['Dep_Resid'] = df['Dep_Resid'].astype(str)
df.loc[mask, 'Dep_Resid'] = '000'

In [None]:
# Display the first few rows of the DataFrame.
df.head()

In [None]:
# Return unique values of the 'Dep_Resid' column in the DataFrame 'df'
df['Dep_Resid'].unique()

In [None]:
# select all samples where Dep_Resid == "999"
selected_samples = df[df['Dep_Resid'] == "999"]
selected_samples

In [None]:
# delete 999 (sin especificar) for 'Dep_Resid'
df = df[(df['Dep_Resid'] != "999")]

In [None]:
# Return unique values of the 'Dep_Resid' column in the DataFrame 'df'
df['Dep_Resid'].unique()

In [None]:
# Return the number of rows in the DataFrame.
len(df)

In [None]:
# difference in length of dataframe
8155-8070

85 (8155-8070) records belonging to unspecified departments were deleted

In [None]:
# for df['Dep_Resid'] append 0 to digts  less than 100

# Convert 'Dep_Resid' to string type to allow string operations
df['Dep_Resid'] = df['Dep_Resid'].astype(str)

# Function to append '0' if the number is less than 100
def append_zero(prov):
    if len(prov) < 2:  # Check if the string length is less than 2.
      return "00" + prov
    else:
      if len(prov) < 3:
        return "0" + prov
    return prov

In [None]:
# Apply the function to the 'PROVRE' column
df['Dep_Resid'] = df['Dep_Resid'].apply(append_zero)

In [None]:
# Return unique values of the 'Dep_Resid' column in the DataFrame 'df'
df['Dep_Resid'].unique()

Create a unique department identification (IDDPTO) based on "Prov_resid" + "Dep_Resid" according to INDEC (Instituto Nacional de Estadística y Censos) geographic codes (https://www.indec.gob.ar/indec/web/Nivel3-Tema-1-39).

In [None]:
# Add two columns 'Prov_resid' and 'Dep_Resid' from a DataFrame and assign the sum to a new column 'IDDPTO'.
df['IDDPTO'] = df['Prov_resid'] + df['Dep_Resid']

In [None]:
# Convert the 'IDDPTO' column in the DataFrame 'df' to a string data type.
df['IDDPTO'] = df['IDDPTO'].astype(str)

In [None]:
# Display the first few rows of the DataFrame
df.head()

In [None]:
# Rename the columns of a DataFrame.
df = df.rename(columns={'Año': 'ANIO', "Sexo": "SEXO", "Grupedad": "GRUPEDAD", "Cantidad": "CANTIDAD"})

In [None]:
# Select specific columns from a DataFrame.
df = df[['ANIO', 'IDDPTO', 'SEXO', 'GRUPEDAD', 'CANTIDAD']]

In [None]:
# Display the first few rows of the DataFrame
df.head()

In [None]:
# Save the DataFrame to a CSV file without sex and age group modifications
df.to_csv("pdt/asthma_mortality/data/csv/def_asma_2001_2022_clean_01.csv", index=False)

### Verificar los valores de sexo

In [None]:
# Read a CSV file into a pandas DataFrame
df = pd.read_csv("pdt/asthma_mortality/data/csv/def_asma_2001_2022_clean_01.csv", dtype={'IDDPTO': str})

In [None]:
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,ANIO,IDDPTO,SEXO,GRUPEDAD,CANTIDAD
0,2001,2000,2,03.45 a 64,1
1,2001,2000,2,05.75 y más,1
2,2001,2000,1,04.65 a 74,1
3,2001,2000,1,04.65 a 74,1
4,2001,2000,2,05.75 y más,1


In [None]:
# Return the number of rows in the DataFrame.
len(df)

8070

In [None]:
# Return unique values in the 'SEXO' column of the DataFrame 'df'
df['SEXO'].unique()

array([2, 1, 9])

In [None]:
# Filter the DataFrame `df` to exclude rows where the 'SEXO' column has a value of 9.
df = df[df['SEXO'] != 9]

In [None]:
# Return unique values in the 'SEXO' column of the DataFrame 'df'
df['SEXO'].unique()

array([2, 1])

In [None]:
# Return the number of rows in the DataFrame.
len(df)

8057

In [None]:
# difference in length of dataframe
8070-8057

13 records (8070-8057) with undefined sex were eliminated

In [None]:
# Convert the 'SEXO' column in a pandas DataFrame to a categorical data type.
df['SEXO'] = pd.Categorical(df['SEXO'])

In [None]:
# Return unique values in the 'SEXO' column of the DataFrame 'df'
df['SEXO'].unique()

[2, 1]
Categories (2, int64): [1, 2]

### Check age group categories

In [None]:
# Return unique values in the 'GRUPEDAD' column of the DataFrame 'df'.
df['GRUPEDAD'].unique()

array(['03.45 a 64', '05.75 y más', '04.65 a 74', '02.20 a 44',
       '01.5 a 19', '00.0 a 4', '10.Sin esp'], dtype=object)

In [None]:
# Filter the DataFrame to exclude rows where the 'GRUPEDAD' column is equal to '10.Sin esp'.
df = df[df['GRUPEDAD'] != '10.Sin esp']

In [None]:
# Return the number of rows in the DataFrame
len(df)

8028

In [None]:
# difference in length of dataframe
8057-8028

29

29 records (8057-8028) were deleted with unspecified GROUPEDAD

Rename values ​​in GROUPEDAD column


In [None]:
# Replace specific values in the 'GRUPEDAD' column of a DataFrame with more readable ranges
df['GRUPEDAD'] = df['GRUPEDAD'].replace('00.0 a 4', '0-4')
df['GRUPEDAD'] = df['GRUPEDAD'].replace('01.5 a 19', '5-19')
df['GRUPEDAD'] = df['GRUPEDAD'].replace('02.20 a 44', '20-44')
df['GRUPEDAD'] = df['GRUPEDAD'].replace('03.45 a 64', '45-64')
df['GRUPEDAD'] = df['GRUPEDAD'].replace('04.65 a 74', '65-74')
df['GRUPEDAD'] = df['GRUPEDAD'].replace('05.75 y más', '>= 75')

In [None]:
# Return unique values in the 'GRUPEDAD' column of the DataFrame 'df'.
df['GRUPEDAD'].unique()

array(['45-64', '>= 75', '65-74', '20-44', '5-19', '0-4'], dtype=object)

In [None]:
# Convert 'GRUPEDAD' to ordinal data

# Define the desired order of categories
categories = ['0-4', '5-19', '20-44', '45-64', '65-74', '>= 75']

# Create a categorical data type with the specified order
cat_type = CategoricalDtype(categories=categories, ordered=True)

# Convert the 'GRUPEDAD' column to the ordinal categorical type
df['GRUPEDAD'] = df['GRUPEDAD'].astype(cat_type)


In [None]:
# Return unique values in the 'GRUPEDAD' column of the DataFrame 'df'.
df['GRUPEDAD'].unique()

['45-64', '>= 75', '65-74', '20-44', '5-19', '0-4']
Categories (6, object): ['0-4' < '5-19' < '20-44' < '45-64' < '65-74' < '>= 75']

### Check that the death toll is not negative.

In [None]:
 # Return unique values in the 'CANTIDAD' column of the DataFrame 'df'
df['CANTIDAD'].unique()

array([ 1,  2,  3,  5,  4,  7, 13,  6,  8, 10,  9, 11, 12, 16, 14, 15])

In [None]:
# Display a concise summary of a DataFrame including the index dtype and column dtypes, non-null values, and memory usage.
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8028 entries, 0 to 8069
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   ANIO      8028 non-null   int64   
 1   IDDPTO    8028 non-null   object  
 2   SEXO      8028 non-null   category
 3   GRUPEDAD  8028 non-null   category
 4   CANTIDAD  8028 non-null   int64   
dtypes: category(2), int64(2), object(1)
memory usage: 266.9+ KB


In [None]:
# Save the DataFrame to a CSV file with sex and age group modifications
df.to_csv("pdt/asthma_mortality/data/csv/def_asma_2001_2022_clean_02.csv", index=False)