<a href="https://colab.research.google.com/github/eric-castillo05/HackatecITZacatepec2022/blob/main/HackatecITZacapetec2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this project, we used the ["Exceso de Mortalidad Historico 2022"](https://www.dgis.salud.gob.mx/contenidos/basesdedatos/da_exceso_mortalidad_mexico_gobmx.html) dataset to predict the sex of a person based on several features. These features included the deceased's state, municipality, age of death, date of death, and a possible COVID-19 diagnosis.

# Load the data
To streamline our workflow and facilitate efficient data retrieval, we opted to upload all datasets to Google Drive. This centralized storage solution ensures convenient access to the data, which is currently organized into multiple CSV files.

In [1]:
import pandas as pd

In [2]:
df1 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE09.csv')
df2 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE13.csv')
df3 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE19.csv')
df4 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE24.csv')
df5 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE28.csv')
df6 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE31.csv')
df7 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE36.csv')
df8 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE41.csv')

To further enhance workflow efficiency, we consolidated all datasets into a single comprehensive file. This unification eliminates the need for managing multiple files, streamlining data access and analysis processes.

In [3]:
df = pd.concat([df1, df2[1:], df3[1:], df4[1:], df5[1:], df6[1:], df7[1:], df8[1:]], axis=0)

# Data Cleaning: Ensuring Quality and Consistency
To guarantee the integrity and reliability of our analysis, we performed thorough data cleaning, addressing the following:

* NA Values: We meticulously identified and handled missing
values to prevent biases and inaccuracies in our results.
* Out-of-Range Values: We carefully scrutinized and corrected values that fell outside expected ranges for:
* Dates: Ensuring adherence to valid date formats and chronological logic.
* States: Verifying consistency with established state codes and boundaries.
* Age: Taking particular care to rectify any implausible or erroneous age entries.

Key Considerations for Age:
* Plausibility: We ensured that ages aligned with realistic human lifespans.
* Typos and Mismatches: We identified and corrected potential errors in age data entry.
* Outliers: We investigated any age values that significantly deviated from typical patterns, assessing their validity and impact on analysis.
Through this rigorous data cleaning, we established a foundation of clean, reliable data, enabling accurate and meaningful insights from our subsequent analysis.

In [4]:
df.isna().any()

FECHA_ACTUALIZACION    False
ID_REGISTRO            False
ENTIDAD_REG            False
MUNICIPIO_REG          False
FECHA_DEFUNCION        False
FECHA_DE_REGISTRO      False
SEXO                   False
EDAD                   False
POSIBLE-COVID19        False
dtype: bool

In [5]:
df['FECHA_DEFUNCION'] = pd.to_datetime(df['FECHA_DEFUNCION'])

In [6]:
print(f'El porcentaje de los datos que tienen edad > 100 son {(len(df[df.EDAD>100]) *100)/len(df)}')

El porcentaje de los datos que tienen edad > 100 son 0.875716803636558


In [7]:
df = df.loc[df['EDAD'] < 99]

# Model Selection: Logistic Regression

In the context of predicting binary outcomes (such as sex), logistic regression is a powerful tool. It allows us to model the probability of an event occurring based on one or more predictor variables. Specifically, we’ll explore how variations in the following independent variables influence the likelihood of a specific outcome (in this case, sex):

* ENTIDAD_REG (State): The state where the individual resides.
* MUNICIPIO_REG (Municipality): The specific municipality within the state.
* FECHA_DEFUNCION (Date of Death): The date when the individual passed away.
* EDAD (Age): The age of the individual.
* POSIBLE-COVID19 (Possible COVID-19): Whether the individual is suspected of having COVID-19.

Logistic regression provides insights into the relationship between these predictors and the binary outcome (sex).

* Subtracts Minimum Date: Subtracts the earliest date in the dataset from each individual date, ensuring a starting point of zero.
* Extracts Days: Employs the dt.days attribute to convert the resulting timedeltas (representing durations) into numerical values representing days since the earliest date.

In [8]:
df['FECHA_DEFUNCION'] = (df['FECHA_DEFUNCION'] - df['FECHA_DEFUNCION'].min()).dt.days

In [9]:
# Assigns predictor variables to X
X = df[['ENTIDAD_REG', 'MUNICIPIO_REG', 'FECHA_DEFUNCION', 'EDAD', 'POSIBLE-COVID19']]

# Assigns target variable (sex) to y
y = df['SEXO']

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt

To avoid overfitting, we set aside 20% of the data for testing and trained the model on the remaining 80%

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [20]:
model = LogisticRegression()
model.fit(X_train, y_train)

# Model Evaluation



In [21]:
y_pred = model.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(model.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.59


In [22]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

[[ 410266 2170467]
 [ 328362 3246462]]


In [23]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.56      0.16      0.25   2580733
           2       0.60      0.91      0.72   3574824

    accuracy                           0.59   6155557
   macro avg       0.58      0.53      0.48   6155557
weighted avg       0.58      0.59      0.52   6155557

