<a href="https://colab.research.google.com/github/eric-castillo05/HackatecITZacatepec2022/blob/main/HackatecITZacapetec2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this project, we used the ["Exceso de Mortalidad Historico 2022"](https://www.dgis.salud.gob.mx/contenidos/basesdedatos/da_exceso_mortalidad_mexico_gobmx.html) dataset to predict the sex of a person based on several features. These features included the deceased's state, municipality, age of death, date of death, and a possible COVID-19 diagnosis.

# Load the data
To streamline our workflow and facilitate efficient data retrieval, we opted to upload all datasets to Google Drive. This centralized storage solution ensures convenient access to the data, which is currently organized into multiple CSV files.

In [1]:
import pandas as pd

In [2]:
df1 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE09.csv')
df2 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE13.csv')
df3 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE19.csv')
df4 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE24.csv')
df5 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE28.csv')
df6 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE31.csv')
df7 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE36.csv')
df8 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE41.csv')

To further enhance workflow efficiency, we consolidated all datasets into a single comprehensive file. This unification eliminates the need for managing multiple files, streamlining data access and analysis processes.

In [3]:
df = pd.concat([df1, df2[1:], df3[1:], df4[1:], df5[1:], df6[1:], df7[1:], df8[1:]], axis=0)

# Data Cleaning: Ensuring Quality and Consistency
To guarantee the integrity and reliability of our analysis, we performed thorough data cleaning, addressing the following:

* NA Values: We meticulously identified and handled missing
values to prevent biases and inaccuracies in our results.
* Out-of-Range Values: We carefully scrutinized and corrected values that fell outside expected ranges for:
* Dates: Ensuring adherence to valid date formats and chronological logic.
* States: Verifying consistency with established state codes and boundaries.
* Age: Taking particular care to rectify any implausible or erroneous age entries.

Key Considerations for Age:
* Plausibility: We ensured that ages aligned with realistic human lifespans.
* Typos and Mismatches: We identified and corrected potential errors in age data entry.
* Outliers: We investigated any age values that significantly deviated from typical patterns, assessing their validity and impact on analysis.
Through this rigorous data cleaning, we established a foundation of clean, reliable data, enabling accurate and meaningful insights from our subsequent analysis.

In [4]:
df.isna().any()

FECHA_ACTUALIZACION    False
ID_REGISTRO            False
ENTIDAD_REG            False
MUNICIPIO_REG          False
FECHA_DEFUNCION        False
FECHA_DE_REGISTRO      False
SEXO                   False
EDAD                   False
POSIBLE-COVID19        False
dtype: bool

In [5]:
df['FECHA_ACTUALIZACION'] = pd.to_datetime(df['FECHA_ACTUALIZACION'])
df['FECHA_DE_REGISTRO'] = pd.to_datetime(df['FECHA_DE_REGISTRO'])
df['FECHA_DEFUNCION'] = pd.to_datetime(df['FECHA_DEFUNCION'])

In [6]:
print(f'El porcentaje de los datos que tienen edad > 100 son {(len(df[df.EDAD>100]) *100)/len(df)}')

El porcentaje de los datos que tienen edad > 100 son 0.875716803636558


In [7]:
df = df.loc[df['EDAD'] < 99]

# Model Selection: Multiple Linear Regression

To model the relationship between the target variable (sex) and multiple predictors, we strategically selected multiple linear regression. This technique allows us to assess the extent to which variations in sex can be explained by the combined influence of the following independent variables:

* ENTIDAD_REG (State)
* MUNICIPIO_REG (Municipality)
* FECHA_DEFUNCION (Date of Death)
* EDAD (Age)
* POSIBLE-COVID19 (Possible COVID-19)

* Subtracts Minimum Date: Subtracts the earliest date in the dataset from each individual date, ensuring a starting point of zero.
* Extracts Days: Employs the dt.days attribute to convert the resulting timedeltas (representing durations) into numerical values representing days since the earliest date.

In [8]:
df['FECHA_DEFUNCION'] = (df['FECHA_DEFUNCION'] - df['FECHA_DEFUNCION'].min()).dt.days

In [9]:
# Assigns predictor variables to X
X = df[['ENTIDAD_REG', 'MUNICIPIO_REG', 'FECHA_DEFUNCION', 'EDAD', 'POSIBLE-COVID19']]

# Assigns target variable (sex) to y
y = df['SEXO']

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt

To avoid overfitting, we set aside 20% of the data for testing and trained the model on the remaining 80%

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [12]:
model = LinearRegression()
model.fit(X_train, y_train)

# Model Evaluation



In [13]:
y_pred = model.predict(X_test)

In [14]:
pred_y_df=pd.DataFrame({'Actual Value':y_test, 'Pred':y_pred, 'Difference':y_test-y_pred})
pred_y_df

Unnamed: 0,Actual Value,Pred,Difference
833454,2,1.537605,0.462395
1673497,2,1.560039,0.439961
1988044,2,1.512233,0.487767
1027039,2,1.635105,0.364895
1025412,2,1.649082,0.350918
...,...,...,...
1590201,2,1.720360,0.279640
1680350,2,1.587350,0.412650
118226,1,1.513437,-0.513437
668408,1,1.597691,-0.597691


Mean Absolute Error (MAE): The MAE is 0.4775. This means that on average, the model is off by 0.4775 units when predicting the dependent variable.


In [20]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))

Mean Absolute Error: 0.4775420302082085
