<a href="https://colab.research.google.com/github/eric-castillo05/HackatecITZacatepec2022/blob/main/EntityForecast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
For this project, we leveraged the ["Exceso de Mortalidad Historico 2022"](http://www.dgis.salud.gob.mx/contenidos/basesdedatos/da_exceso_mortalidad_mexico_gobmx.html)  dataset to build a predictive model for determining the sex of an individual. The features considered in our analysis encompassed information such as the state and municipality of the deceased, age at the time of death, date of death, and the presence of a potential COVID-19 diagnosis.

# Objectives
The primary objective of this project is to employ the k-Nearest Neighbors (KNN) algorithm to build a predictive model for determining the sex of an individual. This model utilizes various features extracted from the "Exceso de Mortalidad Historico 2022" dataset, including the deceased's state, municipality, age at death, date of death.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the data
To streamline our workflow and facilitate efficient data retrieval, we opted to upload all datasets to Google Drive. This centralized storage solution ensures convenient access to the data, which is currently organized into multiple CSV files.

In [2]:
df1 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE09.csv')
df2 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE13.csv')
df3 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE19.csv')
df4 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE24.csv')
df5 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE28.csv')
df6 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE31.csv')
df7 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE36.csv')
df8 = pd.read_csv('/content/drive/MyDrive/datasets/Exceso_Mortalidad_Historico_MX_2022/DDAAxsom2022SE41.csv')

To further enhance workflow efficiency, we consolidated all datasets into a single comprehensive file. This unification eliminates the need for managing multiple files, streamlining data access and analysis processes.

In [3]:
df = pd.concat([df1, df2[1:], df3[1:], df4[1:], df5[1:], df6[1:], df7[1:], df8[1:]], axis=0)

# Data Cleaning

In [4]:
# Conserve only valid ages
df = df.loc[df['EDAD'] < 99]

# Convert the date to a pandas-compatible format
df['FECHA_DEFUNCION'] = pd.to_datetime(df['FECHA_DEFUNCION'], dayfirst=True)

# Transform dates to the number of days elapsed since the minimum date in the dataset
df['FECHA_DEFUNCION'] =  (df['FECHA_DEFUNCION'] -df['FECHA_DEFUNCION'].min()).dt.days

# Model Selection: k-Nearest Neighbors (KNN)

In [5]:
# Assigns predictor variables to X
X = df[['ENTIDAD_REG','FECHA_DEFUNCION', 'EDAD', 'SEXO']]

# Assigns target classes (MUNICIPIO_REG) to y
y = df['MUNICIPIO_REG']

To avoid overfitting, we set aside 30% of the data for testing and trained the model on the remaining 70%



In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [7]:
knn_model = KNeighborsClassifier(n_neighbors=7)
knn_model.fit(X_train, y_train)

# Model Evaluation

In [8]:
# Classifications on the test set
predictions = knn_model.predict(X_test)

# Model evaluation
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.6103330697774385
Classification Report:
               precision    recall  f1-score   support

           1       0.75      0.78      0.76    160452
           2       0.68      0.70      0.69    152296
           3       0.68      0.70      0.69     85380
           4       0.77      0.80      0.79    181251
           5       0.57      0.61      0.59    223190
           6       0.70      0.72      0.71     97108
           7       0.68      0.69      0.69    137517
           8       0.71      0.71      0.71     56384
           9       0.63      0.64      0.63     59883
          10       0.54      0.57      0.55    216405
          11       0.66      0.66      0.66     55766
          12       0.70      0.69      0.70     61252
          13       0.50      0.51      0.50     69322
          14       0.58      0.55      0.56    214742
          15       0.58      0.59      0.58    284915
          16       0.58      0.54      0.56     58564
          17       0.67     

  _warn_prf(average, modifier, msg_start, len(result))
