### Heart Attack

ref: 
https://www.kaggle.com/datasets/ankushpanday1/heart-attack-in-youth-vs-adult-in-nigeria

### Acerca del Conjunto de Datos
El conjunto de datos contiene registros detallados de casos de infartos entre jóvenes (15-35 años) y adultos (35+ años) en Nigeria, distribuidos en los 36 estados y el Territorio de la Capital Federal (FCT). El conjunto de datos incluye características demográficas, de estilo de vida, médicas y socioeconómicas, lo que permite un análisis exhaustivo de las tendencias de infartos.

### Análisis

### Para Principiantes
#### Distribuciones Básicas:
- Identificar la prevalencia de infartos por grupo de edad, género y estado.
- Analizar la proporción de fumadores frente a no fumadores.
- Explorar cómo varía el IMC (Índice de Masa Corporal) por grupo de edad.

#### Visualizaciones:
- Crear gráficos de barras o de pastel para visualizar la distribución por estado o género.
- Usar histogramas para explorar las distribuciones de IMC y niveles de estrés.

### Para Usuarios Intermedios
#### Correlaciones:
- Explorar correlaciones entre factores de estilo de vida (ejercicio, tabaquismo) y la gravedad del infarto.
- Identificar relaciones entre condiciones médicas (hipertensión, diabetes) y tasas de hospitalización.

#### Análisis de Características:
- Analizar el impacto de vivir en zonas urbanas frente a rurales en la prevalencia de infartos.
- Evaluar cómo los niveles de ingresos se correlacionan con las tasas de hospitalización y supervivencia.

#### Segmentación:
- Agrupar los datos por estados o regiones para comprender las tendencias regionales en casos de infarto.

### Para Usuarios Avanzados
#### Modelado Predictivo:
- Construir modelos de clasificación para predecir la gravedad del infarto basándose en características de estilo de vida, médicas y demográficas.
- Crear modelos de predicción de supervivencia utilizando regresión logística o bosques aleatorios.

#### Importancia de Características:
- Aplicar técnicas avanzadas como SHAP (SHapley Additive exPlanations) para interpretar las predicciones del modelo e identificar factores de riesgo críticos.

#### Recomendaciones de Políticas por Estado:
- Realizar análisis de clústeres para identificar estados con perfiles de riesgo similares y recomendar intervenciones sanitarias específicas.

#### Análisis Profundo:
- Usar análisis de series temporales (si se extiende con fechas) para estudiar tendencias en casos de infarto a lo largo del tiempo.

---

In [1]:
# Lectura del dataset
import pandas as pd
import numpy as np
import plotly.express as px

# Cargamos el dataset
df = pd.read_csv('./datasets_data/heart_attack_youth_vs_adult_nigeria.csv', sep=',', encoding='latin1')

df.head()

Unnamed: 0,State,Age_Group,Gender,BMI,Smoking_Status,Alcohol_Consumption,Exercise_Frequency,Hypertension,Diabetes,Cholesterol_Level,Family_History,Stress_Level,Diet_Type,Heart_Attack_Severity,Hospitalized,Survived,Income_Level,Urban_Rural,Employment_Status
0,Ondo,Youth,Female,34.5,Non-Smoker,High,Occasionally,No,Yes,Borderline,Yes,Low,Unhealthy,Mild,No,Yes,Medium,Urban,Employed
1,FCT,Youth,Male,15.2,Non-Smoker,High,Occasionally,No,Yes,High,No,Moderate,Unhealthy,Mild,Yes,No,High,Rural,Employed
2,Borno,Youth,Female,25.0,Non-Smoker,High,Weekly,No,Yes,High,No,Moderate,Mixed,Moderate,Yes,No,Medium,Rural,Student
3,Katsina,Youth,Male,19.7,Non-Smoker,High,Occasionally,No,No,High,Yes,High,Healthy,Mild,Yes,No,Low,Rural,Unemployed
4,Kaduna,Adult,Female,35.6,Non-Smoker,Low,Rarely,No,Yes,High,Yes,High,Healthy,Moderate,No,Yes,Low,Rural,Student


In [2]:
### Verificar nulos
df.isnull().sum()/len(df)*100

State                     0.00000
Age_Group                 0.00000
Gender                    0.00000
BMI                       0.00000
Smoking_Status            0.00000
Alcohol_Consumption      25.09891
Exercise_Frequency        0.00000
Hypertension              0.00000
Diabetes                  0.00000
Cholesterol_Level         0.00000
Family_History            0.00000
Stress_Level              0.00000
Diet_Type                 0.00000
Heart_Attack_Severity     0.00000
Hospitalized              0.00000
Survived                  0.00000
Income_Level              0.00000
Urban_Rural               0.00000
Employment_Status         0.00000
dtype: float64

In [3]:
print(f"total de datos en el dataset {len(df)}")
df.dtypes

total de datos en el dataset 898796


State                     object
Age_Group                 object
Gender                    object
BMI                      float64
Smoking_Status            object
Alcohol_Consumption       object
Exercise_Frequency        object
Hypertension              object
Diabetes                  object
Cholesterol_Level         object
Family_History            object
Stress_Level              object
Diet_Type                 object
Heart_Attack_Severity     object
Hospitalized              object
Survived                  object
Income_Level              object
Urban_Rural               object
Employment_Status         object
dtype: object

In [4]:
prevalence = df.groupby(['State', 'Age_Group', 'Gender']).size().reset_index(name='Heart_Attack_Count')

prevalence['Heart_Attack_Percentage'] = prevalence.groupby(['State', 'Age_Group'])['Heart_Attack_Count'].transform(lambda x: x/x.sum()*100)
prevalence.head()

Unnamed: 0,State,Age_Group,Gender,Heart_Attack_Count,Heart_Attack_Percentage
0,Abia,Adult,Female,6074,50.119647
1,Abia,Adult,Male,6045,49.880353
2,Abia,Youth,Female,6061,49.574677
3,Abia,Youth,Male,6165,50.425323
4,Adamawa,Adult,Female,6224,50.246226


In [5]:
# Cuantos paises hay en el dataset
print(f"En el dataset hay {len(df['State'].unique())}")

# paises en el dataset
print(f"{df['State'].unique()} \n estados de Nigeria")

En el dataset hay 37
['Ondo' 'FCT' 'Borno' 'Katsina' 'Kaduna' 'Kogi' 'Ebonyi' 'Kwara' 'Yobe'
 'Akwa Ibom' 'Kebbi' 'Adamawa' 'Osun' 'Rivers' 'Edo' 'Lagos' 'Niger'
 'Ogun' 'Gombe' 'Zamfara' 'Benue' 'Cross River' 'Jigawa' 'Anambra' 'Enugu'
 'Nasarawa' 'Kano' 'Taraba' 'Imo' 'Bayelsa' 'Sokoto' 'Delta' 'Oyo' 'Abia'
 'Bauchi' 'Ekiti' 'Plateau'] 
 estados de Nigeria


In [6]:
len(df.columns)

19

In [7]:
df.describe()

Unnamed: 0,BMI
count,898796.0
mean,27.502941
std,7.220324
min,15.0
25%,21.2
50%,27.5
75%,33.8
max,40.0


In [8]:
df['Gender'].unique()

array(['Female', 'Male'], dtype=object)

### Transformar datos

In [9]:
df["Gender"] = df["Gender"].replace("Female", 0)
df["Gender"] = df["Gender"].replace("Male", 1)
df.head()

  df["Gender"] = df["Gender"].replace("Male", 1)


Unnamed: 0,State,Age_Group,Gender,BMI,Smoking_Status,Alcohol_Consumption,Exercise_Frequency,Hypertension,Diabetes,Cholesterol_Level,Family_History,Stress_Level,Diet_Type,Heart_Attack_Severity,Hospitalized,Survived,Income_Level,Urban_Rural,Employment_Status
0,Ondo,Youth,0,34.5,Non-Smoker,High,Occasionally,No,Yes,Borderline,Yes,Low,Unhealthy,Mild,No,Yes,Medium,Urban,Employed
1,FCT,Youth,1,15.2,Non-Smoker,High,Occasionally,No,Yes,High,No,Moderate,Unhealthy,Mild,Yes,No,High,Rural,Employed
2,Borno,Youth,0,25.0,Non-Smoker,High,Weekly,No,Yes,High,No,Moderate,Mixed,Moderate,Yes,No,Medium,Rural,Student
3,Katsina,Youth,1,19.7,Non-Smoker,High,Occasionally,No,No,High,Yes,High,Healthy,Mild,Yes,No,Low,Rural,Unemployed
4,Kaduna,Adult,0,35.6,Non-Smoker,Low,Rarely,No,Yes,High,Yes,High,Healthy,Moderate,No,Yes,Low,Rural,Student


In [10]:
df.head()

Unnamed: 0,State,Age_Group,Gender,BMI,Smoking_Status,Alcohol_Consumption,Exercise_Frequency,Hypertension,Diabetes,Cholesterol_Level,Family_History,Stress_Level,Diet_Type,Heart_Attack_Severity,Hospitalized,Survived,Income_Level,Urban_Rural,Employment_Status
0,Ondo,Youth,0,34.5,Non-Smoker,High,Occasionally,No,Yes,Borderline,Yes,Low,Unhealthy,Mild,No,Yes,Medium,Urban,Employed
1,FCT,Youth,1,15.2,Non-Smoker,High,Occasionally,No,Yes,High,No,Moderate,Unhealthy,Mild,Yes,No,High,Rural,Employed
2,Borno,Youth,0,25.0,Non-Smoker,High,Weekly,No,Yes,High,No,Moderate,Mixed,Moderate,Yes,No,Medium,Rural,Student
3,Katsina,Youth,1,19.7,Non-Smoker,High,Occasionally,No,No,High,Yes,High,Healthy,Mild,Yes,No,Low,Rural,Unemployed
4,Kaduna,Adult,0,35.6,Non-Smoker,Low,Rarely,No,Yes,High,Yes,High,Healthy,Moderate,No,Yes,Low,Rural,Student


In [11]:
diccionario_edades = {"Youth": 0, "Adult": 1}

df["Age_Group"]=df["Age_Group"].map(diccionario_edades)
df.head()

Unnamed: 0,State,Age_Group,Gender,BMI,Smoking_Status,Alcohol_Consumption,Exercise_Frequency,Hypertension,Diabetes,Cholesterol_Level,Family_History,Stress_Level,Diet_Type,Heart_Attack_Severity,Hospitalized,Survived,Income_Level,Urban_Rural,Employment_Status
0,Ondo,0,0,34.5,Non-Smoker,High,Occasionally,No,Yes,Borderline,Yes,Low,Unhealthy,Mild,No,Yes,Medium,Urban,Employed
1,FCT,0,1,15.2,Non-Smoker,High,Occasionally,No,Yes,High,No,Moderate,Unhealthy,Mild,Yes,No,High,Rural,Employed
2,Borno,0,0,25.0,Non-Smoker,High,Weekly,No,Yes,High,No,Moderate,Mixed,Moderate,Yes,No,Medium,Rural,Student
3,Katsina,0,1,19.7,Non-Smoker,High,Occasionally,No,No,High,Yes,High,Healthy,Mild,Yes,No,Low,Rural,Unemployed
4,Kaduna,1,0,35.6,Non-Smoker,Low,Rarely,No,Yes,High,Yes,High,Healthy,Moderate,No,Yes,Low,Rural,Student


## mascara - filtro

In [12]:
df["Smoking_Status"].unique()

array(['Non-Smoker', 'Smoker'], dtype=object)

In [13]:
len(df)

898796

In [14]:
mask = df['Smoking_Status'] == "Smoker"
df_filtrado = df[mask]

In [15]:
porcentaje_fumadores = len(df_filtrado)/len(df)*100 

In [16]:
print(f"Total de fumadores en el dataset - porcentaje {porcentaje_fumadores}")

Total de fumadores en el dataset - porcentaje 49.98698258559228


In [17]:
df_smoker_nonsmoker = df.groupby(["Gender", "Smoking_Status"]).size().reset_index(name='Total_poblacion')
df_smoker_nonsmoker.head()

Unnamed: 0,Gender,Smoking_Status,Total_poblacion
0,0,Non-Smoker,225044
1,0,Smoker,224649
2,1,Non-Smoker,224471
3,1,Smoker,224632


In [None]:
#df.groupby(["Gender", "Smoking_Status"]).agg({"Income_Level": "sum", "BMI": "mean"})