## Análisis Exploratorio del dataset Framingham

#### Carga de datos

In [None]:
import pandas as pd
data = pd.read_csv("C:/Users/Eduardo/Desktop/Master/Capstone/bbdd/framingham_clean.csv")
data

Unnamed: 0.1,Unnamed: 0,gender,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD,age_groups,heart_rate_groups
0,0,1,39,4,0,0,0,0,0,0,195,106.0,70.0,26.97,80,77,0,0,1
1,1,0,46,2,0,0,0,0,0,0,250,121.0,81.0,28.73,95,76,0,1,1
2,2,1,48,1,1,20,0,0,0,0,245,127.5,80.0,25.34,75,70,0,1,1
3,3,0,61,3,1,30,0,0,1,0,225,150.0,95.0,28.58,65,103,1,2,1
4,4,0,46,3,1,23,0,0,0,0,285,130.0,84.0,23.10,85,85,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4235,4235,0,48,2,1,20,Missing,0,0,0,248,131.0,72.0,22.00,84,86,0,1,1
4236,4236,0,44,1,1,15,0,0,0,0,210,126.5,87.0,19.16,86,78,0,1,1
4237,4237,0,52,2,0,0,0,0,0,0,269,133.5,83.0,21.47,80,107,0,1,1
4238,4238,1,40,3,0,0,0,0,1,0,185,141.0,98.0,25.60,67,72,0,0,1


#### Data Description
#### Demographic:

Sex: male or female("M" or "F")

Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

Education :1:Higher Secondary,2:Graduation,3:Post Graduation,4:PHD

#### Behavioral

Is_smoking: whether or not the patient is a current smoker ("YES" or "NO")

Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

#### Medical( history)

BP Meds: whether or not the patient was on blood pressure medication (Nominal)

Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)

Prevalent Hyp: whether or not the patient was hypertensive (Nominal)

Diabetes: whether or not the patient had diabetes (Nominal)

#### Medical(current)

Tot Chol: total cholesterol level (Continuous)

Sys BP: systolic blood pressure (Continuous)

Dia BP: diastolic blood pressure (Continuous)

BMI: Body Mass Index (Continuous)

Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)

Glucose: glucose level (Continuous)

#### Predict variable (desired target)

10-year risk of coronary heart disease CHD(binary: “1”, means “Yes”, “0” means “No”) - DV

#### Exploración general

In [None]:
# Resumen general
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         4240 non-null   int64  
 1   gender             4240 non-null   int64  
 2   age                4240 non-null   int64  
 3   education          4240 non-null   int64  
 4   currentSmoker      4240 non-null   int64  
 5   cigsPerDay         4240 non-null   int64  
 6   BPMeds             4240 non-null   object 
 7   prevalentStroke    4240 non-null   int64  
 8   prevalentHyp       4240 non-null   int64  
 9   diabetes           4240 non-null   int64  
 10  totChol            4240 non-null   int64  
 11  sysBP              4240 non-null   float64
 12  diaBP              4240 non-null   float64
 13  BMI                4240 non-null   float64
 14  heartRate          4240 non-null   int64  
 15  glucose            4240 non-null   int64  
 16  TenYearCHD         4240 

In [None]:
# Tipos de las columnas del dataset
data.dtypes

Unnamed: 0             int64
gender                 int64
age                    int64
education              int64
currentSmoker          int64
cigsPerDay             int64
BPMeds                object
prevalentStroke        int64
prevalentHyp           int64
diabetes               int64
totChol                int64
sysBP                float64
diaBP                float64
BMI                  float64
heartRate              int64
glucose                int64
TenYearCHD             int64
age_groups             int64
heart_rate_groups      int64
dtype: object

In [None]:
# Dimensiones del dataset
data.shape

(4240, 19)

#### Limpieza de datos
Se observa que la columna BPMeds es del tipo object al contrario que el resto. Esto se debe a que tiene un valor numérico booleano o el texto "Missing" en los casos en los que no se tiene datos. En este caso se sustituyen los "Missing" por el valor numérico 2 para poder convertir este campo a numérico.

Se modifica la columna gender de numérica a texto sustituyendo los valores 0 por 'M' y los 1 por 'F'.

Se modifica la columna is_smoking de numérica a texto sustituyendo los valores 0 por 'NO' y los 1 por 'YES'.

In [None]:
# Se reemplaza 'Missing' por 2
data['BPMeds'] = data['BPMeds'].replace('Missing',2)
# Se convierte 'BPMeds' a numerico
data['BPMeds'] = pd.to_numeric(data['BPMeds'])

In [None]:
# Se convierte 'gender' a texto
data['gender'] = data['gender'].astype(str)
# Se reemplaza 'Missing' por 2
data['gender'] = data['gender'].replace([0,1],["M","F"])

In [None]:
# Se convierte 'gender' a texto
data['currentSmoker'] = data['currentSmoker'].astype(str)
# Se reemplaza 'Missing' por 2
data['currentSmoker'] = data['currentSmoker'].replace(["0","1"],["NO","YES"])

In [None]:
data.dtypes

Unnamed: 0             int64
gender                object
age                    int64
education              int64
currentSmoker         object
cigsPerDay             int64
BPMeds                 int64
prevalentStroke        int64
prevalentHyp           int64
diabetes               int64
totChol                int64
sysBP                float64
diaBP                float64
BMI                  float64
heartRate              int64
glucose                int64
TenYearCHD             int64
age_groups             int64
heart_rate_groups      int64
dtype: object

#### 3. Análisis de outliers y EDA

In [None]:
data.describe()

Unnamed: 0.1,Unnamed: 0,age,education,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD,age_groups,heart_rate_groups
count,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0,4240.0
mean,2119.5,49.580189,1.955189,8.94434,0.054245,0.005896,0.310613,0.025708,236.667689,132.354599,82.897759,25.799005,75.878774,81.600943,0.151887,1.1,0.921698
std,1224.1269,8.572942,1.018522,11.904777,0.276262,0.076569,0.462799,0.15828,44.32848,22.0333,11.910394,4.070775,12.023937,22.86034,0.358953,0.665533,0.348895
min,0.0,32.0,1.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.54,44.0,40.0,0.0,0.0,0.0
25%,1059.75,42.0,1.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,75.0,23.0775,68.0,72.0,0.0,1.0,1.0
50%,2119.5,49.0,2.0,0.0,0.0,0.0,0.0,0.0,234.0,128.0,82.0,25.4,75.0,78.0,0.0,1.0,1.0
75%,3179.25,56.0,3.0,20.0,0.0,0.0,1.0,0.0,262.0,144.0,90.0,28.0325,83.0,85.0,0.0,2.0,1.0
max,4239.0,70.0,4.0,70.0,2.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0,2.0,2.0


In [None]:
from ydata_profiling import ProfileReport
ProfileReport(data,minimal=True)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset: 100%|██████████| 25/25 [00:00<00:00, 210.28it/s, Completed]                         
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.23s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  2.64it/s]


