# Analysis

Before starting anything, let's take a quick look at our data and determine some boundaries to validate our dataset.

## Rules (to start with)

- **Age (days,int):** range 10 - 110
- **Height (cm,int):** range 60 - 230  
- **Weight (kg, int):** range 30 - 250
- **Gender:** -> to bool
- **BMI** -> range 

**Ap High:**
- Low (Hypotension): Less than 90 mm Hg systolic
- Normal: Less than 120 mm Hg systolic
- Elevated: 120-129 mm Hg systolic
- High (Hypertension):
    - Stage 1: 130-139 mm Hg systolic
    - Stage 2: 140 mm Hg or higher systolic
- Critical: 180 mm Hg and above

**Ap Low:**
- Low: Less than 60mm Hg diastolic
- Normal: Less than 80 mm Hg diastolic
- Elevated: 80-89 mm Hg diastolic
- High (Hypertension):
    - Stage 1: 90-99 mm Hg diastolic
    - Stage 2: 100 mm Hg or higher diastolic
- Critical (Made that one up): 120 mm Hg diastolic

## Let's start with those and validate our data to be applicable to real people
- In scripts, you can find the first change we made, booleans are now as booleans


In [1]:
from scripts.generic_methods import load_dataset, drop_by_filter
from scripts.patient import Patient

In [2]:
from pathlib import Path


_CARDIO_DATASET_PATH = Path("../data/object_compatible/cardio_train.csv")
CARDIO_DATASET = load_dataset(_CARDIO_DATASET_PATH, ";")

In [3]:
CARDIO_DATASET.head()
CARDIO_DATASET.info()
CARDIO_DATASET.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  bool   
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  bool   
 10  alco         70000 non-null  bool   
 11  active       70000 non-null  bool   
 12  cardio       70000 non-null  bool   
dtypes: bool(5), float64(1), int64(7)
memory usage: 4.6 MB


Unnamed: 0,id,age,height,weight,ap_hi,ap_lo,cholesterol,gluc
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,49972.4199,19468.865814,164.359229,74.20569,128.817286,96.630414,1.366871,1.226457
std,28851.302323,2467.251667,8.210126,14.395757,154.011419,188.47253,0.68025,0.57227
min,0.0,10798.0,55.0,10.0,-150.0,-70.0,1.0,1.0
25%,25006.75,17664.0,159.0,65.0,120.0,80.0,1.0,1.0
50%,50001.5,19703.0,165.0,72.0,120.0,80.0,1.0,1.0
75%,74889.25,21327.0,170.0,82.0,140.0,90.0,2.0,1.0
max,99999.0,23713.0,250.0,200.0,16020.0,11000.0,3.0,3.0


Let's take a peak at values we outlied in our set

In [4]:
def patient_is_valid(patient: Patient) -> bool:
    """Check if a patient is valid."""
    return patient.is_valid

In [5]:
VALID_DATASET = drop_by_filter(CARDIO_DATASET, lambda p: not patient_is_valid(p))
VALID_DATASET.to_csv("../data/filtered/cardio_train.csv", index=False)
VALID_DATASET.head()
VALID_DATASET.info()
VALID_DATASET.describe()

<class 'pandas.core.frame.DataFrame'>
Index: 58989 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           58989 non-null  int64  
 1   age          58989 non-null  int64  
 2   gender       58989 non-null  bool   
 3   height       58989 non-null  int64  
 4   weight       58989 non-null  float64
 5   ap_hi        58989 non-null  int64  
 6   ap_lo        58989 non-null  int64  
 7   cholesterol  58989 non-null  int64  
 8   gluc         58989 non-null  int64  
 9   smoke        58989 non-null  bool   
 10  alco         58989 non-null  bool   
 11  active       58989 non-null  bool   
 12  cardio       58989 non-null  bool   
dtypes: bool(5), float64(1), int64(7)
memory usage: 4.3 MB


Unnamed: 0,id,age,height,weight,ap_hi,ap_lo,cholesterol,gluc
count,58989.0,58989.0,58989.0,58989.0,58989.0,58989.0,58989.0,58989.0
mean,49986.275441,19355.196732,164.496177,73.091339,121.895574,79.541999,1.33335,1.213769
std,28857.726227,2477.644014,7.77967,13.445919,11.05652,7.813638,0.658455,0.56193
min,0.0,10798.0,120.0,30.0,90.0,60.0,1.0,1.0
25%,24989.0,17551.0,159.0,64.0,120.0,80.0,1.0,1.0
50%,50101.0,19637.0,165.0,71.0,120.0,80.0,1.0,1.0
75%,74890.0,21251.0,170.0,80.0,130.0,80.0,1.0,1.0
max,99999.0,23713.0,207.0,180.0,140.0,100.0,3.0,3.0
