# Cardio Data Analysis

## Contents

* About
    * Project Development
    * Problem Definition
    * Objective
* Data
    * Libraries
    * Importing
    * Variables
* Exploratory Data Analysis (EDA)
    * Plotting Objectives
    * Functions
    * Shape and Size
    * Types
    * Unique Values
    * Missing Values
    * Units Conversion
    * Continuous and Categorical Variables
        * Continuous Variables
            * Summary statistics
            * Probability Distribution
            * Making Sense of the (Continuous) Data
        * Categorical Variables
            * Bar Plots
            * Making Sense of the (Categorical) Data
    * Class Imbalance
* Feature Engineering
    * Units Conversion
    * Continuous Variables
        * Feature Scaling - Standardization (or Z-score Normalization)
        * Outliers Detection and Treatment
    * Categorical Variables
        * Label Encoding
* Feature Selection
    * Inferential Statistics and Hypothesis Testing
    * Feature Importance
    * Correlation Matrix Heatmap
* Model Training
* Model Evaluation
* Class Imbalance ?

## About

### Project Development
This project was developed locally with Visual Studio Code and GitHub version control.

Please check this project @ [GitHub page](https://caiocvelasco.github.io/) or @ [GitHub Repository - Cardio Data Analysis](https://github.com/caiocvelasco/health-data-analysis/blob/a4fafbcd8148a6d501f42a10ae9d313fc3b268e1/cardio-data-analysis-project.ipynb).

### Problem Definition

A client would like to understand some important patients' cardio-related descriptive statistics.

### Objective
Our goal is to calculate some descriptive statistics using Numpy, a package for scientific computing in Python.

## Data
Data was already available on a _csv_ format.

### Libraries

In [188]:
# !pip install seaborn pandas matplotlib numpy
import pandas as pd              # for data analysis
import numpy as np               # for scientific computing
import os                        # for file interactions in the user's operating system
import warnings                  # for dealing with warning messages if need be
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt  # for data visualization
# import matplotlib as mpl
import seaborn as sns            # for data visualization

### Importing

In [189]:
# Basic Settings
csv_folder_name = "health_dataset"  # please, change the folder name (where the CSV files are stored) if need be
notebook_location = "C:\\Users\\caiov\\OneDrive - UCLA IT Services\\Documentos\\DataScience\\Datasets" # set the location where this notebook is saved
csv_folder_path = notebook_location + "\\" + csv_folder_name  # set path for the CSV files
os.chdir(csv_folder_path)                                     # set location of CSV files

# Save cvs Data on a Pandas Dataframe
df = pd.read_csv("cardio_base.csv", sep = ",", skipinitialspace = True) #skip space after delimiter if need be

# Save a Copy of the Dataframe
data = df.copy()

# Dataset Manipulation
data.name = "Cardio Base Dataset" # rename the dataset 
cols = data.columns;              # create an index list with feature names

# Quick Overview of a Sample from the Data
pd.set_option("display.max_columns", None) # changing the max_columns value
data.sample(5)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,smoke
14016,20000,18269,1,159,63.0,110,70,1,0
32848,46919,20336,1,160,80.0,120,80,1,0
55270,78843,22032,1,162,70.0,140,90,2,0
64394,91910,18078,2,167,75.0,120,80,1,0
54671,77997,22699,2,175,82.0,120,70,2,0


### Variables

Let's take a closer look at the variables and their documentation.

| Feature | variable name | variable type |
|:-------|:---------------|:--------------|
| Id                       | unique ID   | int
| Age                      | age         | int (days)
| Gender                   | gender      | int (binary)
| Height                   | height      | int (cm)
| Weight                   | weight      | float (kg)
| Systolic blood pressure  | ap_hi       | int
| Diastolic blood pressure | ap_lo       | int
| Cholesterol              | cholesterol | int (1: normal, 2: above normal, 3: well above normal)
| Smoking                  | smoke       | int (binary)

## Exploratory Data Analysis (EDA)

### Plotting Objectives
Before diving into the EDA, it is good to have a clear goal in mind. Our goal is to calculate some descriptive statistics.

Given our goal, the following points should help explore and visualize data accordingly:
 * Check features and their distributions, unidimensionally.
 * Check correlation between features, bidimensionally.

### Defining Functions for EDA

In [190]:
### DATA ANALYSIS PART ###

# Checking Shape
def data_shape(data):
    print("Dataset shape: " + str(data.shape[0]) + " observations and " + str(data.shape[1]) + " features.")

# Check Size
def data_size(data):
    print("This dataset has a total of: " + str(data.size) + " entries.")

# Check Information
def data_info(data):
    print(data.name)
    print("--------------------------------------")
    data.info()
    print("--------------------------------------")  
    
# Get Unique Values - Indicator Variables
def unique_values(data):                                          # define a function (output: unique values for indicator variables)
    indicator_cols = ['gender', 'cholesterol', 'smoke']
    for i in indicator_cols:                                                # cols is the list of features from this dataset defined in the "Importing the Dataset" section above
        print('Unique values in', i, 'are', data[i].unique()) # calls function unique() to find get unique values
        print('----------------------------------------------------------------------------------------------------')

# Check for Missing Values
def missing_values(data):
    print('Checking for missing values in the', data.name) # data.name has been defined previously in the "Importing" section
    print('------------------------------------------------------------')
    print(data.isna().sum())
    print('------------------------------------------------------------')

# Save Data - Continuous Variables
def save_cont_data(data):
    cont_data = data.select_dtypes(include = 'number')
    return cont_data
    
# Save Data - Discrete Variable
def save_cat_data(data):
    cat_data = data.select_dtypes(exclude = 'number')
    return cat_data

# IQR Method - Detecting Outliers
def iqr_method(potential_outliers, data_copy): #arg 1 takes list of features with potential outliers, arg2 
    i = 1
    for col in potential_outliers:
        Q1 = data_copy[col].quantile(0.25)
        Q3 = data_copy[col].quantile(0.75)
        IQR = Q3 - Q1
        print(f'column {i}: {data_copy[col].name}\n------------------------')
        print('1st quantile => ',Q1)
        print('3rd quantile => ',Q3)
        print('IQR =>',IQR)

        lower_bound  = Q1-(1.5*IQR)
        print('lower_bound => ' + str(lower_bound))

        upper_bound = Q3+(1.5*IQR)
        print('upper_bound => ' + str(upper_bound))
        print("\n------------------------")
        
        i = i + 1

        data_copy[col][((data_copy[col] < lower_bound) | (data_copy[col] > upper_bound))] = np.nan  # replacing outliers with NaN


### VISUALIZATION PART ###

# Plot Probability Distributions - Continuous Variables
def pdf_plot_cont(cont_data):
    for i in cont_data:
        ax = sns.displot(cont_data[i])
        plt.show()

# Plot Bar Plots - Discrete Variables (and order by value_counts within them)
def bar_plot_cat(cat_data):
    plt.figure(figsize=(20,4))
    for i in cat_data:
        ax = sns.countplot(y = cat_data[i], order = cat_data[i].value_counts().index)
        plt.show()

# Plot Box Plots - Continuous Variables
def box_plot(potential_outliers, cont_data): # the first argument takes a list of features and the second the dataset
    for i in potential_outliers:
        ax = sns.boxplot(x = cont_data[i], orient = 'h')
        plt.show()


### Shape and Size

In [191]:
# Check Shape and Size
data_shape(data) # calls shape function
data_size(data)  # calls size function

Dataset shape: 70000 observations and 9 features.
This dataset has a total of: 630000 entries.


### Types

In [192]:
# Check Data Type
data_info(data) # calls info function

Cardio Base Dataset
--------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   smoke        70000 non-null  int64  
dtypes: float64(1), int64(8)
memory usage: 4.8 MB
--------------------------------------


All features have the expected type. 

Notice that all non-null counts are the same, so the dataset does not seem to have missing values. 

However, it is always good to check whether discrete variables have the expected values. These are: __gender, cholesterol, smoke__. 

For that, we will look into _Unique values_.

### Unique Values
Let's take a closer look into the discrete variables.

In [193]:
# Check for unique values - Discrete variables
unique_values(data) # calls unique values function

Unique values in gender are [2 1]
----------------------------------------------------------------------------------------------------
Unique values in cholesterol are [1 3 2]
----------------------------------------------------------------------------------------------------
Unique values in smoke are [0 1]
----------------------------------------------------------------------------------------------------


All indicator variables the expected values. There are no inconsistent values.

### Missing Values

In [194]:
# Check for missing values
missing_values(data) # calls missing values function

Checking for missing values in the Cardio Base Dataset
------------------------------------------------------------
id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
smoke          0
dtype: int64
------------------------------------------------------------


In this dataset, there is no missing values in the N/A format.

### Units Conversion

Checking whether theer is a need for unit conversion in the given features.

The variable age is in days, so we need to convert it to years and round it down, as requested. The other variables are okay.

In [195]:
# Converting Age from days to years .apply(np.floor)
data['age_years_float'] = (data['age'] / 365)
data['age_years'] = (data['age'] / 365).apply(np.floor).astype(int)

# Quick overview of age conversion
print(data[['age', 'age_years_float', 'age_years']].sample(5))

# Drop unecessary feature
data = data.drop(columns='age_years_float')

# Quickly check new data
data.head()

         age  age_years_float  age_years
3373   22223        60.884932         60
59282  20408        55.912329         55
42209  14601        40.002740         40
48050  17988        49.282192         49
53427  19686        53.934247         53


Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,smoke,age_years
0,0,18393,2,168,62.0,110,80,1,0,50
1,1,20228,1,156,85.0,140,90,3,0,55
2,2,18857,1,165,64.0,130,70,3,0,51
3,3,17623,2,169,82.0,150,100,1,0,48
4,4,17474,1,156,56.0,100,60,1,0,47
