# Cardio Data Analysis

## Contents

* About
    * Project Development
    * Problem Definition
    * Objective
* Data
    * Libraries
    * Importing
    * Variables
* Exploratory Data Analysis (EDA)
    * Plotting Objectives
    * Functions
    * Shape and Size
    * Types
    * Unique Values
    * Missing Values
    * Units Conversion
    * Continuous and Categorical Variables
        * Continuous Variables
            * Summary statistics
            * Probability Distribution
            * Making Sense of the (Continuous) Data
        * Categorical Variables
            * Bar Plots
            * Making Sense of the (Categorical) Data
    * Class Imbalance
* Feature Engineering
    * Units Conversion
    * Continuous Variables
        * Feature Scaling - Standardization (or Z-score Normalization)
        * Outliers Detection and Treatment
    * Categorical Variables
        * Label Encoding
* Feature Selection
    * Inferential Statistics and Hypothesis Testing
    * Feature Importance
    * Correlation Matrix Heatmap
* Model Training
* Model Evaluation
* Class Imbalance ?

## About

### Project Development
This project was developed locally with Visual Studio Code and GitHub version control.

Please check this project @ [GitHub page](https://caiocvelasco.github.io/) or @ [GitHub Repository - Cardio Data Analysis](https://github.com/caiocvelasco/health-data-analysis/blob/a4fafbcd8148a6d501f42a10ae9d313fc3b268e1/cardio-data-analysis-project.ipynb).

### Problem Definition

A client would like to understand some important patients' cardio-related descriptive statistics.

### Objective
Our goal is to calculate some descriptive statistics using Numpy, a package for scientific computing in Python.

## Data
Data was already available on a _csv_ format.

### Libraries

In [2]:
# !pip install seaborn pandas matplotlib numpy
import pandas as pd              # for data analysis
import numpy as np               # for scientific computing
import os                        # for file interactions in the user's operating system
import warnings                  # for dealing with warning messages if need be
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt  # for data visualization
# import matplotlib as mpl
import seaborn as sns            # for data visualization

### Importing

In [3]:
# Basic Settings
csv_folder_name = "health_dataset"  # please, change the folder name (where the CSV files are stored) if need be
notebook_location = "C:\\Users\\caiov\\OneDrive - UCLA IT Services\\Documentos\\DataScience\\Datasets" # set the location where this notebook is saved
csv_folder_path = notebook_location + "\\" + csv_folder_name  # set path for the CSV files
os.chdir(csv_folder_path)                                     # set location of CSV files

# Save cvs Data on a Pandas Dataframe
df = pd.read_csv("cardio_base.csv", sep = ",", skipinitialspace = True) #skip space after delimiter if need be

# Save a Copy of the Dataframe
data = df.copy()

# Dataset Manipulation
data.name = "Cardio Base Dataset" # rename the dataset 
cols = data.columns;              # create an index list with feature names

# Quick Overview of a Sample from the Data
data.sample(5)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,smoke
52134,74362,21852,1,162,56.0,110,80,1,0
55547,79250,17714,2,171,83.0,120,80,1,0
22444,32064,21150,1,160,72.0,150,80,1,0
13360,19072,16665,1,162,58.0,110,60,1,0
44837,64030,16119,1,166,93.0,120,80,1,0


### Variables

Let's take a closer look at the variables and their documentation.

__Feature | variable name | type__

* Id                       | unique ID   | continuous variable  | int
* Age                      | age         | continuous variable  | int (days)
* Gender                   | gender      | categorical variable | binary
* Height                   | height      | continuous variable  | int (cm)
* Weight                   | weight      | continuous variable  | float (kg)
* Systolic blood pressure  | ap_hi       | continuous variable  | int
* Diastolic blood pressure | ap_lo       | continuous variable  | int
* Cholesterol              | cholesterol | categorical variable | 1: normal, 2: above normal, 3: well above normal
* Smoking                  | smoke       | categorical variable | binary