<div align="center">
    <h1>Heart Disease</h1>
</div>

<div>
    <h1>Introduction to Artificial Intelligence | Project 2 | Universidad del Valle</h1>
</div>

![Imagen](https://img.webmd.com/dtmcms/live/webmd/consumer_assets/site_images/articles/health_tools/did_you_know_this_could_lead_to_heart_disease_slideshow/493ss_thinkstock_rf_heart_illustration.jpg)

## Authors
- Bryan Steven Biojó     - 1629366
- Julián Andrés Castaño  - 1625743
- Juan Sebastián Saldaña - 1623447
- Juan Pablo Rendón      - 1623049

## Objective
- Apply the concept of Machine Learning (ML) to solve a **classification problem** using the methods seen in the course.

## 1. Importing libraries
As a first step, the libraries used during the development of the problem will be imported:

In [8]:
# Common libraries
import numpy as np
import pandas as pd
import seaborn as sb
import tensorflow as tf
import matplotlib.pyplot as plt
import math
import re
import os
import sys
import warnings
warnings.filterwarnings('ignore')
from matplotlib.legend_handler import HandlerLine2D
from IPython.display import SVG, display
from graphviz import Source

# Sklearn libraries
import sklearn
from sklearn import tree
from sklearn.tree import export_graphviz
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
#from sklearn.externals.six import StringIO

# Keras libraries
from keras.models import Sequential
from keras.layers.core import Dense

## 2. Loading the dataset
Next, we will load the dataset with the heart diseases which are in a CSV file. Said data was extracted directly from the **Kaggle** website (https://www.kaggle.com/ronitf/heart-disease-uci) and uploaded again to the following **GitHub** repository (https://github.com/bryansbr/heart-disease-AI:

In [9]:
# Dataset
url = "https://raw.githubusercontent.com/bryansbr/heart-disease-AI/main/heart.csv"
data = pd.read_csv(url)
print(data.columns)
print(data.shape)
#data.head()
data.describe()

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')
(303, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


## 3. Data description
In total we have 14 columns with the following information:

- **age:** Age in years.
- **sex:** Where (1 = male; 0 = female).
- **cp:** Chest pain type. Where (1 = angina; 2 = pain without angina; 3 = asymptomatic).
- **trestbps:** Resting blood pressure (in mm/Hg on admission to the hospital).
- **chol:** Serum cholesterol of the person in mg/dl.
- **fbs:** Fasting blood sugar > 120 mg/dl. Where (1 = true; 0 = false).
- **restecg:** Resting electrocardiographic results. Where (0 = normal; 1 = with ST-T wave abnormality (T wave inversions and/or ST elevation or depression > 0.05 mV); 2 = showing probable or definitive left ventricular hypertrophy according to Romhilt criteria-Estes).
- **thalac:** Maximum heart rate achieved.
- **exang:** Exercise induced angina (1 = yes; 0 = no). 
- **oldpeak:** ST depression induced by exercise relative to rest.
- **slope:** The slope of the peak exercise ST segment. Where (0 = ascending slope; 1 = flat; 2 = descending slope).
- **ca:** Number of major vessels (0 - 3) colored by flourosopy.
- **thal:** Blood disorder known as 'Thalassemia'. Where (3 = normal; 6 = fixed defect; 7 = reversable defect).
- **target:** Indicates the probability of suffering from heart disease, according to the information in the preceding columns (1 = yes; 0 = no). This is the column that we want to **predict** with our ML models.

## 4. Types of variables
Now, we will group the variable types into numeric or categorical as appropriate. The **numerical variables** are those statistical variables that give, as a result, a numerical value and these can be discrete or continuous, while the **categorical variables** can take one of a limited number, and usually fixed, of possible values that are base of some qualitative characteristic.

According to the above, the grouping of the variables would be as follows:

|   **Variable**  |   **Type**  |
|-----------------|-------------|
|    **age**      |  numerical  |
|     **sex**     | categorical |
|    **cp**       | categorical |
|   **trestbps**  |  numerical  |
|     **chol**    |  numerical  |
|     **fbs**     | categorical |
|   **restecg**   | categorical |
|   **thalac**    |  numerical  |
|    **exang**    | categorical |
|   **oldpeak**   |  numerical  |
|    **slope**    | categorical |
|     **ca**      | categorical |
|    **thal**     | categorical |
|   **target**    | categorical |

## 5. Graphing the variables
According to the previous information, the graphs of the variables are made. The numerical variables will be represented as **histograms**, while the categorical variables as **pie diagrams**.

### 5.1. Checking for missing or null data.
Before graphing, let's check for missing or null data in our dataset. If they exist, we must complete or remove them as appropriate.

In [14]:
print("NaN data exists in the dataset?: ")
print(data.isna().any().any())
print("---------------------------------")
print("null data exists in the dataset?:")
print(data.isnull().any().any())

NaN data exists in the dataset?: 
False
---------------------------------
null data exists in the dataset?:
False


In this case, we see that there are no null or missing data, so we can proceed to graph the variables according to their grouping.