[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alikn/intro_to_ai/blob/master/exploratory_data_analysis_assignment.ipynb)

# Heart disease prediction

A medical serivces company has asked your company to create a model for them to be able to predict which patients are at risk of heart disease. This will help them focus their services more on patients who need it the most. The patients flagged by the model will go through a series of tests to confirm the prediction. Doing all the tests for all patients can be very costly and time consuming.

Here is the goal of the prediction task: to create a model to predict a patient having heart disease with at least 90% recall and 50% precision.  

To help us create the model, they have given us a dataset with more than 70 attributes. A colleague of yours has gone through the dataset and narrowed it down to 13 attributes which they think are associated with the heart disease. 

You are given the processed dataset and was asked to do a limited-scope exploratory data analysis on it.

## About this assignment

- Open this notebook in Colab.
- Go through the Colab and answer the questions. Questions are in code cells which include "Question x". Add your code to the code cell and run the code cell.  
- The cells which start with "[Discuss in your team]" are for you to discuss the topic with each other. No need to update those cells.
- Materials covered in session 6 of the course can be helpful for this assignment.
- Once you are done, download a *ipynb* file of the colab (File > Print > Download > Download .ipynb) and submit it as the deliverable. 

## Dataset description

Here are the columns in the dataset and their description:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
  - Value 1: typical angina
  - Value 2: atypical angina
  - Value 3: non-anginal pain
  - Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
  - Value 0: normal
  - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
  - Value 1: upsloping
  - Value 2: flat
  - Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: (target column) diagnosis of heart disease (angiographic disease status)
  - Value 0: < 50% diameter narrowing
  - Value 1: > 50% diameter narrowing


The target attribute is "num" column. What it means is that the eventual goal of the project is to create a model which predicts the value of column 14 given columns 1 to 13.

Note that column "num" as well as some other columns are categorical variables for which different categories are denoted by numbers. 

Here is the [source of this dataset](http://archive.ics.uci.edu/ml/datasets/Heart+Disease). Take a look there if you want to see more info about the dataset.

## Your names

In [None]:
## Add your names here: 

## Loading the dataset

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://alik-courses.web.app/intro-to-ai/datasets/cleveland_heart_disease.csv')

## Taking a look at the raw data

In [None]:
## Question 1: Show the data in the top 10 rows of the dataset.

## Add your answer here and run the cell.



[Discuss in your team] With your teammate, discuss what are the points you can infer just by looking at the raw data.

## Using Pandas to summarize the data

In [None]:
## Question 2: What is the shape of the dataset?

## Add your answer here.


In [None]:
## Question 3: Use Pandas to get a summary of the data (number of non-null elements, type of elements) in columns.

## Add your answer here.


In [1]:
## Question 4: Use Pandas to describe the columns (some stats for numerical columns and some info about unique categories for non-numerical columns).

## Add your answer here.


[Discuss in your team] Using Pandas summary functions gives us some info about the columns. From this info, can you tell if there are null values in any of the columns? 

Pay attention to medians of different columns. How did Pandas treat the categorical columns which were coded with numbers?

## Numerical features distributions

In [None]:
## Question 5: Plot features distributions.

## Add your answer here.


[Discuss in your team] What is the difference between the distribution of numerically encoded categorical variables and true numerical features?

## Target feature distribution

In [None]:
# Question 6: Plot the distribution of the target feature as a bar plot.

# Add your answer here.


## Effect of numerical features on output

In [None]:
## Question 7: Plot segmentation distributions between output feature and two of the numerical features which intuitively seem to be good predictors for heart disease.

## Add your answer here: first segmentation plot


In [None]:
## Add your answer here: second segmentation plot


[Discuss in your team] Does the boxplots differ for when there is no heart disease (num = 0) and other cases?

## Update the target column

The model we want to eventually create aims at predicting whether or not a patient has a heart problem. The data in the target column includes not only if the patient has a healthy heart (num = 0), but also the severity of the disease (num values 1-4). We need to create a new target column which can have a boolean value of False for no disease and True for disease.

In [None]:
## Question 8: using Pandas conditional operators, add the new target column to the dataframe.
## Hint: take a look at Intro to Pandas notebook if you need a refresher. 

## Add your answer here.
