# **Heart Disease Analysis and Detection**
* Dataset is given by Heart Trek.
* Dataset Contains details of the patient who have Heart Disease.

**<h3>Coronary Artery Disease:</h3>** 

- It refers to the **narrowing or blockage** of the **coronary arteries**, usually caused by the **build-up** of **cholesterol** and **fatty deposits** (called plaques) on the **inner walls** of the **arteries**.

- These plaques can restrict blood flow to the heart muscle by **physically** **clogging** the **artery** or by causing **abnormal** artery **tone** and **function**.

- This can **cause chest pain** called **Angina**. When **one or more** of the **coronary arteries** are completely **blocked**, a **heart attack** may occur.


There are **various factors** that **affects** coronary **heart disease**. These are as follows:

- Risk factors like gender, family history, race, ethnicity etc. are **non-modifiable (unable to cure)**.

- Risk factors like cigarette smoking, high blood cholesterol levels, high blood pressure, physical inactivity, etc. are **modifiable**.

# **Problem Statement**
---

- **Most of the time** it is **impossible** to **identify** whether a person has **heart disease** or not **because** **diagnosis** of a heart disease is **performed** on a **combination of clinical symptoms and tests results** which is **calculated through traditional processes**. 

- **Due to** the availability of **huge number of risk factors** it is **impossible** to **achieve** **accurate** **results** all the time.

In [None]:
!pip install -q yellowbrick   

In [None]:
import numpy as np
import pandas as pd

#For Generating the statistical Report
from pandas_profiling import ProfileReport

# For Random seed values
from random import randint

# For Scientifc Python
from scipy import stats

# For datetime
from datetime import datetime as dt

# For Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# For Preprocessing
from sklearn.preprocessing import StandardScaler

# For Feature Selection
from sklearn.feature_selection import SelectFromModel

# For Feature Importances
from yellowbrick.model_selection import FeatureImportances

# For metrics evaluation
from sklearn.metrics import precision_recall_curve, classification_report, plot_confusion_matrix

# For Data Modeling
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('../input/heartdisease/dataset.csv')

In [None]:
df.head()

#### Dataset Consists of 366 Records and 14 Features
| Records | Features |
| :-- | :-- |
| 366 | 14 

#### Datatypes: Float64, Object, int64



| # | Features | Description |
| :-- | :--| :--| 
|01|**age**| Age of the patient.|
|02|**sex**| Gender of the patient [male = 1, female = 0].|
|03|**chest_pain_type**|Type of chest-pain experienced by the patient [typical angina, atypical angina, non-anginal pain, asymptomatic].|
|04|**resting_blood_pressure**|Resting blood pressure value of an individual in mmHg.|
|05|**cholesterol**|Serum cholesterol in mg/dl.|
|06|**fasting_blood_sugar**|Compares the fasting blood sugar value of an individual with 120mg/dl [ value > 120 : 1, value < 120 : 0].|
|07|**rest_ecg**|Resting electrocardiographic results [normal : 0, ST-T wave abnormality : 1, left ventricular hyperthrophy : 2].|
|08|**max_heart_rate_achieved**|Max heart rate achieved by an individual (bpm).|
|09|**exercise_induced_angina**|Exercise induced angina [yes : 1, no : 0].|
|10|**st_depression**|ST depression induced by exercise relative to rest.|
|11|**st_slope**|Peak exercise ST segment [upsloping : 1, flat : 2, downsloping : 3].|
|12|**num_major_vessels**|Number of major vessels (0–3) colored by flouroscopy. The labels are applied on different colors while examining major vessels.|
|13|**thalassemia**|Thalassemia is a blood disorder passed down through families (inherited) in which the body makes an abnormal form or inadequate amount of hemoglobin. [normal : 3, fixed defect : 6, reversible defect : 7]|
|14|**target**|If have heart disease [No : 0, Yes : 1].|

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum()/len(df) * 100

**Observation:**

- **On average** there are **patients** of **age 54**.

- **25% of patients** have **age <= to 47** while **50% and 75%** of **patients** have **age <= to 55 and 61** respectively.

- **On average patients** have **resting blood pressure of 131.56 mmH**g.

- **25% of patients** have **resting blood presuure <= 120 mmHg** while **50% and 75% of patients** have **resting blood pressure** of **<= 130 mmHg and <= 140 mmHg** respectively.

- **On average patients** have **cholesterol of 247.56 mg/dl**.

- **25% of patients** have **cholesterol <= 212 mg/dl** while **50% and 75% of patients** have **cholesterol <= 243 mg/dl and <= 275 mg/dl** respectively.

- **On average patients** have **max heart rate achieved at 149.4**.

- **25% of patients** have **max heart rate achieved <= 133.5 bpm** while **50% and 75% of patients** have **max heart rate achieved <= 152 bpm and <= 165 bpm** respectively.

- **On average patients** have **ST depression of unit value**.

- **25% of patients** have **no ST depression** while **50% and 75% of patients** have **ST depression <= 0.8 and <= 1.8** respectively.

- **Colestrol, Maximum Heart Rate Achieved, Exercise Induced angina** these features have missing values.

In [None]:
num_feature = []

for i in df.columns.values:
  if ((df[i].dtype == int) | (df[i].dtype == float)):
    num_feature.append(i)
    
print('Total Numerical Features:', len(num_feature))
print('Features:', num_feature)

##### These are the numerical columns in the data set. Now we are going to Plot the values to analyse the distribution of the data.

### Graphical Analysis

#### Creating Subplots

In [None]:
fig, axes = plt.subplots(nrows = 4, ncols = 4, sharex = False, figsize = (15,12))

for ax, col in zip(axes.flat, num_feature):
    sns.distplot(a = df[col], bins = 50, ax = ax)
    ax.set_title(col)
    plt.setp(axes, xlabel = '')
    ax.grid(False)
plt.tight_layout()
plt.show()

**Observation:**

- **Positively Skewed Features:**
  - resting_blood_pressure
  - cholesterol
  - st_depression
  - num_major_vessels
  - fasting_blood_sugar
  - rest_ecg
  - exercise_induced_angina
  - st_slope
  - thalassemia
- **Negatively Skewed Features:**
  - age
  - sex
- **Normally Skewed Features:**
  - None

In [None]:
plt.figure(figsize = (12,8))
sns.heatmap(df.corr(), annot = True)

#### Finding Duplicate rows in the dataset

In [None]:
duplicate = df[df.duplicated()]
print(len(duplicate))

**Observation:**

- **Report shows** that **there** are total **14 variables** out of which **6 are numeric, 4 are boolean and 4 are categorical**.
- There exists **41 duplicate rows**.
- Features like **st_depression and num_major_vessels contains zeros**.
- **age** is **negatively correlated** with **max_heart_rate_achieved**.
- **st_depression** is **negatively correlated** with **st_slope**.
- **Cholesterol** feature shows there are **18 null values** along with the significance that feature is **normally distribution**.
- **max_heart_rate_achieved** feature shows **left skewness** in addition with **27 null values**.

## Data Preprocessing

### Identification and Handling of Missing values

#### Handling NULL values

In [None]:
null_frame = pd.DataFrame(index = df.columns.values)
null_frame['Null Frequency'] = df.isnull().sum().values
percent = df.isnull().sum().values/df.shape[0]
null_frame['Missing %age'] = np.round(percent, decimals = 4) * 100
null_frame.transpose()

- **cholestrol**:
  - Missing Information (18) &rarr; Replace with Median value.
- **max_heart_rate_achieved**:
  - Missing Information (27) &rarr; Replace with Median value.

In [None]:
df['cholesterol'] = df['cholesterol'].replace(np.nan, df['cholesterol'].median())

In [None]:
df['max_heart_rate_achieved'] = df['max_heart_rate_achieved'].replace(np.nan, df['max_heart_rate_achieved'].median())

In [None]:
df.isnull().sum()

**Observation:**

- **All the NULL values are removed successfully**.

### Handling Duplicate Data

In [None]:
print('Contains Redundant Records?:', df.duplicated().any())
print('Duplicate Count:', df.duplicated().sum())

- **There are 41 Duplicate data in the dataset**

In [None]:
df.drop_duplicates(inplace = True)

In [None]:
print('Contains Redundant Records?:', df.duplicated().any())
print('Duplicate Count:', df.duplicated().sum())

- **We have successfully removed the duplicate data**

In [None]:
df.info()

- **Now as we can see the dataset is having 325 data after dropping the duplicate data**

In [None]:
df['chest_pain_type'].unique()

**Observation:**

- We can see that label **typical angina was incorrectly typed as typical anginia**.

In [None]:
df['chest_pain_type'] = df['chest_pain_type'].str.replace(pat = 'typical anginia', repl = 'typical angina')

## Exploratory Data Analysis

**<h4>Question 1: What is the proportion of males and females having heart disease or not?</h4>**

In [None]:
male_data = df[df['sex'] == 1]
female_data = df[df['sex'] == 0]

In [None]:
figure = plt.figure(figsize = (12,7))
plt.subplot(1,2,1)
space = np.ones(2)/10
male_data['target'].value_counts().plot(kind = 'pie', explode = space, fontsize = 14, autopct = '%3.1f%%', wedgeprops = dict(width = 0.15), 
                                    shadow = True, startangle = 160, figsize = [13.66, 7.68], legend = True, labels = ['', ''])
plt.legend()
plt.ylabel('Male', size = 14)

plt.subplot(1,2,2)
space = np.ones(2)/10
female_data['target'].value_counts().plot(kind = 'pie', explode = space, fontsize = 14, autopct = '%3.1f%%', wedgeprops = dict(width=0.15), 
                                    shadow = True, startangle = 160, figsize = [13.66, 7.68], legend = True, labels = ['', ''])
plt.legend(['No Heart Disease', 'Heart Disease'])
plt.ylabel('Fefsfflmales', size = 14)
plt.suptitle('Proportion of Males and Females vs Heart Disease', size = 16)
plt.show()

**Observation:**

- Around **56% of males don't have heart disease** while **~44% of male** patients **have any heart disease**.
- Around **74% of female patients don't have heart disease** while **~26% of female patients have any heart disease**.

**<h4>Question 2: What is the proportion of males and females having different type of chest pain?</h4>**

- **Typical Angina**: It is the presence of substernal chest pain or discomfort that was provoked by exertion or emotional stress and was relieved by rest and/or nitroglycerin.
- **Non-Anginal Pain**: It is defended with the possibility of avoiding diagnoses such as "atypical chest pain" or "atypical angina."
- **Atypical Angina**: It implies that the complaint is actually angina pecto- ris, though not conforming in every way to the expected or classic description.
- **Asymptomatic**: It means neither causing nor exhibiting symptoms of disease.