# **Heart Failure Prediction Dataset Assignment** **200 marks**:

---













>  **Important Note:**  
> All code in this assignment must be **clean, readable, and well-formatted**. This includes:
>
> -  Clear and meaningful variable names  
> -  Consistent indentation and spacing  
> -  Proper use of comments to explain logic  
> -  Organized code blocks in separate cells  
> -  Avoidance of redundant or repeated code  
>
>  Submissions with poorly written or unreadable code may result in **mark deductions**, even if the logic is correct.


## ** About Dataset**

##**Context**
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

##**Attribute Information**
* Age: age of the patient [years]
* Sex: sex of the patient [M: Male, F: Female]
* ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
* RestingBP: resting blood pressure [mm Hg]
* Cholesterol: serum cholesterol [mm/dl]
* FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
* ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
* Oldpeak: oldpeak = ST [Numeric value measured in depression]
* ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* HeartDisease: output class [1: heart disease, 0: Normal]

## **Objective:**

The objective of this assignment is to analyze a dataset related to heart disease prediction and build predictive models that can classify whether a patient is likely to have heart disease or not based on clinical and demographic attributes

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### **1. Load the Data** — *[5 Marks]*
- Load dataset using pandas
- Display first few rows

---


### **2. Data Inspection** — *[10 Marks]*
- Shape of data
- Info (data types, nulls)
- Value counts for each column

---


### **3. Data Cleaning** — *[15 Marks]*
- Handle missing values (if any)
- Encode categorical variables
- Convert data types if required

---



### **4. Outlier Detection & Treatment** — *[10 Marks]*

### **5. Data Description** — *[10 Marks]*
- Describe numerical features (mean, std, min, max)
- Unique values for categorical variables

---



### **6. Univariate Analysis** — *[20 Marks]*
Analyze **each column individually**, one by one. No common loop functions allowed. Include:
- Histograms for numerical variables
- Bar plots for categorical
- Comments on distributions

---


### **7. Bivariate Analysis** — *[20 Marks]*
Analyze the relationship of **each independent variable** with the target variable `HeartDisease`.
- Box plots, violin plots, groupby means
- Separate plots/analysis for each column

---


### **8. Multivariate Analysis** — *[10 Marks]*
- Pairplot
- Interactions between 2+ variables
- Comments on how combinations impact heart disease

---


### **9. Heatmap - Correlation Matrix** — *[10 Marks]*
- Correlation matrix
- Use `seaborn.heatmap()`
- Identify top correlations with target

---



### **10. Model Building** — *[50 Marks]*
Build the following classification models:
- Logistic Regression
- Naive Bayes (choose appropriate types)
- K-Nearest Neighbors (KNN)
- Decision Tree
- Support Vector Machine (SVM)
- Random Forest
- Bagging Classifier
- Boosting Algorithms:
  - AdaBoost
  - Gradient Boosting (GBM)
  - XGBoost
- Stacking Ensemble
- Voting Classifier

---



### **11. Model Evaluation** — *[20 Marks]*
- Accuracy, Precision, Recall, F1-Score
- Confusion Matrix
- ROC-AUC Curve (where applicable)

---


### **12. Interpretation of Metrics** — *[10 Marks]*
- Explain what the evaluation metrics mean
- Discuss trade-offs (e.g., precision vs recall)

---



### **13. Final Conclusion** — *[10 Marks]*
- Summarize findings
- Which model performed best?
- Possible improvements

---

 **Total: 200 Marks**
