
### Heart Disease Prediction: End-to-End Machine Learning Project  

This notebook is a comprehensive guide to building a machine learning model to predict the likelihood of heart disease using clinical data. By leveraging Python and popular data science libraries, we'll systematically address each step in the process.

---

### **Roadmap**  
1. **Understanding the Problem**: Defining the goal and framing the challenge.  
2. **Data Preprocessing**: Cleaning and preparing the raw dataset for analysis.  
3. **Evaluation**: Setting performance benchmarks to measure model success.  
4. **Feature Exploration**: Investigating key variables and their relevance to the prediction task.  
5. **Modeling**: Training, testing, and refining machine learning models.  
6. **Experimentation**: Testing alternative approaches to optimize performance.  

---

### **1. Problem Definition**  
The task is to predict whether a patient has heart disease based on their clinical health parameters. This involves analyzing various health metrics to create a reliable prediction system.  

**Objective Statement**:  
Given a set of clinical parameters about a patient, determine whether they are likely to have heart disease (`yes` or `no`).  

---

### **2. Data Source**  
The dataset is sourced from the **UCI Machine Learning Repository**, a well-known repository for machine learning datasets.  

---

### **3. Evaluation Metric**  
The project will be considered successful if:  
- **Proof of Concept**: Achieving at least **80% accuracy** during the initial model evaluation phase.  

---

### **4. Features**  
Here’s a breakdown of the dataset’s features and their roles:  

| Variable Name | Role        | Type         | Description                                                | Units      | Missing Values |
|---------------|-------------|--------------|------------------------------------------------------------|------------|----------------|
| **age**       | Feature     | Integer      | Age of the patient                                         | years      | No             |
| **sex**       | Feature     | Categorical  | Sex of the patient                                         | -          | No             |
| **cp**        | Feature     | Categorical  | Chest pain type                                            | -          | No             |
| **trestbps**  | Feature     | Integer      | Resting blood pressure (on hospital admission)            | mm Hg      | No             |
| **chol**      | Feature     | Integer      | Serum cholesterol level                                    | mg/dl      | No             |
| **fbs**       | Feature     | Categorical  | Fasting blood sugar > 120 mg/dl                            | -          | No             |
| **restecg**   | Feature     | Categorical  | Resting electrocardiographic results                      | -          | No             |
| **thalach**   | Feature     | Integer      | Maximum heart rate achieved                                | -          | No             |
| **exang**     | Feature     | Categorical  | Exercise-induced angina                                    | -          | No             |
| **oldpeak**   | Feature     | Float        | ST depression induced by exercise relative to rest         | -          | No             |
| **slope**     | Feature     | Categorical  | Slope of the peak exercise ST segment                     | -          | No             |
| **ca**        | Feature     | Integer      | Number of major vessels (0–3) colored by fluoroscopy       | -          | Yes            |
| **thal**      | Feature     | Categorical  | Thalassemia (blood disorder)                               | -          | Yes            |
| **num**       | Target      | Integer      | Indicates presence of heart disease (1 = yes, 0 = no)      | -          | No             |  

--- 





## preparing the tools

Using Pandas, NUmpy and Matplotlib for Data Manipulation and Analysis

In [2]:
# import all tools the tools we need

# Regular EDA and plotting libraries 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

%matplotlib inline 

# Models from Scikit-Learn 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.metrics import RocCurveDisplay
