## PREDICTING HEART DISEASE USING ENSEMBLE TECHNIQUE

### Introduction

The dataset set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

Features:
- age: Age of the patient
- sex: Gender (1 = male, 0 = female)
- cp: Chest pain type
- trestbps: Resting blood pressure
- chol: Serum cholesterol
- fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
- restecg: Resting electrocardiographic results
- thalach: Maximum heart rate achieved
- exang: Exercise-induced angina (1 = yes, 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope: Slope of the peak exercise ST segment
- ca: Number of major vessels (0-3) colored by fluoroscopy
- thal: Thalassemia

### Objective

Predict the presence of heart disease based on various health-related features.

### Step 1: Loading and Exploring the Data

In [1]:
import pandas as pd

#load the Heart Disease UCI dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
heart_data = pd.read_csv(url, names=names, na_values="?")

#display basic information about the dataset
print(heart_data.info())
print(heart_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
 13  target    303 non-null    int64  
dtypes: float64(13), int64(1)
memory usage: 33.3 KB
None
    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0   

### Step 2: Data Preprocessing


In [2]:
#handle missing values
heart_data = heart_data.dropna()

In [3]:
#convert categorical variables to numerical using one-hot encoding
heart_data = pd.get_dummies(heart_data, columns=["sex", "cp", "fbs", "restecg", "exang", "slope", "thal"])

In [4]:
#separate features and target variable
X = heart_data.drop("target", axis=1)
y = heart_data["target"]

### Step 3: Splitting the Data

In [5]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 4: Building and Training the Random Forest Classifier

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

#create a Random Forest classifier with 100 trees
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

#train the classifier on the training data
rf_classifier.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

### Step 6: Making Predictions and Evaluation

In [7]:
#make predictions on the test data
y_pred = rf_classifier.predict(X_test)

#evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

#display additional metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 60.00%

Classification Report:
              precision    recall  f1-score   support

           0       0.77      1.00      0.87        36
           1       0.00      0.00      0.00         9
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.00      0.00      0.00         3

    accuracy                           0.60        60
   macro avg       0.15      0.20      0.17        60
weighted avg       0.46      0.60      0.52        60



### Step 7: Creating a Python script named 'app.py'

In the script:

- We load the trained model.
- Use Streamlit to create a simple web app with a sidebar for user input features.
- Collect user input features using sliders and select boxes.
- Make predictions using the trained model and display the result.

The copy of the file has also been shared with you along with the assignment.

![image.png](attachment:image.png)

### Step 8: Installing Streamlit

We install the streamlit library using the following code:

- pip install streamlit

![image.png](attachment:image.png)

### Step 9: Changing the Directory to the Directory where the '.py' file is Saved

We saved the file in the Desktop, hence from the local environment we moved to Desktop environment.

![image.png](attachment:image.png)

### Step 10: Running the Streamlit app locally

We use the following command in the terminal,
- streamlit run app.py

And, below is the output that we got.

![image.png](attachment:image.png)

### Step 11: Launching a Local Server

Now, let's check our Streamlit app that opened in the below mentioned localhost,
- http://localhost:8501

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## CONCLUSION

Therefore, the assignment involved the creation of a machine learning model using ensemble techniques, specifically a Random Forest classifier. The case study focused on predicting heart disease using a real-world dataset, the "Heart Disease UCI" dataset. The workflow included data loading, exploration, preprocessing, model training, and evaluation. 

The ensemble technique, Random Forest, demonstrated its effectiveness in predicting heart disease based on the selected features from the dataset. The model's performance was evaluated, and the deployment of a Streamlit app provided a practical and interactive way for users to utilize the trained model.