# <b>Software Requirements:</b>
<ul>
<li><b>Python</b> (3.6 or later)</li>
<li><b>Necessary libraries:</b> pandas, numpy, scikit-learn, matplotlib (install using pip install pandas numpy scikit-learn matplotlib)</li></ul>

## <b>Data Acquisition</b>

Publicly available dataset from the OpenML Repository's Heart Disease dataset (https://www.openml.org/search?type=data&sort=runs&id=43823&status=active).

# <b>1. Import Modules for Data Analysis and Load Data:<b>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
# Load the heart disease data
data = pd.read_csv("HEART DISEASE PREDICTION USING LOGISTIC REGRESSION.csv")


<ul>
<li><b>pandas:</b> Used for data manipulation and analysis.</li>
<li><b>scikit-learn:</b> Provides machine learning tools.</li>
<li><b>train_test_split:</b> Splits data into training and testing sets.</li>
<li><b>LogisticRegression:</b> Implements the logistic regression algorithm.</li>
<li><b>heart.csv:</b> Replace with the actual path to your downloaded dataset.</li></ul>

# <b>2. Data Exploration and Preprocessing:</b>
<ul>
    <li>
        <h3>
            <b>
                <i>Get basic information about data:</i>
            </b>
        </h3>
    </li>
</ul>

In [2]:
print(data.head())  # View the first few rows
data.info()  # Check data types and missing values

   Age  Sex  Chest_pain_type   BP  Cholesterol  FBS_over_120  EKG_results  \
0   70    1                4  130          322             0            2   
1   67    0                3  115          564             0            2   
2   57    1                2  124          261             0            0   
3   64    1                4  128          263             0            0   
4   74    0                2  120          269             0            2   

   Max_HR  Exercise_angina  ST_depression  Slope_of_ST  \
0     109                0            2.4            2   
1     160                0            1.6            2   
2     141                0            0.3            1   
3     105                1            0.2            2   
4     121                1            0.2            1   

   Number_of_vessels_fluro  Thallium Heart_Disease  
0                        3         3      Presence  
1                        0         7       Absence  
2                        0   

<ul>
<li><b>head():</b> Displays the first few rows of the data.</li>
<li><b>info():</b> Provides information about data types and missing values.</li>

<ul>
    <li>
        <h3>
            <b>
                <i>Handle missing values: (Imputation techniques or removal of missingness)</i>
            </b>
        </h3>
    </li>
</ul>

In [3]:
data = data.dropna()  # Remove rows with missing values

<ul>
    <li>
        <h3>
            <b>
                <i>Encode categorical features:</i>
            </b>
        </h3>
    </li>
</ul>

In [4]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical features (e.g., Heart_Disease)
le = LabelEncoder()
data["Heart_Disease"] = le.fit_transform(data["Heart_Disease"])

<ul>
<li><b>LabelEncoder:</b> Converts categorical labels into numerical values.
<li><b>fit_transform:</b> Fits the encoder to the data and transforms the column.
</ul>

In [5]:
## After encoding the Heart_Disease in to the INTEGER datatype
print(data.head())

   Age  Sex  Chest_pain_type   BP  Cholesterol  FBS_over_120  EKG_results  \
0   70    1                4  130          322             0            2   
1   67    0                3  115          564             0            2   
2   57    1                2  124          261             0            0   
3   64    1                4  128          263             0            0   
4   74    0                2  120          269             0            2   

   Max_HR  Exercise_angina  ST_depression  Slope_of_ST  \
0     109                0            2.4            2   
1     160                0            1.6            2   
2     141                0            0.3            1   
3     105                1            0.2            2   
4     121                1            0.2            1   

   Number_of_vessels_fluro  Thallium  Heart_Disease  
0                        3         3              1  
1                        0         7              0  
2                        0

<ul>
    <li>
        <h3>
            <b>
                <i>Feature scaling: (standardization)</i>
            </b>
        </h3>
    </li>
</ul>

In [6]:
from sklearn.preprocessing import StandardScaler

# Standardize features (consider normalization as well)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.drop("Heart_Disease", axis=1))

# Combine scaled features with target variable
data_prepared = pd.DataFrame(data_scaled, columns=data.columns[:-1])
data_prepared["Heart_Disease"] = data["Heart_Disease"]

<ul>
<li><b>StandardScaler:</b> Standardizes features by removing the mean and scaling to unit variance.
<li><b>fit_transform:</b> Fits the scaler and transforms the data.
<li><b>data.drop("target", axis=1):</b> Drops the target variable ("Heart_Disease") from the data.</li>
</ul>

# <b>3. Split Data into Training and Testing Sets:</b>

In [7]:
X = data_prepared.drop(["Heart_Disease", "Sex"], axis=1) ## axis=1 means column and axis=0 means row independently
y = data_prepared["Heart_Disease"]  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


<ul>
<li><b>train_test_split:</b> Splits data into training and testing sets.
<li><b>X:</b> Feature matrix.
<li><b>y:</b> Target variable.</li>
<li><b>test_size:</b> Proportion of data for the testing set (20% in this case).</li>
<li><b>random_state:</b> Ensures reproducibility (set a seed value).</li>
</ul>

# <b>4. Train the Logistic Regression Model:</b>

In [8]:
model = LogisticRegression()
model.fit(X_train, y_train)


<ul>
<li><b>LogisticRegression:</b> Creates a logistic regression classifier object.
<li><b>fit:</b> Trains the model on the training data.</li>
</ul>

# <b>5. Evaluate the Model:</b>

In [9]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Accuracy: 0.9259259259259259
Precision: 1.0
Recall: 0.8095238095238095
F1 Score: 0.8947368421052632


<ol type="1">
    <li><b>Import Metrics:</b></li>
    <ul>
        <li>from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score: This line imports the necessary functions from the sklearn.metrics module to calculate various evaluation metrics for classification models.</li>
    </ul>

<li><b>Prediction:</b></li>
<ul>
    <li>y_pred = model.predict(X_test): This line uses the trained logistic regression model (model) to make predictions on the unseen testing data (X_test). The predicted labels are stored in the y_pred variable.</li>
</ul>
<li><b>Evaluation Metrics Calculation:</b></li>
<ul>
    <li>accuracy = accuracy_score(y_test, y_pred): Calculates the accuracy, which is the proportion of correct predictions made by the model.</li>
    <li>precision = precision_score(y_test, y_pred): Calculates the precision, which is the ratio of true positives (correctly predicted positive cases) to all predicted positive cases. It measures how well the model identifies actual positives.</li>
    <li>recall = recall_score(y_test, y_pred): Calculates the recall, which is the ratio of true positives to all actual positive cases. It measures how well the model finds all the actual positives.</li>
    <li>f1 = f1_score(y_test, y_pred): Calculates the F1 score, which is the harmonic mean of precision and recall. It provides a balance between the two metrics.</li>
</ul>
<li><b>Printing Results:</b></li>
<ul><li>The calculated metrics (accuracy, precision, recall, F1 score) are printed to the console along with descriptive labels.</li></ul>
</ol>