# Logistic Regression

Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score, roc_auc_score, classification_report,confusion_matrix,roc_curve
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import streamlit as st
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## 1. Data Exoploration

### a. Load the dataset

In [None]:
df = pd.read_csv("diabetes.csv")
df.head()

### b. Examine features, data types, and summary statistics

In [None]:
df.shape

In [None]:
df.info()
df.describe().T

In [None]:
df.isnull().sum() # No missing values in the dataset

### c. Visualizations

In [None]:
# Histogram.
df.hist(bins=30, figsize=(12,8))
plt.suptitle('Feature Distributions')
plt.show()

In [None]:
# Boxplot(to check outliers)
plt.figure(figsize=(10,6))
sns.boxplot(data=df)
plt.xticks(rotation=45)
plt.show()

In [None]:
# Correlation Heatemap
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Pairplot (+ hue on Outcome) ‚Äì quick way to see relationships
sns.pairplot(df,
             vars=['Glucose','BMI','Age','BloodPressure','Insulin'],
             hue='Outcome',
             plot_kws={'alpha':0.5},
             diag_kind='kde')
plt.show()

### Pattern & Correlation Analysis

* Glucose and BMI show strong positive correlation with Outcome
* Presence of outliers in Insulin and SkinThickness
* Target variable: Outcome (0 = Non-diabetic, 1 = Diabetic)
* Some features contain 0 values, which are biologically invalid (treated as missing)

## 2. Data Preprocessing 

### a. Handle missing values
Columns where 0 is invalid:
* Glucose, BloodPressure, SkinThickness, Insulin, BMI

In [None]:
cols_with_zero = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_with_zero] = df[cols_with_zero].replace(0, np.nan)

# Median imputation
for col in cols_with_zero:
    df[col].fillna(df[col].median(), inplace=True)

### b. Encode Categorical Variables

* The diabetes dataset does not contain any categorical features.
All input variables are numerical, and the target variable
"Outcome" is already binary encoded (0 and 1).

* Therefore, no categorical encoding techniques such as
Label Encoding or One-Hot Encoding are required

All features are numeric; the target Outcome is already encoded as 0/1 ‚Äì no further encoding is required.

### Feature / Target Split and Scaling

In [None]:
X = df.drop('Outcome', axis=1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 3. Model Building

In [None]:
model = LogisticRegression(max_iter=1000) # building the logistic regression model
model.fit(X_train_scaled, y_train) # training the model on the training data

## 4. Model Evaluation

### a. Performance Metrics

In [None]:
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:,1]

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1-score:', f1_score(y_test, y_pred))
print('ROC-AUC:', roc_auc_score(y_test, y_prob))

print('\nClassification Report:\n')
print(classification_report(y_test, y_pred))

In [None]:
# confusion matrix
confusion_matrix(y_test, y_pred)

### b. ROC Curve

In [None]:
fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, label='Logistic Regression')
plt.plot([0,1], [0,1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

## 5. Interpretation of Coefficients

### a. Coefficient Interpretation

In [None]:
coeff_df = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_[0],
    "Odds_Ratio": np.exp(model.coef_[0])
}).sort_values(by="Odds_Ratio", ascending=False)
coeff_df

### b. Significance of Features in Predicting Diabetes

#### Explanation:

- A positive coefficient indicates that an increase in the feature value increases the likelihood of a patient being diabetic.

- A negative coefficient indicates that an increase in the feature value decreases the likelihood of diabetes.

- The magnitude of the coefficient reflects the strength of the feature‚Äôs influence on the prediction, assuming all other features remain constant.

Since the features were standardized before training, the coefficients can be directly compared to understand relative importance.
#### insights:

 - Glucose - strongest predictor

 - BMI, Age & Pregnancies - moderate impact
 
 Features with smaller or near-zero coefficients contribute less to the prediction and have a weaker relationship with the target variable.

## 6.Streamlit Deployment

### a. Save Model and Scaler

In [None]:
joblib.dump(model, "logistic_model.pkl")
joblib.dump(scaler, "scaler.pkl")

### b. Streamlit App (app.py)

In [None]:
import streamlit as st
import pandas as pd
import joblib
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

# ------------------------------------------------------------------
# 1. Load the artefacts you created with train_and_save.py
# ------------------------------------------------------------------
model   = joblib.load("logistic_model.pkl")   # LogisticRegression
scaler  = joblib.load("scaler.pkl")          # StandardScaler
# The imputer was saved in the training script; if it is missing we skip it
try:
    imputer = joblib.load("imputer.pkl")    # SimpleImputer (median)
    pipeline = make_pipeline(imputer, scaler, model)
except Exception:
    pipeline = make_pipeline(scaler, model)   # fallback - no imputation

# ------------------------------------------------------------------
# 2Ô∏è. Feature definition (label,  min, max, step, default, dtype)
# ------------------------------------------------------------------
FEATURES = [
    ("Pregnancies",            0,  20, 1,   2,   int),
    ("Glucose",                0, 200, 1, 120,   int),
    ("BloodPressure",          0, 150, 1,  70,   int),
    ("SkinThickness",          0, 100, 1,  20,   int),
    ("Insulin",                0,1000, 1,  80,   int),
    ("BMI",                 0.0,  80, 0.1, 30.0, float),
    ("DiabetesPedigreeFunction",0.0,2.5,0.01,0.5, float),
    ("Age",                    1, 120, 1,  33,   int),
]

def get_user_df() -> pd.DataFrame:
    """Create the eight sidebar number-inputs **once** and return a 1-row DataFrame."""
    vals = {}
    for name, lo, hi, step, default, typ in FEATURES:
        if typ is int:
            vals[name] = st.sidebar.number_input(
                label=name,
                min_value=int(lo),
                max_value=int(hi),
                value=int(default),
                step=int(step),
                key=name,                 # unique key - prevents duplicate-ID errors
                format="%d",
            )
        else:  # float
            vals[name] = st.sidebar.number_input(
                label=name,
                min_value=float(lo),
                max_value=float(hi),
                value=float(default),
                step=float(step),
                key=name,
                format="%.2f",
            )
    return pd.DataFrame([vals])

# ------------------------------------------------------------------
# 3. Streamlit page layout

st.set_page_config(page_title="Diabetes predictor", page_icon="ü©∫")
st.title("ü©∫ Diabetes Prediction - Logistic Regression")
st.write(
    "Enter the eight clinical measurements in the left sidebar, "
    "press **Predict**, and see the probability of diabetes."
)

# Build the input DataFrame **once**
user_df = get_user_df()

if st.button("üîÆ Predict"):
    prob = pipeline.predict_proba(user_df)[0, 1]      # prob. of class‚ÄØ1 (diabetes)
    pred = int(prob >= 0.5)                         # binary decision

    col1, col2 = st.columns(2)
    col1.metric("Probability of Diabetes", f"{prob*100:.1f}%")
    col2.metric("Predicted class (0 = No, 1 = Yes)", pred)

    if pred:
        st.error("‚ö†Ô∏è High risk - the model predicts diabetes.")
    else:
        st.success("‚úÖ Low risk - the model predicts no diabetes.")

# ------------------------------------------------------------------
# 4Ô∏è. Show the data that was fed to the model (optional)
# ------------------------------------------------------------------
with st.expander("üîé Input data (what the model sees)"):
    st.dataframe(user_df)


* The Streamlit Application On Diabetes Prediction (app.py) has be created and ready to run.

### c. Run Locally

In [None]:
# streamlit run app.py in the terminal to run the app.

### d. Online Deployment (Streamlit Cloud)

The Logistic Regression model has been deployed using Streamlit Community Cloud by git repo.

üîó Live Application Link:  
https://diab-pred-model.streamlit.app/

The application loads the trained model and scaler, accepts user inputs for all features, and predicts the diabetes outcome.


## Interview Questions

#### 1. Difference between Precision and Recall?

* Precision: Of all predicted positives, how many are actually positive
* Recall: Of all actual positives, how many were correctly predicted

Use precision when false positives are costly. Use recall when missing positives is risky.

#### 2. What is Cross-Validation and why is it important?

Cross-Validation splits data into multiple folds and trains/tests the model repeatedly to.

* Reduces overfitting
* Provides a more reliable performance estimate
* Ensures model generalizes well to unseen data
* Common method: k-fold cross-validation