#Logistic Regression | Assignment

**Question 1: What is Logistic Regression, and how does it differ from Linear
Regression?**

**Logistic Regression vs. Linear Regression**

### ✅ **Logistic Regression:**

* **Purpose:** Used for **classification** problems, typically binary classification (e.g., yes/no, spam/not spam).

* **Output:** Predicts the **probability** that a given input belongs to a particular class (value between **0 and 1**).

* **Function Used:** Uses the **logistic (sigmoid)** function to squash the output of a linear equation into the range \[0, 1].

  $$
  \sigma(z) = \frac{1}{1 + e^{-z}}, \text{ where } z = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n
  $$

* **Decision Boundary:** A threshold (commonly 0.5) is applied to classify the result into one of the categories.

---

### ✅ **Linear Regression:**

* **Purpose:** Used for **regression** problems, i.e., predicting continuous numerical values.

* **Output:** Predicts an output that can range from **-∞ to +∞**.

* **Function Used:** Fits a **straight line** (or hyperplane in higher dimensions) to the data.

  $$
  y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n
  $$

* **Loss Function:** Uses **Mean Squared Error (MSE)** to minimize prediction errors.

---

### 🔑 **Key Differences:**

| Feature                 | Linear Regression      | Logistic Regression                |
| ----------------------- | ---------------------- | ---------------------------------- |
| **Type of Problem**     | Regression             | Classification                     |
| **Output**              | Continuous value       | Probability (0 to 1)               |
| **Prediction Function** | Linear equation        | Logistic (sigmoid) function        |
| **Loss Function**       | Mean Squared Error     | Log Loss (Cross-Entropy)           |
| **Use Case Example**    | Predicting house price | Predicting if email is spam or not |



**Question 2: Explain the role of the Sigmoid function in Logistic Regression.**

### ✅ **Role of the Sigmoid Function in Logistic Regression**

The **sigmoid function** is at the heart of logistic regression. Its primary role is to **map any real-valued number (from $-\infty$ to $+\infty$) into a value between 0 and 1**, which can then be interpreted as a **probability**.

---

### 📌 **Definition of the Sigmoid Function:**

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where:

* $z$ is the output of the linear combination of features:

  $$
  z = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n
  $$

---

### 🔍 **Role in Logistic Regression:**

1. **Transforms Linear Output to Probability:**

   * Logistic regression first computes a **linear score** using the input features and weights.
   * The **sigmoid function squashes** this linear score into a range between 0 and 1.
   * This makes it suitable for **binary classification**, as it gives the probability of the input belonging to the **positive class** (usually labeled as 1).

2. **Helps with Classification:**

   * The model uses a **threshold** (commonly 0.5) on the output of the sigmoid function to make a **classification decision**:

     $$
     \text{If } \sigma(z) \geq 0.5 \Rightarrow \text{class 1 (positive)}
     $$

     $$
     \text{If } \sigma(z) < 0.5 \Rightarrow \text{class 0 (negative)}
     $$

3. **Probabilistic Interpretation:**

   * The sigmoid function enables logistic regression to output **probabilities**, which is useful not just for classification, but also for **confidence estimation** and **ranking**.

---

### 📈 **Graphical Behavior:**

* The sigmoid function has an "S"-shaped curve.
* As $z \to +\infty$, $\sigma(z) \to 1$
* As $z \to -\infty$, $\sigma(z) \to 0$
* At $z = 0$, $\sigma(z) = 0.5$

---

### 🧠 In Summary:

> The sigmoid function converts the linear output of the model into a **probability**, enabling logistic regression to perform **binary classification** in a smooth and differentiable way.


**Question 3: What is Regularization in Logistic Regression and why is it needed?**

### ✅ **What is Regularization in Logistic Regression?**

**Regularization** is a technique used in **logistic regression (and other models)** to **prevent overfitting** by discouraging overly complex models. It works by adding a **penalty term** to the loss function, which **controls the size of the model coefficients**.

---

### 🔍 **Why is Regularization Needed?**

* In logistic regression, the model learns weights ($\beta$) for each feature.
* If the model **fits the training data too well**, it might learn noise or irrelevant patterns — this is **overfitting**.
* Overfitting leads to **poor generalization** to new (unseen) data.
* Regularization helps by **penalizing large weights**, encouraging the model to keep them small and simpler.

---

### 🧮 **How Does It Work?**

#### Original Loss Function (Log Loss):

$$
J(\beta) = - \frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]
$$

> Where:
>
> * $m$ is the number of training examples
> * $\hat{y}^{(i)}$ is the predicted probability

---

### ✳️ **With Regularization:**

There are two common types:

#### 1. **L2 Regularization (Ridge):**

Adds the **sum of squares of weights** to the loss:

$$
J(\beta) = \text{Log Loss} + \lambda \sum_{j=1}^{n} \beta_j^2
$$

* **Effect:** Penalizes large weights smoothly.
* **Commonly used in logistic regression** by default in most libraries (like `sklearn`).

---

#### 2. **L1 Regularization (Lasso):**

Adds the **sum of absolute values of weights**:

$$
J(\beta) = \text{Log Loss} + \lambda \sum_{j=1}^{n} |\beta_j|
$$

* **Effect:** Drives some weights to exactly **zero**, performing **feature selection**.
* Useful when you have many irrelevant or noisy features.

---

### ⚖️ **What is λ (lambda)?**

* It's the **regularization parameter** (also sometimes called `C` in `sklearn`, where `C = 1/λ`).
* Controls the strength of the penalty:

  * **Large λ → more regularization → simpler model**
  * **Small λ → less regularization → more complex model**

---

### 🧠 **In Summary:**

> **Regularization in logistic regression** helps control **model complexity** by adding a penalty for large coefficients. This **prevents overfitting**, improves **generalization**, and can even perform **feature selection** (in L1).


**Question 4: What are some common evaluation metrics for classification models, and
why are they important?**

### ✅ **Common Evaluation Metrics for Classification Models**

Evaluation metrics are essential to understand **how well a classification model performs**, especially in real-world applications where accuracy alone may not tell the full story.

---

### 📊 **1. Accuracy**

$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
$$

* **Good for:** Balanced datasets (equal number of classes).
* **Limitation:** Misleading on **imbalanced datasets** (e.g., predicting all "No" in a 95% "No" dataset gives 95% accuracy but zero usefulness).

---

### 📊 **2. Precision**

$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}
$$

* Measures: **How many of the predicted positives are actually correct**.
* **High precision** = Low false positive rate.
* **Useful when**: False positives are costly (e.g., spam filters, medical diagnosis).

---

### 📊 **3. Recall (Sensitivity / True Positive Rate)**

$$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}
$$

* Measures: **How many actual positives were correctly identified**.
* **High recall** = Low false negative rate.
* **Useful when**: Missing a positive is costly (e.g., cancer detection, fraud detection).

---

### 📊 **4. F1 Score**

$$
\text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision + Recall}}
$$

* **Harmonic mean** of precision and recall.
* Balances precision and recall when both are important.
* **Useful when:** You need a balance between false positives and false negatives.

---

### 📊 **5. Confusion Matrix**

A **2x2 table** for binary classification:

|                     | Predicted Positive  | Predicted Negative  |
| ------------------- | ------------------- | ------------------- |
| **Actual Positive** | True Positive (TP)  | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN)  |

* Gives a **complete breakdown** of classification results.
* Basis for many other metrics.

---

### 📊 **6. ROC Curve & AUC (Area Under Curve)**

* **ROC Curve:** Plots **True Positive Rate vs False Positive Rate** at various thresholds.

* **AUC (Area Under Curve):** Measures the **overall ability** of the model to discriminate between classes.

  * **AUC = 1:** Perfect classifier
  * **AUC = 0.5:** Random guessing

* **Useful when:** Evaluating performance across different classification thresholds.

---

### 📊 **7. Log Loss (Cross-Entropy Loss)**

* Measures the **uncertainty** of predictions.
* Penalizes confident but **wrong predictions** more heavily.
* Used especially when models output **probabilities**.

---

### 🧠 **Why Are These Metrics Important?**

* **Different metrics reveal different strengths and weaknesses** of your model.
* They help choose the **right model** for your specific use case.
* In many applications, the **cost of false positives and false negatives differs** — so **accuracy alone is not enough**.

---

### 🚨 Example:

In a **disease detection model**:

* Accuracy = 95% might sound great.
* But if 95% of people are healthy, the model might just be predicting “healthy” all the time.
* You need **recall** to ensure it catches sick patients and **precision** to avoid false alarms.


**Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.**

In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset from sklearn
data = load_breast_cancer()

# 2. Convert to pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# 3. Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# 4. Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Create and train the Logistic Regression model
model = LogisticRegression(max_iter=10000)  # Increased max_iter to ensure convergence
model.fit(X_train, y_train)

# 6. Make predictions
y_pred = model.predict(X_test)

# 7. Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")


Test Accuracy: 0.9561


**Question 6: Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.**

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train Logistic Regression with L2 Regularization (default)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=10000)
model.fit(X_train, y_train)

# 4. Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 5. Print model coefficients and accuracy
print("Model Coefficients:")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

print(f"\nIntercept: {model.intercept_[0]:.4f}")
print(f"\nTest Accuracy: {accuracy:.4f}")


Model Coefficients:
mean radius: 1.0274
mean texture: 0.2215
mean perimeter: -0.3621
mean area: 0.0255
mean smoothness: -0.1562
mean compactness: -0.2377
mean concavity: -0.5326
mean concave points: -0.2837
mean symmetry: -0.2267
mean fractal dimension: -0.0365
radius error: -0.0971
texture error: 1.3706
perimeter error: -0.1814
area error: -0.0872
smoothness error: -0.0225
compactness error: 0.0474
concavity error: -0.0429
concave points error: -0.0324
symmetry error: -0.0347
fractal dimension error: 0.0116
worst radius: 0.1117
worst texture: -0.5089
worst perimeter: -0.0156
worst area: -0.0169
worst smoothness: -0.3077
worst compactness: -0.7727
worst concavity: -1.4286
worst concave points: -0.5109
worst symmetry: -0.7469
worst fractal dimension: -0.1009

Intercept: 28.6487

Test Accuracy: 0.9561


**Question 7: Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.**

In [3]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 1. Load a multiclass dataset (Iris)
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train Logistic Regression with One-vs-Rest strategy
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# 4. Predict on test set
y_pred = model.predict(X_test)

# 5. Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





**Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.**

In [4]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Set up parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],         # Regularization strength (smaller C = more regularization)
    'penalty': ['l1', 'l2'],              # Regularization type
    'solver': ['liblinear']               # 'liblinear' supports both l1 and l2
}

# 4. Initialize Logistic Regression and GridSearchCV
log_reg = LogisticRegression(max_iter=1000)
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# 5. Fit the model
grid_search.fit(X_train, y_train)

# 6. Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

# 7. Print results
print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")
print(f"Test Set Accuracy: {test_accuracy:.4f}")


Best Parameters: {'C': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9670
Test Set Accuracy: 0.9825


**Question 9: Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.**

In [5]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# -----------------------------
# Model WITHOUT Standardization
# -----------------------------
model_no_scaling = LogisticRegression(max_iter=1000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -----------------------------
# Model WITH Standardization
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_with_scaling = LogisticRegression(max_iter=1000)
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_scaled = model_with_scaling.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# -----------------------------
# Compare and Print Results
# -----------------------------
print(f"Accuracy WITHOUT Scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy WITH Scaling:    {accuracy_scaled:.4f}")


Accuracy WITHOUT Scaling: 0.9561
Accuracy WITH Scaling:    0.9737


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
