## Linear Regression Overview

In this module, we implement linear regression using the **closed-form solution** (also known as the normal equation) to estimate the probability that a tumor is malignant based on features of its cell nuclei. Although the original task is a binary classification problem (benign vs. malignant), we treat the class labels as continuous values — $0$ for benign and $1$ for malignant — and use linear regression to predict a real-valued score between $0$ and $1$.

The model assumes a linear relationship between the input features $ \mathbf{x} \in \mathbb{R}^d $ and the target value $ y \in \mathbb{R} $, defined as:

$$
\hat{y} = \mathbf{w}^\top \mathbf{x} + b
$$

To find the optimal weights $ \mathbf{w} $ and bias $ b $, we first augment the input matrix with a column of ones to account for the bias term. We then solve for $ \mathbf{w} $ using the normal equation:

$$
\mathbf{w} = (X^\top X)^{-1} X^\top y
$$

We evaluate the model using **mean squared error (MSE)** to see how close the predicted outputs $ \hat{y} $ are to the true labels $ y $. To interpret the results in a classification setting, we apply a threshold at $0.5$ — predictions above this value are labeled malignant ($1$), and those below are labeled benign ($0$).

While this approach can provide a simple baseline, it's important to remember that linear regression isn't ideal for classification tasks, especially when the data isn't linearly separable.

---

## Advantages of Linear Regression

- **Simple and Interpretable**: The model is easy to understand and implement, with clear relationships between inputs and outputs.
- **Closed-Form Solution**: The normal equation provides an exact solution without the need for iterative optimization.
- **Fast to Train**: Especially efficient on smaller datasets due to its analytical solution.
- **Baseline for Comparison**: Serves as a strong benchmark for evaluating more complex models.
- **Works Well with Linearly Related Data**: Performs effectively when the underlying relationship between features and target is linear.

## Disadvantages of Linear Regression

- **Assumes Linearity**: Poor performance if the relationship between features and target is nonlinear.
- **Sensitive to Outliers**: Outliers can heavily influence the model's predictions and skew the results.
- **Not Ideal for Classification**: It predicts continuous values, so thresholding is required for classification tasks, which lacks probabilistic interpretation.
- **No Feature Interactions**: Linear regression can't capture interactions between variables unless explicitly added as new features.
- **Collinearity Issues**: Highly correlated features can destabilize the weight estimates and reduce model reliability.

---

## Data Overview

This project uses the **Breast Cancer Wisconsin (Diagnostic) Dataset** from the UCI Machine Learning Repository. It contains **569 samples**, each describing a breast tumor based on features extracted from a digitized image of a fine needle aspirate (FNA). Each sample includes **30 numeric features** that summarize characteristics of the cell nuclei, such as: **Radius**, **Texture**, **Perimeter**, **Area**, **Smoothness**, etc. These are reported as the **mean**, **standard error**, and **worst** values across the tumor.

The target variable is **diagnosis**:
- **M** = Malignant → encoded as **1**
- **B** = Benign → encoded as **0**

Before training, the dataset was **shuffled**, **split manually** into 80% training and 20% testing sets, and the features were **standardized** using z-score normalization.

