# In Depth Analysis of Linear Regression
- **ML Overview**
    - Example, Algorithms vs Model
- **Supervised Learning**
    - Definition, Examples
- **Supervised Learning Setup**
    - Nomenclature, Formulation(`Regression` & `Classification`),  Example,  Learning,  Hypothesis Class.
    - Performance Evaluation
        - Loss Function, 0/1 Loss Function, Squared Loss, Root Mean squared error, Absolute Loss
    - Generalization: The Train-Test Split, Generalization loss.
    
[Linear Regression](#Linear-Regression)

- **Single Feature**
- **Multiple Feature**
- **Model Formulation and Setup**
- **Loss Function**
   - How to solve?
   - Reformulation
   - Consequently
- **Solve Optimization Problem (Analytical Solution employing Calculus)**
- **Model Evaluation Techniques**
- **Polynomial Regression**
- **How to Handle Overfitting?**
- **Regularization (Ridge Regression and Lasso Regression)**
- **Gradient Descent Algorithm**
   - Formulation
   - Algorithm
   - Types 
- **Linear Regression Implementation in Python**
- **Linear Regression Implementation using sklearn**
- **Interview Questions**

<h2 align='center' > ML Overview</h2> 

### What is Machine Learning?
- Automating the process of automation
- Getting computers to program themselves

<figure>
<img src="images/p1.png" width=300px height=300px>
<figcaption align = "center" ><b> Given examples(Training data), make a machine learn system behavior or discover patterns. </b> </figcaption>
</figure>


> **A simple definition of machine learning is the ability of a computer or machine to improve its performance on a specific task through experience. It involves training a model on a data set, and then using that model to make predictions or take actions based on new, unseen data.**

> **In other words, machine learning is a way for computers to learn and make decisions on their own, without being explicitly programmed to perform a specific task.**
-----

<img src="images/p2.png">

&ensp;

**Note**: An algorithm is a set of steps that a model follows to learn from data and make predictions. A model, on the other hand, is a representation of the relationships between the input data and the output predictions. It is trained on a dataset and is able to make predictions on new data based on what it has learned from the training data.

### Rules vs. Learning
- `Rules-based` approaches involve following a predetermined set of rules or instructions to solve a problem or make a decision. These rules are typically established in advance and are followed regardless of the specific circumstances or context.

- On the other hand, `learning-based` approaches involve using data and experience to improve decision-making over time. In the context of machine learning, this involves training a model on a dataset, allowing the model to learn patterns and relationships in the data, and using this learned knowledge to make predictions on new data.

- Both rules-based and learning-based approaches have their own strengths and weaknesses, and the most appropriate approach will depend on the specific problem or task at hand. Rules-based approaches are often simpler and easier to implement, but they may not be able to adapt to changing circumstances or new data as well as learning-based approaches. Learning-based approaches, on the other hand, may be more complex and require more data and computation, but they have the ability to improve over time and adapt to new situations.

<h2 align="center">Supervised Learning</h2> 

**Predicting the labels for unseen data based on labelled
instances.** 

<img src="images/p4.png" height=500px width=500px align='center'>

- Each column is a feature and adds one dimension to the data
- Number of columns define total number of features and hence data dimensionality.
- `Inputs`: referred to as Features.
- `Output`: referred to as Label.
- `Training data`: (input, output) for which the output is known and is used for training a model by ML algorithm.
- `A Loss, an objective or a cost function`: determines how well a trined model approximates the training data
- `Test data`: (input, output) for which the output is known and is used for the evaluation of the performance of the trained model

#### A Recipe for Applying Supervised Learning

To apply supervised learning, we define a dataset and a learning algorithm.

$$ \text{Dataset} + \text{Learning Algorithm} \to \text{Predictive Model} $$

The output is a predictive model that maps inputs to targets. For instance, it can predict targets on new inputs.

-----
The learning algorithm would receive a set of inputs along with the corresponding correct outputs to train a model.

<img src="images/p5.png" height=600px width=600px align="center">
&ensp;

<img src="images/p8.png" height=600px width=600px align="center">


### Algorithms vs Model
- `Linear regression` algorithm produces a model, that is, a vector of values of the coefficients of the model.
- `Decision tree` algorithm produces a model comprised of a tree of if-then statements with specific values.
- `Neural network` along with backpropagation + gradient descent: produces a model comprised of a trained (weights assigned) neural network.

<h2 align="center">Supervised Learning Setup</h2>

##### A Supervised Learning Dataset: Notation

Using the adopted notation, we can formalize the supervised machine learning setup. We represent the entire training data as 
$$\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\} \in X^{R} \times Y$$

Where
- D is a dataset
- X is the d-dimensional feature space($R^d$)
- $x_i$ is the input vector of the <i>ith</i> sample
- Y is the label space

Each $x^{(i)}$ denotes an input (e.g., the measurements for patient $i$), and each $y^{(i)} \in \mathcal{Y}$ is a target (e.g., the Heart Disease). 

<b><u>Regression</u></b>:    $\mathcal{y} = R (prediction-on-continuous-scale)$

<b><u>Classification</u></b>:    $\mathcal{y} = {{0,1}} \;\;or \;\;{-1,1} \;\; or \;\; 1,2 \;\; binary-classification $

$\mathcal{y} = {1,2,3......,M} \;\;\; M-class-Classification$




Together, $(x^{(i)}, y^{(i)})$ form a *training example*.
&emsp;
<h3 align='center'>Example</h3>
<img src="images/9.png" height=500px width=500px >

&emsp;

<h3 align='center'>Learning </h3>

We want to develop a model that can predict the label for the input for which label is **unknown.**

We assume that the data points $(x_i,y_i)$ are drawn from some **unknown** distribution $P(X,Y)$.
<img src="images/10.png" align='center'>

Our goal is to learn the machine (model, function or hypothesis) $h\in H$, such that for a new pair/instance $(x,y)P$ , we can use *h* to obtain
$$h(x) =y$$

with high probability or 
$$h(x)\approx y$$

in some optimal sense.

**Note:** Here, we do not know exact distribution but using **ML**, we try to estimate that distribution. Here, $h$ is our hypothesis, this is basically returns a mapping between input and output. $h$ is a machine. It is a sample from Hypothesis $H$. Here, we want to learn a function from dataset which takes input and produces its output.

<h3 align='center'> Model/Function/Hypothesis: Notation</h3>

We'll say that a model is a function
$$ f : \mathcal{X} \to \mathcal{Y} $$
that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

Often, models have *parameters* $\theta \in \Theta$ living in a set $\Theta$. We will then write the model as
$$ f_\theta : \mathcal{X} \to \mathcal{Y} $$
to denote that it's parametrized by $\theta$.

<h3 align="center">Feature Space</h3>  

**Definition:** A feature space is a set of mathematical features or variables that are used to represent the data. It is a multi-dimensional space where each dimension corresponds to a feature or variable, and each data point is represented as a point in this space.

**Example:** If you have a dataset containing three features (age, height, and weight), the feature space would be a three-dimensional space where each data point is represented by its age, height, and weight.

- Student data(e.g. for predicting grades), an input $x^{(i)} \in \mathcal{X}$ is a $d$-dimensional vector of the form
$$ x^{(i)} = \begin{bmatrix}
x^{(i)}_1 \\
x^{(i)}_2 \\
\vdots \\
x^{(i)}_d
\end{bmatrix}$$
- Where each $x_j^i \;$ is the value of the $ith$ feature for student $j$.
- Examples of features, $x^i$
    - Scores in assignments, quizzes, exams
    - Educational records, grades in previous courses
    - Rankings of previous educational institutes
    - Interaction with online tools? Missed instruments?
- Small number of features and relatively low features with 0 values: **Dense vectors**.

The set $\mathcal{X}$ is called the feature space. Often, we have, $\mathcal{X} = \mathbb{R}^d$.

**Attributes:** We refer to the numerical variables describing the patient as *attributes*. Examples of attributes include:
* The age of a patient.
* The patient's gender.
* The patient's BMI.

**Features:** Often, an input object has many attributes, and we want to use these attributes to define more complex descriptions of the input.

* Is the patient old and a man? (Useful if old men are at risk).
* Is the BMI above the obesity threshold?

We call these custom attributes *features*.

**Feature:** We may denote features via a function $\phi : \mathcal{X} \to \mathbb{R}^p$ that takes an input $x^{(i)} \in \mathcal{X}$ and outputs a $p$-dimensional vector
$$ \phi(x^{(i)}) = \left[\begin{array}{@{}c@{}}
\phi(x^{(i)})_1 \\
\phi(x^{(i)})_2 \\
\vdots \\
\phi(x^{(i)})_p
\end{array} \right]$$
We say that $\phi(x^{(i)})$ is a *featurized* input, and each $\phi(x^{(i)})_j$ is a *feature*.

**Features vs Attributes**:
In practice, the terms attribute and features are often used interchangeably. Most authors refer to $x^{(i)}$ as a vector of features. We will follow this convention and use the term "attribute" only when there is ambiguity between features and attributes. Features can be either discrete or continuous.

<h3 align='center'>Label Space</h3>
Formally, when $(x^{(i)}, y^{(i)})$ form a *training example*, each $y^{(i)} \in \mathcal{Y}$ is a target. We call $\mathcal{Y}$ the target space.

- Binary (one-of-two) – Binary classification
    - Sentiment: positive / negative
    - Email: Spam / Not Spam
    - Online Transactions: Fraudulent (Yes / No)
    - Tumor: Malignant / Benign
    - y ∈ 0,1 e.g. 0: Negative class, 1: Positive class
    - y ∈ −1,1 e.g. -1: Negative class, 1: Positive class
- **Multi-class (one-of-many, many-of-many problems) – Multi-class classification**
    - Sentiment: Positive / negative / neutral
    - Emotion: Happy, Sad, Surprised, Angry,...
    - Part-of-Speech tag: Noun / verb / adjective / adverb /...
    - Recognize a word: One of |V| tags
    - y ∈ 0,1,2,3, ... e.g. 0: Happy, 1: Sad, 2, Angry,...
- **Real-valued – Regression**
    - Temperature, height, age, length, weight, duration, price...
    
**Regression vs. Classification**


1. __Regression__: The target variable $y$ is continuous. We are fitting a curve in a high-dimensional feature space that approximates the shape of the dataset.
2. __Classification__: The target variable $y$ is discrete. Each discrete value corresponds to a *class* and we are looking for a hyperplane that separates the different classes.

<h3 align='center'>Feature Matrix</h3>

Suppose that we have a dataset of size $n$ (e.g., $n$ patients), indexed by $i=1,2,...,n$. Each $x^{(i)}$ is a vector of $d$ features.

Machine learning algorithms are most easily defined in the language of linear algebra. Therefore, it will be useful to represent the entire dataset as one matrix $X \in \mathbb{R}^{n \times d}$, of the form:
$$ X = \begin{bmatrix}
x^{(1)}_1 & x^{(2)}_1 & \ldots & x^{(n)}_1 \\
x^{(1)}_2 & x^{(2)}_2 & \ldots & x^{(n)}_2 \\
\vdots \\
x^{(1)}_d & x^{(2)}_d & \ldots & x^{(n)}_d
\end{bmatrix}.$$

Similarly, we can vectorize the target variables into a vector $y \in \mathbb{R}^n$ of the form
$$ y = \begin{bmatrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(n)}
\end{bmatrix}.$$

<h3 align="center">Hypothesis Space</h3>

- We call the set of possible functions or candidate models `the hypothesis class`.
- The hypothesis $h$ is sampled from a hypothesis space $H$.
$$h \in H$$

- $H$ can be thought of to contain classes of hypotheses which share sets of assumptions like
    - Decisions tree
    - Perceptron
    - Neural networks
    - Support Vector Machines
    
**Example:** $h \in H$ for $H$: Decision trees, would be instances of
decisions trees of different height, arity, thresholds, etc.

- In machine learning, the `hypothesis space` is the set of all possible hypotheses that a learning algorithm can consider when making predictions. It is an important concept because the choice of hypotheses can significantly affect the performance of a learning algorithm.

In [54]:
# Import necessary libraries
import numpy as np

# Define the input space X as a matrix of four rows and two columns, 
# representing four possible input examples with two features each.
X = np.array([[0, 0], 
              [0, 1], 
              [1, 0], 
              [1, 1]])

# Define the output space Y as a vector of four values, representing 
#  the corresponding output labels for each of the four input examples.
Y = np.array([0, 1, 1, 0])

# Define the hypothesis space H
H = []

# Iterate over all possible combinations of weights w1 and w2
for w1 in range(-5, 6):
    for w2 in range(-5, 6):
        # Append the hypothesis h(x) = w1 * x1 + w2 * x2 to the hypothesis space H
        result = lambda x: w1 * X[0] + w2 * X[1]
        H.append(result)

# Print the hypothesis space
print(H)

[<function <lambda> at 0x7f74c0c24ee0>, <function <lambda> at 0x7f74c0c24e50>, <function <lambda> at 0x7f74c0c24f70>, <function <lambda> at 0x7f74c0c24670>, <function <lambda> at 0x7f74c134e1f0>, <function <lambda> at 0x7f74c134e040>, <function <lambda> at 0x7f74c134e0d0>, <function <lambda> at 0x7f74c134e160>, <function <lambda> at 0x7f74c134e310>, <function <lambda> at 0x7f74c134e3a0>, <function <lambda> at 0x7f74c134e430>, <function <lambda> at 0x7f74c134e4c0>, <function <lambda> at 0x7f74c134e550>, <function <lambda> at 0x7f74c134e5e0>, <function <lambda> at 0x7f74c134e670>, <function <lambda> at 0x7f74c134e700>, <function <lambda> at 0x7f74c134e790>, <function <lambda> at 0x7f74c134e820>, <function <lambda> at 0x7f74c134e8b0>, <function <lambda> at 0x7f74c134e940>, <function <lambda> at 0x7f74c134e9d0>, <function <lambda> at 0x7f74c134ea60>, <function <lambda> at 0x7f74c134eaf0>, <function <lambda> at 0x7f74c134eb80>, <function <lambda> at 0x7f74c134ec10>, <function <lambda> at 0x

>**Note:** The code prints the resulting `hypothesis space` $H$, which is a list of all possible hypotheses that can be considered by the learning algorithm.

<h3><b><i>Question:</i></b> 
    For a given problem, How we can select/choose hypothesis(machine) $h \in h$.</h3>
    
**Answer:**
- **Randomly**
    - May not work well
    - Like using a random program to solve a sorting problem
    - May work if $H$ is constrained enough

- **Exhaustively**
    - Would be very slow
    - The space $H$ is usually very large (if not infinite)

- $H$ is usually chosen by data scientists (you!) based on their experience!
    - $h \in H$ is estimated efficiently using various optimization techniques. Define hypothesis class $H$ for a given learning algorithm. Evaluate the performance of each candidate function and choose the best one.

<h3><i>Question<i/>: How do we evaluate the performance? </h3>

**Answer:** 
    
Define a loss function to quantify/calculate the accuracy of the prediction.

<h3 align="center">Loss Function</h3>

- Loss function should quantify/calculate the average error in predicting $y$ using hypothesis function $h$ and input $x$. It is denoted by $L$.

- Smaller is better
    - 0 loss: No error
    - 100% loss: Could not even get one instance right
    - 50% loss: Your h is as informative as a coin toss
    
&emsp;    


<h3 align="center"><b>0/1 Loss</b></h3>

- The `0/1 loss` is a loss function that is used to evaluate the performance of a machine learning model in binary classification tasks. It is defined as the number of incorrect predictions made by the model, divided by the total number of predictions.

<img src="https://www.baeldung.com/wp-content/ql-cache/quicklatex.com-3fc482ec51a32e213970a07a3de41d10_l3.svg" height=600px width=600px>

- Counts the average number of mistakes in predicting $y$
- Returns the training error rate
- Not used due to Non-continuous and non-differentiable
    - Difficult to utilize in optimization
- Used to evaluate classifiers in binary/multiclass settings

In [55]:
# Import necessary libraries
import numpy as np

# Define the true labels y_true as a numpy array of four values, 
# representing the correct class labels for four input examples. 
y_true = np.array([0, 1, 0, 1])

# Define the predicted labels y_pred as a numpy array of 
# four values, representing the class labels predicted by the model.
y_pred = np.array([0, 0, 1, 1])

# Calculate the 0/1 loss
loss = np.mean(y_true != y_pred)

# def zero_one_loss(y_true, y_pred):
#     loss = 0
#     for yt, yp in zip(y_true, y_pred):
#         if yt != yp:
#             loss += 1
#     return loss

# Print the loss
print(loss)

0.5


> **Note:** The output will be `0.5`, indicating that the model made 2 incorrect predictions out of a total of 4, or 50% error.

<h3 align="center"><b>Squared Loss Function</b></h3>

The squared loss function, also known as the mean squared error (MSE) loss, is a common loss function used in regression tasks. It measures the average squared difference between the predicted values and the true values. The squared loss function is defined as:

$$L_{sq}(h) = \frac{1}{2n} \sum_{i=1}^n \left( h(x^{(i)}) - y^{(i)} \right)^2$$

Where $h(x^{(i)})$ is the predicted value, $y^{(i)}$ is the true value, and $n$ is the number of samples. These are defined for a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$.


- Typically used in regression settings
- The loss is always non-negative
- The loss grows quadratically with the absolute magnitude of mis-prediction
- Encourages no predictions to be really far off
- If a prediction is very close to be correct, the square will be tiny and little attention will be given to that example to obtain zero error.

In [56]:
def squared_loss(y_true, y_pred):
    loss = 0
    for yt, yp in zip(y_true, y_pred):
        loss += (yt - yp) ** 2
    loss /= len(y_true)
    return loss


**Note:** Function `squared_loss` takes in two arguments: `y_true` and `y_pred`. `y_true` is a list of true values, and `y_pred` is a list of predicted values. The function calculates the squared difference between each pair of true and predicted values and sums them up. It then divides the total loss by the number of samples to get the average loss. The function returns the average loss as a float.

In [61]:
y_true = [1, 2, 3, 4, 5]
y_pred = [1.5, 2.5, 2.5, 4.5, 6.5]

loss = squared_loss(y_true, y_pred)
print(loss)  

0.65


<h3 align="center"><b>Absolute Loss Function</b></h3>

The absolute loss function, also known as the mean absolute error (MAE) loss, is another common loss function used in regression tasks. It measures the average absolute difference between the predicted values and the true values. The absolute loss function is defined as:

$$L_{abs}(h) = \frac{1}{n} \sum_{i=1}^n \left| h(x^{(i)}) - y^{(i)} \right|$$

Where $h(x^{(i)})$ is the predicted value, $y^{(i)}$ is the true value, and $n$ is the number of samples. These are defined for a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$.


- The loss is always non-negative
- The loss grows linearly with the absolute magnitude of mis-prediction
- Better suited for noisy data.

In [58]:
def absolute_loss(y_true, y_pred):
    loss = 0
    for yt, yp in zip(y_true, y_pred):
        loss += abs(yt - yp)
    loss /= len(y_true)
    return loss

**Note:** The `absolute_loss` function takes in two arguments: `y_true` and `y_pred`. `y_true` is a list of true values, and `y_pred` is a list of predicted values. The function calculates the absolute difference between each pair of true and predicted values and sums them up. It then divides the total loss by the number of samples to get the average loss. The function returns the average loss as a float.

In [60]:
y_true = [1, 2, 3, 4, 5]
y_pred = [1.5, 2.5, 2.5, 4.5, 6.5]

loss = absolute_loss(y_true, y_pred)
print(loss)

0.7


<h4 align="center"><b>Comparsion</b></h4>
<img src="images/p12.png" align="left">
<img src="images/p13.png" align="right" height=400px width=400px>


# Linear Regression

**Regression:** Quantitative Prediction on a continuous scale.
<img src="images/p11.png">

> Here, `PROCESS` or `SYSTEM` refers to any underlying physical or logical phenomenon which maps our input data to our observed and noisy output data.

- **One Variable Regression:** $\;\;y$ is a scalar.
- **Multi-Variable Regression:** $\;\;y$ is a vector.
- **Single feature Regression:** $\;\;x$ is a scalar.
- **Multiple feature Regression:** $\;\;x$ is a vector.

<h3 align="center"> Model Formulation and Setup</h3>

#### *True Model:*
We assume there is an inherent but unknown relationship between
input and output.
$y = f(x) + n$

<img src="images/p16.png" align="right" height=400px width=400px>

#### *Goal:* 
Given noisy observations, we need to estimate the unknown functional
relationship as accurately as possible.

We have:
- For some input $x,\hat y$ is our model output.
- Assume that our model is $\hat f(x,\theta)$, characterized by the paramter(s) $\theta$.
- Model $f(x,\theta)$ has
    - A structure (e.g linear, polynomial, inverse).
    - Parameters in the vector $\theta = [\theta_1,\theta_2, \theta_3,....,\theta_M]$
- Our Model error is $\epsilon = y - \hat y$.

<img src="images/p17.png" align="left" height=400px width=400px>

<img src="images/p18.png" align="right" height=500px width=500px>



<h3 align="center">Model</h3>

We have :

$$\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\} \in X^{R} \times Y$$

Model is a linear function of the features, that is 

$$\hat f(X,\theta) = \theta_0 + \sum_{i=1}^d(\theta_ix_i) = \theta_0 + \theta^TX$$
- Linear Structure
- Model Parameters: $\theta_0 \;\; and \; \theta = [\theta_1,\theta_2, \theta_3,....,\theta_d]$
    - $\theta_0$ is bias or intercept.
    - $\theta = [\theta_1,\theta_2, \theta_3,....,\theta_d]$ represents the weights or slope.
    - $\theta_i$ quanitfies the contribution of i-th feature $x_i$.

<h3 align="center">What is Linear Model?</h3>

#### When d = 1:
$$\hat f(X,\theta) = \theta_0 + \theta_1x$$
$$\hat f(X,\theta) = \theta_0 + \theta_1x$$
- This represents the equation of `line`.
<img src="images/p19.png" align="right">

#### When d=2:
$$\hat f(X,\theta) = \theta_0 + \theta_1x_1 + \theta_2x_2$$
- This represents the equation of the `Plane`
<img src="images/p20.png" align="right">


#### When d=d:
$$\hat f(X,\theta) = \theta_0 + \theta_TX$$
- This represents the `Hyper-plane` in $R^{d+1}$.

----
- For different $\theta_0$ and , <b>$\theta$</b>, we have differnt hyper-planes.
- How do we find the `best` line?
- What do we mean by the `best`?

<h3 align="center">Linear Regression with one Variable </h3>

**Notation:**

- **m** = Number of training samples
- **x** = Feature
- **y** = Label
- $(x^i , y^i)$: the ith sample in the dataset

<img src="https://cdn.scribbr.com/wp-content/uploads//2020/02/simple-linear-regression-graph.png" align="right">
<img src="images/21.png" align="left">



$$h_\theta(x) = \theta_0+ \theta_1x$$

> **Linear Regression with one variable is also called univarite linear regression, simple linear regression.**

**Parameters:**
$$\theta_0, \theta_1$$

**Cost Function**
$$J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$$

**Goal:**
$$minimum_{\theta_0,\theta_1} J(\theta_0,\theta_1)$$

In [62]:
import numpy as np
def cost_function(y_pred, y):
    n = len(y)
    cost = 1/n * np.sum((y_pred - y)**2)
    return cost
y_pred = np.array([1, 2, 3])
y = np.array([1, 2, 2])
cost = cost_function(y_pred, y)
print(cost)

0.3333333333333333


<h3 align="center">A simplified case</h3>

<img src="images/21.jpeg">
<img src="images/22.jpeg">


<h3 align="center">Using both of the <i>knobs</i> </h3>

**Hypothesis:**
$$\theta_0(x) = \theta_0 + \theta_1x$$

**Parameters:**
$$\theta_0, \theta_1$$

**Cost Function**
$$J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})$$

**Goal:**
$$minimum_{\theta_0,\theta_1} J(\theta_0,\theta_1)$$

<h3><i>Question:</i> How do we find the <i>best</i> line? What do we mean by <i>best</i>?</h3>

**Answer:**

#### Define Loss Function:
>Loss function should be a function of model parameters.

- For input $x$, our model error is $e = y - \hat y = y - \hat f(x,\theta) = y - \theta_0 - \theta^Tx$.
- e is also termed as residual error as it is the differnce between observed value and predicted value.
- **d=1**

<img src="images/23.png" align="center">

- For $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\} \in X^{R} \times Y$, we have
$$e_i = y_i-\theta_0-\theta^T,\;\;\;\; i=1,2,3,4,.....,n$$

- Using Residual error, we can define different loss functions:
   $$L_{LSE}(\theta_0,\theta_1) = \sum_{i=1}^n(y_i-\theta_0-\theta^Tx_i)^2$$
   $$L_{MSE}(\theta_0,\theta_1) = {1/n} \sum_{i=1}^n(y_i-\theta_0-\theta^Tx_i)^2$$
   $$L_{MSE}(\theta_0,\theta_1) = \sqrt {{1/n} \sum_{i=1}^n(y_i-\theta_0-\theta^Tx_i)^2}$$
   
> One minimizer for all loss functions.

- We minimize the following loss function:
   $$L_{MSE}(\theta_0,\theta_1) = {1/n} \sum_{i=1}^n(y_i-\theta_0-\theta^Tx_i)^2$$
   
- We have an **optimzation problem**: find the parameters which minimize the loss function. We write optimization problem (with no constraints) as 
   $$minimize_{\theta_0,\theta}\;\; L_{MSE}(\theta_0,\theta_1) = {1/n} \sum_{i=1}^n(y_i-\theta_0-\theta^Tx_i)^2$$
   
   
### How to solve?
- **Analytically:** Determine a critical point that makes the derivtive(if it exists) equal to zero.
- **Numerically:** Solve optimization using some algorithm that iteratively takes use closer to the critical point minimizing objective function.

## Example-1

In [63]:
from numpy import *

# y = mx + b
# m is slope, b is y-intercept
def compute_error_for_line_given_points(b, m, points):
    totalError = 0
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        totalError += (y - (m * x + b)) ** 2
    return totalError / float(len(points))

def step_gradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
    b = starting_b
    m = starting_m
    for i in range(num_iterations):
        b, m = step_gradient(b, m, array(points), learning_rate)
    return [b, m]

def run():
    points = genfromtxt("datasets/data.csv", delimiter=",")
    learning_rate = 0.0001
    initial_b = 0 # initial y-intercept guess
    initial_m = 0 # initial slope guess
    num_iterations = 1000
    print("Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
    print("Running...")
    [b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
    print("After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)))

if __name__ == '__main__':
    run()


Starting gradient descent at b = 0, m = 0, error = 5565.107834483211
Running...
After 1000 iterations b = 0.08893651993741346, m = 1.4777440851894448, error = 112.61481011613473


<img src="images/gradient_descent_example.gif">

### Example - 2

In [64]:
import numpy as np
# Generate some fake data for linear regression
N = 100
x = np.linspace(-1, 1, N)
y = 2 * x + 1 + np.random.randn(N) * 0.2
# Initialize weight and bias with random values
w = np.random.randn()
b = np.random.randn()
# Set the learning rate
alpha = 0.1
# Set the number of iterations
num_iterations = 100
# Iterate through the gradient descent algorithm
for i in range(num_iterations):
    # Calculate the predicted values
    y_pred = w * x + b
    # Calculate the cost function
    cost = 1/N * np.sum((y_pred - y) ** 2)
    # Calculate the gradients
    dw = 2/N * np.sum((y_pred - y) * x)
    db = 2/N * np.sum(y_pred - y)
    # Update the weights and biases
    w = w - alpha * dw
    b = b - alpha * db
    # Print the cost every 10 iterations
    if i % 10 == 0:
        print(f'Iteration {i}: Cost = {cost}')
# Print the final weights and biases
print(f'Final w: {w}')
print(f'Final b: {b}')

Iteration 0: Cost = 1.5427137296536395
Iteration 10: Cost = 0.4055899449885261
Iteration 20: Cost = 0.12820367641883942
Iteration 30: Cost = 0.06040293217462261
Iteration 40: Cost = 0.04382901547954687
Iteration 50: Cost = 0.03977749736664156
Iteration 60: Cost = 0.038787097644943355
Iteration 70: Cost = 0.03854499293476682
Iteration 80: Cost = 0.03848581007209732
Iteration 90: Cost = 0.03847134273178296
Final w: 1.9890302120562438
Final b: 0.9697502088675694


<h3 align="center">Define Loss Function: Reformulation</h3>

   $$L_{MSE}(\theta_0,\theta_1) = {1/n} \sum_{i=1}^n(y_i-\theta_0-\theta^Tx_i)^2 = \frac{1}{2}e^Te$$
   
**Explanation:**
The transpose of a vector `v` is denoted by `v^T` and is defined as the `reflection` of `v` over the main diagonal of a matrix. For example, if `v` is a column vector:

`
v = [v1]
    [v2]
    [v3]
`

then the transpose of `v` is:

`v^T = [v1 v2 v3]`

The `dot product` of two vectors `u` and `v` is denoted by `u.v` and is defined as the sum of the products of the corresponding elements of the two vectors. For example, if u and v are column vectors:

`
u = [u1]
    [u2]
    [u3]
v = [v1]
    [v2]
    [v3]`
    
then the dot product of u and v is:

`u.v = u1*v1 + u2*v2 + u3*v3`

Now, to prove that `e^T.e = n`, where `e` is an n-dimensional column vector with all elements equal to `1` and `n` is the number of elements in `e`, we can use the definition of the dot product:

`e^T.e = e1*e1 + e2*e2 + ... + en*en
       = 1*1 + 1*1 + ... + 1*1
       = n`
       
Therefore, `e^T.e = n.`

Note that this result holds for any n-dimensional column vector e with all elements equal to 1.


   $$L_{MSE}(\theta_0,\theta_1) = {1/n} \sum_{i=1}^n(y_i-\theta_0-\theta^Tx_i)^2 = \frac{1}{2}e^Te$$
  
Here $e=[e_1,e_2,....,e_n]^T$ (column vector) where
$$e_i = y_i-\theta_0-\theta^T,\;\;\;\; i=1,2,3,4,.....,n$$

<img src="images/p26.png" align="left">
<img src="images/p24.png" align="right">


**Consequently:** 


Recall that we may fit a linear model by choosing $\theta$ that minimizes the squared error:
$$J(\theta_0,\theta)=\frac{1}{2}\sum_{i=1}^n(y_i-\theta_0-\theta^\top x_i)^2 = \frac{1}{2}e^Te$$
We can write this sum in matrix-vector form as:
$$J(\theta_0,\theta)=J(w) = \frac{1}{2} (y-Xw)^\top(y-Xw) = \frac{1}{2} \|y-Xw\|^2,$$
where $X$ is the design matrix and $\|\cdot\|$ denotes the Euclidean norm.

### Solve Optimization Problem: (Analytical Solution employing Calculus)
- very beautiful, elegant function we have here!

We first write the loss function as
<img src="images/p27.png" align="center">

-  To further solve this, let us quickly talk about the concept of a gradient of a function.

#### Gradient of a function: Overview
- For a function $f(x)$ that maps $x \; \epsilon \; R^d$ to $R$, we define a gradient (directional derivative) with respect to $x$ as.
<img src="images/p28.png" align="center">

- Derivative quantifies the rate of change along different directions.

**Question:** Calculate $\nabla$ of following functions.
- $f(x) = a^Tx = x^Ta$
- $f(x) = x^Tx$
- $f(x) = x^TPx$



**We have a loss function:**
$$L(w) = \frac{1}{2}(y^Ty - 2w^TX^Ty + w^TX^TXw)$$

- Take gradient with respect to $w$ as
\begin{align*}
\nabla_w J(w) 
& = \nabla_w \frac{1}{2} (X w - y)^\top  (X w - y) \\
& = \frac{1}{2} \nabla_w \left( (Xw)^\top  (X w) - (X w)^\top y - y^\top (X w) + y^\top y \right) \\
& = \frac{1}{2} \nabla_w \left( w^\top  (X^\top X) w - 2(X w)^\top y \right) \\
& = \frac{1}{2} \left( 2(X^\top X) w - 2X^\top y \right) \\
& = (X^\top X) w - X^\top y
\end{align*}

We used the facts that $a^\top b = b^\top a$ (line 3), that $\nabla_x b^\top x = b$ (line 4), and that $\nabla_x x^\top A x = 2 A x$ for a symmetric matrix $A$ (line 4).

> We know from calculus that a function is minimized when its derivative is set to zero. In our case, our objective function is a (multivariate) quadratic; hence it only has one minimum, which is the global minimum.

- Setting the above derivative to zero, we obtain the *normal equations*:
$$ (X^\top X) w = X^\top y.$$

Hence, the value $w^*$ that minimizes this objective is given by:
$$ w^* = (X^\top X)^{-1} X^\top y.$$


- We have determined the weights for which LSE,MSE,RMSE or the norm of the residual is minimized.
- This solution is referred to as least-squared solution as it minimizes the squared error.

<h2 align="center">Gradient Descent Algorihtm</h2>

**Goal:**   Minimize cost function $J(\theta_0, \theta_1)$.

**Definition:** Used all over machine learning for minimization.

**Problem:**
- We have $J(\theta_0, \theta_1)$.
- We want to get $min\;J(\theta_0, \theta_1)$.

**Solution**:
- Start with some $J(\theta_0,\theta_1)$. For example $J(0,0)$.


#### How does it work?
- Start with initial guesse
    - Start at 0,0 (or any other value)
    - Keeping changing $\theta_0$ and $\theta_1$ a little bit to try and reduce $J(\theta_0,\theta_1)$.
- Each time you change the parameters, you select the gradient which reduces J(θ0,θ1) the most possible 
- Repeat
- Do so until you converge to a local minimum
- Has an interesting property
    - Where you start can determine which minimum you end up
    - Here we can see one initialization point led to one local minimum
    - The other led to a different one

<img src="images/p29.png" align="center">

<h3 align="center">A simplified version of gradient descent</h3>

Assume again that we set $\theta_0 = 0$ and our hypothesis and cost function practically have only one coefficient, $\theta_1$.
   $$h_\theta(x) = \theta_1x$$

repeat until convergence{

In [53]:
import numpy as np

# Define the function to be minimized
def f(x):
    return x**2 + 10*np.sin(x)

# Define the gradient of the function
def grad_f(x):
    return 2*x + 10*np.cos(x)

# Choose the step size (learning rate)
alpha = 0.1

# Set the initial value of x
x = 5

# Set the tolerance for the convergence criterion
tol = 1e-6

# Initialize a list to store the values of x at each iteration
x_values = [x]

# Iterate until convergence
while True:
    # Compute the gradient at the current value of x
    grad = grad_f(x)
    
    # Update the value of x using gradient descent
    x = x - alpha * grad
    
    # Store the new value of x
    x_values.append(x)
    
    # Check for convergence
    if np.abs(grad) < tol:
        break

# Print the minimum value found
print(f"The minimum value of the function is {f(x):.4f} at x = {x:.4f}")


The minimum value of the function is 8.3156 at x = 3.8375


In this example, the function f(x) is defined as x**2 + 10*np.sin(x), and the gradient of the function, grad_f(x), is defined as 2*x + 10*np.cos(x). The step size (learning rate) is chosen as alpha = 0.1, and the initial value of x is set to 5. The tolerance for the convergence criterion is set to tol = 1e-6, which means that the algorithm will stop when the absolute value of the gradient falls below this threshold. The algorithm iteratively updates the value of x using the gradient descent rule x = x - alpha * grad, and stores the values of x at each iteration in the list x_values. The loop terminates when the absolute value of the gradient falls below the tolerance threshold, at which point the minimum value of the function has been found.

# Deriving the Gradient Descent Algorithm for Linear Regression

Linear regression is a supervised learning algorithm that is used to predict a continuous target variable given a set of input features. The goal of linear regression is to find the best model parameters (i.e., the weights and bias) that minimize the error between the predicted value and the true value of the target variable.

One way to find the best model parameters is to use the gradient descent algorithm. In this notebook, we will derive the gradient descent algorithm for linear regression.

## Linear Regression Model

In linear regression, we assume that the relationship between the input features and the target variable is linear. This can be expressed as:

$$ \hat{y} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b $$

where $\hat{y}$ is the predicted value, $x_1, x_2, \dots, x_n$ are the input features, $w_1, w_2, \dots, w_n$ are the model weights, and $b$ is the bias term.

## Loss Function

To measure the error between the predicted value and the true value of the target variable, we can use a loss function. One common loss function is the mean squared error (MSE) loss, which is defined as:

$$ L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$

where $n$ is the number of samples, $y_i$ is the true value of the target variable, and $\hat{y_i}$ is the predicted value.

## Gradient Descent

To minimize the MSE loss, we can use the gradient descent algorithm. The gradient descent algorithm works by iteratively updating the model weights and bias to reduce the loss.

The update rule for the weights is:

$$ w_j = w_j - \alpha \frac{\partial L}{\partial w_j} $$

where $\alpha$ is the learning rate and $\frac{\partial L}{\partial w_j}$ is the partial derivative of the loss with respect to $w_j$.

The update rule for the bias is:

$$ b = b - \alpha \frac{\partial L}{\partial b} $$

where $\alpha$ is the learning rate and $\frac{\partial L}{\partial b}$ is the partial derivative of the loss with respect to $b$.

To compute the partial derivatives, we can use the chain rule:

$$ \frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w_j} $$

$$ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial b} $$

Substituting the expressions for the loss and the linear regression model, we get:

$$ \frac{\partial L}{\partial w_j} = \frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y_i}) (-x_{i,j}) $$


# Deriving the Batch Gradient Descent Algorithm for Linear Regression

Linear regression is a supervised learning algorithm that is used to predict a continuous target variable given a set of input features. The goal of linear regression is to find the best model parameters (i.e., the weights and bias) that minimize the error between the predicted value and the true value of the target variable.

One way to find the best model parameters is to use the batch gradient descent algorithm. In this notebook, we will derive the batch gradient descent algorithm for linear regression.

## Linear Regression Model

In linear regression, we assume that the relationship between the input features and the target variable is linear. This can be expressed as:

$$ \hat{y} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b $$

where $\hat{y}$ is the predicted value, $x_1, x_2, \dots, x_n$ are the input features, $w_1, w_2, \dots, w_n$ are the model weights, and $b$ is the bias term.

## Loss Function

To measure the error between the predicted value and the true value of the target variable, we can use a loss function. One common loss function is the mean squared error (MSE) loss, which is defined as:

$$ L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$

where $n$ is the number of samples, $y_i$ is the true value of the target variable, and $\hat{y_i}$ is the predicted value.

## Gradient Descent

To minimize the MSE loss, we can use the batch gradient descent algorithm. The batch gradient descent algorithm works by iteratively updating the model weights and bias to reduce the loss.

The update rule for the weights is:

$$ w_j = w_j - \alpha \frac{\partial L}{\partial w_j} $$

where $\alpha$ is the learning rate and $\frac{\partial L}{\partial w_j}$ is the partial derivative of the loss with respect to $w_j$.

The update rule for the bias is:

$$ b = b - \alpha \frac{\partial L}{\partial b} $$

where $\alpha$ is the learning rate and $\frac{\partial L}{\partial b}$ is the partial derivative of the loss with respect to $b$.

To compute the partial derivatives, we can use the chain rule:

$$ \frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w_j} $$

$$ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial b} $$

Substituting the expressions for the loss and the linear regression model, we get:

$$ \frac{\partial L}{\partial w_j} = \frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y_i}) (-x_{i,j}) $$


Installing Python and a code editor (e.g., PyCharm)
Basics of Python syntax and data types (e.g., variables, strings, integers, lists)
Loops and conditional statements
Functions and modules
Object-oriented programming
Built-in data structures (e.g., lists, dictionaries, sets)
Input/output (I/O) functions
Error handling
Regular expressions
Built-in math functions and modules (e.g., math, statistics)
Built-in datetime module
Built-in os and sys modules
Parsing and generating JSON data
Parsing and generating XML data
Working with ZIP files
Built-in collections module
Built-in itertools module
Multiprocessing
Threading
Executing external processes
Working with databases (e.g., SQLite, MySQL)
Web scraping
Working with web APIs
Working with data visualization libraries (e.g., Matplotlib, Seaborn)
Machine learning with scikit-learn

# Deriving the Sigmoid Function for Logistic Regression

Logistic regression is a supervised learning algorithm that is used to predict a binary target variable given a set of input features. The goal of logistic regression is to find the best model parameters (i.e., the weights and bias) that maximize the probability of the target variable being 1.

To represent the probability of the target variable being 1, we can use the sigmoid function. The sigmoid function is defined as:

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

where $x$ is the input to the function.

The output of the sigmoid function is always between 0 and 1, which makes it suitable for representing probabilities. For example, if the output of the sigmoid function is 0.8, we can interpret this as an 80% probability of the target variable being 1.

In logistic regression, we use the sigmoid function to transform the linear combination of the input features and the model parameters into a probability. This can be expressed as:

$$ \hat{y} = \sigma(w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b) $$

where $\hat{y}$ is the predicted probability, $x_1, x_2, \dots, x_n$ are the input features, $w_1, w_2, \dots, w_n$ are the model weights, and $b$ is the bias term.

To find the best model parameters, we can use an optimization algorithm (e.g., gradient descent) to maximize the likelihood of the target variable being 1.

## Visualizing


# Deriving the Sigmoid Function for Logistic Regression

Logistic regression is a supervised learning algorithm that is used for classification tasks. It is based on the idea of using a linear model to predict the probability of a binary outcome (i.e., 0 or 1).

To map the predicted probability to the range 0 to 1, we can use a sigmoid function. The sigmoid function is defined as:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

where $z$ is the input to the function.

The sigmoid function has the following properties:

- It maps any real-valued number to the range 0 to 1
- It has an "S" shape, with a smooth curve that approaches 0 as the input becomes more negative, and approaches 1 as the input becomes more positive
- It is differentiable, which makes it suitable for use in gradient descent algorithms

The sigmoid function can be used in logistic regression by taking the dot product of the input features and the model weights, and passing the result through the sigmoid function:

$$ \hat{y} = \sigma(\mathbf{w}^T \mathbf{x}) $$

where $\hat{y}$ is the predicted probability, $\mathbf{w}$ is the vector of model weights, and $\mathbf{x}$ is the vector of input features.

To train the model, we can use a loss function such as the cross-entropy loss to measure the difference between the predicted probability and the true label. The model parameters can then be updated using a gradient descent algorithm to minimize the loss.

By using the sigmoid function, logistic regression is able to predict probabilities


Sure! Here is a pseudocode for the gradient descent algorithm:

- Initialize the parameters of the model, such as the weights and biases, with random values.
- Calculate the cost function (also called the loss function) for the current values of the parameters.
- Calculate the gradient of the cost function with respect to the parameters.
- Update the parameters using the gradient and a learning rate alpha: 
    $$ parameter = parameter - alpha * gradient $$
- Repeat steps 2-4 until the cost function reaches a minimum or the maximum number of iterations is reached.


Here is a mathematical example of the gradient descent algorithm for a simple linear regression model with one weight w and one bias b:

- Initialize w and b with random values.
- Calculate the cost function: 
$$ J(w,b) = fact{1/N} * sum( (y_pred - y)^2 ) $$
where $$ y_pred = w * x + b $$ and N is the number of data points.
- Calculate the gradient of the cost function with respect to w and b:
    $$ dw = 2/N * sum( (y_{pred} - y) * x )$$
    $$ db = 2/N * sum( (y_{pred} - y) ) $$
- Update the parameters using the gradient and a learning rate alpha:
    $$ w = w - alpha * dw $$
    $$ b = b - alpha * db $$
- Repeat steps 2-4 until the cost function reaches a minimum or the maximum number of iterations is reached.

In [3]:
import numpy as np
# Generate some fake data for linear regression
N = 100
x = np.linspace(-1, 1, N)
y = 2 * x + 1 + np.random.randn(N) * 0.2
# Initialize weight and bias with random values
w = np.random.randn()
b = np.random.randn()
# Set the learning rate
alpha = 0.1
# Set the number of iterations
num_iterations = 100
# Iterate through the gradient descent algorithm
for i in range(num_iterations):
    # Calculate the predicted values
    y_pred = w * x + b
    # Calculate the cost function
    cost = 1/N * np.sum((y_pred - y) ** 2)
    # Calculate the gradients
    dw = 2/N * np.sum((y_pred - y) * x)
    db = 2/N * np.sum(y_pred - y)
    # Update the weights and biases
    w = w - alpha * dw
    b = b - alpha * db
    # Print the cost every 10 iterations
    if i % 10 == 0:
        print(f'Iteration {i}: Cost = {cost}')
# Print the final weights and biases
print(f'Final w: {w}')
print(f'Final b: {b}')

Iteration 0: Cost = 4.544354874302624
Iteration 10: Cost = 0.2711390881537866
Iteration 20: Cost = 0.08401917395160328
Iteration 30: Cost = 0.04816343170598533
Iteration 40: Cost = 0.03951241913418109
Iteration 50: Cost = 0.037398980098625904
Iteration 60: Cost = 0.03688236187932412
Iteration 70: Cost = 0.03675607394872026
Iteration 80: Cost = 0.036725202675043214
Iteration 90: Cost = 0.03671765614551247
Final w: 1.9616102370712132
Final b: 0.9940260130434699


In [4]:
import warnings
warnings.filterwarnings("ignore")

### Interview questions of Linear regression in machine learning
Here are some common interview questions on linear regression in machine learning:
- What is linear regression and how does it work?
- What is the difference between simple linear regression and multiple linear regression?
- How do you evaluate the performance of a linear regression model?
- What is the ordinary least squares (OLS) method and how is it used in linear regression?
- How can you prevent overfitting in linear regression?
- Can you give an example of how linear regression can be used in a real-world problem?
- How do you decide which variables to include in a linear regression model?
- How do you handle collinearity in a linear regression model?
- What are some common assumptions made in linear regression?
- Can you discuss the limitations of linear regression?