### From Linear Regression to Logistic Classification

Linear Regression and Logistic Regression (classification): 
In both, we’re trying to learn a function that approximates $P(y|x)$. They both make the assumption that $P(y|x)$ can be approximated as some function of a linear combination of input features where $y = \mathbf{w}^T \mathbf{x} + b$. The key difference lies in that Logistic Regression applies a sigmoid function to this linear combination of input features $\sigma(\mathbf{w}^T \mathbf{x} + b)$. This makes the models differ in their outputs, the interpretation of their parameters, and the loss functions that could be used to optimise them:

**The outputs:**
- Linear regression: output is a continuous value from $-\infty$ to $+\infty$ representing the predicted value of $y$ ($y = \mathbf{w}^T \mathbf{x} + b$)
- Logistic regression: output is a probability value between 0 and 1, representing the likelihood of an instance belonging to a specific class.

**Interpretation of coefficients:**
   - In linear regression, the coefficients represent the change in the output variable for a one-unit change in the corresponding input feature, assuming all other features remain constant.
   - In logistic regression, the coefficients represent the change in the log-odds (logit) of the output variable for a one-unit change in the corresponding input feature. The coefficients can be exponentiated to obtain the odds ratios, which represent the multiplicative change in the odds of the output variable.

**Loss function:**
   - Linear regression typically uses the mean squared error (MSE) or mean absolute error (MAE).
   - Logistic regression uses the binary cross-entropy loss function (also known as log loss) to measure the dissimilarity between the predicted probabilities and the actual class labels.

**The Math**

Logistic Regression (classification) does binary classification. Binary classification takes an input $x$ and tries to predict if it belongs to a certain class $Y$ ($y = 1$ if it belongs or not $y = 0$). The output of logistic regression should be then the probability that label $y = 1$. Each output / label therefore will be interpreted as a Bernoulli r.v. $Y \sim Bern(p)$.

$P(y|x)$ for bernoulli:
$$P(x = x) = p \quad \text{if } x = 1$$
$$P(x = x) = 1 - p \quad \text{if } x = 0$$
In one line:
$$P(x = x) = p^x \cdot (1 - p)^{(1-x)} \quad \text{(x could be 1 or 0 and it will give u the 2 equations above)}$$

Logistic regression will try to model $P(y|x)$ i.e., $P(y = 1 | x = x)$ and $P(y = 0 | x = x)$.
It assumes $P(y | x)$ can be approximated as a sigmoid function applied to a linear combination of input features.

So, probability of a single data point in logistic regression is:
$$P(Y = 1 | X = x) = \sigma(\theta_0x_0 + \theta_1x_1 + \ldots + \theta_mx_m)$$

Where $x$ (input) is a feature vector of $m$ dimensions e.g., House [size, num rooms, num bathrooms, price, …] and the model parameters are of size $m$ equal to input features.

and if we always have $x_0 = 1$ then we can write
$$P(Y = 1 | X = x) = \sigma(\theta^T X)$$

and
$$\sigma(z) = \frac{1}{1 + \exp(-z)}$$

So, analogous to what we did with Bernoulli
$$P(Y = 1 | X = x) = \sigma(\theta^T X)$$
$$P(Y = 0 | X = x) = 1 - \sigma(\theta^T x)$$

*each of these is the $p$ in $Bern(p)$*

In one line: Probability of one datapoint (the equation form of the probability mass function of a Bernoulli):
$$P(Y = y | X = x) = \sigma(\theta^T X)^y + (1 - \sigma(\theta^T X))^{(1-y)}$$

Consider we have “n” examples / data points, the likelihood of all data points / independent training examples is:

$$L(\theta) = \prod_{i=1}^{n} P(Y=y^{(i)} | X=x^{(i)})$$
=> substitute Bernoulli equation
$$L(\theta) = \prod_{i=1}^{n} \left[\sigma(\theta^T x^{(i)})^{y^{(i)}} \cdot (1 - \sigma(\theta^T x^{(i)}))^{(1-y^{(i)})}\right]$$

Take its log to convert multiplication to addition:
$$LL(\theta) = \sum_{i=1}^{n} \left[y^{(i)} \cdot \log(\sigma(\theta^T x^{(i)})) + (1-y^{(i)}) \cdot \log(1-\sigma(\theta^T x^{(i)}))\right]$$

Now that we have a function for log-likelihood, we simply need to chose the values of theta that maximize it. We can find the best values of theta by using an optimization algorithm (gradient descent). However, in order to use an optimization algorithm, we first need to know the partial derivative of log likelihood with respect to each parameter.

**How to take derivative of this complicated function using chain rule:**
$$LL(\theta) = \sum_{i=1}^{n} \left[y^{(i)} \cdot \log(\sigma(\theta^T x^{(i)})) + (1-y^{(i)}) \cdot \log(1-\sigma(\theta^T x^{(i)}))\right]$$

Since derivative of sum = sum of derivatives, we’ll focus on derivative of one example (since they’ll all sum)
$$LL(\theta) = y \log(\sigma(\theta^T x)) + (1−y) \log(1−\sigma(\theta^T x))$$

Since we know:
$$p=\sigma(\theta^T x)$$
and
$$z=\theta^T x$$

Let’s simplify it:
$$LL(\theta) = y \log(p) + (1−y) \log(1−p)$$

By the chain rule:
$$\frac{\partial LL(\theta)}{\partial \theta_j} = \frac{\partial LL(\theta)}{\partial p} \cdot \frac{\partial p}{\partial z} \cdot \frac{\partial z}{\partial \theta_j}$$

Now find each of these terms:

$p = \sigma(z)$ -> $\frac{\partial p}{\partial z} = \sigma(z)[1−\sigma(z)]$ (By taking the derivative of the sigmoid)

$z = \theta^T x$ -> $\frac{\partial z}{\partial \theta_j} = x_j$ (Only $x_j$ interacts with $\theta_j$)

$LL(\theta) = y \log(p) + (1−y) \log(1−p)$ -> $\frac{\partial LL(\theta)}{\partial p} = \frac{y}{p} - \frac{1−y}{1−p}$

Each of those derivatives was much easier to calculate. Now we simply multiply them together.

$$\frac{\partial LL(\theta)}{\partial \theta_j} = \left[\frac{y}{p} - \frac{1−y}{1−p}\right] \cdot \sigma(z)[1- \sigma(z)] \cdot x_j$$
By substituting in for each term = $\left[\frac{y}{p} - \frac{1−y}{1−p}\right] \cdot p[1- p] \cdot x_j$ 

Since $p = \sigma(z)$ = $\left[y(1-p) - p(1-y)\right] \cdot x_j$ 

Multiplying in = $\left[y - p\right]x_j$ Expanding = $\left[y - \sigma(\theta^T x)\right]x_j$ 

*NB:*

To find the derivative of $\log(1 - p)$ with respect to $p$, we can use the chain rule:

$$\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)$$

In this case, let's define:
- $f(x) = \log(x)$
- $g(p) = 1 - p$

Then, $\log(1 - p) = f(g(p))$

**Step 1:** Find the derivative of $f(x)$ with respect to $x$.
- $f(x) = \log(x)$
- $f'(x) = \frac{1}{x}$

**Step 2:** Find the derivative of $g(p)$ with respect to $p$.
- $g(p) = 1 - p$
- $g'(p) = -1$

**Step 3:** Apply the chain rule.
$$\frac{d}{dp} \log(1 - p) = f'(g(p)) \cdot g'(p) = \frac{1}{1 - p} \cdot (-1) = -\frac{1}{1 - p}$$

Therefore:

$$\frac{d}{dp} \log(1 - p) = -\frac{1}{1 - p}$$

### In code:

First, we'll generate a synthetic dataset to work with

In [33]:
import random

In [34]:
def generate_dataset(num_samples, num_features):
    xs = []; ys = []
    for _ in range(num_samples):
        features = [random.uniform(0,1) for _ in range(num_features)]
        xs.append(features)
        ys.append(random.choice([0,1]))
    return xs, ys

In [35]:
num_features = 2
num_samples = 10
xs, ys = generate_dataset(num_samples, num_features)

In [36]:
list(zip(xs, ys))

[([0.2605950289782042, 0.5574727514210197], 0),
 ([0.5721455288417747, 0.06878027736984715], 1),
 ([0.5143613269858414, 0.3553292437524699], 1),
 ([0.9761390402284876, 0.7078873052442007], 0),
 ([0.9655319251899727, 0.5652320334915052], 1),
 ([0.6189379895865476, 0.02696606885399666], 0),
 ([0.6315529568513656, 0.5676285036286293], 0),
 ([0.3470826325291636, 0.7144521743133823], 1),
 ([0.4989689701228194, 0.3912761508767272], 0),
 ([0.4880566177936432, 0.006889842454781747], 1)]

In [37]:
def generate_model_weights(num_samples, num_features):
    w = []
    for _ in range(num_samples):
        w.append([random.random() for _ in range(num_features)])
    return w

In [38]:
weights = generate_model_weights(num_samples, num_features)

In [39]:
import math

In [40]:
# model is sigmoid(wx)
def sigmoid(z): return 1 / (1 + math.exp(-z))

In [41]:
def dot_product(w, x):
    res = []
    for x_vec, w_vec in zip(x, w):
        example = []
        for xi,wi in zip(x_vec, w_vec):
            example.append(xi*wi)
        res.append(sum(example))
    return res       

In [42]:
dot_product(xs, weights)

[0.36465061406789456,
 0.10428776033012813,
 0.11900962949692745,
 0.5000708141638529,
 0.8398581744159634,
 0.10227433501167904,
 0.5459675700646139,
 0.6180184060811562,
 0.5009334195997212,
 0.22400145924726633]

In [43]:
ps = [sigmoid(i) for i in dot_product(weights, xs)]

In [44]:
def log_likelihood(y, p):
    N = len(p)
    total_ll = 0
    for i in range(N):
        ll = y[i] * math.log(p[i]) + (1 - y[i])*(math.log(1 - p[i]))
        total_ll += ll
    return total_ll

In [45]:
log_likelihood(ys, ps)

-7.244542532770307

In [46]:
# gradient computation and backpropagation
for n, x in enumerate(xs):
    for m, xi in enumerate(x):
        gradient = (ys[n] - ps[n]) * xi
        weights[n][m] += -1.0 * gradient

In [47]:
# full pass
weights = generate_model_weights(num_samples, num_features)

for i in range(50):
    # forward pass
    ps = [sigmoid(i) for i in dot_product(weights, xs)]
    # calculate loss
    ll = log_likelihood(ys, ps)
    # backpropagation
    for n, x in enumerate(xs):
        for m, xi in enumerate(x):
            gradient = (ys[n] - ps[n]) * xi
            weights[n][m] += 1.0 * gradient
    if i % 10 == 0:
        print(f"Negative Log Likelihood: {-ll}")

Negative Log Likelihood: 7.1381280818011685
Negative Log Likelihood: 1.938974948429435
Negative Log Likelihood: 1.0882342944665075
Negative Log Likelihood: 0.7470847634565279
Negative Log Likelihood: 0.5659523917625765


In [48]:
list(zip(ys, ps))

[(0, 0.05667587183880696),
 (1, 0.9392469086860377),
 (1, 0.9499221535198927),
 (0, 0.014191869518466701),
 (1, 0.9840701024407983),
 (0, 0.05527682744732184),
 (0, 0.028831062184110582),
 (1, 0.9677331060701538),
 (0, 0.05416766887041442),
 (1, 0.9174002515459945)]

In [49]:
[round(p) for p in ps]

[0, 1, 1, 0, 1, 0, 0, 1, 0, 1]

In [50]:
ys

[0, 1, 1, 0, 1, 0, 0, 1, 0, 1]