<a href="https://colab.research.google.com/github/asia281/dnn2022/blob/main/Asia_2023_of_Bootcamp_ML%2C_Lab_4_logistic_regression_student_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Programu Operacyjnego Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej"
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

Let's start with the simplest example - linear separated points on a plane.

In [None]:
import numpy as np

np.random.seed(123)

# these parametrize the line
a = 0.3
b = -0.2
c = 0.001

# True/False mapping
def lin_rule(x, noise=0.):
    return a * x[0] + b * x[1] + c + noise < 0.

# Just for plotting
def get_y_fun(a, b, c):
    def y(x):
        return - x * a / b - c / b
    return y

lin_fun = get_y_fun(a, b, c)

In [None]:
# Training data

n = 500
range_points = 1
sigma = 0.05

X = range_points * 2 * (np.random.rand(n, 2) - 0.5)
y = [lin_rule(x, sigma * np.random.normal()) for x in X]

print(X[:10])
print(y[:10])

[[ 0.39293837 -0.42772133]
 [-0.54629709  0.10262954]
 [ 0.43893794 -0.15378708]
 [ 0.9615284   0.36965948]
 [-0.0381362  -0.21576496]
 [-0.31364397  0.45809941]
 [-0.12285551 -0.88064421]
 [-0.20391149  0.47599081]
 [-0.63501654 -0.64909649]
 [ 0.06310275  0.06365517]]
[False, True, False, False, False, True, False, True, True, False]


Let's plot the data.

In [None]:
import plotly.express as px

# plotly has a problem with coloring boolean values, hence stringify
# see https://community.plotly.com/t/plotly-express-scatter-color-not-showing/25962
fig = px.scatter(x=X[:, 0], y=X[:, 1], color=list(map(str, y)))
x_range = [np.min(X[:, 0]), np.max(X[:, 1])]
fig.add_scatter(x=x_range, y=list(map(lin_fun, x_range)), name='ground truth border')
fig.show()

# Logistic regression

In this exercise you will train a logistic regression model via gradient descent in two simple scenarios.

The general setup is as follows:
* we are given a set of pairs $(x, y)$, where $x \in R^D$ is a vector of real numbers representing the features, and $y \in \{0,1\}$ is the target,
* for a given $x$ we model the probability of $y=1$ by $h(x):=g(w^Tx)$, where $g$ is the sigmoid function: $g(z) = \frac{1}{1+e^{-z}}$,
* to find the right $w$ we will optimize the so called logarithmic loss: $J(w) = -\frac{1}{n}\sum_{i=1}^n y_i \log{h(x_i)} + (1-y_i) \log{(1-h(x_i))}$,
* with the loss function in hand we can improve our guesses iteratively:
    * $w_j^{t+1} = w_j^t - \text{step_size} \cdot \frac{\partial J(w)}{\partial w_j}$,
* we can end the process after some predefined number of epochs (or when the changes are no longer meaningful).

Now, let's implement and train a logistic regression model. You should obtain accuracy of at least 96%.

In [None]:
################################################################
# TODO: Implement logistic regression and compute its accuracy #
################################################################

class logistic_regression:
    # initialise lr and number of iterations
    def __init__(self, lr=0.01, num_iter=10000):
        self.lr = lr
        self.num_iter = num_iter

    def add_intercept(self, X):
        # c_ translates objects to concatenation along the second axis, r_ first
        return np.c_[X, np.ones(X.shape[0])]

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def loss(self, h, y):
        # computes logarithmic loss
        loss = -np.mean([y_i * np.log(h_i) + (1 - y_i) * np.log(1 - h_i) for y_i, h_i in zip(y, h)])
        return loss

    def accuracy(self, probs, y):
        return np.mean(probs.round() == y)

    def fit(self, X, y):
        X = self.add_intercept(X)

        # weights initialization
        self.weights = np.zeros(X.shape[1])

        for i in range(self.num_iter):
            h = self.predict(X)
            gradient = np.dot(X.T, (h - y)) / len(y)
            self.weights -= self.lr * gradient

            h = self.predict(X)
            loss = self.loss(h, y)
            accuracy = self.accuracy(h, y)

            if(i % 1000 == 0):
                print(f'loss: {loss}, accuracy: {accuracy} \t')


    def predict(self, X, add_intercept = False):
        if add_intercept:
          X = self.add_intercept(X)
        return self.sigmoid(np.dot(X, self.weights))

In [None]:
log_reg = logistic_regression()
log_reg.fit(X, y)
weights = log_reg.weights

loss: 0.6925617280155203, accuracy: 0.962 	
loss: 0.3991331607038633, accuracy: 0.958 	
loss: 0.3086509020333399, accuracy: 0.964 	
loss: 0.26426610466726164, accuracy: 0.966 	
loss: 0.23719856803349046, accuracy: 0.968 	
loss: 0.2186215351376166, accuracy: 0.97 	
loss: 0.20490283042866578, accuracy: 0.968 	
loss: 0.1942568706658405, accuracy: 0.968 	
loss: 0.18569519307301224, accuracy: 0.968 	
loss: 0.17862214783156521, accuracy: 0.966 	


In [None]:
print(weights)

[-4.74839233  3.21438095  0.06002603]


Let's visually asses our model. We can do this by using our estimates for $a,b,c$.

In [None]:
#################################################################
# TODO: Pass your estimates for a,b,c to the get_y_fun function #
#################################################################
lin_fun2 = get_y_fun(weights[0], weights[1], weights[2])

fig = px.scatter(x=X[:, 0], y=X[:, 1], color=list(map(str, y)))
x_range = [np.min(X[:, 0]), np.max(X[:, 1])]
fig.add_scatter(x=x_range, y=list(map(lin_fun, x_range)), name='ground truth border')
fig.add_scatter(x=x_range, y=list(map(lin_fun2, x_range)), name='estimated border')
fig.show()

Let's now complicate the things a little bit and make our next problem nonlinear.

In [None]:
# Parameters of the ellipse
s1 = 1.
s2 = 2.
r = 0.75
m1 = 0.15
m2 = 0.125

# 0/1 mapping, checks whether we are inside the ellipse
def circle_rule(x, y, noise=0.):
    return 1 if s1 * (x - m1) ** 2 + s2 * (y - m2) ** 2 + noise < r ** 2 else 0

In [None]:
# Training data

n = 500
range_points = 1

sigma = 0.1

X = range_points * 2 * (np.random.rand(n, 2) - 0.5)

y = [circle_rule(x, y, sigma * np.random.normal()) for x, y in X]

print(X[:10])
print(y[:10])

[[ 0.18633789  0.87560968]
 [-0.81999293  0.61838609]
 [ 0.22604784  0.28001611]
 [ 0.9846182  -0.35783437]
 [-0.27962406  0.07170775]
 [ 0.2501677  -0.37650776]
 [ 0.41264707 -0.8357508 ]
 [-0.61039043 -0.97349628]
 [ 0.49924022  0.89579621]
 [ 0.537422   -0.65425777]]
[0, 0, 1, 0, 1, 1, 0, 0, 0, 0]


Let's plot the data.

In [None]:
import plotly.graph_objects as go

fig = px.scatter(x=X[:, 0], y=X[:, 1], color=list(map(str, y)))

xgrid = np.arange(np.min(X[:, 0]), np.max(X[:, 0]), 0.003)
ygrid = np.arange(np.min(X[:, 1]), np.max(X[:, 1]), 0.003)
contour =  go.Contour(
        z=np.vectorize(circle_rule)(*np.meshgrid(xgrid, ygrid, indexing="ij")),
        x=xgrid,
        y=ygrid
    )
fig.add_trace(contour)
fig.show()

Now, let's train a logistic regression model to tackle this problem. Note that we now need a nonlinear decision boundary. You should obtain accuracy of at least 90%.

Hint:
<sub><sup><sub><sup><sub><sup>
Use feature engineering.
</sup></sub></sup></sub></sup></sub>

In [None]:
################################################################
# TODO: Implement logistic regression and compute its accuracy #
################################################################

def add_features(X):
  return np.c_[X, X[:,0]**2, X[:,1]**2] # wzor bazuje na x, y, x^2, y^2

new_X = add_features(X)
log_reg = logistic_regression()
log_reg.fit(new_X, y)

loss: 0.6923348184492631, accuracy: 0.706 	
loss: 0.5030357755877368, accuracy: 0.706 	
loss: 0.45744107814903645, accuracy: 0.706 	
loss: 0.42452765165690504, accuracy: 0.744 	
loss: 0.3984697206813147, accuracy: 0.814 	
loss: 0.37697904902696444, accuracy: 0.842 	
loss: 0.35882825488425046, accuracy: 0.868 	
loss: 0.34324512723080886, accuracy: 0.896 	
loss: 0.3296957185342701, accuracy: 0.9 	
loss: 0.31779016261831705, accuracy: 0.91 	


In [None]:
weights = log_reg.weights
print(weights)

[ 1.04200759  1.21334966 -2.39726094 -4.02381066  0.74113797]


Let's visually asses our model.

Contrary to the previous scenario, converting our weights to parameters of the ground truth curve may not be straightforward. It's easier to just provide predictions for a set of points in $R^2$.

In [None]:
h = .02

xgrid = np.arange(np.min(X[:, 0]), np.max(X[:, 0]), h)
ygrid = np.arange(np.min(X[:, 1]), np.max(X[:, 1]), h)

xx, yy = np.meshgrid(xgrid, ygrid, indexing="ij")
X_plot = np.c_[xx.ravel(), yy.ravel()] # .ravel() = np.reshape(-1)


_X = np.concatenate([X_plot, X_plot**2], axis=1)
print(_X.shape)

preds = log_reg.predict(_X, add_intercept = True)
print(preds.shape)


(10000, 4)
(10000,)


In [None]:
fig = px.scatter(x=X[:, 0], y=X[:, 1], color=list(map(str, y)))

xx, yy = np.meshgrid(xgrid, ygrid, indexing="ij")

contour = go.Contour(z=preds.reshape(len(xgrid), len(ygrid)), x=xgrid, y=ygrid)
fig.add_trace(contour)
fig.show()

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>