<a href="https://colab.research.google.com/github/Uysim/logistic-regression/blob/master/logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

In [0]:
breast_cancer = load_breast_cancer()
data = breast_cancer["data"]
target = breast_cancer["target"]

In [0]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, shuffle=True)

Since the output can be rounded by sigmoid function to 0 or 1 because some input values are too small or too large, we later get the logarithm of zero which equals to the negative infinitely large number and cannot be used for further computations. That’s why we manually bound our sigmoid function:

In [0]:
def sigmoid(x):
  return np.maximum(np.minimum(1 / (1 + np.exp(-x)), 0.9999), 0.0001)

![alt text](https://i.imgur.com/Ke0U8oG.png)

In [0]:
def lost_function(x, y, theta):
  t = x.dot(theta)
  return -np.sum(y * np.log(sigmoid(t)) + (1 - y) * np.log(1 - sigmoid(t)))


we need to define the cost function for the logistic regression. The most common approach is to iterate over training examples to apply sigmoid to them, then iterate one more time to count the sum of losses. 

The idea of cost function is that we count the sum of the metric distances between our hypothesis and real labels on the training data. The more optimized our parameters are, the less is the distance between the true value and hypothesis. But how can we minimize this distance?

However, we use numpy to apply sigmoid to the whole array and count losses of all the array with just a few lines of code:

In [0]:
def cost_function(x, y, theta):
  return lost_function(x, y, theta) / x.shape[0]


In [0]:
def gradient_cost_function(x, y, theta):
  t = x.dot(theta)
  return x.T.dot(y - sigmoid(t)) / x.shape[0]

In [0]:
def update_theta(x, y, theta, learning_rate):
  return theta + learning_rate * gradient_cost_function(x, y, theta)

In [0]:
def test_train(x, y, learning_rate, iterations=500, threshold=0.0005):
  theta = np.zeros(x.shape[1])
  costs = []
  print(theta)
test_train(X_train, y_train, learning_rate=0.0001)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]


In [0]:
def train(x, y, learning_rate, iterations=500, threshold=0.0005):
  theta = np.zeros(x.shape[1])
  costs = []
  print("Start training")
  
  for i in range(iterations):
    theta = update_theta(x, y, theta, learning_rate)
    cost = cost_function(x, y, theta)
    print(f"[Training step #{i}] — Cost function: {cost:.4f}")
    costs.append({"cost": cost, "weights": theta})
             
    if i > 15 and abs(costs[-2]["cost"] - costs[-1]["cost"]) < threshold:
      break
  return theta, costs

theta, costs = train(X_train, y_train, learning_rate=0.0001)

In [0]:
def predict(x, theta):
  return (sigmoid(x.dot(theta)) >= 0.5).astype(int)

Let’s compare, how predicted data are different than real:

In [0]:
def get_accuracy(x, y, theta):
  y_pred = predict(x, theta)
  return (y_pred == y).sum() / y.shape[0]

In [0]:
print(f"Accuracy on the training set: {get_accuracy(X_train, y_train, theta)}")
print(f"Accuracy on the test set: {get_accuracy(X_test, y_test, theta)}")

In [0]:
plt.figure(figsize=(15,10))
plt.title("Model accuracy depending on the training step")
plt.plot(np.arange(0, len(costs)), [get_accuracy(X_train, y_train, c["weights"]) for c in costs], alpha=0.7, label="Train", color="r")
plt.plot(np.arange(0, len(costs)), [get_accuracy(X_test, y_test, c["weights"]) for c in costs], alpha=0.7, label="Test", color="b")
plt.xlabel("Number of iterations")
plt.ylabel("Accuracy, %")
plt.legend(loc="best")
plt.grid(True)
plt.xticks(np.arange(0, len(costs)+1, 40))
plt.yticks(np.arange(0.5, 1, 0.1))
plt.show()