# Logistic Regression
In logistic regression, we want a probability instead of continuous values.\
Recall the formula for Linear Regression:
$$f(w,b) = wx + b$$
## Sigmoid Function
$$s(x) = \frac{1}{1+e^{-x}}$$
## Logistic Approximation
$$\hat{y} = h_{\theta}(x) = \frac{1}{1+e^{-wx+b}}$$

## Cost Function

$$J(w,b) = J(\theta) = \frac{1}{N}\sum_{i=1}^{n}[y^i\log(h_{\theta}(x^i))+(1-y^i)\log(1-h_{\theta}(x^i))]$$

Optimize with respect to $w$, $b$, use gradient descent.

## Update Rules

$\begin{aligned}&w = w - \alpha\cdot dw\\
&b = b - \alpha\cdot db\end{aligned}$

$$J'(\theta) = \begin{bmatrix}\frac{dJ}{dw}\\ \frac{dJ}{db}\end{bmatrix} = \begin{bmatrix}\frac{1}{N}\sum2x_i(\hat{y} - y_i)\\ \frac{1}{N}\sum2(\hat{y} - y_i)\end{bmatrix}$$

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
import matplotlib.pyplot as plt

In [2]:
class LogisticRegression:
    def __init__(self, lr = 0.001, n_iters = 1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None
    
    def fit(self,X,y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # gradient descent
        for _ in range(self.n_iters):
            linear_model = np.dot(X, self.weights) + self.bias
            y_predicted = self._sigmoid(linear_model)
            
            dw = (1/n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1/n_samples) * np.sum(y_predicted - y)
            
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
    
    def predict(self,X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = self._sigmoid(linear_model)
        y_predicted_cls = [1 if (i > 0.5) else 0 for i in y_predicted]
        return y_predicted_cls
    
    def _sigmoid(self,x):
        return 1/(1+np.exp(-x))

In [3]:
# import data
bc = datasets.load_breast_cancer()    # 2 class problem
X,y = bc.data, bc.target
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 412)

In [4]:
def accuracy(y_true, y_pred):
    accuracy = np.sum(y_true == y_pred) / len(y_true)
    return accuracy

In [5]:
regressor = LogisticRegression(lr=0.0001,n_iters=1000)
regressor.fit(X_train,y_train)
predictions = regressor.predict(X_test)
print("LR classification accuracy: ", accuracy(y_test,predictions))

LR classification accuracy:  0.9385964912280702


In [6]:
# Problems I observed:
    # This accuracy remains unchanged if we kept random state and other parameters unchanged
    # The accuracy rate is 0.5 when the random_state = 42
    # Changing lr to 0.001 results in runtime error warning

This is the answer to my questions above, quoted from Ler Wei Han on $\href{https://towardsdatascience.com/manipulating-machine-learning-results-with-random-state-2a6f49b31081}{towardsdatascience.com}$
1) Fix the random state from the start\
    - fix a global random seed so that randomness does not come into play\
2) Use the prediction results as an interval\
    - Repeat the run with different seeds in order to produce a confidence interval that you can report with\
    - It is the range that one can comfortably say the performance band of the model really lies within\
3) Reduce imbalance/randomness in data split\
    - Ex: to make sure the split does not affect the composition of the data too much\
        - Stratify your data to reduce randomness\
            - data for your train test split/oob error/cross validation has the same ratio of survivors/non-survivors in the train and test set respectively\
            - preserve the percentage of each class in splits