<a href="https://colab.research.google.com/github/findalexli/ML_algo_with_numpy/blob/main/Logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np

In [2]:
from sklearn.datasets import load_iris

data = load_iris
iris = load_iris()
X = iris.data
y = iris.target
y = np.array(y == 1, dtype = int)

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

In logistic regression we find a line or plane, called a decision boundary, that can (almost) linearly separate twoclasses in binary classification.

$  Log looss = \frac{1}{\text{number of points}}  -y log(y') - (1-y)log(1 -y') $ 

where y' is the predicted probablities 

$$ Gradient = \frac_{numer of datapoints} (y - y') * X $$


In [7]:
class LogisitcRegression():
    """
    cls = LogisitcRegression()
    cls.fit(X, y)
    cls.predict(X_test)
    """
    def __init__(self, learning_rate = 0.001, max_epochs = 1000, eps = 1e-7, fit_intercept=True):
        self.learning_rate = learning_rate
        self.max_epochs = max_epochs
        self.eps = eps
        self.fit_intercept = fit_intercept


    def fit(self, X, y):
        if self.fit_intercept:
            X = self._add_intercept(X)
        n, dim = X.shape
        self.theta = np.zeros(dim)
        hist_loss = []
        for iter_idx in range(self.max_epochs):
            prev_theta = self.theta 
            prob = _sigmoid(np.dot(X, self.theta))
            loss = (-y * np.log(prob) - (1 - y) * np.log(1 - prob)).mean()
            hist_loss.append(loss)
            
            grad = (1/n) * np.dot(X.T, (prob - y))
            if iter_idx % 1000 == 0:
                print(f"Loss = {loss}")
                print(f"Grad = {grad}")


            self.theta -= grad * self.learning_rate
            # if iter_idx > 10 and np.linalg.norm(prev_theta - self.theta) < self.eps:
            #     print('Early stopping')
            #     break

    def predict(self, X_test, threshold = 0.5):
        if self.fit_intercept:
            X_test = self._add_intercept(X_test)

        bool_array =  _sigmoid(np.dot(X_test, self.theta)) >= threshold
        return bool_array.astype(int)

    def _add_intercept(self, X):
        intercept_term = np.ones(len(X))
        return np.column_stack((intercept_term, X))

def _sigmoid(x):
    return 1 / (1 + np.exp(-x))


In [8]:
from sklearn.metrics import classification_report, confusion_matrix
cls = LogisitcRegression()

cls.fit(X_train, y_train)
preds = cls.predict(X_test)
classification_report(preds, y_test)




Loss = 0.6931471805599453
Grad = [ 0.15833333 -0.04595683  0.22225439 -0.10079432 -0.06321999]


'              precision    recall  f1-score   support\n\n           0       0.71      0.83      0.77        18\n           1       0.67      0.50      0.57        12\n\n    accuracy                           0.70        30\n   macro avg       0.69      0.67      0.67        30\nweighted avg       0.70      0.70      0.69        30\n'

F1 = precision * recall / (precision + recall)

In [6]:
from sklearn.metrics import roc_auc_score
roc_auc_score(preds, y_test)

0.6666666666666666

The ROC curve is a plot of the true positive rate against the false positive rate at various threshold levels. The
ROC is a probability curve and the area under the curve can be thought of as the degree of separability
between the two classes.

Where precision is defined by P = TP / (TP + FP) which means every time we said an observation was True, was
it actually True and recall is defined by R = TP / (TP + FN) which means out of all the True observations, how
many did we get.

ROC curves can sometimes look like the model is performing well for very imbalanced datasets even if it is
misclassifying most of the minority class. However, the precision-recall curve can specifically handle
imbalanced datasets, thus it is better in these situations.

Extend to multi class

1. one vs all others
2. use softmax function instead of the sigmoid function $$\frac{e^{Z_i}{\sum e^ e^{Z_i}$$