## Logistic Regression

逻辑回归(logistic regression)，一种典型的分类方法.

Logistic distribution 的分布函数是一条 S 形曲线，即 sigmoid 函数 :

$$g(x) = {1\over 1+e^{-x}}$$

对于二分类问题，可以看作是伯努利试验，即 :

$$P(y=1|x) = p ; P(y=0|x) = 1-p$$

$$P(x|\theta)=\theta^x(1-\theta)^{1-x}$$

对应的似然函数为 :

$$L(x)=\Pi P(y|x) = \Pi x^y (1-x)^{1-y} = \Pi g(x)^y (1-g(x))^{1-y}$$

则对数似然函数为 :

$$L(\theta) = \Pi \log g(x)^y (1-g(x))^{1-y}$$

$$=\sum \log g(x)^y + \log (1-g(x))^{1-y}$$

$$=\sum y\log g(x) + (1-y)\log (1-g(x))$$

$$=\sum y\log g(\theta^Tx) + (1-y)\log (1-g(\theta^Tx))$$

这时我们只需要最大化似然函数，就可以得到最优的参数估计，这里对其取负后除$n$作为最终的损失函数 :

$$J(\theta) = -{1\over n}L(\theta)=-{1\over n}\sum_{i=1}^n y\log g(\theta^Tx) + (1-y)\log (1-g(\theta^Tx))$$

In [61]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

import sys

In [62]:
def get_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    data = np.array(df)
    
    return data[:100, :-1], data[:100, -1]
    
    
data, label = get_data()
train_x, test_x, train_y, test_y = train_test_split(data, label, test_size=0.3)

In [336]:
class LogisticRegression:
    def __init__(self, max_iter=200, learning_rate=0.01):
        self.max_iter = max_iter
        self.learning_rate = learning_rate
        
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def data_matrix(self, x):
        data_mat = []
        for d in x:
            data_mat.append([1.0, *d])
        
        return np.array(data_mat)
    
    def fit(self, x, y):
        data_mat = self.data_matrix(x)
        self.w = np.zeros((len(data_mat[0]), 1), dtype=np.float32)

        for iter_ in range(self.max_iter):
            result = self.sigmoid(np.dot(data_mat, self.w))
            y = np.array(y).reshape(len(y), 1)
            error = np.transpose(y - result)
            self.w += self.learning_rate * np.transpose(np.dot(error, data_mat))
            '''
            for i in range(len(x)):
                result = self.sigmoid(np.dot(data_mat[i], self.w))
                error = y[i] - result
                print(error)
                print(np.transpose([data_mat[i]]).shape)
                sys.exit()
                self.w += self.learning_rate * error * np.transpose(
                    [data_mat[i]])
            '''
            
    def score(self, x_test, y_test):
        right = 0
        x_test = self.data_matrix(x_test)

        for x, y in zip(x_test, y_test):
            prediction = self.sigmoid(np.dot(x, self.w))
            if (prediction >= 0.5 and y == 1) or (prediction < 0.5 and y == 0):
                right += 1
        
        print(right / len(x_test))

In [337]:
lr = LogisticRegression()
lr.fit(train_x, train_y)
lr.score(test_x, test_y)

1.0
