# Logistic-Regression

### ロジスティック回帰分析
▶ 一般化線形モデル（Generalized linear model）  
▶ 回帰より分類によく使われる  
▶ 確率を求める  

<img src="https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F40159%2F5db61a65-6902-076b-370a-79169280751f.png?ixlib=rb-1.2.2&auto=compress%2Cformat&fit=max&w=1400&s=55de4694cec4e635f499f98d3be44a16" width=30%>

$$ \large y = \frac{1}{ 1 + e^{-(β_0 + β_1x_1)}}$$

パラメータ$ β_0 $, $ β_１$ を最適化する

## Optimisation
### Likelihood function
▶n番目の予測値  
  
$$ \large P_n = y_n^{t_n}(1 - y_n)^{1-t_n} $$

▶全データに当てはめる  

$$ \large L(β) = \prod_{n = 1}^{N}y_n^{t_n}(1 - y_n)^{1-t_n} $$

この式の最大化を目指す  

→対数を取る（アンダーフローを防ぐ）  
$$ \large -\log L(β) = - \sum_{n=1}^{N}\bigl(t_n\log y_n + (1 - t_n)\log (1 - y_n) \bigl) $$
この式を最小化する

▶勾配降下法  
パラメータβについて微分
$$ \large \frac{dL(β)}{dβ} = \sum_{n=1}^{N}x_n(y_n - t_n) $$

# 実装

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


In [3]:
X = iris.data[50:]
y = iris.target[50:] - 1
print(X.shape)
print(y.shape)

(100, 4)
(100,)


In [4]:
X

array([[7. , 3.2, 4.7, 1.4],
       [6.4, 3.2, 4.5, 1.5],
       [6.9, 3.1, 4.9, 1.5],
       [5.5, 2.3, 4. , 1.3],
       [6.5, 2.8, 4.6, 1.5],
       [5.7, 2.8, 4.5, 1.3],
       [6.3, 3.3, 4.7, 1.6],
       [4.9, 2.4, 3.3, 1. ],
       [6.6, 2.9, 4.6, 1.3],
       [5.2, 2.7, 3.9, 1.4],
       [5. , 2. , 3.5, 1. ],
       [5.9, 3. , 4.2, 1.5],
       [6. , 2.2, 4. , 1. ],
       [6.1, 2.9, 4.7, 1.4],
       [5.6, 2.9, 3.6, 1.3],
       [6.7, 3.1, 4.4, 1.4],
       [5.6, 3. , 4.5, 1.5],
       [5.8, 2.7, 4.1, 1. ],
       [6.2, 2.2, 4.5, 1.5],
       [5.6, 2.5, 3.9, 1.1],
       [5.9, 3.2, 4.8, 1.8],
       [6.1, 2.8, 4. , 1.3],
       [6.3, 2.5, 4.9, 1.5],
       [6.1, 2.8, 4.7, 1.2],
       [6.4, 2.9, 4.3, 1.3],
       [6.6, 3. , 4.4, 1.4],
       [6.8, 2.8, 4.8, 1.4],
       [6.7, 3. , 5. , 1.7],
       [6. , 2.9, 4.5, 1.5],
       [5.7, 2.6, 3.5, 1. ],
       [5.5, 2.4, 3.8, 1.1],
       [5.5, 2.4, 3.7, 1. ],
       [5.8, 2.7, 3.9, 1.2],
       [6. , 2.7, 5.1, 1.6],
       [5.4, 3

In [5]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [7]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=0)
print(X_train.shape)
print(X_test.shape)

(75, 4)
(25, 4)


In [9]:
log_reg = LogisticRegression().fit(X_train, y_train)

In [10]:
from sklearn.model_selection import cross_val_score

In [11]:
cv_score = cross_val_score(log_reg, X_scaled, y, cv=5, scoring="accuracy")
pd.DataFrame(cv_score).rename(columns={0: '正解率'})

Unnamed: 0,正解率
0,0.95
1,1.0
2,0.9
3,0.9
4,1.0


In [12]:
cv_score.mean()

0.95