## 핵심
- model_select : 모델에서 필요한 데이터를 어떻게 나눌 것인지? train-test-split, kfold, groupkfold
- feature_select : 변수선택법 RFE(model에서 주요하지 않는 변수 즉 분산이 작은 변수)
- feature_extraction : 어떤 단어가 있냐 없냐, CountVectorizer 텍스트 마이닝, turn frequency inverse document frequency(역문서 빈도)

> 모델선언(인스턴스) -> 피팅 -> 예측 -> 평가

* 평가(accuracy, precision, f1_score 모두 중요하다.)

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

In [5]:
x=np.array([0.0,1.0,2.0,3.0,4.0,5.0])
y=np.array([0.0,0.8,0.9,0.1,-0.8,-1.0])

# 방정식의 계수를 찾음, 3차원 방정식으로 변환
z = np.polyfit(x,y,3)
print(z,'\n')

p=np.poly1d(z)
print('방정식 : \n',p,'\n')

print(p(0.5),'\n')
print(p(3.5),'\n')
print(p(10),'\n')
print(p(3.0),'\n')

[ 0.08703704 -0.81349206  1.69312169 -0.03968254] 

방정식 : 
          3          2
0.08704 x - 0.8135 x + 1.693 x - 0.03968 

0.6143849206349201 

-0.347321428571432 

22.579365079365022 

0.06825396825396512 



In [17]:
from sklearn.datasets import make_regression
import statsmodels.api as sm
bias = 100

X0,y,w=make_regression(n_samples=200, n_features=1, bias=bias, noise=10, coef=True,random_state=1)

X=sm.add_constant(X0)
y=y.reshape(len(y),1)

# print(X)
# print(y)
print(w)

86.44794300545998


In [18]:
# 행렬의 곱 연산자 => ANN(Atifitial neural network)
w = np.linalg.inv(X.T@X)@X.T@y
w

array([[99.79150869],
       [86.96171201]])

In [27]:
import statsmodels.api as sm

model=sm.OLS(y,X) 
print(model)

result=model.fit()
print(result.summary())

result.params

<statsmodels.regression.linear_model.OLS object at 0x0000014A11B433C8>
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                 1.278e+04
Date:                Tue, 14 Jan 2020   Prob (F-statistic):          8.17e-182
Time:                        01:41:36   Log-Likelihood:                -741.28
No. Observations:                 200   AIC:                             1487.
Df Residuals:                     198   BIC:                             1493.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------

array([99.79150869, 86.96171201])

# LinearRegression

In [40]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

boston = load_boston()
boston

dfX=pd.DataFrame(boston.data, columns=boston.feature_names)
dfy = pd.DataFrame(boston.target, columns=['MEDV'])
print(dfX.shape)
print(dfX.head())
print(dfy.head())

model=LinearRegression().fit(boston.data, boston.target)
print(model)

# coefficient 계수확인
print(model.coef_)
print(model.intercept_)

# 예측값?
predictions=model.predict(boston.data)
predictions

# 잔차
predictions-boston.target 

(506, 13)
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  
   MEDV
0  24.0
1  21.6
2  34.7
3  33.4
4  36.2
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
[-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]
36.459488385089855


array([ 6.00384338e+00,  3.42556238e+00, -4.13240328e+00, -4.79296351e+00,
       -8.25647577e+00, -3.44371554e+00,  1.01808268e-01, -7.56401157e+00,
       -4.97636315e+00,  2.02621071e-02,  3.99949651e+00,  2.68679568e+00,
       -7.93478472e-01, -8.47097189e-01,  1.08348205e+00, -6.02516792e-01,
       -2.57249021e+00, -5.88598653e-01, -4.02198894e+00,  2.06136033e-01,
       -1.07614247e+00, -1.92896331e+00,  6.32881292e-01, -6.93714654e-01,
        7.83383155e-02, -5.13314391e-01, -1.13602345e+00, -9.15257194e-02,
        1.14737285e+00, -1.23571798e-01, -1.24488241e+00,  3.55923295e+00,
       -4.38894264e+00,  1.18275814e+00,  2.06758913e-01,  4.91463526e+00,
        2.34193708e+00,  2.10891142e+00, -1.78497388e+00,  5.57625688e-01,
       -6.84897746e-01,  1.42056414e+00, -9.61337195e-02, -9.02072745e-02,
        1.74149176e+00,  2.79669817e+00,  4.23200323e-01,  1.43655088e+00,
       -5.29344623e+00, -2.19392249e+00,  1.58152535e+00,  3.47222285e+00,
        2.65585080e+00,  

# Logeistic Regression