### Lv1 모델링 1/6 python 파이썬 scikit-learn

scikit-learn 라이브러리를 사용한 모델링

### Lv1 모델링 python 파이썬 2/6 모델개념 - Decision making tree(의사결정나무)

- 자료(data)에 존재하는 각 feature에 특정 값을 하나 지정하고, 이를 기준으로 모든 행을 두 개의 노드(node)로 분류하는 알고리즘.
- 이런 방식의 분류를 이진의사결정(binary decision making), 또는 이진분할이라 한다(특정 값 2개를 지정하면 3진분류가 됨). 이 과정을 반복함으로써 자료의 모든 항목이 분류되도록 하는 것이 의사결정나무 알고리즘이다.
- 이때 각 feature에 지정하는 값의 기준은 불순도(imputiry)이다.
- 대표적인 이진의사결정 알고리즘인 CART에서 사용되는 불순도는 지니불순도이다. 지니불순도의 일반식은 다음과 같다.
$$imp(t) = 1 - \sum_{j=1} P_j^2$$

In [1]:
# importing decision making tree algorithm

import sklearn
from sklearn.tree import DecisionTreeClassifier

### Lv1 모델링 python 파이썬 3/6 모델 선언(의사결정나무)

In [2]:
# Declare the model
model = DecisionTreeClassifier()

### Lv1 모델링 python 파이썬 4/6 모델 훈련(의사결정나무)

모델 선언 후 fit(X, Y) 함수로 모델 훈련

- X 데이터: 예측에 사용되는 변수
- Y 데이터: 예측 결과 변수
<br><br>
- train data 에서 drop([‘제외할컬럼명’], axis=1) 함수를 사용하면 피쳐를 제외할 수 있음

- train[‘예측할컬럼명'] 으로 Y 데이터를 인덱싱할 수 있음

In [3]:
# Download the data

!wget 'https://bit.ly/3gLj0Q6'

import zipfile
with zipfile.ZipFile('3gLj0Q6', 'r') as existing_zip:
    existing_zip.extractall('data')

--2022-09-07 16:08:05--  https://bit.ly/3gLj0Q6
Resolving bit.ly (bit.ly)... 67.199.248.10, 67.199.248.11
Connecting to bit.ly (bit.ly)|67.199.248.10|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://drive.google.com/uc?export=download&id=1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E [following]
--2022-09-07 16:08:05--  https://drive.google.com/uc?export=download&id=1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E
Resolving drive.google.com (drive.google.com)... 172.217.174.110, 2404:6800:400a:80b::200e
Connecting to drive.google.com (drive.google.com)|172.217.174.110|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0c-10-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/mkd1jg40ta1fbruoq17cj292hjp3oilp/1662534450000/17946651057176172524/*/1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E?e=download&uuid=351aa401-aee1-49de-9214-ca2ae8a8236b [following]
--2022-09-07 16:08:06--  https://doc-0c-10-docs.googleuserc

In [4]:
# Importing pandas, DecisionTreeClassifier

import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [5]:
# Importing a csv data as DataFrame objects

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [7]:
# Check the train, test data information and its shape

print('======== TRAIN DATA ========')
print('Train DataFrame information\n')
print(train.info(), '\n')
print('Train DataFrame shape: ', train.shape, '\n')
print('Number of null values: ', train.isnull().sum())
print('\n')
print('======== TEST DATA ========')
print('Test DataFrame information\n')
print(test.info(), '\n')
print('Test DataFrame shape: ', test.shape, '\n')
print('Number of null values: ', test.isnull().sum())

Train DataFrame information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      1459 non-null   int64  
 1   hour                    1459 non-null   int64  
 2   hour_bef_temperature    1457 non-null   float64
 3   hour_bef_precipitation  1457 non-null   float64
 4   hour_bef_windspeed      1450 non-null   float64
 5   hour_bef_humidity       1457 non-null   float64
 6   hour_bef_visibility     1457 non-null   float64
 7   hour_bef_ozone          1383 non-null   float64
 8   hour_bef_pm10           1369 non-null   float64
 9   hour_bef_pm2.5          1342 non-null   float64
 10  count                   1459 non-null   float64
dtypes: float64(9), int64(2)
memory usage: 125.5 KB
None 

Train DataFrame shape:  (1459, 11) 

Number of null values:  id                          0
hour                     

### Eraing missing values

In [10]:
train = train.dropna()
test = test.dropna()

## Check if the rows that have null values are deleted.
print(train.isnull().sum())
print(test.isnull().sum())

id                        0
hour                      0
hour_bef_temperature      0
hour_bef_precipitation    0
hour_bef_windspeed        0
hour_bef_humidity         0
hour_bef_visibility       0
hour_bef_ozone            0
hour_bef_pm10             0
hour_bef_pm2.5            0
count                     0
dtype: int64
id                        0
hour                      0
hour_bef_temperature      0
hour_bef_precipitation    0
hour_bef_windspeed        0
hour_bef_humidity         0
hour_bef_visibility       0
hour_bef_ozone            0
hour_bef_pm10             0
hour_bef_pm2.5            0
dtype: int64


The feature 'count' is what we will make our prediction.<br>
So, we need to use drop() function with axis=1 option to create pandas DataFrame with the train data excluding 'count' feature.<br><br>
Let's assign the name of train data as 'X_train',<br>
and for the real value(observation) for our prediction, which is the 'count' feature, name it 'Y_train'.

In [11]:
X_train = train.drop(['count'], axis=1)
Y_train = train['count']

In [12]:
print('X_train\n', X_train, '\n')
print('Y_train\n', Y_train, '\n')

X_train
         id  hour  hour_bef_temperature  hour_bef_precipitation  \
0        3    20                  16.3                     1.0   
1        6    13                  20.1                     0.0   
2        7     6                  13.9                     0.0   
3        8    23                   8.1                     0.0   
4        9    18                  29.5                     0.0   
...    ...   ...                   ...                     ...   
1454  2174     4                  16.8                     0.0   
1455  2175     3                  10.8                     0.0   
1456  2176     5                  18.3                     0.0   
1457  2178    21                  20.7                     0.0   
1458  2179    17                  21.1                     0.0   

      hour_bef_windspeed  hour_bef_humidity  hour_bef_visibility  \
0                    1.5               89.0                576.0   
1                    1.4               48.0                916

Declare the model using the variable 'model' we declared above.<br>
Using fit() function, let X_train as input and Y_train as output.<br>
Train the model using fit().

In [14]:
model = DecisionTreeRegressor()
model.fit(X_train, Y_train)  # train the model