### Lv1 모델링 1/6 python 파이썬 scikit-learn

Modeling using scikit-learn library

### Lv1 모델링 python 파이썬 2/6 모델개념 - Decision making tree(의사결정나무)

The decision-making tree is an algorithm that sets a specific value for each feature, then classifies every row into two nodes using the value we set.<br>
This type of classification is called binary decision-making.<br>
When we set two specific values instead of one, then the classification will be a ternary(3진 or 3항 in the Korean language) classification.<br>
The decision tree algorithm is to repeat this process so that all items in the data are classified.<br>

The criteria of setting a specific value to each feature is the impurity(불순도 in the Korean language).<br>
CART, a representative decision-making algorithm, uses Gini impurity.<br>
The general expression for the Gini impurity is:
$$imp(t) = 1 - \sum_{j=1} P_j^2$$

In [1]:
# importing decision making tree algorithm

import sklearn
from sklearn.tree import DecisionTreeClassifier

### Lv1 모델링 python 파이썬 3/6 모델 선언(의사결정나무)

In [2]:
# Declare the model
model = DecisionTreeClassifier()

### Lv1 모델링 python 파이썬 4/6 모델 훈련(의사결정나무)

After we declare a model, we can train the model using `fit(X, Y)` function.

- X: The parameter to make prediction
- Y: The parameter for the result of the precition.<br>

- We can exclude a feature using `.drop([‘ColumnNameToExclude’], axis=1)` method from the data.

- We can index the Y data using `train[‘ColumnNameToIndex']` method.

In [3]:
# Download the data

!wget 'https://bit.ly/3gLj0Q6'

import zipfile
with zipfile.ZipFile('3gLj0Q6', 'r') as existing_zip:
    existing_zip.extractall('data')

--2022-09-14 18:55:27--  https://bit.ly/3gLj0Q6
Resolving bit.ly (bit.ly)... 67.199.248.11, 67.199.248.10
Connecting to bit.ly (bit.ly)|67.199.248.11|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://drive.google.com/uc?export=download&id=1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E [following]
--2022-09-14 18:55:28--  https://drive.google.com/uc?export=download&id=1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E
Resolving drive.google.com (drive.google.com)... 142.250.206.206, 2404:6800:400a:80a::200e
Connecting to drive.google.com (drive.google.com)|142.250.206.206|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0c-10-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/93qm0h6fv3g4vkl01smttfas3foonepr/1663149300000/17946651057176172524/*/1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E?e=download&uuid=eea91aa6-a950-4248-a643-5f815036cc2a [following]
--2022-09-14 18:55:28--  https://doc-0c-10-docs.googleuserc

In [4]:
# Importing pandas, DecisionTreeClassifier

import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [5]:
# Importing a csv data as DataFrame objects

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [6]:
# Check the train, test data information and its shape

print('======== TRAIN DATA ========')
print('Train DataFrame information\n')
print(train.info(), '\n')
print('Train DataFrame shape: ', train.shape, '\n')
print('Number of null values: ', train.isnull().sum())
print('\n')
print('======== TEST DATA ========')
print('Test DataFrame information\n')
print(test.info(), '\n')
print('Test DataFrame shape: ', test.shape, '\n')
print('Number of null values: ', test.isnull().sum())

Train DataFrame information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      1459 non-null   int64  
 1   hour                    1459 non-null   int64  
 2   hour_bef_temperature    1457 non-null   float64
 3   hour_bef_precipitation  1457 non-null   float64
 4   hour_bef_windspeed      1450 non-null   float64
 5   hour_bef_humidity       1457 non-null   float64
 6   hour_bef_visibility     1457 non-null   float64
 7   hour_bef_ozone          1383 non-null   float64
 8   hour_bef_pm10           1369 non-null   float64
 9   hour_bef_pm2.5          1342 non-null   float64
 10  count                   1459 non-null   float64
dtypes: float64(9), int64(2)
memory usage: 125.5 KB
None 

Train DataFrame shape:  (1459, 11) 

Number of null values:  id                          0
hour                     

### Eraing missing values

In [7]:
train = train.dropna()
test = test.dropna()

## Check if the rows that have null values are deleted.
print(train.isnull().sum())
print(test.isnull().sum())

id                        0
hour                      0
hour_bef_temperature      0
hour_bef_precipitation    0
hour_bef_windspeed        0
hour_bef_humidity         0
hour_bef_visibility       0
hour_bef_ozone            0
hour_bef_pm10             0
hour_bef_pm2.5            0
count                     0
dtype: int64
id                        0
hour                      0
hour_bef_temperature      0
hour_bef_precipitation    0
hour_bef_windspeed        0
hour_bef_humidity         0
hour_bef_visibility       0
hour_bef_ozone            0
hour_bef_pm10             0
hour_bef_pm2.5            0
dtype: int64


The feature 'count' is what we will make our prediction.<br>
So, we need to use drop() function with axis=1 option to create pandas DataFrame with the train data excluding 'count' feature.<br><br>
Let's assign the name of train data as 'X_train',<br>
and for the real value(observation) for our prediction, which is the 'count' feature, name it 'Y_train'.

In [8]:
X_train = train.drop(['count'], axis=1)
Y_train = train['count']

In [9]:
print('======== X_train ========\n', X_train, '\n\n')
print('======== Y_train ========\n', Y_train, '\n')

         id  hour  hour_bef_temperature  hour_bef_precipitation  \
0        3    20                  16.3                     1.0   
1        6    13                  20.1                     0.0   
2        7     6                  13.9                     0.0   
3        8    23                   8.1                     0.0   
4        9    18                  29.5                     0.0   
...    ...   ...                   ...                     ...   
1454  2174     4                  16.8                     0.0   
1455  2175     3                  10.8                     0.0   
1456  2176     5                  18.3                     0.0   
1457  2178    21                  20.7                     0.0   
1458  2179    17                  21.1                     0.0   

      hour_bef_windspeed  hour_bef_humidity  hour_bef_visibility  \
0                    1.5               89.0                576.0   
1                    1.4               48.0                916.0   
2 

Declare the model using the variable 'model' we declared above.<br>
Using fit() function, let X_train as input and Y_train as output.<br>
Train the model using fit().

In [10]:
model = DecisionTreeRegressor()
model.fit(X_train, Y_train)  # train the model

DecisionTreeRegressor()