### Lv2 모델링 1/2 python 파이썬 랜덤포레스트 개념, 선언

Ramdon forest is also called **random decision forests**.<br>

Random forests correct for decision trees' habit of overfitting to their training set.<br>

Random forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time.<br>

For classification:
- The output of the random forest is the class selected by most trees.<br>

For regression:
- The return is the mean or average prediction of the individual trees.<br>

Random forests generally outperform decision trees, but their accuracy is **lower** than ***gradient boosted trees***.<br>

However, data characteristics can affect their performance.

### In practical understanding

Random forest raises its accuracy by creating multiple decision trees and then calculating the means of the trees.<br>

From the given data, we can extract random datasets and create multiple models using them.

In [1]:
# Random forest regression model
# Import RandomForestRegressor from sklearn.ensemble to declare a model.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

### Lv2 모델링 2/2 python 파이썬 랜덤포레스트를 평가척도에 맞게 학습

We can use the 'criterion' option of the random forest module.<br>

Using it, we can decide what criterion would be the standard of the training.<br>

For now, we are going to use RMSE.<br>

RMSE is the square root of MSE, and you can implement RMSE by using `criterion = 'squared_error'` option.<br>

The option was once `criterion = 'mse'`, but the syntax was changed, so it was deprecated from sklearn.ensembel v1.0.

The equation for MSE is:
$$MSE = \frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2$$

In [2]:
# Downloading data
!wget 'https://bit.ly/3gLj0Q6'

# Unzip the downloaded data
import zipfile
with zipfile.ZipFile('3gLj0Q6', 'r') as existing_zip:
    existing_zip.extractall('data')

--2022-09-08 15:07:36--  https://bit.ly/3gLj0Q6
Resolving bit.ly (bit.ly)... 67.199.248.11, 67.199.248.10
Connecting to bit.ly (bit.ly)|67.199.248.11|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://drive.google.com/uc?export=download&id=1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E [following]
--2022-09-08 15:07:36--  https://drive.google.com/uc?export=download&id=1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E
Resolving drive.google.com (drive.google.com)... 142.250.76.142, 2404:6800:400a:80a::200e
Connecting to drive.google.com (drive.google.com)|142.250.76.142|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0c-10-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/2dkumjcaubnjn160t42hrtgl9gkpdabd/1662617250000/17946651057176172524/*/1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E?e=download&uuid=7ec655fe-463e-43dd-8ad5-3617a5a04932 [following]
--2022-09-08 15:07:37--  https://doc-0c-10-docs.googleusercon

In [3]:
# Import pandas, RandomForestRegressor
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

In [4]:
# Load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [5]:
# Check if the data is loaded
print('============ Train Data ============\n')
print('Train Data Information\n', train.info(), '\n')
print('Train Data Shape: ', train.shape, '\n')

print('============ Test Data ============')
print('Test Data Information\n', test.info(), '\n')
print('Test Data Shape: ', test.shape, '\n')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      1459 non-null   int64  
 1   hour                    1459 non-null   int64  
 2   hour_bef_temperature    1457 non-null   float64
 3   hour_bef_precipitation  1457 non-null   float64
 4   hour_bef_windspeed      1450 non-null   float64
 5   hour_bef_humidity       1457 non-null   float64
 6   hour_bef_visibility     1457 non-null   float64
 7   hour_bef_ozone          1383 non-null   float64
 8   hour_bef_pm10           1369 non-null   float64
 9   hour_bef_pm2.5          1342 non-null   float64
 10  count                   1459 non-null   float64
dtypes: float64(9), int64(2)
memory usage: 125.5 KB
Train Data Information
 None 

Train Data Shape:  (1459, 11) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 715 entries, 0 to 714
Data columns (to

In [6]:
# Check if there are missing values
print(train.isnull().sum())

id                          0
hour                        0
hour_bef_temperature        2
hour_bef_precipitation      2
hour_bef_windspeed          9
hour_bef_humidity           2
hour_bef_visibility         2
hour_bef_ozone             76
hour_bef_pm10              90
hour_bef_pm2.5            117
count                       0
dtype: int64


In [7]:
# Missing value handling: interpolation
# Linear interpolation is used here.
train.interpolate(inplace=True)

In [8]:
# Check if the missing values are well removed
print(train.isnull().sum())

id                        0
hour                      0
hour_bef_temperature      0
hour_bef_precipitation    0
hour_bef_windspeed        0
hour_bef_humidity         0
hour_bef_visibility       0
hour_bef_ozone            0
hour_bef_pm10             0
hour_bef_pm2.5            0
count                     0
dtype: int64


In [9]:
# Isolate the feature 'count' from the train data
X_train = train.drop(['count'], axis=1)  # axis=1: row

# Declare a dataframe that has only 'count' feature
Y_train = train['count']

In [10]:
# Train the model using random forest
model = RandomForestRegressor(criterion = 'squared_error')
model.fit(X_train, Y_train)

RandomForestRegressor()