### Lv2 튜닝 1/4 python 파이썬 랜덤 포레스트 변수 중요도 확인

Once we finish the training using .fit(), we can analyze the importance of the parameters using **`model.feature_importances_`**.<br>

The concept of the importance of the parameters is that how much each feature does important role when we decide the prediction parameter.<br>

If a feature does nothing important during the training process, we're gonna get rid of it to improve the accuracy of the model.

In [1]:
# Downloading data
!wget 'https://bit.ly/3gLj0Q6'

# Unzip the downloaded data
import zipfile
with zipfile.ZipFile('3gLj0Q6', 'r') as existing_zip:
    existing_zip.extractall('data')

--2022-09-09 14:50:37--  https://bit.ly/3gLj0Q6
Resolving bit.ly (bit.ly)... 67.199.248.11, 67.199.248.10
Connecting to bit.ly (bit.ly)|67.199.248.11|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://drive.google.com/uc?export=download&id=1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E [following]
--2022-09-09 14:50:37--  https://drive.google.com/uc?export=download&id=1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E
Resolving drive.google.com (drive.google.com)... 142.251.42.174, 2404:6800:4004:81f::200e
Connecting to drive.google.com (drive.google.com)|142.251.42.174|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0c-10-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/4lmbksq3d47utu7tgmd34l658f6ij1np/1662702600000/17946651057176172524/*/1or_QN1ksv81DNog6Tu_kWcZ5jJWf5W9E?e=download&uuid=13343fb5-52f1-472e-8702-94167c059288 [following]
--2022-09-09 14:50:38--  https://doc-0c-10-docs.googleusercon

In [2]:
# Import pandas and RandomForestRegressor
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

In [3]:
# Load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [4]:
# Check if the data loading is successful
print('============ Train Data ============\n')
print('Train Data Information\n', train.info(), '\n')
print('Train Data Shape: ', train.shape, '\n')

print('============ Test Data ============')
print('Test Data Information\n', test.info(), '\n')
print('Test Data Shape: ', test.shape, '\n')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      1459 non-null   int64  
 1   hour                    1459 non-null   int64  
 2   hour_bef_temperature    1457 non-null   float64
 3   hour_bef_precipitation  1457 non-null   float64
 4   hour_bef_windspeed      1450 non-null   float64
 5   hour_bef_humidity       1457 non-null   float64
 6   hour_bef_visibility     1457 non-null   float64
 7   hour_bef_ozone          1383 non-null   float64
 8   hour_bef_pm10           1369 non-null   float64
 9   hour_bef_pm2.5          1342 non-null   float64
 10  count                   1459 non-null   float64
dtypes: float64(9), int64(2)
memory usage: 125.5 KB
Train Data Information
 None 

Train Data Shape:  (1459, 11) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 715 entries, 0 to 714
Data columns (to

In [5]:
# Check if there are missing values
print(train.isnull().sum())

id                          0
hour                        0
hour_bef_temperature        2
hour_bef_precipitation      2
hour_bef_windspeed          9
hour_bef_humidity           2
hour_bef_visibility         2
hour_bef_ozone             76
hour_bef_pm10              90
hour_bef_pm2.5            117
count                       0
dtype: int64


In [6]:
# Remove the missing values using linear interpolation
train.interpolate(inplace=True)

In [7]:
# Check the result of the linear interpolation
print(train.isnull().sum())

id                        0
hour                      0
hour_bef_temperature      0
hour_bef_precipitation    0
hour_bef_windspeed        0
hour_bef_humidity         0
hour_bef_visibility       0
hour_bef_ozone            0
hour_bef_pm10             0
hour_bef_pm2.5            0
count                     0
dtype: int64


In [8]:
# Check the test data has null values
print(test.isnull().sum())

id                         0
hour                       0
hour_bef_temperature       1
hour_bef_precipitation     1
hour_bef_windspeed         1
hour_bef_humidity          1
hour_bef_visibility        1
hour_bef_ozone            35
hour_bef_pm10             37
hour_bef_pm2.5            36
dtype: int64


In [9]:
# Replace the null value in test dataset with zero.
test.fillna(0, inplace=True)

In [10]:
# Check the replacement is done properly
print(test.isnull().sum())

id                        0
hour                      0
hour_bef_temperature      0
hour_bef_precipitation    0
hour_bef_windspeed        0
hour_bef_humidity         0
hour_bef_visibility       0
hour_bef_ozone            0
hour_bef_pm10             0
hour_bef_pm2.5            0
dtype: int64


In [11]:
# Declare the model
X_train = train.drop(['count'], axis=1)
Y_train = train['count']

# Train the model
model = RandomForestRegressor(criterion = 'squared_error')
model.fit(X_train, Y_train)

RandomForestRegressor()

In [12]:
# Print the feature importances
model.feature_importances_

array([0.02570432, 0.58994633, 0.18560472, 0.01767652, 0.02571379,
       0.03750465, 0.03178874, 0.03327152, 0.03288451, 0.0199049 ])

### Lv2 튜닝 2/4 python 파이썬 변수 제거

After we evaluate the feature importances, we can train the model alongside the deletion of the less important features.<br>

First of all, 'id' and 'count' are not important for prediction at all.<br>
Thus, we will create a new dataframe by droping 'id' and 'count' feature.<br>

When we predict something, the test dataset must have exactly the same features as the training dataset, thus we will drop the same features from the existing test dataset if there are any.<br>

Also, we can create another new dataset by dropping another less important feature.<br>

Let's say we're gonna drop 'hour_bef_windspeed' and 'hour_bef_pm2.5' as well.

In [13]:
# Create train datasets
X_train1 = train.drop(['count', 'id'], axis=1)
X_train2 = train.drop(['count', 'id', 'hour_bef_windspeed'], axis=1)
X_train3 = train.drop(['count', 'id', 'hour_bef_windspeed', 'hour_bef_pm2.5'], axis=1)

# Create test datasets
test1 = test.drop(['id'], axis=1)
test2 = test.drop(['id', 'hour_bef_windspeed'], axis=1)
test3 = test.drop(['id', 'hour_bef_windspeed', 'hour_bef_pm2.5'], axis=1)

In [14]:
# Check if the datasets formed properly
print('X_train1.shape :', X_train1.shape)
print('X_train2.shape :', X_train2.shape)
print('X_train3.shape :', X_train3.shape)
print('test1.shape :', test1.shape)
print('test2.shape :', test2.shape)
print('test3.shape :', test3.shape)

X_train1.shape : (1459, 9)
X_train2.shape : (1459, 8)
X_train3.shape : (1459, 7)
test1.shape : (715, 9)
test2.shape : (715, 8)
test3.shape : (715, 7)


In [15]:
# Train for each datasets
# We're going to create separate models for each dataset
model1 = RandomForestRegressor(criterion = 'squared_error')
model2 = RandomForestRegressor(criterion = 'squared_error')
model3 = RandomForestRegressor(criterion = 'squared_error')

In [16]:
# Train each model
model1.fit(X_train1, Y_train)
model2.fit(X_train2, Y_train)
model3.fit(X_train3, Y_train)

RandomForestRegressor()

In [17]:
# Predict with models
prediction1 = model1.predict(test1)
prediction2 = model2.predict(test2)
prediction3 = model3.predict(test3)

In [18]:
# Save the predictions
result1 = pd.read_csv('data/submission.csv')
result2 = pd.read_csv('data/submission.csv')
result3 = pd.read_csv('data/submission.csv')

result1['count'] = prediction1
result2['count'] = prediction2
result3['count'] = prediction3

result1.to_csv('result1.csv', index=False)
result2.to_csv('result2.csv', index=False)
result3.to_csv('result3.csv', index=False)