# 2유형 
- 서울시 각 시군구별 건물 및 시설 현황 데이터 로드 
- 학습용 데이터를 활용하여 총가스소비량을 예측하는 모형을 개발 
- 평가용 데이터를 적용하여 총가스사용량을 예측
    - 예측 결과는 RMSE를 기준으로 평가
    - 타켓 변수(총가스사용량)는 일부 값의 0으로 기재되어있으며 이는 결측치를 0으로 처리한 값

In [65]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split

In [66]:
train_df = pd.read_csv("data/gas_train.csv")
test_df = pd.read_csv("data/gas_test.csv")

In [67]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3196 entries, 0 to 3195
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   시군구명    3196 non-null   object 
 1   생활및판매   3196 non-null   int64  
 2   공공문화    3196 non-null   int64  
 3   복지의료    3196 non-null   int64  
 4   업무오락체육  3196 non-null   int64  
 5   총가스사용량  3196 non-null   float64
dtypes: float64(1), int64(4), object(1)
memory usage: 149.9+ KB


In [68]:
len(train_df.loc[train_df['총가스사용량'] == 0])

57

In [69]:
train_df.head()

Unnamed: 0,시군구명,생활및판매,공공문화,복지의료,업무오락체육,총가스사용량
0,구로구,2,0,0,0,9077.8
1,구로구,6,0,1,2,10105.5
2,구로구,27,0,0,0,8603.6
3,구로구,2,0,0,0,11076.8
4,구로구,83,0,1,19,10781.4


In [70]:
train_df['시군구명'].value_counts()

시군구명
강남구     715
종로구     336
서초구     312
중구      247
영등포구    169
마포구     130
용산구     130
구로구     130
관악구     117
송파구     117
동대문구    117
동작구      91
강동구      78
광진구      78
성북구      78
금천구      78
양천구      65
강북구      52
노원구      52
은평구      52
성동구      52
Name: count, dtype: int64

In [71]:
dummie_df = pd.get_dummies(train_df['시군구명'])

In [72]:
train_df = pd.concat([train_df, dummie_df], axis = 1).drop('시군구명', axis = 1)

In [73]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1476 entries, 0 to 1475
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   시군구명    1476 non-null   object
 1   생활및판매   1476 non-null   int64 
 2   공공문화    1476 non-null   int64 
 3   복지의료    1476 non-null   int64 
 4   업무오락체육  1476 non-null   int64 
dtypes: int64(4), object(1)
memory usage: 57.8+ KB


In [74]:
X , Y = train_df.drop('총가스사용량', axis=1), train_df['총가스사용량']

In [75]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [76]:
# 0의 값들을 그대로 사용 
rf_model = RandomForestRegressor(random_state=42)
xg_model = XGBRegressor(random_state=42)
lb_model = LGBMRegressor(random_state=42)

rf_model.fit(X_train, y_train)
xg_model.fit(X_train, y_train)
lb_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)
xg_pred = xg_model.predict(X_test)
lb_pred = lb_model.predict(X_test)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000177 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 290
[LightGBM] [Info] Number of data points in the train set: 2556, number of used features: 25
[LightGBM] [Info] Start training from score 10341.989008


In [77]:
rf_rsme = root_mean_squared_error(y_test, rf_pred)
xg_rsme = root_mean_squared_error(y_test, xg_pred)
lb_rsme = root_mean_squared_error(y_test, lb_pred)

print("RF RSME:", rf_rsme)
print("XG RSME:", xg_rsme)
print("LB RSME:", lb_rsme)


RF RSME: 973.6574568141835
XG RSME: 926.4150716491562
LB RSME: 1068.8678707429824


In [78]:
def model_test(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    # 0의 값들을 그대로 사용 
    rf_model = RandomForestRegressor(random_state=42)
    xg_model = XGBRegressor(random_state=42)
    lb_model = LGBMRegressor(random_state=42)

    rf_model.fit(X_train, y_train)
    xg_model.fit(X_train, y_train)
    lb_model.fit(X_train, y_train)

    rf_pred = rf_model.predict(X_test)
    xg_pred = xg_model.predict(X_test)
    lb_pred = lb_model.predict(X_test)

    rf_rsme = root_mean_squared_error(y_test, rf_pred)
    xg_rsme = root_mean_squared_error(y_test, xg_pred)
    lb_rsme = root_mean_squared_error(y_test, lb_pred)
    print("RF RSME:", rf_rsme)
    print("XG RSME:", xg_rsme)
    print("LB RSME:", lb_rsme)
    
    return rf_model, xg_model, lb_model

In [79]:
train_df2 = train_df.copy()
train_df2['총가스사용량'] = train_df2['총가스사용량'].map(lambda x: train_df2['총가스사용량'].mean() if x == 0 else x)

In [80]:
model_test(train_df2.drop('총가스사용량', axis=1), train_df2['총가스사용량'])

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000080 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 290
[LightGBM] [Info] Number of data points in the train set: 2556, number of used features: 25
[LightGBM] [Info] Start training from score 10528.285173
RF RSME: 619.8340842748598
XG RSME: 552.9004523618427
LB RSME: 711.8575776285679


(RandomForestRegressor(random_state=42),
 XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              feature_weights=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, ...),
 LGBMRegressor(random_state=42))

In [81]:
train_df3 = train_df.copy()
train_df3 = train_df3.loc[train_df3['총가스사용량'] > 0]

In [82]:
rf, xg, gbm = model_test(train_df3.drop('총가스사용량', axis=1), train_df3['총가스사용량'])

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000126 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 2511, number of used features: 25
[LightGBM] [Info] Start training from score 10540.864072
RF RSME: 526.0944782129594
XG RSME: 502.8475000530377
LB RSME: 618.5819952963213


In [83]:
test_dummie = pd.get_dummies(test_df['시군구명'])
test_df = pd.concat([test_df, test_dummie], axis=1).drop('시군구명', axis=1)

In [84]:
xg.predict(test_df)

array([10412.463, 10662.005,  8874.568, ...,  9470.274,  9492.741,
       10495.441], shape=(1476,), dtype=float32)