### Airnology 2023

#### Descriptions

**datetime**            : Waktu ketika data dihitung (dalam format timestamp UNIX).

**datetime_iso**        : Waktu dalam format ISO 8601, termasuk zona waktu.

**time-zone**           : Zona waktu dalam detik terhadap UTC.

**temp**                : Suhu saat ini dalam Celcius.

**visibility**          : Visibilitas rata-rata dalam meter.

**d_point**             : Titik embun saat ini dalam Celcius.

**feels**               : Suhu yang dirasakan saat ini dalam Celcius.

**min_temp**            : Suhu minimum dalam rentang waktu tertentu dalam Celcius.

**max_temp**            : Suhu maksimum dalam rentang waktu tertentu dalam Celcius.

**pressure**            : Tekanan atmosfer dalam hPa .

**sea_level**           : Tekanan atmosfer pada permukaan laut dalam hPa.

**grnd_level**          : Tekanan atmosfer pada permukaan tanah dalam hPa.

**hum**                 : Persentase kelembaban udara saat ini.

**wind_spd**            : Kecepatan angin saat ini dalam m/s.

**wind_deg**            : Arah angin dalam derajat.

**rain_1h**             : Curah hujan dalam 1 jam terakhir dalam mm. (variabel target)

**rain_3h**             : Curah hujan dalam 3 jam terakhir dalam mm.

**snow_1h**             : Curah salju dalam 1 jam terakhir dalam mm.

**snow_3h**             : Curah salju dalam 3 jam terakhir dalam mm.

**clouds**              : Persentase penutupan awan saat ini.

#### Libraries

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import scipy.stats

#### Methods

In [51]:
# cleaning methods

def clean_temp(temp) :
    if isinstance(temp, str) :
        temp = temp.replace(' Celcius', '')
        temp = temp.replace(' C', '')
        temp = temp.replace('°C', '')
    return temp

def clean_rain(rain) :
    if isinstance(rain, str) :
        try :
            float(rain)
            return rain
        except :
            new_rain = 0
            return new_rain
        
def clean_wind(wind) :
    if isinstance(wind, str) :
        wind = wind.replace('°', '')
        wind = wind.replace('m/s', '')
    return wind

def clean_visibility(visibility) :
    if isinstance(visibility, str) :
        if visibility in ['unidentified', ' ', 'unrecognized', 'unknown', 'empty', 'undefined', 'missing'] :
            return 'unknown'
        elif visibility in ['-1m', '-1 m'] :
            return '-1m'
        elif visibility in ['-1km', '-1 km'] :
            return '-1km'
    return visibility

def clean_ground_and_sea(ground_and_sea) :
    if isinstance(ground_and_sea, str) :
        if ground_and_sea in ['undetermined', 'unsettled', 'unestablished', 'not recorded', 'unknown', 'not_recorded', 'not-recorded','unspecified'] :
            return 'unknown'
    return ground_and_sea
    
def clean_prssr(prssr) :
    if isinstance(prssr, str) :
        if prssr in ['-100.0 hPa.', '-100.0 hPa', '-100'] :
            return 99.0
        prssr = prssr.replace('hPa.', '')
        prssr = prssr.replace('hPa', '')
    return prssr

def clean_hum(hum) :
    if isinstance(hum, str) :
        hum = hum.replace('%', '')
    return hum

def clean_cloud(cloud) :
    if isinstance(cloud, str) :
        cloud = cloud.replace('%', '')
    return cloud

def categorize_hour(hour) :
    if hour in [4,5]:
        return "dawn"
    elif hour in [6,7]:
        return "early morning"
    elif hour in [8,9,10]:
        return "late morning"
    elif hour in [11,12,13]:
        return "noon"
    elif hour in [14,15,16]:
        return "afternoon"
    elif hour in [17, 18,19]:
        return "evening"
    elif hour in [20, 21, 22]:
        return "night"
    elif hour in [23,0,1,2,3]:
        return "midnight"

# impute
def knn_impute(df, na_target) :
    df = df.copy()

    numeric_df = df.select_dtypes(np.number)
    non_na_columns = numeric_df.loc[:, numeric_df.isna().sum() == 0].columns

    y_train = numeric_df.loc[numeric_df[na_target].isna() == False, na_target]
    X_train = numeric_df.loc[numeric_df[na_target].isna() == False, non_na_columns]
    X_test = numeric_df.loc[numeric_df[na_target].isna() == True, non_na_columns]

    knn = KNeighborsRegressor()
    knn.fit(X_train, y_train)

    y_pred = knn.predict(X_test)

    df.loc[df[na_target].isna() == True, na_target] = y_pred

    return df


#### Data Overview

In [52]:
train = pd.read_csv('../../datasets/train.csv')
test = pd.read_csv('../../datasets/test.csv')

TARGET = 'rain_1h'
# train.drop('rain_1h', axis=1, inplace=True)

In [53]:
train.isna().sum()

datetime             0
datetime_iso         0
time-zone            0
temp                 0
visibility      290768
d_point              0
feels                0
min_temp             0
max_temp             0
prssr                0
sea_level       148916
grnd_level      148961
hum                  0
wind_spd             0
wind_deg             0
rain_1h              0
rain_3h         149551
snow_1h         149184
snow_3h         149181
clouds               0
dtype: int64

In [54]:
print(f'train shape : {train.shape}')
print(f'test shape : {test.shape}')

train shape : (341880, 20)
test shape : (49368, 19)


#### Merging train and test

In [55]:
# # merging train2 and test data
# merged = pd.concat([train2, test], axis = 0).reset_index(drop=True) 
# merged.drop(['datetime', 'snow_1h', 'snow_3h', 'time-zone'], axis=1, inplace=True)
# merged.set_index('datetime_iso', drop=True, inplace=True)

train2 = train.copy()


drop_cols = ['datetime', 'snow_1h', 'snow_3h', 'time-zone']

train2.drop(drop_cols, axis=1, inplace=True)
test.drop(drop_cols, axis=1, inplace=True)

# converting temp dtypes
for column in ['temp','d_point','feels','min_temp','max_temp'] :
    train2[column] = train2[column].apply(lambda x: clean_temp(x))
    train2[column] = train2[column].astype('float64')

    test[column] = test[column].apply(lambda x: clean_temp(x))
    test[column] = test[column].astype('float64')

# converting rain dtypes
for column in [TARGET, 'rain_3h'] :
    train2[column] = train2[column].apply(lambda x: clean_rain(x))
    train2[column] = train2[column].astype('float64')

for column in ['rain_3h'] :
    test[column] = test[column].apply(lambda x: clean_rain(x))
    test[column] = test[column].astype('float64')

# converting wind dtypes
for column in ['wind_spd', 'wind_deg'] :
    train2[column] = train2[column].apply(lambda x: clean_wind(x))
    train2[column] = train2[column].astype('float64')

    test[column] = test[column].apply(lambda x: clean_wind(x))
    test[column] = test[column].astype('float64')

# cleaning visibility
for column in ['visibility'] :
    train2[column] = train2[column].apply(lambda x: clean_visibility(x))
    train2[column] = train2[column].fillna(train2[column].mode().iloc[0])

    test[column] = test[column].apply(lambda x: clean_visibility(x))
    test[column] = test[column].fillna(train2[column].mode().iloc[0])

# cleaning ground and sea level
for column in ['sea_level', 'grnd_level'] :
    train2[column] = train2[column].apply(lambda x: clean_ground_and_sea(x))
    train2[column] = train2[column].fillna(train2[column].mode().iloc[0])

    test[column] = test[column].apply(lambda x: clean_ground_and_sea(x))
    test[column] = test[column].fillna(train2[column].mode().iloc[0])

# cleaning pressure
for column in ['prssr'] :
    train2[column] = train2[column].apply(lambda x: clean_prssr(x))
    train2[column] = train2[column].astype('float64')

    test[column] = test[column].apply(lambda x: clean_prssr(x))
    test[column] = test[column].astype('float64')
    
# cleaning humidity
for column in ['hum'] :
    train2[column] = train2[column].apply(lambda x: clean_hum(x))
    train2[column] = train2[column].astype('float64')

    test[column] = test[column].apply(lambda x: clean_hum(x))
    test[column] = test[column].astype('float64')

# cleaning clouds 
for column in ['clouds'] :
    train2[column] = train2[column].apply(lambda x: clean_cloud(x))
    train2[column] = train2[column].astype('float64')

    test[column] = test[column].apply(lambda x: clean_cloud(x))
    test[column] = test[column].astype('float64')

# preprocessing train
train2['datetime_iso'] = pd.to_datetime(train2['datetime_iso'])

train2['month'] = train2['datetime_iso'].dt.month

# train2['date_issued:day_of_week'] = train2['datetime_iso'].dt.day_of_week
# train2['date_issued:day_of_week'] = train2['date_issued:day_of_week'].astype(str)

train2['day_of_year'] = train2['datetime_iso'].dt.day_of_year
# train2['date_issued:day_of_year'] = train2['date_issued:day_of_year'].astype(str)

train2['day'] = train2['datetime_iso'].dt.day
train2['day'] = train2['day'].astype(str)

train2['hour'] = train2['datetime_iso'].dt.hour
train2['hour_cat'] = train2['hour'].apply(categorize_hour)
train2['hour'] = train2['hour'].astype(str)

train2 = train2.set_index('datetime_iso')

# preprocessing test
test['datetime_iso'] = pd.to_datetime(test['datetime_iso'])

test['month'] = test['datetime_iso'].dt.month

# test['date_issued:day_of_week'] = test['datetime_iso'].dt.day_of_week
# test['date_issued:day_of_week'] = test['date_issued:day_of_week'].astype(str)

test['day_of_year'] = test['datetime_iso'].dt.day_of_year
# test['date_issued:day_of_year'] = test['date_issued:day_of_year'].astype(str)

test['day'] = test['datetime_iso'].dt.day
test['day'] = test['day'].astype(str)

test['hour'] = test['datetime_iso'].dt.hour
test['hour_cat'] = test['hour'].apply(categorize_hour)
test['hour'] = test['hour'].astype(str)

test = test.set_index('datetime_iso')

# column transform from month to season
train2['month']= np.sin(-0.449 * (train2['month'] - 3))
test['month']= np.sin(-0.449 * (test['month'] - 3))

# column transform day of year
train2['day_of_year']= 55 * (np.sin(0.015 * (train2['day_of_year'] - 60)))
test['day_of_year']= 55 * (np.sin(0.015 * (test['day_of_year'] - 60)))

#### Capping Outliers

In [56]:
train3 = train2[train2["rain_1h"] >= 0]

MAX_TEMP = 50
MAX_HUM = 150
MAX_MAX_TEMP = 50
MAX_FEELS = 60
MAX_PRESSURE  = 1200
MAX_MIN_TENP = 60

print(train3.shape)
train3 = train3[train3["max_temp"] < MAX_MAX_TEMP]
train3 = train3[train3["hum"] < MAX_HUM]
train3 = train3[train3["feels"] < MAX_FEELS]
train3 = train3[train3["prssr"] < MAX_PRESSURE]
train3 = train3[train3["min_temp"] < MAX_MIN_TENP]
train3 = train3[train3["temp"] < MAX_TEMP]
train3["wind_deg"] = train3["wind_deg"].apply(lambda x: x % 360 if x > 360 else x)
print(train3.shape)

(341880, 20)
(312278, 20)


#### Imputer

In [57]:
train3.isna().sum()

temp                0
visibility          0
d_point             0
feels               0
min_temp            0
max_temp            0
prssr               0
sea_level           0
grnd_level          0
hum                 0
wind_spd            0
wind_deg            0
rain_1h             0
rain_3h        136644
clouds              0
month               0
day_of_year         0
day                 0
hour                0
hour_cat            0
dtype: int64

In [58]:
# imputing with knn
train4 = train3.copy()
for column in ['rain_3h'] :
    train4 = knn_impute(train4, column)
    test = knn_impute(test, column)
    
for column in ['d_point'] :
    test = knn_impute(test, column)

train4

# # imputing with mean
# merged2 = merged.copy()
# for column in ['d_point','rain_3h'] :
#     merged2[column] = merged[column].fillna(merged[column].mean())

# merged2

Unnamed: 0_level_0,temp,visibility,d_point,feels,min_temp,max_temp,prssr,sea_level,grnd_level,hum,wind_spd,wind_deg,rain_1h,rain_3h,clouds,month,day_of_year,day,hour,hour_cat
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1979-01-01 00:00:00+00:00,24.75,unknown,23.89,25.76,24.28,25.22,1012.0,unknown,unknown,95.0,0.82,320.0,0.00,0.0,100.0,0.782082,-42.565324,1,0,midnight
1979-01-01 01:00:00+00:00,24.58,unknown,23.73,25.57,23.99,25.26,1012.0,unknown,unknown,95.0,0.96,338.0,0.00,0.0,100.0,0.782082,-42.565324,1,1,midnight
1979-01-01 02:00:00+00:00,26.60,unknown,24.06,26.60,26.10,27.39,1012.0,unknown,unknown,86.0,1.22,339.0,0.00,0.0,99.0,0.782082,-42.565324,1,2,midnight
1979-01-01 03:00:00+00:00,27.31,unknown,24.37,30.90,26.59,28.36,1012.0,unknown,unknown,84.0,1.08,342.0,0.13,0.0,94.0,0.782082,-42.565324,1,3,midnight
1979-01-01 04:00:00+00:00,27.41,unknown,25.05,31.54,26.58,28.31,1011.0,unknown,unknown,87.0,0.86,336.0,0.34,0.0,100.0,0.782082,-42.565324,1,4,dawn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017-12-31 17:00:00+00:00,25.34,unknown,24.83,26.46,24.75,26.04,1008.0,unknown,unknown,97.0,0.12,188.0,2.66,0.0,100.0,0.782958,-54.481733,31,17,evening
2017-12-31 18:00:00+00:00,25.11,unknown,24.60,26.21,24.59,25.62,1007.0,unknown,unknown,97.0,0.98,21.0,0.00,0.0,100.0,0.782958,-54.481733,31,18,evening
2017-12-31 20:00:00+00:00,24.51,unknown,24.17,25.58,23.89,25.13,1006.0,unknown,unknown,98.0,0.85,21.0,0.00,0.0,100.0,0.782958,-54.481733,31,20,night
2017-12-31 22:00:00+00:00,26.68,unknown,24.71,29.76,25.02,27.25,1008.0,unknown,unknown,89.0,1.46,17.0,0.00,0.0,98.0,0.782958,-54.481733,31,22,night


In [59]:
train4.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 312278 entries, 1979-01-01 00:00:00+00:00 to 2017-12-31 23:00:00+00:00
Data columns (total 20 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   temp         312278 non-null  float64
 1   visibility   312278 non-null  object 
 2   d_point      312278 non-null  float64
 3   feels        312278 non-null  float64
 4   min_temp     312278 non-null  float64
 5   max_temp     312278 non-null  float64
 6   prssr        312278 non-null  float64
 7   sea_level    312278 non-null  object 
 8   grnd_level   312278 non-null  object 
 9   hum          312278 non-null  float64
 10  wind_spd     312278 non-null  float64
 11  wind_deg     312278 non-null  float64
 12  rain_1h      312278 non-null  float64
 13  rain_3h      312278 non-null  float64
 14  clouds       312278 non-null  float64
 15  month        312278 non-null  float64
 16  day_of_year  312278 non-null  float64
 17  day          312278 n

#### Transformations

In [60]:
very_skewed_col = []
num_cols = train4.select_dtypes(np.number)
for column in num_cols :
    skew = scipy.stats.skew(train4[column])
    print(f'{column} skewness : {skew}')
    if skew > 0.5 or skew < -0.5 :
        very_skewed_col.append(column)

print(very_skewed_col)

temp skewness : 0.6603544548266891
d_point skewness : 7.908324917944134
feels skewness : 0.6995901014134408
min_temp skewness : 0.5625930127787475
max_temp skewness : 0.6445112697468711
prssr skewness : -0.20421068827097205
hum skewness : -1.1640478084317862
wind_spd skewness : 5.200931091784641
wind_deg skewness : -0.32973978725682096
rain_1h skewness : 6.7224247502743015
rain_3h skewness : nan
clouds skewness : -1.784178447731747
month skewness : 0.18767254222517527
day_of_year skewness : -0.20257793535651178
['temp', 'd_point', 'feels', 'min_temp', 'max_temp', 'hum', 'wind_spd', 'rain_1h', 'clouds']


In [61]:
# train5 = train4.copy()
# for column in ['temp', 'feels', 'min_temp', 'max_temp', 'wind_spd', 'clouds', 'd_point', 'hum'] :
#     train5[column] = np.log1p(train5[column])
#     test[column] = np.log1p(test[column])

# for column in ['d_point', 'hum'] :
#     train5[column], _lambda = scipy.stats.boxcox(train5[column])
#     test[column] = scipy.stats.boxcox(test[column], lmbda=_lambda)

# num_cols = train5.select_dtypes(np.number)
# for column in num_cols :
#     skew = scipy.stats.skew(train5[column])
#     print(f'{column} skewness : {skew}')

In [62]:
test.isna().sum()

temp           0
visibility     0
d_point        0
feels          0
min_temp       0
max_temp       0
prssr          0
sea_level      0
grnd_level     0
hum            0
wind_spd       0
wind_deg       0
rain_3h        0
clouds         0
month          0
day_of_year    0
day            0
hour           0
hour_cat       0
dtype: int64

#### Target Transformation

In [63]:
train5 = train4.copy()
# train5[TARGET] = np.log(train5[TARGET])

#### Encoding

In [64]:
train6 = pd.get_dummies(train5)
test = pd.get_dummies(test)

In [65]:
print(train6.shape)
print(test.shape)

(312278, 85)
(49368, 84)


In [66]:
train6

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_1h,...,hour_8,hour_9,hour_cat_afternoon,hour_cat_dawn,hour_cat_early morning,hour_cat_evening,hour_cat_late morning,hour_cat_midnight,hour_cat_night,hour_cat_noon
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1979-01-01 00:00:00+00:00,24.75,23.89,25.76,24.28,25.22,1012.0,95.0,0.82,320.0,0.00,...,0,0,0,0,0,0,0,1,0,0
1979-01-01 01:00:00+00:00,24.58,23.73,25.57,23.99,25.26,1012.0,95.0,0.96,338.0,0.00,...,0,0,0,0,0,0,0,1,0,0
1979-01-01 02:00:00+00:00,26.60,24.06,26.60,26.10,27.39,1012.0,86.0,1.22,339.0,0.00,...,0,0,0,0,0,0,0,1,0,0
1979-01-01 03:00:00+00:00,27.31,24.37,30.90,26.59,28.36,1012.0,84.0,1.08,342.0,0.13,...,0,0,0,0,0,0,0,1,0,0
1979-01-01 04:00:00+00:00,27.41,25.05,31.54,26.58,28.31,1011.0,87.0,0.86,336.0,0.34,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017-12-31 17:00:00+00:00,25.34,24.83,26.46,24.75,26.04,1008.0,97.0,0.12,188.0,2.66,...,0,0,0,0,0,1,0,0,0,0
2017-12-31 18:00:00+00:00,25.11,24.60,26.21,24.59,25.62,1007.0,97.0,0.98,21.0,0.00,...,0,0,0,0,0,1,0,0,0,0
2017-12-31 20:00:00+00:00,24.51,24.17,25.58,23.89,25.13,1006.0,98.0,0.85,21.0,0.00,...,0,0,0,0,0,0,0,0,1,0
2017-12-31 22:00:00+00:00,26.68,24.71,29.76,25.02,27.25,1008.0,89.0,1.46,17.0,0.00,...,0,0,0,0,0,0,0,0,1,0


#### Scaling

In [67]:
# splitting the dataset

X = train6.drop(TARGET, axis=1)
y = train6[TARGET]

# Train tets split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=101)

In [68]:
X_train.columns

Index(['temp', 'd_point', 'feels', 'min_temp', 'max_temp', 'prssr', 'hum',
       'wind_spd', 'wind_deg', 'rain_3h', 'clouds', 'month', 'day_of_year',
       'visibility_-1', 'visibility_-1km', 'visibility_-1m',
       'visibility_unknown', 'sea_level_-1', 'sea_level_unknown',
       'grnd_level_-1', 'grnd_level_unknown', 'day_1', 'day_10', 'day_11',
       'day_12', 'day_13', 'day_14', 'day_15', 'day_16', 'day_17', 'day_18',
       'day_19', 'day_2', 'day_20', 'day_21', 'day_22', 'day_23', 'day_24',
       'day_25', 'day_26', 'day_27', 'day_28', 'day_29', 'day_3', 'day_30',
       'day_31', 'day_4', 'day_5', 'day_6', 'day_7', 'day_8', 'day_9',
       'hour_0', 'hour_1', 'hour_10', 'hour_11', 'hour_12', 'hour_13',
       'hour_14', 'hour_15', 'hour_16', 'hour_17', 'hour_18', 'hour_19',
       'hour_2', 'hour_20', 'hour_21', 'hour_22', 'hour_23', 'hour_3',
       'hour_4', 'hour_5', 'hour_6', 'hour_7', 'hour_8', 'hour_9',
       'hour_cat_afternoon', 'hour_cat_dawn', 'hour_cat_early mor

In [69]:
train5.isna().sum()

temp           0
visibility     0
d_point        0
feels          0
min_temp       0
max_temp       0
prssr          0
sea_level      0
grnd_level     0
hum            0
wind_spd       0
wind_deg       0
rain_1h        0
rain_3h        0
clouds         0
month          0
day_of_year    0
day            0
hour           0
hour_cat       0
dtype: int64

In [70]:
all_cols = X_train.columns
num_cols = ['temp', 'd_point', 'feels', 'min_temp', 'max_temp', 'prssr', 'hum',
       'wind_spd', 'wind_deg', 'rain_3h', 'clouds', 'month',
       'day_of_year']
cat_cols = np.setdiff1d(all_cols, num_cols)

scaler = StandardScaler()
scaler.fit(X_train[num_cols])

X_train_scaled = pd.concat([pd.DataFrame(scaler.transform(X_train[num_cols]), columns=num_cols, index=X_train.index), X_train[cat_cols]], axis=1)
X_test_scaled = pd.concat([pd.DataFrame(scaler.transform(X_test[num_cols]), columns=num_cols, index=X_test.index), X_test[cat_cols]], axis=1)

# full data
X_scaled = pd.concat([pd.DataFrame(scaler.transform(X[num_cols]), columns=num_cols, index=X.index), X[cat_cols]], axis=1)
test_scaled = pd.concat([pd.DataFrame(scaler.transform(test[num_cols]), columns=num_cols, index=test.index), test[cat_cols]], axis=1)

In [71]:
cb_regressor = CatBoostRegressor(iterations=35000, early_stopping_rounds=10)
xgb_regressor = XGBRegressor()
lgbm_regressor = LGBMRegressor()


cb_regressor.fit(X_scaled, y)

Learning rate set to 0.005636
0:	learn: 0.7048569	total: 27.4ms	remaining: 16m
1:	learn: 0.7044853	total: 50.1ms	remaining: 14m 37s
2:	learn: 0.7041122	total: 78.5ms	remaining: 15m 16s
3:	learn: 0.7037428	total: 108ms	remaining: 15m 44s
4:	learn: 0.7033775	total: 133ms	remaining: 15m 32s
5:	learn: 0.7030175	total: 156ms	remaining: 15m 9s
6:	learn: 0.7026605	total: 183ms	remaining: 15m 13s
7:	learn: 0.7023082	total: 205ms	remaining: 14m 57s
8:	learn: 0.7019570	total: 232ms	remaining: 15m 2s
9:	learn: 0.7016106	total: 263ms	remaining: 15m 20s
10:	learn: 0.7012678	total: 287ms	remaining: 15m 12s
11:	learn: 0.7009269	total: 312ms	remaining: 15m 8s
12:	learn: 0.7005999	total: 334ms	remaining: 14m 59s
13:	learn: 0.7002729	total: 361ms	remaining: 15m 2s
14:	learn: 0.6999441	total: 385ms	remaining: 14m 56s
15:	learn: 0.6996233	total: 412ms	remaining: 15m
16:	learn: 0.6993033	total: 436ms	remaining: 14m 57s
17:	learn: 0.6989854	total: 463ms	remaining: 15m
18:	learn: 0.6986751	total: 488ms	remai

<catboost.core.CatBoostRegressor at 0x23391461b80>

In [72]:
xgb_regressor.fit(X_scaled, y)

In [73]:
lgbm_regressor.fit(X_scaled, y)

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2350
[LightGBM] [Info] Number of data points in the train set: 312278, number of used features: 83
[LightGBM] [Info] Start training from score 0.189211


In [74]:
# cb_y_pred = cb_regressor.predict(X_test)
# xgb_y_pred = xgb_regressor.predict(X_test)
# lgbm_y_pred = lgbm_regressor.predict(X_test)

# ensemble_y_pred = (
#     (0.6 * cb_y_pred) +
#     (0.4 * xgb_y_pred)
#     # (0.2 * lgbm_y_pred) 
# )
# print('RMSE:', np.sqrt(mean_squared_error(y_test, ensemble_y_pred))) 

In [75]:
submission = pd.read_csv("../../datasets/sample_submission.csv")

cat_y_pred = cb_regressor.predict(test_scaled)
xgb_y_pred = xgb_regressor.predict(test_scaled)
lgbm_y_pred = lgbm_regressor.predict(test_scaled)

ensembled_y_pred = (
   ( 0.6 * cat_y_pred ) +
   ( 0.1 * xgb_y_pred ) +
   ( 0.3 * lgbm_y_pred ) 
)

submission['rain_1h'] = ensembled_y_pred
submission['rain_1h'] = submission['rain_1h'].apply(lambda x: 0. if x < 0 else x)
submission.to_csv('../predictions/submission5_rang.csv', index=False)