### Airnology 2023

#### Descriptions

**datetime**            : Waktu ketika data dihitung (dalam format timestamp UNIX).

**datetime_iso**        : Waktu dalam format ISO 8601, termasuk zona waktu.

**time-zone**           : Zona waktu dalam detik terhadap UTC.

**temp**                : Suhu saat ini dalam Celcius.

**visibility**          : Visibilitas rata-rata dalam meter.

**d_point**             : Titik embun saat ini dalam Celcius.

**feels**               : Suhu yang dirasakan saat ini dalam Celcius.

**min_temp**            : Suhu minimum dalam rentang waktu tertentu dalam Celcius.

**max_temp**            : Suhu maksimum dalam rentang waktu tertentu dalam Celcius.

**pressure**            : Tekanan atmosfer dalam hPa .

**sea_level**           : Tekanan atmosfer pada permukaan laut dalam hPa.

**grnd_level**          : Tekanan atmosfer pada permukaan tanah dalam hPa.

**hum**                 : Persentase kelembaban udara saat ini.

**wind_spd**            : Kecepatan angin saat ini dalam m/s.

**wind_deg**            : Arah angin dalam derajat.

**rain_1h**             : Curah hujan dalam 1 jam terakhir dalam mm. (variabel target)

**rain_3h**             : Curah hujan dalam 3 jam terakhir dalam mm.

**snow_1h**             : Curah salju dalam 1 jam terakhir dalam mm.

**snow_3h**             : Curah salju dalam 3 jam terakhir dalam mm.

**clouds**              : Persentase penutupan awan saat ini.

#### Libraries

In [176]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error

#### Methods

In [177]:
# cleaning methods

def clean_temp(temp) :
    if isinstance(temp, str) :
        temp = temp.replace(' Celcius', '')
        temp = temp.replace(' C', '')
        temp = temp.replace('°C', '')
    return temp

def clean_rain(rain) :
    if isinstance(rain, str) :
        try :
            float(rain)
            return rain
        except :
            new_rain = 0
            return new_rain
        
def clean_wind(wind) :
    if isinstance(wind, str) :
        wind = wind.replace('°', '')
        wind = wind.replace('m/s', '')
    return wind

def clean_visibility(visibility) :
    if isinstance(visibility, str) :
        if visibility in ['unidentified', ' ', 'unrecognized', 'unknown', 'empty', 'undefined', 'missing'] :
            return 'unknown'
        elif visibility in ['-1m', '-1 m'] :
            return '-1m'
        elif visibility in ['-1km', '-1 km'] :
            return '-1km'
    return visibility

def clean_ground_and_sea(ground_and_sea) :
    if isinstance(ground_and_sea, str) :
        if ground_and_sea in ['undetermined', 'unsettled', 'unestablished', 'not recorded', 'unknown', 'not_recorded', 'not-recorded','unspecified'] :
            return 'unknown'
    return ground_and_sea
    
def clean_prssr(prssr) :
    if isinstance(prssr, str) :
        if prssr in ['-100.0 hPa.', '-100.0 hPa', '-100'] :
            return 99.0
        prssr = prssr.replace('hPa.', '')
        prssr = prssr.replace('hPa', '')
    return prssr

def clean_hum(hum) :
    if isinstance(hum, str) :
        hum = hum.replace('%', '')
    return hum

def clean_cloud(cloud) :
    if isinstance(cloud, str) :
        cloud = cloud.replace('%', '')
    return cloud

# impute
def knn_impute(df, na_target) :
    df = df.copy()

    numeric_df = df.select_dtypes(np.number)
    non_na_columns = numeric_df.loc[:, numeric_df.isna().sum() == 0].columns

    y_train = numeric_df.loc[numeric_df[na_target].isna() == False, na_target]
    X_train = numeric_df.loc[numeric_df[na_target].isna() == False, non_na_columns]
    X_test = numeric_df.loc[numeric_df[na_target].isna() == True, non_na_columns]

    knn = KNeighborsRegressor()
    knn.fit(X_train, y_train)

    y_pred = knn.predict(X_test)

    df.loc[df[na_target].isna() == True, na_target] = y_pred

    return df


#### Data Overview

In [178]:
train = pd.read_csv('../../datasets/train.csv')
test = pd.read_csv('../../datasets/test.csv')

TARGET = train['rain_1h']
train.drop('rain_1h', axis=1, inplace=True)

In [179]:
print(f'train shape : {train.shape}')
print(f'test shape : {test.shape}')

train shape : (341880, 19)
test shape : (49368, 19)


#### Merging train and test

In [180]:
# merging train and test data
merged = pd.concat([train, test], axis = 0).reset_index(drop=True) 
merged.drop(['datetime', 'snow_1h', 'snow_3h', 'time-zone'], axis=1, inplace=True)
merged.set_index('datetime_iso', drop=True, inplace=True)

# converting temp dtypes
for column in ['temp','d_point','feels','min_temp','max_temp'] :
    merged[column] = merged[column].apply(lambda x: clean_temp(x))
    merged[column] = merged[column].astype('float64')

# converting rain dtypes
for column in ['rain_3h'] :
    merged[column] = merged[column].apply(lambda x: clean_rain(x))
    merged[column] = merged[column].astype('float64')

# converting wind dtypes
for column in ['wind_spd', 'wind_deg'] :
    merged[column] = merged[column].apply(lambda x: clean_wind(x))
    merged[column] = merged[column].astype('float64')

# cleaning visibility
for column in ['visibility'] :
    merged[column] = merged[column].apply(lambda x: clean_visibility(x))
    merged[column] = merged[column].fillna(merged[column].mode().iloc[0])

# cleaning ground and sea level
for column in ['sea_level', 'grnd_level'] :
    merged[column] = merged[column].apply(lambda x: clean_ground_and_sea(x))
    merged[column] = merged[column].fillna(merged[column].mode().iloc[0])

# cleaning pressure
for column in ['prssr'] :
    merged[column] = merged[column].apply(lambda x: clean_prssr(x))
    merged[column] = merged[column].astype('float64')
    
# cleaning humidity
for column in ['hum'] :
    merged[column] = merged[column].apply(lambda x: clean_hum(x))
    merged[column] = merged[column].astype('float64')

# cleaning clouds 
for column in ['clouds'] :
    merged[column] = merged[column].apply(lambda x: clean_cloud(x))
    merged[column] = merged[column].astype('float64')

In [181]:
merged.head()

Unnamed: 0_level_0,temp,visibility,d_point,feels,min_temp,max_temp,prssr,sea_level,grnd_level,hum,wind_spd,wind_deg,rain_3h,clouds
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1979-01-01 00:00:00+00:00,24.75,unknown,23.89,25.76,24.28,25.22,1012.0,unknown,unknown,95.0,0.82,320.0,0.0,100.0
1979-01-01 01:00:00+00:00,24.58,unknown,23.73,25.57,23.99,25.26,1012.0,unknown,unknown,95.0,0.96,338.0,0.0,100.0
1979-01-01 02:00:00+00:00,26.6,unknown,24.06,26.6,26.1,27.39,1012.0,unknown,unknown,86.0,1.22,339.0,0.0,99.0
1979-01-01 03:00:00+00:00,27.31,unknown,24.37,30.9,26.59,28.36,1012.0,unknown,unknown,84.0,1.08,342.0,0.0,94.0
1979-01-01 04:00:00+00:00,27.41,unknown,25.05,31.54,26.58,28.31,1011.0,unknown,unknown,87.0,0.86,336.0,0.0,100.0


In [182]:
# data is formatted, now take care of missing values
merged.isna().sum()

temp               0
visibility         0
d_point            1
feels              0
min_temp           0
max_temp           0
prssr              0
sea_level          0
grnd_level         0
hum                0
wind_spd           0
wind_deg           0
rain_3h       171078
clouds             0
dtype: int64

In [183]:
# # imputing with knn
# merged2 = merged.copy()
# for column in ['d_point','rain_3h'] :
#     merged2 = knn_impute(merged2, column)

# merged2

# imputing with mean
merged2 = merged.copy()
for column in ['d_point','rain_3h'] :
    merged2[column] = merged[column].fillna(merged[column].mean())

merged2

Unnamed: 0_level_0,temp,visibility,d_point,feels,min_temp,max_temp,prssr,sea_level,grnd_level,hum,wind_spd,wind_deg,rain_3h,clouds
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1979-01-01 00:00:00+00:00,24.75,unknown,23.89,25.76,24.28,25.22,1012.0,unknown,unknown,95.0,0.82,320.0,0.000000,100.0
1979-01-01 01:00:00+00:00,24.58,unknown,23.73,25.57,23.99,25.26,1012.0,unknown,unknown,95.0,0.96,338.0,0.000000,100.0
1979-01-01 02:00:00+00:00,26.60,unknown,24.06,26.60,26.10,27.39,1012.0,unknown,unknown,86.0,1.22,339.0,0.000000,99.0
1979-01-01 03:00:00+00:00,27.31,unknown,24.37,30.90,26.59,28.36,1012.0,unknown,unknown,84.0,1.08,342.0,0.000000,94.0
1979-01-01 04:00:00+00:00,27.41,unknown,25.05,31.54,26.58,28.31,1011.0,unknown,unknown,87.0,0.86,336.0,0.000000,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-08-19 19:00:00+00:00,24.37,unknown,23.34,25.32,22.70,28.20,1011.0,unknown,unknown,94.0,1.57,239.0,0.000016,84.0
2023-08-19 20:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,unknown,unknown,95.0,1.53,235.0,0.000000,70.0
2023-08-19 21:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,unknown,unknown,95.0,1.53,235.0,0.000016,70.0
2023-08-19 22:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,unknown,unknown,95.0,1.53,235.0,0.000016,70.0


In [184]:
merged2.info() 

<class 'pandas.core.frame.DataFrame'>
Index: 391248 entries, 1979-01-01 00:00:00+00:00 to 2023-08-19 23:00:00+00:00
Data columns (total 14 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   temp        391248 non-null  float64
 1   visibility  391248 non-null  object 
 2   d_point     391248 non-null  float64
 3   feels       391248 non-null  float64
 4   min_temp    391248 non-null  float64
 5   max_temp    391248 non-null  float64
 6   prssr       391248 non-null  float64
 7   sea_level   391248 non-null  object 
 8   grnd_level  391248 non-null  object 
 9   hum         391248 non-null  float64
 10  wind_spd    391248 non-null  float64
 11  wind_deg    391248 non-null  float64
 12  rain_3h     391248 non-null  float64
 13  clouds      391248 non-null  float64
dtypes: float64(11), object(3)
memory usage: 44.8+ MB


#### Encoding

In [185]:
merged3 = pd.get_dummies(merged2).reset_index()
merged3['datetime_iso'] = pd.to_datetime(merged3['datetime_iso'])
merged3['month'] = merged3['datetime_iso'].dt.month
merged3['hour'] = merged3['datetime_iso'].dt.hour
merged3 = merged3.set_index('datetime_iso')

# column transform from month to season
merged3['month']= 2.7 * np.cos(0.524 * (merged3['month'] - (-5.5))) + 0.7

In [186]:
merged3

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,...,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,sea_level_-1,sea_level_unknown,grnd_level_-1,grnd_level_unknown,month,hour
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1979-01-01 00:00:00+00:00,24.75,23.89,25.76,24.28,25.22,1012.0,95.0,0.82,320.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,0
1979-01-01 01:00:00+00:00,24.58,23.73,25.57,23.99,25.26,1012.0,95.0,0.96,338.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,1
1979-01-01 02:00:00+00:00,26.60,24.06,26.60,26.10,27.39,1012.0,86.0,1.22,339.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,2
1979-01-01 03:00:00+00:00,27.31,24.37,30.90,26.59,28.36,1012.0,84.0,1.08,342.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,3
1979-01-01 04:00:00+00:00,27.41,25.05,31.54,26.58,28.31,1011.0,87.0,0.86,336.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-08-19 19:00:00+00:00,24.37,23.34,25.32,22.70,28.20,1011.0,94.0,1.57,239.0,0.000016,...,0,0,0,1,0,1,0,1,2.598819,19
2023-08-19 20:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.000000,...,0,0,0,1,0,1,0,1,2.598819,20
2023-08-19 21:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.000016,...,0,0,0,1,0,1,0,1,2.598819,21
2023-08-19 22:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.000016,...,0,0,0,1,0,1,0,1,2.598819,22


In [187]:
merged3.sort_values(by='wind_spd', ascending=False).head(10) # try capping wind_spd at 25

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,...,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,sea_level_-1,sea_level_unknown,grnd_level_-1,grnd_level_unknown,month,hour
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-06-09 23:00:00+00:00,26.01,24.97,26.01,25.35,27.04,1010.0,94.0,9999.0,90.0,1.6e-05,...,0,0,0,1,0,1,0,1,3.311196,23
2023-06-27 22:00:00+00:00,25.11,24.6,26.21,24.3,26.64,1012.0,97.0,9999.0,90.0,1.6e-05,...,0,0,0,1,0,1,0,1,3.311196,22
2023-07-22 05:00:00+00:00,30.94,24.34,36.85,30.04,31.65,1011.0,68.0,9999.0,90.0,1.6e-05,...,0,0,0,1,0,1,0,1,3.304462,5
2023-03-11 00:00:00+00:00,26.25,24.85,26.25,25.81,28.04,1012.0,92.0,9999.0,90.0,1.6e-05,...,0,0,0,1,0,1,0,1,0.010087,0
2023-03-02 11:00:00+00:00,28.43,25.46,33.95,27.04,28.94,1011.0,84.0,9999.0,90.0,0.0,...,0,0,0,1,0,1,0,1,0.010087,11
2023-06-20 10:00:00+00:00,25.02,23.81,26.01,24.48,25.56,1012.0,93.0,9999.0,90.0,1.6e-05,...,0,0,0,1,0,1,0,1,3.311196,10
2023-07-19 08:00:00+00:00,30.4,25.47,37.4,29.94,31.09,1007.0,75.0,9999.0,90.0,1.6e-05,...,0,0,0,1,0,1,0,1,3.304462,8
2023-05-02 10:00:00+00:00,28.18,26.19,34.14,27.51,28.74,1007.0,89.0,9999.0,360.0,1.6e-05,...,0,0,0,1,0,1,0,1,2.617214,10
2015-10-18 07:00:00+00:00,32.62,19.84,34.76,30.02,173.18,1009.0,47.0,25.0,175.0,1.6e-05,...,0,0,0,1,0,1,0,1,-0.015017,7
1997-09-14 04:00:00+00:00,28.33,21.86,31.02,26.93,29.7,1012.0,68.0,23.24,180.0,1.6e-05,...,0,0,0,1,0,1,0,1,1.383627,4


#### Outliers

In [188]:
# capping outliers
merged4 = merged3.copy()
merged4.loc[merged3['wind_spd'] > 25.0, 'wind_spd'] = 25
merged4.loc[merged3['wind_deg'] > 360.0, 'wind_spd'] = 360.0

#### Scaling

In [189]:
# splitting the dataset

X = merged4[:341880].reset_index()

y = TARGET.apply(lambda x : clean_rain(x))
y = y.astype('float64')

final_train = pd.concat([X,y], axis=1).set_index('datetime_iso')

final_test = merged4[341880:].reset_index()


In [190]:
data_no_outliers = final_train.loc[final_train['temp'] <= 35]
data_no_outliers = data_no_outliers.loc[data_no_outliers['d_point'] <= 35]

x_no_outliers = data_no_outliers.drop('rain_1h', axis=1)
y_no_outliers = data_no_outliers[['rain_1h']]

In [191]:
# Train tets split
X_train, X_test, y_train, y_test = train_test_split(X_no_outliers,y_no_outliers, test_size=0.2, random_state=101)
X_train = X_train.set_index('datetime_iso')
X_test = X_test.set_index('datetime_iso')

In [192]:
scaler = RobustScaler()
scaler.fit(X_train)

X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

X_test_scaled

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,...,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,sea_level_-1,sea_level_unknown,grnd_level_-1,grnd_level_unknown,month,hour
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1983-12-12 15:00:00+00:00,-0.522796,-0.362205,-0.157609,-0.644118,-0.648649,0.5,0.500000,-0.297872,0.784615,1.0,...,0.0,0.0,0.0,0.0,1.0,-1.0,0.0,0.0,-0.858679,0.250000
1989-12-09 19:00:00+00:00,-0.854103,-1.212598,-0.320652,-0.791176,-0.963964,0.5,0.500000,-0.244681,1.069231,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.858679,0.583333
1987-10-08 15:00:00+00:00,-0.279635,-0.417323,-0.055707,-0.232353,-0.384384,0.5,0.083333,0.053191,0.061538,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.364325,0.250000
2005-11-09 18:00:00+00:00,-0.325228,-0.118110,-0.067935,-0.294118,-0.252252,0.5,0.333333,0.617021,0.169231,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.678670,0.500000
2006-10-21 12:00:00+00:00,0.319149,0.385827,0.601902,0.255882,0.243243,0.0,-0.333333,-0.148936,-0.553846,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.364325,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017-02-28 22:00:00+00:00,-0.325228,-0.535433,-0.078804,-0.426471,-0.468468,0.0,0.083333,-0.340426,1.046154,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.673889,0.833333
1991-09-09 02:00:00+00:00,-0.012158,-1.866142,-0.073370,0.000000,-0.090090,2.0,-1.083333,1.882979,-0.176923,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,-0.833333
1982-10-15 20:00:00+00:00,-0.699088,-1.086614,-0.251359,-0.629412,-0.771772,-0.5,0.333333,-0.212766,0.376923,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.364325,0.666667
2003-06-07 00:00:00+00:00,-0.413374,-0.488189,-0.114130,-0.332353,0.018018,0.5,0.250000,0.861702,0.076923,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.502102,-1.000000


In [193]:
model = CatBoostRegressor(iterations=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))   # 195062

Learning rate set to 0.173728
0:	learn: 0.6978402	total: 33.9ms	remaining: 16.9s
1:	learn: 0.6905033	total: 64.3ms	remaining: 16s
2:	learn: 0.6847285	total: 93.5ms	remaining: 15.5s
3:	learn: 0.6802185	total: 124ms	remaining: 15.3s
4:	learn: 0.6768417	total: 156ms	remaining: 15.5s
5:	learn: 0.6745974	total: 184ms	remaining: 15.2s
6:	learn: 0.6728805	total: 213ms	remaining: 15s
7:	learn: 0.6714154	total: 239ms	remaining: 14.7s
8:	learn: 0.6703555	total: 265ms	remaining: 14.5s
9:	learn: 0.6692782	total: 292ms	remaining: 14.3s
10:	learn: 0.6685175	total: 317ms	remaining: 14.1s
11:	learn: 0.6676310	total: 343ms	remaining: 14s
12:	learn: 0.6671107	total: 370ms	remaining: 13.9s
13:	learn: 0.6666848	total: 397ms	remaining: 13.8s
14:	learn: 0.6664159	total: 423ms	remaining: 13.7s
15:	learn: 0.6658120	total: 451ms	remaining: 13.7s
16:	learn: 0.6654304	total: 478ms	remaining: 13.6s
17:	learn: 0.6650920	total: 506ms	remaining: 13.5s
18:	learn: 0.6648381	total: 531ms	remaining: 13.4s
19:	learn: 0.6

In [194]:
X = X.set_index('datetime_iso')

In [195]:
model.fit(X_no_outliers, y_no_outliers)
submission_pred = model.predict(final_test)

submission = pd.DataFrame({'datetime_iso' : final_test['datetime_iso'],
                            'rain_1h' : submission_pred})
submission.loc[submission['rain_1h'] < 0, 'rain_1h'] = 0
submission

Learning rate set to 0.179963
0:	learn: 0.6939570	total: 36.9ms	remaining: 18.4s
1:	learn: 0.6859155	total: 70.6ms	remaining: 17.6s
2:	learn: 0.6801270	total: 103ms	remaining: 17.1s
3:	learn: 0.6761436	total: 137ms	remaining: 17s
4:	learn: 0.6729387	total: 179ms	remaining: 17.7s
5:	learn: 0.6706089	total: 216ms	remaining: 17.8s
6:	learn: 0.6687880	total: 251ms	remaining: 17.7s
7:	learn: 0.6673538	total: 283ms	remaining: 17.4s
8:	learn: 0.6663234	total: 314ms	remaining: 17.1s
9:	learn: 0.6655112	total: 346ms	remaining: 16.9s
10:	learn: 0.6646688	total: 377ms	remaining: 16.8s
11:	learn: 0.6639258	total: 414ms	remaining: 16.8s
12:	learn: 0.6635039	total: 451ms	remaining: 16.9s
13:	learn: 0.6630903	total: 498ms	remaining: 17.3s
14:	learn: 0.6625642	total: 543ms	remaining: 17.5s
15:	learn: 0.6622494	total: 603ms	remaining: 18.2s
16:	learn: 0.6618072	total: 646ms	remaining: 18.4s
17:	learn: 0.6615647	total: 679ms	remaining: 18.2s
18:	learn: 0.6613347	total: 711ms	remaining: 18s
19:	learn: 0.

Unnamed: 0,datetime_iso,rain_1h
0,2018-01-01 00:00:00+00:00,0.000000
1,2018-01-01 01:00:00+00:00,0.057817
2,2018-01-01 02:00:00+00:00,0.115673
3,2018-01-01 03:00:00+00:00,0.070298
4,2018-01-01 04:00:00+00:00,0.000000
...,...,...
49363,2023-08-19 19:00:00+00:00,0.059427
49364,2023-08-19 20:00:00+00:00,0.034460
49365,2023-08-19 21:00:00+00:00,0.028148
49366,2023-08-19 22:00:00+00:00,0.028192


In [196]:
submission.to_csv('submission2_rang.csv', index=False)