### Airnology 2023

#### Descriptions

**datetime**            : Waktu ketika data dihitung (dalam format timestamp UNIX).

**datetime_iso**        : Waktu dalam format ISO 8601, termasuk zona waktu.

**time-zone**           : Zona waktu dalam detik terhadap UTC.

**temp**                : Suhu saat ini dalam Celcius.

**visibility**          : Visibilitas rata-rata dalam meter.

**d_point**             : Titik embun saat ini dalam Celcius.

**feels**               : Suhu yang dirasakan saat ini dalam Celcius.

**min_temp**            : Suhu minimum dalam rentang waktu tertentu dalam Celcius.

**max_temp**            : Suhu maksimum dalam rentang waktu tertentu dalam Celcius.

**pressure**            : Tekanan atmosfer dalam hPa .

**sea_level**           : Tekanan atmosfer pada permukaan laut dalam hPa.

**grnd_level**          : Tekanan atmosfer pada permukaan tanah dalam hPa.

**hum**                 : Persentase kelembaban udara saat ini.

**wind_spd**            : Kecepatan angin saat ini dalam m/s.

**wind_deg**            : Arah angin dalam derajat.

**rain_1h**             : Curah hujan dalam 1 jam terakhir dalam mm. (variabel target)

**rain_3h**             : Curah hujan dalam 3 jam terakhir dalam mm.

**snow_1h**             : Curah salju dalam 1 jam terakhir dalam mm.

**snow_3h**             : Curah salju dalam 3 jam terakhir dalam mm.

**clouds**              : Persentase penutupan awan saat ini.

#### Libraries

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error

#### Methods

In [56]:
# cleaning methods

def clean_temp(temp) :
    if isinstance(temp, str) :
        temp = temp.replace(' Celcius', '')
        temp = temp.replace(' C', '')
        temp = temp.replace('°C', '')
    return temp

def clean_rain(rain) :
    if isinstance(rain, str) :
        try :
            float(rain)
            return rain
        except :
            new_rain = 0
            return new_rain
        
def clean_wind(wind) :
    if isinstance(wind, str) :
        wind = wind.replace('°', '')
        wind = wind.replace('m/s', '')
    return wind

def clean_visibility(visibility) :
    if isinstance(visibility, str) :
        if visibility in ['unidentified', ' ', 'unrecognized', 'unknown', 'empty', 'undefined', 'missing'] :
            return 'unknown'
        elif visibility in ['-1m', '-1 m'] :
            return '-1m'
        elif visibility in ['-1km', '-1 km'] :
            return '-1km'
    return visibility

def clean_ground_and_sea(ground_and_sea) :
    if isinstance(ground_and_sea, str) :
        if ground_and_sea in ['undetermined', 'unsettled', 'unestablished', 'not recorded', 'unknown', 'not_recorded', 'not-recorded','unspecified'] :
            return 'unknown'
    return ground_and_sea
    
def clean_prssr(prssr) :
    if isinstance(prssr, str) :
        if prssr in ['-100.0 hPa.', '-100.0 hPa', '-100'] :
            return 99.0
        prssr = prssr.replace('hPa.', '')
        prssr = prssr.replace('hPa', '')
    return prssr

def clean_hum(hum) :
    if isinstance(hum, str) :
        hum = hum.replace('%', '')
    return hum

def clean_cloud(cloud) :
    if isinstance(cloud, str) :
        cloud = cloud.replace('%', '')
    return cloud

# impute
def knn_impute(df, na_target) :
    df = df.copy()

    numeric_df = df.select_dtypes(np.number)
    non_na_columns = numeric_df.loc[:, numeric_df.isna().sum() == 0].columns

    y_train = numeric_df.loc[numeric_df[na_target].isna() == False, na_target]
    X_train = numeric_df.loc[numeric_df[na_target].isna() == False, non_na_columns]
    X_test = numeric_df.loc[numeric_df[na_target].isna() == True, non_na_columns]

    knn = KNeighborsRegressor()
    knn.fit(X_train, y_train)

    y_pred = knn.predict(X_test)

    df.loc[df[na_target].isna() == True, na_target] = y_pred

    return df


#### Data Overview

In [86]:
train = pd.read_csv('../../datasets/train.csv')
test = pd.read_csv('../../datasets/test.csv')

TARGET = train['rain_1h']
train.drop('rain_1h', axis=1, inplace=True)

In [58]:
print(f'train shape : {train.shape}')
print(f'test shape : {test.shape}')

train shape : (341880, 19)
test shape : (49368, 19)


#### Merging train and test

In [59]:
# merging train and test data
merged = pd.concat([train, test], axis = 0).reset_index(drop=True) 
merged.drop(['datetime', 'snow_1h', 'snow_3h', 'time-zone'], axis=1, inplace=True)
merged.set_index('datetime_iso', drop=True, inplace=True)

# converting temp dtypes
for column in ['temp','d_point','feels','min_temp','max_temp'] :
    merged[column] = merged[column].apply(lambda x: clean_temp(x))
    merged[column] = merged[column].astype('float64')

# converting rain dtypes
for column in ['rain_3h'] :
    merged[column] = merged[column].apply(lambda x: clean_rain(x))
    merged[column] = merged[column].astype('float64')

# converting wind dtypes
for column in ['wind_spd', 'wind_deg'] :
    merged[column] = merged[column].apply(lambda x: clean_wind(x))
    merged[column] = merged[column].astype('float64')

# cleaning visibility
for column in ['visibility'] :
    merged[column] = merged[column].apply(lambda x: clean_visibility(x))
    merged[column] = merged[column].fillna(merged[column].mode().iloc[0])

# cleaning ground and sea level
for column in ['sea_level', 'grnd_level'] :
    merged[column] = merged[column].apply(lambda x: clean_ground_and_sea(x))
    merged[column] = merged[column].fillna(merged[column].mode().iloc[0])

# cleaning pressure
for column in ['prssr'] :
    merged[column] = merged[column].apply(lambda x: clean_prssr(x))
    merged[column] = merged[column].astype('float64')
    
# cleaning humidity
for column in ['hum'] :
    merged[column] = merged[column].apply(lambda x: clean_hum(x))
    merged[column] = merged[column].astype('float64')

# cleaning clouds 
for column in ['clouds'] :
    merged[column] = merged[column].apply(lambda x: clean_cloud(x))
    merged[column] = merged[column].astype('float64')

In [60]:
merged.head()

Unnamed: 0_level_0,temp,visibility,d_point,feels,min_temp,max_temp,prssr,sea_level,grnd_level,hum,wind_spd,wind_deg,rain_3h,clouds
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1979-01-01 00:00:00+00:00,24.75,unknown,23.89,25.76,24.28,25.22,1012.0,unknown,unknown,95.0,0.82,320.0,0.0,100.0
1979-01-01 01:00:00+00:00,24.58,unknown,23.73,25.57,23.99,25.26,1012.0,unknown,unknown,95.0,0.96,338.0,0.0,100.0
1979-01-01 02:00:00+00:00,26.6,unknown,24.06,26.6,26.1,27.39,1012.0,unknown,unknown,86.0,1.22,339.0,0.0,99.0
1979-01-01 03:00:00+00:00,27.31,unknown,24.37,30.9,26.59,28.36,1012.0,unknown,unknown,84.0,1.08,342.0,0.0,94.0
1979-01-01 04:00:00+00:00,27.41,unknown,25.05,31.54,26.58,28.31,1011.0,unknown,unknown,87.0,0.86,336.0,0.0,100.0


In [61]:
# data is formatted, now take care of missing values
merged.isna().sum()

temp               0
visibility         0
d_point            1
feels              0
min_temp           0
max_temp           0
prssr              0
sea_level          0
grnd_level         0
hum                0
wind_spd           0
wind_deg           0
rain_3h       171078
clouds             0
dtype: int64

In [62]:
# # imputing with knn
# merged2 = merged.copy()
# for column in ['d_point','rain_3h'] :
#     merged2 = knn_impute(merged2, column)

# merged2

# imputing with mean
merged2 = merged.copy()
for column in ['d_point','rain_3h'] :
    merged2[column] = merged[column].fillna(merged[column].mean())

merged2

Unnamed: 0_level_0,temp,visibility,d_point,feels,min_temp,max_temp,prssr,sea_level,grnd_level,hum,wind_spd,wind_deg,rain_3h,clouds
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1979-01-01 00:00:00+00:00,24.75,unknown,23.89,25.76,24.28,25.22,1012.0,unknown,unknown,95.0,0.82,320.0,0.000000,100.0
1979-01-01 01:00:00+00:00,24.58,unknown,23.73,25.57,23.99,25.26,1012.0,unknown,unknown,95.0,0.96,338.0,0.000000,100.0
1979-01-01 02:00:00+00:00,26.60,unknown,24.06,26.60,26.10,27.39,1012.0,unknown,unknown,86.0,1.22,339.0,0.000000,99.0
1979-01-01 03:00:00+00:00,27.31,unknown,24.37,30.90,26.59,28.36,1012.0,unknown,unknown,84.0,1.08,342.0,0.000000,94.0
1979-01-01 04:00:00+00:00,27.41,unknown,25.05,31.54,26.58,28.31,1011.0,unknown,unknown,87.0,0.86,336.0,0.000000,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-08-19 19:00:00+00:00,24.37,unknown,23.34,25.32,22.70,28.20,1011.0,unknown,unknown,94.0,1.57,239.0,0.000016,84.0
2023-08-19 20:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,unknown,unknown,95.0,1.53,235.0,0.000000,70.0
2023-08-19 21:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,unknown,unknown,95.0,1.53,235.0,0.000016,70.0
2023-08-19 22:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,unknown,unknown,95.0,1.53,235.0,0.000016,70.0


In [63]:
merged2.info() 

<class 'pandas.core.frame.DataFrame'>
Index: 391248 entries, 1979-01-01 00:00:00+00:00 to 2023-08-19 23:00:00+00:00
Data columns (total 14 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   temp        391248 non-null  float64
 1   visibility  391248 non-null  object 
 2   d_point     391248 non-null  float64
 3   feels       391248 non-null  float64
 4   min_temp    391248 non-null  float64
 5   max_temp    391248 non-null  float64
 6   prssr       391248 non-null  float64
 7   sea_level   391248 non-null  object 
 8   grnd_level  391248 non-null  object 
 9   hum         391248 non-null  float64
 10  wind_spd    391248 non-null  float64
 11  wind_deg    391248 non-null  float64
 12  rain_3h     391248 non-null  float64
 13  clouds      391248 non-null  float64
dtypes: float64(11), object(3)
memory usage: 44.8+ MB


In [90]:
merged2_train = merged2[:341880].reset_index()
merged2_test = merged2[341880:].reset_index()

In [91]:
merged2_train.to_csv('train_cleaned1.csv')
merged2_test.to_csv('test_cleaned1.csv')

#### Encoding

In [64]:
merged3 = pd.get_dummies(merged2).reset_index()
merged3['datetime_iso'] = pd.to_datetime(merged3['datetime_iso'])
merged3['month'] = merged3['datetime_iso'].dt.month
merged3['hour'] = merged3['datetime_iso'].dt.hour
merged3 = merged3.set_index('datetime_iso')

# column transform from month to season
merged3['month']= 2.7 * np.cos(0.524 * (merged3['month'] - (-5.5))) + 0.7

In [65]:
merged3

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,...,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,sea_level_-1,sea_level_unknown,grnd_level_-1,grnd_level_unknown,month,hour
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1979-01-01 00:00:00+00:00,24.75,23.89,25.76,24.28,25.22,1012.0,95.0,0.82,320.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,0
1979-01-01 01:00:00+00:00,24.58,23.73,25.57,23.99,25.26,1012.0,95.0,0.96,338.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,1
1979-01-01 02:00:00+00:00,26.60,24.06,26.60,26.10,27.39,1012.0,86.0,1.22,339.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,2
1979-01-01 03:00:00+00:00,27.31,24.37,30.90,26.59,28.36,1012.0,84.0,1.08,342.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,3
1979-01-01 04:00:00+00:00,27.41,25.05,31.54,26.58,28.31,1011.0,87.0,0.86,336.0,0.000000,...,0,0,0,1,0,1,0,1,-1.906168,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-08-19 19:00:00+00:00,24.37,23.34,25.32,22.70,28.20,1011.0,94.0,1.57,239.0,0.000016,...,0,0,0,1,0,1,0,1,2.598819,19
2023-08-19 20:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.000000,...,0,0,0,1,0,1,0,1,2.598819,20
2023-08-19 21:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.000016,...,0,0,0,1,0,1,0,1,2.598819,21
2023-08-19 22:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.000016,...,0,0,0,1,0,1,0,1,2.598819,22


In [66]:
merged3.sort_values(by='wind_deg', ascending=False).head(10) # try capping wind_spd at 25

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,...,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,sea_level_-1,sea_level_unknown,grnd_level_-1,grnd_level_unknown,month,hour
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1994-12-03 14:00:00+00:00,25.32,24.81,26.44,24.64,26.0,1011.0,97.0,5.53,1810.8,0.0,...,0,0,0,1,0,1,0,1,-1.912842,14
1983-12-20 21:00:00+00:00,24.54,24.37,25.63,23.73,25.7,1009.0,99.0,1.16,1810.8,0.0,...,0,0,0,1,0,1,0,1,-1.912842,21
1993-06-01 00:00:00+00:00,24.74,24.57,25.85,24.03,25.42,1010.0,99.0,1.67,1810.8,0.0,...,0,0,0,1,0,1,0,1,3.311196,0
1997-05-09 22:00:00+00:00,25.26,24.75,26.37,24.5,26.29,1008.0,97.0,1.04,1810.8,1.6e-05,...,0,0,0,1,0,1,0,1,2.617214,22
1987-02-13 00:00:00+00:00,25.27,23.88,26.25,24.8,25.77,1013.0,92.0,1.81,1810.8,1.6e-05,...,0,0,0,1,0,1,0,1,-1.203435,0
2013-02-02 01:00:00+00:00,25.31,24.45,26.38,24.86,26.04,1013.0,95.0,2.17,1810.8,1.6e-05,...,0,0,0,1,0,1,0,1,-1.203435,1
1979-05-07 10:00:00+00:00,25.28,25.11,26.45,24.6,25.94,1006.0,99.0,1.05,1805.77,0.0,...,0,0,0,1,0,1,1,0,2.617214,10
2001-11-28 13:00:00+00:00,25.52,24.66,26.61,24.89,26.09,1011.0,95.0,0.79,1805.77,0.0,...,0,0,0,1,1,0,0,1,-1.221786,13
2008-01-30 19:00:00+00:00,25.43,24.39,26.48,24.71,26.14,1008.0,94.0,1.49,1805.77,0.0,...,0,0,0,1,0,1,0,1,-1.906168,19
2013-12-15 20:00:00+00:00,24.87,24.87,26.02,24.18,25.8,1008.0,100.0,0.96,1805.77,1.6e-05,...,0,0,1,0,0,1,0,1,-1.912842,20


#### Outliers

In [67]:
# capping outliers
merged4 = merged3.copy()
merged4.loc[merged3['wind_spd'] > 25.0, 'wind_spd'] = 25
merged4.loc[merged3['wind_deg'] > 360.0, 'wind_spd'] = 360.0

#### Scaling

In [68]:
# splitting the dataset

X = merged4[:341880].reset_index()

y = TARGET.apply(lambda x : clean_rain(x))
y = y.astype('float64')

final_train = pd.concat([X,y], axis=1).set_index('datetime_iso')

final_test = merged4[341880:].reset_index()


In [69]:
# Train tets split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=101)
X_train = X_train.set_index('datetime_iso')
X_test = X_test.set_index('datetime_iso')

In [70]:
scaler = RobustScaler()
scaler.fit(X_train)

X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

X_test_scaled

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,...,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,sea_level_-1,sea_level_unknown,grnd_level_-1,grnd_level_unknown,month,hour
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1987-04-25 01:00:00+00:00,0.085044,0.108527,-0.022358,0.141176,-0.008982,1.0,-0.166667,-0.936170,-1.290076,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006543,-0.833333
1990-12-05 11:00:00+00:00,-0.258065,0.186047,-0.029133,-0.223529,-0.374251,0.5,0.416667,-0.414894,-0.022901,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.858679,0.000000
2017-04-25 22:00:00+00:00,-0.076246,0.255814,14.817751,-0.429412,-0.137725,0.0,0.166667,0.138298,0.480916,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006543,0.916667
1990-01-25 20:00:00+00:00,-0.692082,-0.697674,-0.243225,-0.670588,-0.736527,0.0,0.583333,0.319149,-1.526718,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.856941,0.750000
1983-04-30 05:00:00+00:00,1.029326,1.341085,1.362466,1.038235,1.014970,0.0,-0.833333,-0.627660,0.526718,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006543,-0.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-02-12 03:00:00+00:00,0.492669,0.271318,0.778455,0.505882,0.703593,0.5,-0.666667,-0.106383,-1.496183,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.673889,-0.666667
2000-09-16 16:00:00+00:00,-0.231672,0.116279,-0.019648,-0.182353,-0.356287,0.5,0.333333,-1.031915,1.030534,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.416667
2007-12-27 05:00:00+00:00,1.228739,-0.775194,1.161924,1.326471,1.113772,-2.0,-2.083333,1.638298,0.389313,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.858679,-0.500000
1996-06-01 15:00:00+00:00,-0.020528,0.674419,-0.071138,-0.105882,-0.005988,-0.5,32.573333,-1.223404,1.007634,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.502102,0.333333


In [71]:
model = CatBoostRegressor(iterations=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))   # 195062

Learning rate set to 0.174565
0:	learn: 0.6963992	total: 32.3ms	remaining: 16.1s
1:	learn: 0.6889582	total: 55.7ms	remaining: 13.9s
2:	learn: 0.6833660	total: 79.5ms	remaining: 13.2s
3:	learn: 0.6792898	total: 105ms	remaining: 13s
4:	learn: 0.6760895	total: 127ms	remaining: 12.6s
5:	learn: 0.6736408	total: 149ms	remaining: 12.3s
6:	learn: 0.6719353	total: 174ms	remaining: 12.3s
7:	learn: 0.6703644	total: 198ms	remaining: 12.1s
8:	learn: 0.6690780	total: 226ms	remaining: 12.3s
9:	learn: 0.6683357	total: 256ms	remaining: 12.6s
10:	learn: 0.6676268	total: 282ms	remaining: 12.5s
11:	learn: 0.6669929	total: 307ms	remaining: 12.5s
12:	learn: 0.6663150	total: 329ms	remaining: 12.3s
13:	learn: 0.6659434	total: 353ms	remaining: 12.2s
14:	learn: 0.6655601	total: 375ms	remaining: 12.1s
15:	learn: 0.6651008	total: 400ms	remaining: 12.1s
16:	learn: 0.6646944	total: 426ms	remaining: 12.1s
17:	learn: 0.6643232	total: 456ms	remaining: 12.2s
18:	learn: 0.6641611	total: 479ms	remaining: 12.1s
19:	learn:

In [35]:
X = X.set_index('datetime_iso')

In [37]:
model.fit(X, y)
submission_pred = model.predict(final_test)

submission = pd.DataFrame({'datetime_iso' : final_test.index,
                            'rain_1h' : submission_pred})
submission

Learning rate set to 0.102928
0:	learn: 0.7017702	total: 44.1ms	remaining: 44.1s
1:	learn: 0.6989591	total: 78.6ms	remaining: 39.2s
2:	learn: 0.6966717	total: 111ms	remaining: 36.7s
3:	learn: 0.6948673	total: 145ms	remaining: 36.1s
4:	learn: 0.6933904	total: 186ms	remaining: 37s
5:	learn: 0.6920744	total: 225ms	remaining: 37.2s
6:	learn: 0.6908084	total: 296ms	remaining: 42s
7:	learn: 0.6897983	total: 329ms	remaining: 40.8s
8:	learn: 0.6889433	total: 362ms	remaining: 39.8s
9:	learn: 0.6881773	total: 400ms	remaining: 39.6s
10:	learn: 0.6875544	total: 439ms	remaining: 39.5s
11:	learn: 0.6870114	total: 474ms	remaining: 39s
12:	learn: 0.6865176	total: 509ms	remaining: 38.6s
13:	learn: 0.6860737	total: 546ms	remaining: 38.4s
14:	learn: 0.6856831	total: 576ms	remaining: 37.8s
15:	learn: 0.6853687	total: 607ms	remaining: 37.3s
16:	learn: 0.6850823	total: 643ms	remaining: 37.2s
17:	learn: 0.6847424	total: 688ms	remaining: 37.5s
18:	learn: 0.6845424	total: 724ms	remaining: 37.4s
19:	learn: 0.68

Unnamed: 0,datetime_iso,rain_1h
0,2018-01-01 00:00:00+00:00,0.238933
1,2018-01-01 01:00:00+00:00,0.141375
2,2018-01-01 02:00:00+00:00,0.728784
3,2018-01-01 03:00:00+00:00,0.824885
4,2018-01-01 04:00:00+00:00,0.281660
...,...,...
49363,2023-08-19 19:00:00+00:00,0.013762
49364,2023-08-19 20:00:00+00:00,0.019350
49365,2023-08-19 21:00:00+00:00,0.019350
49366,2023-08-19 22:00:00+00:00,0.019350


In [40]:
submission.to_csv('submission1_rang.csv', index=False)