### Airnology 2023

#### Descriptions

**datetime**            : Waktu ketika data dihitung (dalam format timestamp UNIX).

**datetime_iso**        : Waktu dalam format ISO 8601, termasuk zona waktu.

**time-zone**           : Zona waktu dalam detik terhadap UTC.

**temp**                : Suhu saat ini dalam Celcius.

**visibility**          : Visibilitas rata-rata dalam meter.

**d_point**             : Titik embun saat ini dalam Celcius.

**feels**               : Suhu yang dirasakan saat ini dalam Celcius.

**min_temp**            : Suhu minimum dalam rentang waktu tertentu dalam Celcius.

**max_temp**            : Suhu maksimum dalam rentang waktu tertentu dalam Celcius.

**pressure**            : Tekanan atmosfer dalam hPa .

**sea_level**           : Tekanan atmosfer pada permukaan laut dalam hPa.

**grnd_level**          : Tekanan atmosfer pada permukaan tanah dalam hPa.

**hum**                 : Persentase kelembaban udara saat ini.

**wind_spd**            : Kecepatan angin saat ini dalam m/s.

**wind_deg**            : Arah angin dalam derajat.

**rain_1h**             : Curah hujan dalam 1 jam terakhir dalam mm. (variabel target)

**rain_3h**             : Curah hujan dalam 3 jam terakhir dalam mm.

**snow_1h**             : Curah salju dalam 1 jam terakhir dalam mm.

**snow_3h**             : Curah salju dalam 3 jam terakhir dalam mm.

**clouds**              : Persentase penutupan awan saat ini.

#### Libraries

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from pycaret.regression import setup, compare_models
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error

#### Methods

In [2]:
# cleaning methods

def clean_temp(temp) :
    if isinstance(temp, str) :
        temp = temp.replace(' Celcius', '')
        temp = temp.replace(' C', '')
        temp = temp.replace('°C', '')
    return temp

def clean_rain(rain) :
    if isinstance(rain, str) :
        try :
            float(rain)
            return rain
        except :
            new_rain = 0
            return new_rain
        
def clean_wind(wind) :
    if isinstance(wind, str) :
        wind = wind.replace('°', '')
        wind = wind.replace('m/s', '')
    return wind

def clean_visibility(visibility) :
    if isinstance(visibility, str) :
        if visibility in ['unidentified', ' ', 'unrecognized', 'unknown', 'empty', 'undefined', 'missing'] :
            return 'unknown'
        elif visibility in ['-1m', '-1 m'] :
            return '-1m'
        elif visibility in ['-1km', '-1 km'] :
            return '-1km'
    return visibility
    
def clean_prssr(prssr) :
    if isinstance(prssr, str) :
        if prssr in ['-100.0 hPa.', '-100.0 hPa', '-100'] :
            return 99.0
        prssr = prssr.replace('hPa.', '')
        prssr = prssr.replace('hPa', '')
    return prssr

def clean_hum(hum) :
    if isinstance(hum, str) :
        hum = hum.replace('%', '')
    return hum

def clean_cloud(cloud) :
    if isinstance(cloud, str) :
        cloud = cloud.replace('%', '')
    return cloud

# impute
def knn_impute(df, na_target) :
    df = df.copy()

    numeric_df = df.select_dtypes(np.number)
    non_na_columns = numeric_df.loc[:, numeric_df.isna().sum() == 0].columns

    y_train = numeric_df.loc[numeric_df[na_target].isna() == False, na_target]
    X_train = numeric_df.loc[numeric_df[na_target].isna() == False, non_na_columns]
    X_test = numeric_df.loc[numeric_df[na_target].isna() == True, non_na_columns]

    knn = KNeighborsRegressor()
    knn.fit(X_train, y_train)

    y_pred = knn.predict(X_test)

    df.loc[df[na_target].isna() == True, na_target] = y_pred

    return df


#### Data Overview

In [3]:
train = pd.read_csv('../../datasets/train.csv')
test = pd.read_csv('../../datasets/test.csv')

TARGET = train['rain_1h']
train.drop('rain_1h', axis=1, inplace=True)

In [4]:
print(f'train shape : {train.shape}')
print(f'test shape : {test.shape}')

train shape : (341880, 19)
test shape : (49368, 19)


#### Merging train and test

In [5]:
# merging train and test data
merged = pd.concat([train, test], axis = 0).reset_index(drop=True) 
merged.drop(['datetime', 'snow_1h', 'snow_3h', 'sea_level', 'grnd_level', 'time-zone'], axis=1, inplace=True)
merged.set_index('datetime_iso', drop=True, inplace=True)

# converting temp dtypes
for column in ['temp','d_point','feels','min_temp','max_temp'] :
    merged[column] = merged[column].apply(lambda x: clean_temp(x))
    merged[column] = merged[column].astype('float64')

# converting rain dtypes
for column in ['rain_3h'] :
    merged[column] = merged[column].apply(lambda x: clean_rain(x))
    merged[column] = merged[column].astype('float64')

# converting wind dtypes
for column in ['wind_spd', 'wind_deg'] :
    merged[column] = merged[column].apply(lambda x: clean_wind(x))
    merged[column] = merged[column].astype('float64')

# cleaning visibility
for column in ['visibility'] :
    merged[column] = merged[column].apply(lambda x: clean_visibility(x))
    merged[column] = merged[column].fillna(merged[column].mode().iloc[0])

# cleaning pressure
for column in ['prssr'] :
    merged[column] = merged[column].apply(lambda x: clean_prssr(x))
    merged[column] = merged[column].astype('float64')
    
# cleaning humidity
for column in ['hum'] :
    merged[column] = merged[column].apply(lambda x: clean_hum(x))
    merged[column] = merged[column].astype('float64')

# cleaning clouds 
for column in ['clouds'] :
    merged[column] = merged[column].apply(lambda x: clean_cloud(x))
    merged[column] = merged[column].astype('float64')

In [6]:
merged.head()

Unnamed: 0_level_0,temp,visibility,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,clouds
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1979-01-01 00:00:00+00:00,24.75,unknown,23.89,25.76,24.28,25.22,1012.0,95.0,0.82,320.0,0.0,100.0
1979-01-01 01:00:00+00:00,24.58,unknown,23.73,25.57,23.99,25.26,1012.0,95.0,0.96,338.0,0.0,100.0
1979-01-01 02:00:00+00:00,26.6,unknown,24.06,26.6,26.1,27.39,1012.0,86.0,1.22,339.0,0.0,99.0
1979-01-01 03:00:00+00:00,27.31,unknown,24.37,30.9,26.59,28.36,1012.0,84.0,1.08,342.0,0.0,94.0
1979-01-01 04:00:00+00:00,27.41,unknown,25.05,31.54,26.58,28.31,1011.0,87.0,0.86,336.0,0.0,100.0


In [7]:
# data is formatted, now take care of missing values
merged.isna().sum()

temp               0
visibility         0
d_point            1
feels              0
min_temp           0
max_temp           0
prssr              0
hum                0
wind_spd           0
wind_deg           0
rain_3h       171078
clouds             0
dtype: int64

In [8]:
# imputing with knn
merged2 = merged.copy()
for column in ['d_point','rain_3h'] :
    merged2 = knn_impute(merged2, column)

merged2

Unnamed: 0_level_0,temp,visibility,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,clouds
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1979-01-01 00:00:00+00:00,24.75,unknown,23.89,25.76,24.28,25.22,1012.0,95.0,0.82,320.0,0.0,100.0
1979-01-01 01:00:00+00:00,24.58,unknown,23.73,25.57,23.99,25.26,1012.0,95.0,0.96,338.0,0.0,100.0
1979-01-01 02:00:00+00:00,26.60,unknown,24.06,26.60,26.10,27.39,1012.0,86.0,1.22,339.0,0.0,99.0
1979-01-01 03:00:00+00:00,27.31,unknown,24.37,30.90,26.59,28.36,1012.0,84.0,1.08,342.0,0.0,94.0
1979-01-01 04:00:00+00:00,27.41,unknown,25.05,31.54,26.58,28.31,1011.0,87.0,0.86,336.0,0.0,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2023-08-19 19:00:00+00:00,24.37,unknown,23.34,25.32,22.70,28.20,1011.0,94.0,1.57,239.0,0.0,84.0
2023-08-19 20:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.0,70.0
2023-08-19 21:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.0,70.0
2023-08-19 22:00:00+00:00,23.87,unknown,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.0,70.0


In [9]:
merged2.info() 

<class 'pandas.core.frame.DataFrame'>
Index: 391248 entries, 1979-01-01 00:00:00+00:00 to 2023-08-19 23:00:00+00:00
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   temp        391248 non-null  float64
 1   visibility  391248 non-null  object 
 2   d_point     391248 non-null  float64
 3   feels       391248 non-null  float64
 4   min_temp    391248 non-null  float64
 5   max_temp    391248 non-null  float64
 6   prssr       391248 non-null  float64
 7   hum         391248 non-null  float64
 8   wind_spd    391248 non-null  float64
 9   wind_deg    391248 non-null  float64
 10  rain_3h     391248 non-null  float64
 11  clouds      391248 non-null  float64
dtypes: float64(11), object(1)
memory usage: 38.8+ MB


#### Encoding

In [10]:
merged3 = pd.get_dummies(merged2).reset_index()
merged3['datetime_iso'] = pd.to_datetime(merged3['datetime_iso'])
merged3['month'] = merged3['datetime_iso'].dt.month
merged3 = merged3.set_index('datetime_iso')

# column transform from month to season
merged3['month']= 2.7 * np.cos(0.524 * (merged3['month'] - (-5.5))) + 0.7

In [11]:
merged3

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,clouds,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,month
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1979-01-01 00:00:00+00:00,24.75,23.89,25.76,24.28,25.22,1012.0,95.0,0.82,320.0,0.0,100.0,0,0,0,1,-1.906168
1979-01-01 01:00:00+00:00,24.58,23.73,25.57,23.99,25.26,1012.0,95.0,0.96,338.0,0.0,100.0,0,0,0,1,-1.906168
1979-01-01 02:00:00+00:00,26.60,24.06,26.60,26.10,27.39,1012.0,86.0,1.22,339.0,0.0,99.0,0,0,0,1,-1.906168
1979-01-01 03:00:00+00:00,27.31,24.37,30.90,26.59,28.36,1012.0,84.0,1.08,342.0,0.0,94.0,0,0,0,1,-1.906168
1979-01-01 04:00:00+00:00,27.41,25.05,31.54,26.58,28.31,1011.0,87.0,0.86,336.0,0.0,100.0,0,0,0,1,-1.906168
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-08-19 19:00:00+00:00,24.37,23.34,25.32,22.70,28.20,1011.0,94.0,1.57,239.0,0.0,84.0,0,0,0,1,2.598819
2023-08-19 20:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.0,70.0,0,0,0,1,2.598819
2023-08-19 21:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.0,70.0,0,0,0,1,2.598819
2023-08-19 22:00:00+00:00,23.87,23.02,24.79,21.91,28.01,1011.0,95.0,1.53,235.0,0.0,70.0,0,0,0,1,2.598819


#### Scaling

In [12]:
scaler = RobustScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(merged3), columns=merged3.columns, index=merged3.index)
data_scaled

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,clouds,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,month
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1979-01-01 00:00:00+00:00,-0.445748,-0.553846,-0.126498,-0.372781,-0.641791,1.0,0.230769,-0.494505,0.906250,0.0,0.157895,0.0,0.0,0.0,0.0,-0.856941
1979-01-01 01:00:00+00:00,-0.495601,-0.676923,-0.151798,-0.458580,-0.629851,1.0,0.230769,-0.340659,1.046875,0.0,0.157895,0.0,0.0,0.0,0.0,-0.856941
1979-01-01 02:00:00+00:00,0.096774,-0.423077,-0.014647,0.165680,0.005970,1.0,-0.461538,-0.054945,1.054688,0.0,0.105263,0.0,0.0,0.0,0.0,-0.856941
1979-01-01 03:00:00+00:00,0.304985,-0.184615,0.557923,0.310651,0.295522,1.0,-0.615385,-0.208791,1.078125,0.0,-0.157895,0.0,0.0,0.0,0.0,-0.856941
1979-01-01 04:00:00+00:00,0.334311,0.338462,0.643142,0.307692,0.280597,0.5,-0.384615,-0.450549,1.031250,0.0,0.157895,0.0,0.0,0.0,0.0,-0.856941
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-08-19 19:00:00+00:00,-0.557185,-0.976923,-0.185087,-0.840237,0.247761,0.5,0.153846,0.329670,0.273438,0.0,-0.684211,0.0,0.0,0.0,0.0,0.316539
2023-08-19 20:00:00+00:00,-0.703812,-1.223077,-0.255659,-1.073964,0.191045,0.5,0.230769,0.285714,0.242188,0.0,-1.421053,0.0,0.0,0.0,0.0,0.316539
2023-08-19 21:00:00+00:00,-0.703812,-1.223077,-0.255659,-1.073964,0.191045,0.5,0.230769,0.285714,0.242188,0.0,-1.421053,0.0,0.0,0.0,0.0,0.316539
2023-08-19 22:00:00+00:00,-0.703812,-1.223077,-0.255659,-1.073964,0.191045,0.5,0.230769,0.285714,0.242188,0.0,-1.421053,0.0,0.0,0.0,0.0,0.316539


In [13]:
X = data_scaled[:341880].reset_index()

y = TARGET.apply(lambda x : clean_rain(x))
y = y.astype('float64')

final_train = pd.concat([X,y], axis=1).set_index('datetime_iso')

final_test = data_scaled[341880:]
train.shape

(341880, 19)

In [15]:
final_train

Unnamed: 0_level_0,temp,d_point,feels,min_temp,max_temp,prssr,hum,wind_spd,wind_deg,rain_3h,clouds,visibility_-1,visibility_-1km,visibility_-1m,visibility_unknown,month,rain_1h
datetime_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1979-01-01 00:00:00+00:00,-0.445748,-0.553846,-0.126498,-0.372781,-0.641791,1.0,0.230769,-0.494505,0.906250,0.0,0.157895,0.0,0.0,0.0,0.0,-0.856941,0.00
1979-01-01 01:00:00+00:00,-0.495601,-0.676923,-0.151798,-0.458580,-0.629851,1.0,0.230769,-0.340659,1.046875,0.0,0.157895,0.0,0.0,0.0,0.0,-0.856941,0.00
1979-01-01 02:00:00+00:00,0.096774,-0.423077,-0.014647,0.165680,0.005970,1.0,-0.461538,-0.054945,1.054688,0.0,0.105263,0.0,0.0,0.0,0.0,-0.856941,0.00
1979-01-01 03:00:00+00:00,0.304985,-0.184615,0.557923,0.310651,0.295522,1.0,-0.615385,-0.208791,1.078125,0.0,-0.157895,0.0,0.0,0.0,0.0,-0.856941,0.13
1979-01-01 04:00:00+00:00,0.334311,0.338462,0.643142,0.307692,0.280597,0.5,-0.384615,-0.450549,1.031250,0.0,0.157895,0.0,0.0,0.0,0.0,-0.856941,0.34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017-12-31 19:00:00+00:00,-0.354839,-0.046154,13.957390,-0.325444,-0.501493,-1.5,0.384615,-0.406593,-1.492188,0.0,0.105263,0.0,0.0,0.0,0.0,-0.858679,0.00
2017-12-31 20:00:00+00:00,-0.516129,-0.338462,-0.150466,-0.488166,-0.668657,-2.0,0.461538,-0.461538,-1.429688,0.0,0.157895,0.0,0.0,0.0,0.0,-0.858679,0.00
2017-12-31 21:00:00+00:00,-0.480938,-0.246154,13.663116,-0.455621,29.728358,-1.5,0.461538,0.296703,-1.390625,0.0,0.000000,0.0,0.0,0.0,0.0,-0.858679,0.00
2017-12-31 22:00:00+00:00,0.120235,0.076923,0.406125,-0.153846,-0.035821,-1.0,-0.230769,0.208791,-1.460938,0.0,0.052632,0.0,0.0,0.0,0.0,-0.858679,0.00


In [21]:
# Train tets split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=101)

In [27]:
model = CatBoostRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))   # 195062

Learning rate set to 0.099362
0:	learn: 0.7040391	total: 215ms	remaining: 3m 34s
1:	learn: 0.7012969	total: 245ms	remaining: 2m 2s
2:	learn: 0.6990158	total: 275ms	remaining: 1m 31s
3:	learn: 0.6971530	total: 302ms	remaining: 1m 15s
4:	learn: 0.6956885	total: 331ms	remaining: 1m 5s
5:	learn: 0.6943299	total: 356ms	remaining: 59s
6:	learn: 0.6932369	total: 381ms	remaining: 54s
7:	learn: 0.6922951	total: 404ms	remaining: 50.2s
8:	learn: 0.6914179	total: 434ms	remaining: 47.8s
9:	learn: 0.6906380	total: 464ms	remaining: 45.9s
10:	learn: 0.6900220	total: 493ms	remaining: 44.4s
11:	learn: 0.6894020	total: 522ms	remaining: 43s
12:	learn: 0.6888942	total: 545ms	remaining: 41.4s
13:	learn: 0.6884127	total: 572ms	remaining: 40.3s
14:	learn: 0.6879924	total: 594ms	remaining: 39s
15:	learn: 0.6876426	total: 617ms	remaining: 37.9s
16:	learn: 0.6873242	total: 642ms	remaining: 37.1s
17:	learn: 0.6870303	total: 672ms	remaining: 36.7s
18:	learn: 0.6867786	total: 706ms	remaining: 36.5s
19:	learn: 0.686

In [35]:
X = X.set_index('datetime_iso')

In [37]:
model.fit(X, y)
submission_pred = model.predict(final_test)

submission = pd.DataFrame({'datetime_iso' : final_test.index,
                            'rain_1h' : submission_pred})
submission

Learning rate set to 0.102928
0:	learn: 0.7017702	total: 44.1ms	remaining: 44.1s
1:	learn: 0.6989591	total: 78.6ms	remaining: 39.2s
2:	learn: 0.6966717	total: 111ms	remaining: 36.7s
3:	learn: 0.6948673	total: 145ms	remaining: 36.1s
4:	learn: 0.6933904	total: 186ms	remaining: 37s
5:	learn: 0.6920744	total: 225ms	remaining: 37.2s
6:	learn: 0.6908084	total: 296ms	remaining: 42s
7:	learn: 0.6897983	total: 329ms	remaining: 40.8s
8:	learn: 0.6889433	total: 362ms	remaining: 39.8s
9:	learn: 0.6881773	total: 400ms	remaining: 39.6s
10:	learn: 0.6875544	total: 439ms	remaining: 39.5s
11:	learn: 0.6870114	total: 474ms	remaining: 39s
12:	learn: 0.6865176	total: 509ms	remaining: 38.6s
13:	learn: 0.6860737	total: 546ms	remaining: 38.4s
14:	learn: 0.6856831	total: 576ms	remaining: 37.8s
15:	learn: 0.6853687	total: 607ms	remaining: 37.3s
16:	learn: 0.6850823	total: 643ms	remaining: 37.2s
17:	learn: 0.6847424	total: 688ms	remaining: 37.5s
18:	learn: 0.6845424	total: 724ms	remaining: 37.4s
19:	learn: 0.68

Unnamed: 0,datetime_iso,rain_1h
0,2018-01-01 00:00:00+00:00,0.238933
1,2018-01-01 01:00:00+00:00,0.141375
2,2018-01-01 02:00:00+00:00,0.728784
3,2018-01-01 03:00:00+00:00,0.824885
4,2018-01-01 04:00:00+00:00,0.281660
...,...,...
49363,2023-08-19 19:00:00+00:00,0.013762
49364,2023-08-19 20:00:00+00:00,0.019350
49365,2023-08-19 21:00:00+00:00,0.019350
49366,2023-08-19 22:00:00+00:00,0.019350


In [40]:
submission.to_csv('submission1_rang.csv', index=False)