# [WiDS Datathon 2023](https://www.kaggle.com/competitions/widsdatathon2023/overview)

### Dependent Variable
<details>
    <summary> (click to expand)</summary>

- **contest-tmp2m-14d__tmp2m**: the arithmetic mean of the max and min observed temperature over the next 14 days for each location and start date, computed as (measured max temperature + measured mini temperature) / 2

</details>


### Independent Variables
<details>
    <summary> (click to expand)</summary>
    
- **contest-slp-14d**: file containing sea level pressure (slp)
- **nmme0-tmp2m-34w**: file containing most recent monthly NMME model forecasts for tmp2m (**cancm30, cancm40, ccsm30, ccsm40, cfsv20, gfdlflora0, gfdlflorb0, gfdl0, nasa0, nmme0mean**) and average forecast across those models (nmme0mean)
- **contest-pres-sfc-gauss-14d**: pressure
- **mjo1d**: MJO phase and amplitude
- **contest-pevpr-sfc-gauss-14d**: potential evaporation
- **contest-wind-h850-14d**: geopotential height at 850 millibars
- **contest-wind-h500-14d**: geopotential height at 500 millibars
- **contest-wind-h100-14d**: geopotential height at 100 millibars
- **contest-wind-h10-14d**: geopotential height at 10 millibars
- **contest-wind-vwnd-925-14d**: longitudinal wind at 925 millibars
- **contest-wind-vwnd-250-14d**: longitudinal wind at 250 millibars
- **contest-wind-uwnd-250-14d**: zonal wind at 250 millibars
- **contest-wind-uwnd-925-14d**: zonal wind at 925 millibars
- **contest-rhum-sig995-14d**: relative humidity
- **contest-prwtr-eatm-14d**: precipitable water for entire atmosphere
- **nmme-prate-34w**: weeks 3-4 weighted average of monthly NMME model forecasts for precipitation
- **nmme-prate-56w**: weeks 5-6 weighted average of monthly NMME model forecasts for precipitation
- **nmme0-prate-56w**: weeks 5-6 weighted average of most recent monthly NMME model forecasts for precipitation
- **nmme0-prate-34w**: weeks 3-4 weighted average of most recent monthly NMME model forecasts for precipitation
- **nmme-tmp2m-34w**: weeks 3-4 weighted average of most recent monthly NMME model forecasts for target label, contest-tmp2m-14d__tmp2m
- **nmme-tmp2m-56w**: weeks 5-6 weighted average of monthly NMME model forecasts for target label, contest-tmp2m-14d__tmp2m
- **mei**: MEI (mei), MEI rank (rank), and Niño Index Phase (nip)
- **elevation**: elevation
- **contest-precip-14d**: measured precipitation
- **climateregions**: Köppen-Geigerclimateclassifications, string


- **lat**: latitude of location (anonymized)
- **lon**: longitude of location (anonymized)
- **startdate**: startdate of the 14 day period
- **sst**: sea surface temperature
- **icec**: sea ice concentration
- **cancm30, cancm40, ccsm30, ccsm40, cfsv20, gfdlflora0, gfdlflorb0, gfdl0, nasa0, nmme0mean**: most recent forecasts from weather models

</details>

In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
from fastai.tabular.core import df_shrink
import torch
from zipfile import ZipFile
import warnings
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

warnings.filterwarnings('ignore')

seed = 100
np.random.seed(seed);
torch.manual_seed(seed);

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Load Data

In [2]:
z = ZipFile('data/widsdatathon2023.zip')
train = df_shrink(pd.read_csv(z.open('train_data.csv'), parse_dates=["startdate"])).drop('index', axis=1)
test = df_shrink(pd.read_csv(z.open('test_data.csv'), parse_dates=["startdate"])).drop('index', axis=1)
submit = df_shrink(pd.read_csv(z.open('sample_solution.csv')))
target = 'contest-tmp2m-14d__tmp2m'
print(f'{train.isna().any().sum()} cols with null: {train.columns[train.isna().any()].tolist()}')
print(f'cat vars: {list(train.select_dtypes(exclude=np.number).columns)}')

8 cols with null: ['nmme0-tmp2m-34w__ccsm30', 'nmme-tmp2m-56w__ccsm3', 'nmme-prate-34w__ccsm3', 'nmme0-prate-56w__ccsm30', 'nmme0-prate-34w__ccsm30', 'nmme-prate-56w__ccsm3', 'nmme-tmp2m-34w__ccsm3', 'ccsm30']
cat vars: ['startdate', 'climateregions__climateregion']


## Feature Engineering

In [3]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

def imputer(df): # multivariate imputer: fill missing values
    temp = df.select_dtypes(include=np.number)
    df1 = df.copy()
    df1[temp.columns] = IterativeImputer(n_nearest_features=10, skip_complete=True, 
                                         random_state=seed).fit_transform(temp)
    return df1

def drop_outliers(df, thres=3): # drop rows with outliers
    col = df.drop(columns=['lat', 'lon', 'startdate', 'climateregions__climateregion'], 
                  errors='ignore').columns
    return df[(np.abs(stats.zscore(df[col])) < thres).all(axis=1)]

def corr(df, thres=0.85): # find highly correlated columns
    matrix = df.corr().abs()
    matrix = matrix.where(np.triu(np.ones(matrix.shape), k=1).astype(bool))
    drop = [c for c in matrix.columns if any(matrix[c] > thres)]
    if target in drop: drop.remove(target)
    return drop

def encode(df): 
    df['coor']  = df.groupby(['lat','lon']).ngroup()
    df['climateregions__climateregion'] = LabelEncoder().fit_transform(df['climateregions__climateregion'])
    df = df.drop(columns=['lat', 'lon'])
    return df

def add_col(df): 
    df['year']  = df['startdate'].dt.year
    df['month'] = df['startdate'].dt.month
    df['day']   = df['startdate'].dt.day
    df = df.drop(columns=['startdate'])
    return df

def xy(df): 
    X = df.drop(columns=target)
    y = df[target]
    return X, y

def split(df, date):
    train = df[df.startdate <  date]
    test  = df[df.startdate >= date]
    X_train, y_train = xy(train)
    X_test,  y_test  = xy(test)
    return (X_train.reset_index(drop=True), y_train.reset_index(drop=True), 
            X_test.reset_index(drop=True), y_test.reset_index(drop=True))

def feature_engineering(train, test):
    train1 = imputer(train)
#     print(f'{train1.isna().any().sum()} cols with null')

    # train1 = drop_outliers(train1)
    # print(f'{len(train) - len(train1)} rows dropped: ')
    
    # train1 = train1.drop(columns=corr(train1))
#     print(f'{len(corr(train1))} / {train.shape[1]} cols dropped')
    
    df = pd.concat([train1, test[train1.columns.intersection(test.columns)]], 
                   ignore_index=True)
    
    df = encode(df)
    X, y, X_test, _ = split(df, date='2020')
    
    return X, y, X_test

## Model

In [4]:
def scale(train, test):
    scaler = MinMaxScaler()
    train_sc = scaler.fit_transform(train)
    test_sc  = pd.DataFrame(scaler.transform(test), columns=train.columns)
    train_sc = pd.DataFrame(train_sc, columns=train.columns)
    return train_sc, test_sc

def savePred(pred, index, filename=None, save=False):
    df = pd.DataFrame({target:pred, 'index': index})
    if save==True:
        df.to_csv(f'{filename}.csv', index=False)
    return df

def rmse(actual, predicted):
    return mean_squared_error(actual, predicted, squared=False)

X, y, X_test = feature_engineering(train.copy(), test.copy())

X_train, y_train, X_val, y_val = split(pd.concat([X, y], axis=1), date='2016-7-1')
print(f'train: {X_train.shape} | val: {X_val.shape} | test: {X_test.shape}')

train: (343866, 243) | val: (31868, 243) | test: (31354, 243)


In [5]:
tmp2m_34 = train.columns[train.columns.str.contains('nmme0-tmp2m-34w')].tolist()
tmp2m_56 = train.columns[train.columns.str.contains('tmp2m-56w')].tolist()
featr_14 = train.columns[train.columns.str.contains('14d')].tolist()
useful = tmp2m_34 + tmp2m_56 + featr_14
if target in useful: useful.remove(target)
# [c for c in train.columns.values if item not in useful]

In [None]:
from statsmodels.graphics.tsaplots import plot_pacf

In [None]:
plot_pacf()

In [6]:
scaler = StandardScaler()

y_res = y.copy()
nVal = int(len(X_val) / X.coor.nunique()) -1
rr = []
rr1 = []

for j in range(X.coor.nunique()):
    x_T = X[X.coor == j][useful]
    x_T = pd.DataFrame(scaler.fit_transform(x_T), 
                       columns=x_T.columns, index=x_T.index)
    y_T = y[X.coor == j]
    x_t = X_test[X_test.coor == j][useful]
    x_t = pd.DataFrame(scaler.transform(x_t), 
                       columns=x_t.columns, index=x_t.index)
    x = pd.concat([x_T, x_t])
    
    p = -28 ## Lag nmme_tm2m_56w_features 28 days
    
    ## Creating shifted features 
    x[tmp2m_34] = x[tmp2m_34].shift(-14).ffill()
    x[tmp2m_56] = x[tmp2m_56].shift(p).ffill(limit=20)
    x2 = x.iloc[:-nVal,:]
    x_test2 = x.iloc[-nVal:p+20,:]
    
    las = Lasso(alpha=0.019, max_iter=10000)
    
    ## Prediction without shift
    las.fit(x_T, y_T)
    train_res= las.predict(x_T)
    test_res = las.predict(x_t)
    
    ## Prediction with shift
    las.fit(x2, y_T)
#     train_res_1 = las.predict(x2)
    test_res_1  = las.predict(x_test2)

    df_test   = pd.DataFrame({target:test_res},   index=x_t.index)
    df_test_1 = pd.DataFrame({target:test_res_1}, index=x_t.index[:p+20])
    
    ## calculating the residuals 
    y_res[X.coor==j] = y_res[X.coor==j] - train_res
    
    rr.append(df_test)
    rr1.append(df_test_1)

In [7]:
sub = pd.concat(rr)
sub1 = pd.concat(rr1)
sub.loc[sub.index.isin(sub1.index.values), target] = sub1[target].values

In [8]:
drop = corr(X)
X1 = drop_outliers(X.drop(columns=drop, errors='ignore'))
X1 = add_col(X1)
X1_test = X_test.drop(columns=drop, errors='ignore')
X1_test = add_col(X1_test)

In [9]:
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

cbr = CatBoostRegressor(verbose=0, loss_function="RMSE", l2_leaf_reg=2.3, 
                        max_depth=6, iterations=25000)
cbr.fit(X1, y_res.loc[X1.index], cat_features=['coor'])
y_pred_cat = cbr.predict(X1_test)


lgb = LGBMRegressor(objective='regression', metric='rmse', 
                    categorical_feature=X1.columns.get_loc('coor'))
lgb.fit(X1, y_res.loc[X1.index], verbose=0)
y_pred_lgb = lgb.predict(X1_test)


xgb = XGBRegressor(objective="reg:squarederror", eval_metric='rmse', 
                   random_state=seed)
xgb.fit(X1, y_res.loc[X1.index], verbose=0)
y_pred_xgb = xgb.predict(X1_test)

In [14]:
y_pred = y_pred_lgb*0.10+y_pred_cat*0.60+y_pred_xgb*0.30

sub["contest-tmp2m-14d__tmp2m"] = sub["contest-tmp2m-14d__tmp2m"] + y_pred
# sub = sub[["contest-tmp2m-14d__tmp2m", "index"]]
sub['index'] = submit['index']
sub.to_csv('submission.csv', index = False)
sub

Unnamed: 0,contest-tmp2m-14d__tmp2m,index
0,30.246001,375734
1,30.155734,375735
2,30.136013,375736
3,30.061622,375737
4,30.067243,375738
...,...,...
31349,7.101493,407083
31350,7.253449,407084
31351,6.636917,407085
31352,6.785509,407086


In [None]:
# from catboost import CatBoostRegressor

# params = {'iterations': 15000,
#           'learning_rate': 0.01,
#           'depth': 6,
#           'l2_leaf_reg': 3,
#           'bagging_temperature': 0,
#           'border_count': 128,
#           'loss_function': 'RMSE',
#           'random_seed': seed,
#           'verbose': False}

# model = CatBoostRegressor(**params)
# model.fit(X_train, y_train,  eval_set=(X_val, y_val))
# savePred(model.predict(X_test), testIndex, filename='s2', save=False)