# Rainfall Prediction with CatBoost and Optuna

In this notebook, we use CatBoost, a powerful gradient boosting algorithm, along with Optuna for hyperparameter tuning to predict rainfall based on weather data. We also apply feature engineering and feature selection to improve model performance.



## Importing Libraries
We begin by importing all the necessary libraries, including CatBoost for modeling, Optuna for hyperparameter optimization, and standard Python libraries like pandas and numpy for data manipulation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import optuna
import warnings
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score, roc_curve

warnings.filterwarnings('ignore')


## Loading the Dataset
We load the training and test datasets. The training set contains the target variable (rainfall) that we will predict, while the test set will be used to evaluate our final model.

In [None]:
train = pd.read_csv('path/to/your/train.csv')
test = pd.read_csv('path/to/your/test.csv')

y = train['rainfall']
X = train.drop(['id', 'rainfall'], axis=1)
X_test = test.drop(['id'], axis=1)

## Feature Engineering
Feature engineering is crucial for improving model performance. Here, we create new features based on the existing ones to capture additional information from the data, such as the difference between maximum and minimum temperatures, wind direction components, and other interactions between weather features.

In [None]:
def add_features(df):
    df['temp_range'] = df['maxtemp'] - df['mintemp']
    df['dewpoint_diff'] = df['temparature'] - df['dewpoint']
    df['wind_x'] = df['windspeed'] * np.cos(np.radians(df['winddirection']))
    df['wind_y'] = df['windspeed'] * np.sin(np.radians(df['winddirection']))
    df['cloud_sun_ratio'] = df['cloud'] / (df['sunshine'] + 1)
    df['sun_intensity'] = df['sunshine'] * df['temparature']
    df['humidity_pressure'] = df['humidity'] * df['pressure']
    df['humidity_temp_ratio'] = df['humidity'] / (df['temparature'] + 1)
    return df

X = add_features(X)
X_test = add_features(X_test)


## Feature Selection with CatBoost
We use the CatBoost model to identify and select the most important features. This step helps in reducing the dimensionality of the dataset, which can improve both model training time and performance.

In [None]:
cat_feat_model = CatBoostClassifier(iterations=700, learning_rate=0.03, depth=6, random_seed=42, verbose=0)
cat_feat_model.fit(X, y)
importances = pd.Series(cat_feat_model.get_feature_importance(Pool(X, label=y)), index=X.columns)
top_features = importances.sort_values(ascending=False).head(20).index.tolist()

X_top = X[top_features]
X_test_top = X_test[top_features]


## Hyperparameter Tuning with Optuna
To improve the model’s performance, we use Optuna for hyperparameter optimization. Optuna will automatically search for the best combination of hyperparameters for the CatBoost model, using cross-validation to evaluate performance.

In [None]:
def objective(trial):
    params = {
        'iterations': 700,
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'depth': trial.suggest_int('depth', 4, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1.0, 10.0),        
        'bagging_temperature': trial.suggest_float('bagging_temperature', 0.0, 1.0),
        'random_strength': trial.suggest_float('random_strength', 0.0, 10.0),
        'border_count': trial.suggest_int('border_count', 32, 255),
        'verbose': 0,
        'random_seed': 42,
        'loss_function': 'Logloss',
        'eval_metric': 'AUC'
    }

    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    aucs = []

    for train_idx, val_idx in kf.split(X_top, y):
        X_tr, X_val = X_top.iloc[train_idx], X_top.iloc[val_idx]
        y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

        model = CatBoostClassifier(**params)
        model.fit(X_tr, y_tr)
        preds = model.predict_proba(X_val)[:, 1]
        aucs.append(roc_auc_score(y_val, preds))

    return np.mean(aucs)


## Final Model Training
Using the best hyperparameters found by Optuna, we train the final CatBoost model on the full training dataset and predict the rainfall on the test dataset.

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)

best_params = study.best_params
best_params.update({
    'iterations': 700,
    'verbose': 0,
    'random_seed': 42
})

final_model = CatBoostClassifier(**best_params)
final_model.fit(X_top, y)
final_probs = final_model.predict_proba(X_test_top)[:, 1]


## Conclusion
In this notebook, we successfully built a CatBoost model with Optuna hyperparameter tuning and feature engineering to predict rainfall. 

### If you found this notebook useful:
If you have any feedback or suggestions to improve the code, feel free to open an **issue** or **pull request**. Contributions are welcome!
