# Importing Libraries

In this section, we import various libraries that are essential for data processing, model building, and evaluation.

- **pandas**: A powerful data manipulation library for Python. It is used for reading data, data manipulation, and data analysis.
- **numpy**: A fundamental package for scientific computing with Python. It is used for array operations and mathematical functions.
- **os**: A library that provides a way of using operating system dependent functionality like reading or writing to the file system.
- **warnings**: A library to handle warnings in Python, allowing us to ignore them if necessary.
- **lightgbm**: LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed for fast performance and efficiency.
- **xgboost**: XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.
- **catboost**: CatBoost is a gradient boosting library that supports categorical features directly and is designed to be fast and easy to use.
- **optuna**: A hyperparameter optimization framework to automate hyperparameter search.
- **sklearn.preprocessing**: Contains preprocessing utilities and transformers that are used to transform data before feeding it into the model.
  - **MinMaxScaler**: Scales and translates each feature individually such that it is in the given range on the training set.
  - **StandardScaler**: Standardizes features by removing the mean and scaling to unit variance.
  - **RobustScaler**: Scales features using statistics that are robust to outliers.
  - **QuantileTransformer**: Transforms features using quantile information to map data to a uniform distribution.
  - **LabelEncoder**: Encodes target labels with value between 0 and n_classes-1.
- **sklearn.metrics**: Provides metrics to evaluate the performance of models.
  - **mean_absolute_error**: Computes the mean absolute error between the ground truth and predicted values.
- **sklearn.model_selection**: Contains utilities to split the dataset into training and testing subsets and perform cross-validation.
  - **TimeSeriesSplit**: Provides train/test indices to split time series data samples sequentially.
- **sklearn.ensemble**: Contains ensemble-based methods for regression and classification.
  - **HistGradientBoostingRegressor**: A gradient boosting regressor for large datasets that uses histogram-based algorithm.

Below is the code that imports these libraries and sets up some initial configurations:


In [1]:
import pandas as pd
import numpy as np
import os
import warnings
warnings.simplefilter('ignore')
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)

import lightgbm as lgb
import xgboost as xgb
import catboost as catb
import optuna
optuna.logging.disable_default_handler()


from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer, LabelEncoder
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import HistGradientBoostingRegressor

# Data Preprocessing Function

This section contains a function to preprocess the date-related data in the dataframe. The function, `process_date`, performs the following operations:

1. **Convert Date Column**: Converts the 'tarih' column to datetime format.
2. **Sort Data**: Sorts the dataframe by 'ilce' (district) and 'tarih' (date) in ascending order.
3. **Set Index**: Sets the 'tarih' column as the index of the dataframe.


In [None]:
def process_date(dataframe):
    dataframe['tarih'] = pd.to_datetime(dataframe['tarih'], format='%Y-%m-%d')
    dataframe.sort_values(by=['ilce', 'tarih'], ascending=True, inplace=True)
    dataframe.set_index('tarih', inplace=True)

    return dataframe

# Data Concatenation Function

This section contains a function to ensure that each district (`ilce`) in the dataframe has a complete date range. The function, `concatenate`, performs the following operations:

1. **Create Complete Date Ranges**: For each district, create a complete date range from the minimum to the maximum date present in the data.
2. **Merge Dataframes**: Merge the original dataframe with the complete date range dataframe for each district, filling in missing values.
3. **Concatenate Dataframes**: Concatenate all district-specific dataframes into a single dataframe.

In [None]:
def concatenate(dataframe):
    all_dates_dfs = {}
    for ilce, ilce_df in dataframe.groupby(['ilce']):
        min_date = ilce_df.index.min()
        max_date = ilce_df.index.max()
        date_range = pd.date_range(start=min_date, end=max_date)
        all_dates_dfs[ilce] = pd.DataFrame(index=date_range)

    merged_dfs = {}
    for ilce, ilce_df in dataframe.groupby('ilce'):
        merged_dfs[ilce] = pd.merge(all_dates_dfs[ilce], ilce_df, how='left', left_index=True, right_index=True)
        merged_dfs[ilce].fillna({'ilce': ilce, 'bildirimsiz_sum': 0, 'bildirimli_sum': 0}, inplace=True)

    dataframe = pd.concat(merged_dfs.values())

    return dataframe

# Data Grouping Function

This section contains a function to group the dataframe by district (`ilce`), year, month, and day, and then aggregate the sum of specific columns. The function, `process_group`, performs the following operations:

1. **Extract Date Components**: Extracts the year, month, and day from the index.
2. **Group by Columns**: Groups the dataframe by `ilce`, `year`, `month`, and `day`.
3. **Aggregate Sum**: Aggregates the sum of `bildirimsiz_sum` and `bildirimli_sum` columns.


In [None]:
def process_group(dataframe):
    dataframe['year'] = dataframe.index.year
    dataframe['month'] = dataframe.index.month
    dataframe['day'] = dataframe.index.day
    dataframe = dataframe.groupby(['ilce', 'year', 'month', 'day'], as_index=False)['bildirimsiz_sum', 'bildirimli_sum'].agg('sum')

    return dataframe

# Weather Feature Extraction Function

This section contains a function to extract and create new features from the existing weather data in the dataframe. The function, `extract_features_on_weather`, performs the following operations:

1. **Calculate Temperature Difference**: Computes the difference between the actual temperature (`t_2m:C`) and the apparent temperature (`t_apparent:C`).
2. **Calculate Wind Direction Components**: Computes the sine and cosine components of the wind direction (`wind_dir_10m:d`).
3. **Calculate Wind Speed Components**: Computes the wind speed components based on the sine and cosine of the wind direction.

### Feature Descriptions
- `temp_diff`: The difference between the actual temperature (`t_2m:C`) and the apparent temperature (`t_apparent:C`).
- `sin_direction`: The sine of the wind direction (`wind_dir_10m:d`).
- `cos_direction`: The cosine of the wind direction (`wind_dir_10m:d`).
- `WindSpeed_sin_component`: The wind speed component in the sine direction.
- `WindSpeed_cos_component`: The wind speed component in the cosine direction.

In [None]:
def extract_features_on_weather(dataframe: pd.DataFrame) -> pd.DataFrame:

    # dataframe["wind_coolinf_effect"] = 33 - (10 * (dataframe["wind_speed_10m:ms"]**0.5)) + (10.45 * dataframe["t_2m:C"]) - (0.2778 * (dataframe["wind_speed_10m:ms"] * dataframe["t_2m:C"]))
    # dataframe['temperature_relative_humidity_interaction'] = dataframe['t_2m:C'] * (1 + 0.33 * dataframe['relative_humidity_2m:p'] / 100)
    # dataframe["sunlight_intensity"] = dataframe["global_rad:W"] / (1 + dataframe["effective_cloud_cover:p"])
    # dataframe["wind_chill_index"] = 13.12 + 0.6215 * dataframe["t_2m:C"] - 11.37 * (dataframe["wind_speed_10m:ms"] ** 0.16) + 0.3965 * dataframe["t_2m:C"] * (dataframe["wind_speed_10m:ms"] ** 0.16)
    # dataframe['precipitation_cloud_interaction'] = dataframe['prob_precip_1h:p'] * (1 + 0.5 * dataframe['effective_cloud_cover:p'] / 100)
    # dataframe['temperature_ratio'] = dataframe['t_apparent:C'] / (dataframe['t_2m:C'] + 1e-10)
    dataframe['temp_diff'] = (dataframe['t_2m:C'] - dataframe['t_apparent:C'])
    dataframe['sin_direction'] = np.sin(dataframe['wind_dir_10m:d']*np.pi/180)
    dataframe['cos_direction'] = np.cos(dataframe['wind_dir_10m:d']*np.pi/180)
    dataframe["WindSpeed_sin_component"] = dataframe['wind_speed_10m:ms'] * dataframe["sin_direction"]
    dataframe["WindSpeed_cos_component"] = dataframe['wind_speed_10m:ms'] * dataframe["cos_direction"]
    # dataframe['is_rainy'] = np.where(dataframe['prob_precip_1h:p'] > 0, 1, 0)
    # dataframe['is_windy'] = np.where(dataframe['wind_speed_10m:ms'] > 10, 1, 0)
    # dataframe['log_wind_speed'] = np.log(dataframe['wind_speed_10m:ms'] + 1)
    dataframe.drop(['cos_direction', 't_2m:C', 't_apparent:C'], axis=1, inplace=True)

    return dataframe

# Weather Data Processing Function

This section contains a function to process and aggregate weather data. The function, `weather_df_process`, performs the following operations:

1. **Extract Date Components**: Extracts the year, month, and day from the index.
2. **Drop Unnecessary Columns**: Drops the 'lat' and 'lon' columns as they are not needed for further processing.
3. **Group and Aggregate Data**: Groups the dataframe by `ilce`, year, month, day, and daily frequency, and then aggregates the mean, minimum, and maximum values for each group.
4. **Rename Columns**: Renames the columns to include the aggregation type for clarity.

### Column Descriptions
- The new columns will include the original column names suffixed with '_mean', '_min', and '_max' to indicate the type of aggregation performed.

In [None]:
def weather_df_process(dataframe):
    dataframe['year'] = dataframe.index.year
    dataframe['month'] = dataframe.index.month
    dataframe['day'] = dataframe.index.day
    dataframe.drop(columns=['lat', 'lon'], axis=1, inplace=True)
    weather_df = dataframe.groupby(['ilce', 'year', 'month', 'day', pd.Grouper(freq='D')]).agg(['mean', 'min', 'max'])
    weather_df.columns = [f'{col}_{stat}' for col, stat in weather_df.columns]

    return weather_df

# Data Merging Function

This section contains a function to merge the main dataframe with weather data and holidays data. The function, `merge_data`, performs the following operations:

1. **Merge with Weather Data**: Merges the main dataframe with the weather dataframe on `ilce`, `year`, `month`, and `day` using an inner join.
2. **Merge with Holidays Data**: Merges the resulting dataframe with the holidays dataframe on `year`, `month`, and `day` using a left join.
3. **Sort Data**: Sorts the dataframe by `ilce`, `year`, `month`, and `day` in ascending order.
4. **Set Date Index**: Converts the year, month, and day columns into a datetime format and sets it as the index.
5. **Handle Missing Holiday Data**: Fills missing values in the 'Tatil Adı' column with 0 and converts it to a binary indicator (1 if it is a holiday, 0 otherwise).

### Steps Involved
1. Merge the main dataframe with the weather dataframe.
2. Merge the resulting dataframe with the holidays dataframe.
3. Sort the dataframe by `ilce`, `year`, `month`, and `day`.
4. Create a datetime index from the `year`, `month`, and `day` columns.
5. Fill missing holiday names with 0 and create a binary indicator for holidays.

In [None]:
def merge_data(dataframe, weather_df, holidays):
    merged_df = pd.merge(dataframe, weather_df, on=['ilce', 'year', 'month', 'day'], how='inner')
    final_df = pd.merge(merged_df, holidays, on=['year', 'month', 'day'], how='left')
    final_df.sort_values(by=['ilce', 'year', 'month', 'day'], ascending=True, inplace=True)
    final_df['tarih'] = pd.to_datetime(final_df[['year', 'month', 'day']])
    final_df.set_index('tarih', inplace=True)
    final_df['Tatil Adı'].fillna(0, inplace=True)
    final_df['Tatil Adı'] = final_df['Tatil Adı'].apply(lambda x: 1 if x != 0 else 0)
    
    return final_df

# Date Feature Extraction Function

This section contains a function to extract various date-related features from the dataframe's index. The function, `extract_date_features`, performs the following operations:

1. **Extract Quarter**: Extracts the quarter of the year from the date index.
2. **Extract Day of Year**: Extracts the day of the year from the date index.
3. **Extract Week of Year**: Extracts the week of the year from the date index.
4. **Extract Day of Week**: Extracts the day of the week from the date index.

### New Features
- `quarter`: The quarter of the year (1 to 4).
- `dayofyear`: The day of the year (1 to 365/366).
- `weekofyear`: The week of the year (1 to 52/53).
- `dayofweek`: The day of the week (0 for Monday to 6 for Sunday).

In [None]:
def extract_date_features(dataframe):

    dataframe['quarter'] = dataframe.index.quarter
    dataframe['dayofyear'] = dataframe.index.dayofyear
    dataframe['weekofyear'] = dataframe.index.weekofyear
    dataframe['dayofweek'] = dataframe.index.dayofweek
    # dataframe['weekday'] = dataframe.index.weekday
    # dataframe['is_Mon'] = np.where(dataframe['dayofweek'] == 1, 1, 0)
    # dataframe['is_Tue'] = np.where(dataframe['dayofweek'] == 2, 1, 0)
    # dataframe['is_Wed'] = np.where(dataframe['dayofweek'] == 3, 1, 0)
    # dataframe['is_Thu'] = np.where(dataframe['dayofweek'] == 4, 1, 0)
    # dataframe['is_Fri'] = np.where(dataframe['dayofweek'] == 5, 1, 0)
    # dataframe['is_Sat'] = np.where(dataframe['dayofweek'] == 6, 1, 0)

    return dataframe

# Date Sinusoidal and Cosine Feature Calculation Function

This section contains a function to calculate sinusoidal and cosine transformations of various date-related features. The function, `calculate_date_sin_cos`, performs the following operations:

1. **Calculate Sinusoidal and Cosine Transformations**: For each date-related feature (month, day of year, day of month, quarter), calculate the sine and cosine values to capture cyclical patterns.

### New Features
- `sin_month`: The sine transformation of the month.
- `cos_month`: The cosine transformation of the month.
- `sin_dayofyear`: The sine transformation of the day of the year.
- `cos_dayofyear`: The cosine transformation of the day of the year.
- `sin_dayofmonth`: The sine transformation of the day of the month.
- `cos_dayofmonth`: The cosine transformation of the day of the month.
- `sin_quarter`: The sine transformation of the quarter.
- `cos_quarter`: The cosine transformation of the quarter.

In [None]:
def calculate_date_sin_cos(dataframe):
    dataframe['sin_month'] = np.sin(2 * np.pi * dataframe['month']/12)
    dataframe['cos_month'] = np.cos(2 * np.pi * dataframe['month']/12)
    dataframe['sin_dayofyear'] = np.sin(2 * np.pi * dataframe['dayofyear']/365)
    dataframe['cos_dayofyear'] = np.cos(2 * np.pi * dataframe['dayofyear']/365)
    dataframe['sin_dayofmonth'] = np.sin(2 * np.pi * dataframe['day']/30)
    dataframe['cos_dayofmonth'] = np.cos(2 * np.pi * dataframe['day']/30)
    dataframe['sin_quarter'] = np.sin(2*np.pi*dataframe.quarter/4)
    dataframe['cos_quarter'] = np.cos(2*np.pi*dataframe.quarter/4)
    
    return dataframe

# Categorization Process Function

This section contains a function to categorize various weather-related features in the dataframe. The function, `categorize_process`, categorizes specific weather parameters and adds a seasonal category based on the month.

### Categorical Features
- `season`: Categorizes the month into seasons:
  - "Spring" for months 3 to 5
  - "Summer" for months 6 to 8
  - "Autumn" for months 9 to 11
  - "Winter" for months 12, 1, and 2

### Potential Categorizations (Commented Out)
- Wind Speed (`wind_speed_10m:ms`): Categorized into "Low", "Ideal", "Pre-high", and "High".
- Relative Humidity (`relative_humidity_2m:p`): Categorized into "Low", "Ideal", "Pre-high", and "High".
- Effective Cloud Cover (`effective_cloud_cover:p`): Categorized into "Low", "Pre-high", "Ideal", and "High".
- Wind Direction (`wind_dir_10m:d`): Categorized into compass sectors ("N", "NE", "E", "SE", "S", "SW", "W", "NW").
- Global Radiation (`global_rad:W`): Categorized into "Low", "Ideal", "Pre-High", and "High".
- Probability of Precipitation (`prob_precip_1h:p`): Categorized into "Low", "Ideal", "Pre-High", and "High".

In [None]:
# def categorize_process(dataframe):
#     # dataframe.loc[(dataframe["wind_speed_10m:ms"] < 0.5), 'NEW_WIND_CAT'] = 'Low'
#     # dataframe.loc[(dataframe["wind_speed_10m:ms"] >= 0.5) & (dataframe["wind_speed_10m:ms"] < 1.5), 'NEW_WIND_CAT'] = 'Ideal'
#     # dataframe.loc[(dataframe["wind_speed_10m:ms"] >= 1.5) & (dataframe["wind_speed_10m:ms"] < 2.5), 'NEW_WIND_CAT'] = 'Pre-high'
#     # dataframe.loc[(dataframe["wind_speed_10m:ms"] >= 2.5), 'NEW_WIND_CAT'] = 'High'

#     # dataframe.loc[(dataframe["relative_humidity_2m:p"] < 35), 'NEW_Humidity_CAT'] = 'Low'
#     # dataframe.loc[(dataframe["relative_humidity_2m:p"] >= 35) & (dataframe["relative_humidity_2m:p"] < 55), 'NEW_Humidity_CAT'] = 'Ideal'
#     # dataframe.loc[(dataframe["relative_humidity_2m:p"] >= 55) & (dataframe["relative_humidity_2m:p"] < 72), 'NEW_Humidity_CAT'] = 'Pre-high'
#     # dataframe.loc[(dataframe["relative_humidity_2m:p"] >= 72), 'NEW_Humidity_CAT'] ='High'

#     # dataframe['NEW_CLD_CAT'] = pd.cut(dataframe['effective_cloud_cover:p'], bins=[-1, 25, 50, 75, 101], labels=['Low', 'Pre-high', 'Ideal', 'High'])

#     # def wind_direction_category(degrees):
#     #     sectors = ['N', 'NE', 'E', 'SE', 'S', 'SW', 'W', 'NW']
#     #     index = int((degrees + 22.5) / 45) % 8
#     #     return sectors[index]

#     # dataframe['wind_direction_sector'] = dataframe['wind_dir_10m:d'].apply(wind_direction_category)
#     # dataframe.drop('wind_dir_10m:d', axis=1, inplace=True)
    
#     # def categorize_solar_radiation(global_rad_mean):
#     #     if global_rad_mean <= 50:
#     #         return "Low"
#     #     elif 50 < global_rad_mean <= 200:
#     #         return "Ideal"
#     #     elif 200 < global_rad_mean <= 300:
#     #         return "Pre-High" 
#     #     else:
#     #         return "High"

#     # dataframe['NEW_global_rad_CAT'] = dataframe['global_rad:W'].apply(categorize_solar_radiation)

#     # def categorize_precipitation(prob_precip_1h):
#     #     if prob_precip_1h <= 5:
#     #         return "Low"
#     #     elif 5 < prob_precip_1h <= 15:
#     #         return "Ideal"
#     #     elif 15 < prob_precip_1h <= 25:
#     #         return "Pre-High"
#     #     else:
#     #         return "High"
    
#     # dataframe['NEW_precipitation_CAT'] = dataframe['prob_precip_1h:p'].apply(categorize_precipitation)

#     def categorize_season(month):
#         if 3 <= month <= 5:
#             return "Spring"
#         elif 5 < month <= 8:
#             return "Summer"
#         elif 8 < month <= 11:
#             return "Autumn" 
#         else:
#             return "Winter"

#     dataframe['season'] = dataframe['month'].apply(categorize_season)

#     return dataframe

# train = categorize_process(train)
# test = categorize_process(test)


# Save District Data to CSV Function

This section contains a function to save data for each district (`ilce`) to separate CSV files. The function, `save_districts_csv`, performs the following operations:

1. **Create District Dataframes**: Creates a dictionary of dataframes, one for each district.
2. **Save to CSV**: Iterates over each district dataframe and saves it as a CSV file in the specified directory.

### Steps Involved
1. **Identify Unique Districts**: Extracts unique district names from the dataframe.
2. **Create District-specific Dataframes**: Creates a dictionary where each key is a district name, and each value is a dataframe containing data for that district.
3. **Save Each District Dataframe to CSV**: Saves each district-specific dataframe to a CSV file named after the district in the specified directory.

In [None]:
def save_districts_csv(dataframe, dataframe_dir_name):
    ilceler = dataframe['ilce'].unique()
    ilce_dict = {}
    for ilce in ilceler:
        ilce_dict[ilce] = dataframe[dataframe['ilce'] == ilce].copy()

    for ilce, ilce_df in ilce_dict.items():
        name = ilce + ".csv"
        ilce_df.to_csv(rf'C:\Users\ahmet\VSCode\GDZ-Datathon\{dataframe_dir_name}\{name}')

# Load Data from Folder Function

This section contains a function to load CSV files from a specified folder into a dictionary of dataframes. The function, `load_from_folder`, performs the following operations:

1. **List Files in Directory**: Lists all files in the specified folder.
2. **Filter CSV Files**: Filters the list to include only CSV files.
3. **Load Data**: Loads each CSV file into a dataframe and sets the 'tarih' column as a datetime index.
4. **Store in Dictionary**: Stores each dataframe in a dictionary with the district name as the key.

### Steps Involved
1. **Identify CSV Files**: Lists and filters the files in the specified folder to include only CSV files.
2. **Load Dataframes**: Loads each CSV file into a dataframe, converts the 'tarih' column to datetime, and sets it as the index.
3. **Store in Dictionary**: Stores each loaded dataframe in a dictionary with the district name (extracted from the file name) as the key.

In [None]:
def load_from_folder(folder_path):
    ilce_dataframes = {}
    for file in os.listdir(folder_path):
        if file.endswith(".csv"):
            ilce = file[:-4] 
            ilce_df = pd.read_csv(os.path.join(folder_path, file))
            ilce_df['tarih'] = pd.to_datetime(ilce_df['tarih'])
            ilce_df.set_index('tarih', inplace=True)
            ilce_dataframes[ilce] = ilce_df
    return ilce_dataframes

# Feature Extraction Class

This section contains a class for extracting additional features from the data. The class, `ExtractFeatures`, provides methods to process both training and testing datasets by adding time-based and differencing features.

## Class: `ExtractFeatures`

This class is designed to process dataframes by adding new features based on time and differences.

##### Parameters
- `ilce_train_df`: DataFrame containing the training data.
- `ilce_test_df`: DataFrame containing the testing data.
- `target_col`: The name of the target column.

##### Returns
- `ilce_train_df`: The processed training dataframe with new features.
- `ilce_test_df`: The processed testing dataframe with new features.

##### Steps Involved
1. **Define Time Features**: Specifies a list of time-related features to include.
2. **Identify Numerical Columns**: Identifies numerical columns to use for differencing.
3. **Concatenate DataFrames**: Concatenates the training and testing dataframes for consistent feature engineering.
4. **Calculate Differences**: Calculates daily and weekly differences for numerical columns.
5. **Split DataFrames**: Splits the concatenated dataframe back into training and testing sets.
6. **Drop Missing Values**: Drops rows with missing values from the training dataframe.

In [None]:
class ExtractFeatures:
    def __init__(self, cfg):
        self.cfg = cfg
    
    def process(self, ilce_train_df, ilce_test_df, target_col):
        time_features = ['year', 'month', 'day', 'dayofweek' ,'sin_month', 'cos_month', 'sin_dayofmonth', 'cos_dayofmonth', 'Tatil Adı','quarter','dayofyear','weekofyear']
        num_cols = [col for col in ilce_train_df.columns if ilce_train_df[col].dtype in ['int64', 'float64'] and col not in [time_features, target_col]]
        concatenated_df = pd.concat([ilce_train_df, ilce_test_df], axis=0)

        for col in num_cols:
            concatenated_df[f'{col}_shift_daily_diff'] = concatenated_df[col] - concatenated_df[col].shift()
            concatenated_df[f'{col}_shift_weekly_diff'] = concatenated_df[col] - concatenated_df[col].shift(7)
            # concatenated_df[f'{col}_shift_monthly_diff'] = concatenated_df[col] - concatenated_df[col].shift(30)
       
        ilce_train_df = concatenated_df[:len(ilce_train_df)]
        ilce_test_df = concatenated_df[len(ilce_train_df):]
        ilce_train_df.dropna(axis=0, inplace=True)

        return ilce_train_df, ilce_test_df

# Data Scaling Class

This section contains a class for scaling the data. The class, `Scaler`, provides methods to scale both training and testing datasets using different scaling techniques based on the configuration.

## Class: `Scaler`

This class is designed to process dataframes by applying scaling to numerical features.

##### Parameters
- `ilce_train_df`: DataFrame containing the training data.
- `ilce_test_df`: DataFrame containing the testing data.
- `target_col`: The name of the target column.

##### Returns
- `ilce_train_df`: The scaled training dataframe.
- `ilce_test_df`: The scaled testing dataframe.

##### Steps Involved
1. **Identify Numerical Columns**: Identifies numerical columns to be scaled, excluding the target column.
2. **Select Scaler**: Chooses the appropriate scaler based on the configuration:
   - `MinMaxScaler`: Scales features to a given range, typically [0, 1].
   - `StandardScaler`: Standardizes features by removing the mean and scaling to unit variance.
   - `RobustScaler`: Scales features using statistics that are robust to outliers.
   - `QuantileTransformer`: Transforms features using quantiles information to map data to a normal distribution.
3. **Apply Scaling**: Fits the scaler on the training data and transforms both training and testing data.

In [3]:
class Scaler:
    def __init__(self, cfg):
        self.cfg = cfg
    
    def process(self, ilce_train_df, ilce_test_df, target_col):
        num_cols = [col for col in ilce_train_df.columns if ilce_train_df[col].dtype in ['int64', 'float64'] and col != target_col]

        if self.cfg.min_max_scaler:
            scaler = MinMaxScaler(feature_range=(0, 1))
        elif self.cfg.standard_scaler:
            scaler = StandardScaler()
        elif self.cfg.robust_scaler:
            scaler = RobustScaler()
        elif self.cfg.quantile_transformer:
            scaler = QuantileTransformer(output_distribution='normal')
        else:
            return ilce_train_df, ilce_test_df

        for col in num_cols:
            ilce_train_df[col] = scaler.fit_transform(ilce_train_df[[col]])
            ilce_test_df[col] = scaler.transform(ilce_test_df[[col]])

        return ilce_train_df, ilce_test_df

# Categorical Encoding Class

This section contains a class for encoding categorical features. The class, `Encoder`, provides methods to encode both training and testing datasets using label encoding based on the configuration.

## Class: `Encoder`

This class is designed to process dataframes by encoding categorical features.

##### Parameters
- `ilce_train_df`: DataFrame containing the training data.
- `ilce_test_df`: DataFrame containing the testing data.
- `unique_col`: The unique identifier column that should not be encoded.

##### Returns
- `ilce_train_df`: The encoded training dataframe.
- `ilce_test_df`: The encoded testing dataframe.

##### Steps Involved
1. **Identify Categorical Columns**: Identifies categorical columns to be encoded, excluding the unique identifier column and 'tarih'.
2. **Select Encoder**: Chooses the appropriate encoder based on the configuration:
   - `LabelEncoder`: Encodes target labels with values between 0 and `n_classes-1`.
3. **Apply Encoding**: Fits the encoder on the training data and transforms both training and testing data.

In [4]:
class Encoder:
    def __init__(self, cfg):
        self.cfg = cfg
    
    def process(self, ilce_train_df, ilce_test_df, unique_col):
        cat_cols = [col for col in ilce_train_df.columns if ilce_train_df[col].dtype in ['object', 'category'] and col not in [unique_col, 'tarih']]

        if self.cfg.label_encoder:
            encoder = LabelEncoder()
            for col in cat_cols:
                ilce_train_df[col] = encoder.fit_transform(ilce_train_df[col])
                ilce_test_df[col] = encoder.fit_transform(ilce_test_df[col])

        return ilce_train_df, ilce_test_df

# Optuna Hyperparameter Tuning Class

This section contains a class for hyperparameter tuning using Optuna. The class, `Optuna`, provides methods to optimize the hyperparameters of various machine learning models using time series cross-validation (TimeSeriesSplit).

## Class: `Optuna`

This class is designed to perform hyperparameter tuning on different machine learning models using Optuna, which is an automatic hyperparameter optimization framework.

Constructor for the `Optuna` class.

#### `process(self, dataframe, unique_col, target_col, n_trial)`

This method initiates the hyperparameter tuning process for the specified models using the provided data.

##### Parameters
- `dataframe`: The input dataframe containing the features and target column.
- `unique_col`: The unique identifier column to be excluded from features.
- `target_col`: The target column for prediction.
- `n_trial`: The number of trials for Optuna optimization.

##### Returns
- `best_params`: A dictionary containing the best parameters and scores for each model.

##### Steps Involved
1. **Drop Columns**: Drops the unique identifier and target columns from the dataframe to create features (X) and target (y).
2. **Initialize Best Parameters Dictionary**: Creates a dictionary to store the best parameters for each model.
3. **Optimize Models**: Calls the appropriate optimization method for each model based on the configuration flags:
   - `histgbr_optuna`: Calls `_optuna_histgbr` to optimize `HistGradientBoostingRegressor`.
   - `lgb_optuna`: Calls `_optuna_lgbm` to optimize `LGBMRegressor`.
   - `xgb_optuna`: Calls `_optuna_xgb` to optimize `XGBRegressor`.
   - `catb_optuna`: Calls `_optuna_catb` to optimize `CatBoostRegressor`.

#### `_optuna_histgbr(self, X, y, n_trial)`

This private method performs hyperparameter tuning for `HistGradientBoostingRegressor` using Optuna.

##### Parameters
- `X`: The features dataframe.
- `y`: The target series.
- `n_trial`: The number of trials for Optuna optimization.

##### Returns
- `best_params`: The best parameters from the optimization.
- `best_value`: The best MAE value from the optimization.

##### Steps Involved
1. **Objective Function**: Defines the objective function for Optuna, which trains the model using the suggested hyperparameters and evaluates it using TimeSeriesSplit cross-validation.
2. **Study Creation**: Creates an Optuna study to minimize the objective function.
3. **Study Optimization**: Runs the optimization for the specified number of trials.

#### `_optuna_lgbm(self, X, y, n_trial)`

This private method performs hyperparameter tuning for `LGBMRegressor` using Optuna.

##### Parameters
- `X`: The features dataframe.
- `y`: The target series.
- `n_trial`: The number of trials for Optuna optimization.

##### Returns
- `best_params`: The best parameters from the optimization.
- `best_value`: The best MAE value from the optimization.

##### Steps Involved
1. **Objective Function**: Defines the objective function for Optuna, which trains the model using the suggested hyperparameters and evaluates it using TimeSeriesSplit cross-validation.
2. **Study Creation**: Creates an Optuna study to minimize the objective function.
3. **Study Optimization**: Runs the optimization for the specified number of trials.

#### `_optuna_xgb(self, X, y, n_trial)`

This private method performs hyperparameter tuning for `XGBRegressor` using Optuna.

##### Parameters
- `X`: The features dataframe.
- `y`: The target series.
- `n_trial`: The number of trials for Optuna optimization.

##### Returns
- `best_params`: The best parameters from the optimization.
- `best_value`: The best MAE value from the optimization.

##### Steps Involved
1. **Objective Function**: Defines the objective function for Optuna, which trains the model using the suggested hyperparameters and evaluates it using TimeSeriesSplit cross-validation.
2. **Study Creation**: Creates an Optuna study to minimize the objective function.
3. **Study Optimization**: Runs the optimization for the specified number of trials.

#### `_optuna_catb(self, X, y, n_trial)`

This private method performs hyperparameter tuning for `CatBoostRegressor` using Optuna.

##### Parameters
- `X`: The features dataframe.
- `y`: The target series.
- `n_trial`: The number of trials for Optuna optimization.

##### Returns
- `best_params`: The best parameters from the optimization.
- `best_value`: The best MAE value from the optimization.

##### Steps Involved
1. **Objective Function**: Defines the objective function for Optuna, which trains the model using the suggested hyperparameters and evaluates it using TimeSeriesSplit cross-validation.
2. **Study Creation**: Creates an Optuna study to minimize the objective function.
3. **Study Optimization**: Runs the optimization for the specified number of trials.

### Example Usage
This class can be used to perform hyperparameter tuning on different machine learning models to find the optimal set of parameters for each model, enhancing their predictive performance.

In [5]:
class Optuna:
    def __init__(self, cfg):
        self.cfg = cfg

    def process(self, dataframe, unique_col, target_col, n_trial):
        
        X = dataframe.drop(columns=[unique_col, target_col])
        y = dataframe[target_col]

        best_params = {}

        if self.cfg.histgbr_optuna:
            print("Executing HistGradientBoostingRegressor..")
            best_params['histgbr'] = self._optuna_histgbr(X, y, n_trial)

        if self.cfg.lgb_optuna:
            print("Executing LGBMRegressor..")
            best_params['lgbm'] = self._optuna_lgbm(X, y, n_trial)

        if self.cfg.xgb_optuna:
            print("Executing XGBRegressor..")
            best_params['xgb'] = self._optuna_xgb(X, y, n_trial)

        if self.cfg.catb_optuna:
            print("Executing CatBoostRegressor..")
            best_params['catb'] = self._optuna_catb(X, y, n_trial)

        return best_params

    def _optuna_histgbr(self, X, y, n_trial):
        def hist_gradient_objective(trial):
            params = {
                'loss': 'absolute_error',
                'verbose': 0,
                'max_iter': trial.suggest_int('max_iter', 100, 2000),
                'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.2, log=True),
                'max_leaf_nodes': trial.suggest_int('max_leaf_nodes', 2, 128),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 2, 100),
                'max_depth': trial.suggest_int('max_depth', 3, 7),
                'l2_regularization': trial.suggest_float('l2_regularization', 1e-5, 1),
            }

            model = HistGradientBoostingRegressor(**params)
            
            tscv = TimeSeriesSplit(n_splits=5)
            mae_scores = []

            for train_index, test_index in tscv.split(X):
                X_train, X_test = X.iloc[train_index], X.iloc[test_index]
                y_train, y_test = y.iloc[train_index], y.iloc[test_index]

                model.fit(X_train, y_train)
                y_pred = model.predict(X_test)
                mae = mean_absolute_error(y_test, y_pred)
                mae_scores.append(mae)

            return sum(mae_scores) / len(mae_scores)

        study = optuna.create_study(direction='minimize')
        study.optimize(hist_gradient_objective, n_trials=n_trial)
        return study.best_params, study.best_value

    def _optuna_lgbm(self, X, y, n_trial):
        def lgbm_objective(trial):
            params = {
                'objective': 'regression',
                'metric': 'mae',
                'verbosity': -1,
                'boosting_type': 'gbdt',
                'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
                'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.2, log=True),
                'num_leaves': trial.suggest_int('num_leaves', 2, 128),
                'max_depth': trial.suggest_int('max_depth', 3, 7),
                'reg_alpha': trial.suggest_float('reg_alpha', 1e-5, 1),
                'reg_lambda': trial.suggest_float('reg_lambda', 1e-5, 1),
                'min_child_samples': trial.suggest_int('min_child_samples', 5, 50)
            }

            tscv = TimeSeriesSplit(n_splits=5)
            mae_scores = []

            for train_index, test_index in tscv.split(X):
                X_train, X_test = X.iloc[train_index], X.iloc[test_index]
                y_train, y_test = y.iloc[train_index], y.iloc[test_index]

                train_data = lgb.Dataset(X_train, label=y_train)
                test_data = lgb.Dataset(X_test, label=y_test)

                model = lgb.train(params, train_data, valid_sets=[test_data])
                y_pred = model.predict(X_test)
                mae = mean_absolute_error(y_test, y_pred)

                mae_scores.append(mae)

            return sum(mae_scores) / len(mae_scores)
            
        study = optuna.create_study(direction='minimize')
        study.optimize(lgbm_objective, n_trials=n_trial)
        return study.best_params, study.best_value

    def _optuna_xgb(self, X, y, n_trial):
        def xgb_objective(trial):
            params = {
                'objective': 'reg:squarederror',
                'verbosity': 0,
                'eval_metric': 'mae',
                'grow_policy': trial.suggest_categorical('grow_policy', ['depthwise', 'lossguide']),
                'n_estimators': trial.suggest_int('n_estimators', 100, 3000),
                'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.2, log=True),
                'max_depth': trial.suggest_int('max_depth', 3, 9),
                'lambda': trial.suggest_loguniform('lambda', 1e-3, 1.0),
                'alpha': trial.suggest_loguniform('alpha', 1e-3, 1.0),
                'subsample': trial.suggest_float('subsample', 0.5, 1.0),
                'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
                'min_child_weight': trial.suggest_int('min_child_weight', 1, 10), 
                'gamma': trial.suggest_float('gamma', 0, 5), 
                'tree_method': trial.suggest_categorical('tree_method', ['auto', 'exact', 'approx', 'hist', 'gpu_hist']), 
                'booster': trial.suggest_categorical('booster', ['gbtree', 'gblinear', 'dart']), 
            }

            if params['booster'] == 'dart':
                params['sample_type'] = trial.suggest_categorical('sample_type', ['uniform', 'weighted'])
                params['normalize_type'] = trial.suggest_categorical('normalize_type', ['tree', 'forest'])
                params['rate_drop'] = trial.suggest_float('rate_drop', 1e-8, 1.0, log=True)
                params['skip_drop'] = trial.suggest_float('skip_drop', 1e-8, 1.0, log=True)

            tscv = TimeSeriesSplit(n_splits=5)
            mae_scores = []

            for train_index, test_index in tscv.split(X):
                X_train, X_test = X.iloc[train_index], X.iloc[test_index]
                y_train, y_test = y.iloc[train_index], y.iloc[test_index]

                dtrain = xgb.DMatrix(X_train, label=y_train)
                dtest = xgb.DMatrix(X_test, label=y_test)

                model = xgb.train(params, dtrain)
                y_pred = model.predict(dtest)
                mae = mean_absolute_error(y_test, y_pred)
                mae_scores.append(mae)

            return sum(mae_scores) / len(mae_scores)

        study = optuna.create_study(direction='minimize')
        study.optimize(xgb_objective, n_trials=n_trial)
        return study.best_params, study.best_value

    def _optuna_catb(self, X, y, n_trial):
        def catboost_objective(trial):
            params = {
                'objective': 'MAE', 
                'verbose': False,
                'eval_metric': 'MAE',
                'grow_policy': 'Lossguide',
                'iterations': trial.suggest_int('iterations', 100, 2000),
                'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.2, log=True),
                'depth': trial.suggest_int('depth', 3, 7),  
                'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 1e-3, 1.0),  
                'max_leaves': trial.suggest_int('max_leaves', 10, 200),  
                'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 50),  
                'bagging_temperature': trial.suggest_float('bagging_temperature', 0.0, 10.0), 
                'border_count': trial.suggest_int('border_count', 32, 255),  
                'random_strength': trial.suggest_float('random_strength', 0.0, 100.0), 
            }


            model = catb.CatBoostRegressor(**params)
            tscv = TimeSeriesSplit(n_splits=5)
            mae_scores = []

            for train_index, test_index in tscv.split(X):
                X_train, X_test = X.iloc[train_index], X.iloc[test_index]
                y_train, y_test = y.iloc[train_index], y.iloc[test_index]

                model.fit(X_train, y_train)
                y_pred = model.predict(X_test)
                mae = mean_absolute_error(y_test, y_pred)
                mae_scores.append(mae)

            return sum(mae_scores) / len(mae_scores)

        study = optuna.create_study(direction='minimize')
        study.optimize(catboost_objective, n_trials=n_trial)
        return study.best_params, study.best_value

# Machine Learning Models Prediction Class

This section contains a class for making predictions using different machine learning models. The class, `MLModels`, provides methods to fit and predict using the best model and parameters identified through hyperparameter tuning.

## Class: `MLModels`

This class is designed to fit and predict using various machine learning models, including HistGradientBoostingRegressor, LGBMRegressor, XGBRegressor, and CatBoostRegressor, based on the best parameters found during hyperparameter tuning.

Constructor for the `MLModels` class.

#### `process(self, ilce, ilce_train_df, ilce_test_df, unique_col, target_col, best_model, best_params)`

This method fits the best model using the training data and makes predictions on the test data.

##### Parameters
- `ilce`: The district identifier.
- `ilce_train_df`: DataFrame containing the training data.
- `ilce_test_df`: DataFrame containing the testing data.
- `unique_col`: The unique identifier column to be excluded from features.
- `target_col`: The target column for prediction.
- `best_model`: The best model identified during hyperparameter tuning.
- `best_params`: The best parameters for the best model.

##### Returns
- `predictions`: A dictionary containing the predictions for the test data.

##### Steps Involved
1. **Prepare Training and Testing Data**: Drops the unique identifier and target columns from the dataframes to create features (X) and target (y) for both training and testing datasets.
2. **Model Selection and Prediction**: Based on the `best_model` parameter, initializes, fits, and predicts using the corresponding model:
   - **HistGradientBoostingRegressor**: Fits the model using `HistGradientBoostingRegressor` and makes predictions on the test data.
   - **LGBMRegressor**: Fits the model using `LGBMRegressor` and makes predictions on the test data.
   - **XGBRegressor**: Fits the model using `XGBRegressor` and makes predictions on the test data.
   - **CatBoostRegressor**: Fits the model using `CatBoostRegressor` and makes predictions on the test data.
3. **Store Predictions**: Stores the predictions in a dictionary with the district identifier as the key.

### Example Usage
This class can be used to fit and predict using the best machine learning model and parameters identified through hyperparameter tuning, ensuring optimal predictive performance.

In [6]:
class MLModels:
    def __init__(self, cfg):
        self.cfg = cfg

    def process(self, ilce, ilce_train_df, ilce_test_df, unique_col, target_col, best_model, best_params):
        predictions = {}

        X_train = ilce_train_df.drop(columns=[unique_col, target_col])
        y_train = ilce_train_df[target_col]
        X_test = ilce_test_df.drop(columns=[unique_col, target_col])
        
        if best_model == 'histgbr':
            model = HistGradientBoostingRegressor(**best_params)
            model.fit(X_train, y_train)
            y_pred_test = model.predict(X_test)
            predictions[ilce] = y_pred_test
            print(f'{ilce} test data is predicted with HistGradientBoostingRegressor')


        elif best_model == 'lgbm':
            train_data = lgb.Dataset(X_train, label=y_train)
            test_data = lgb.Dataset(X_test)
            model = lgb.train(best_params, train_data)
            y_pred_test = model.predict(X_test)
            predictions[ilce] = y_pred_test
            print(f'{ilce} test data is predicted with LGBMRegressor')

        elif best_model == 'xgb':
            
            dtrain = xgb.DMatrix(X_train, label=y_train)
            dtest = xgb.DMatrix(X_test)
            model = xgb.train(best_params, dtrain)
            y_pred_test = model.predict(dtest)
            predictions[ilce] = y_pred_test
            print(f'{ilce} test data is predicted with XGBRegressor')

        elif best_model == 'catb':
            best_params['verbose'] = False
            best_params['grow_policy'] = 'Lossguide'
            model = catb.CatBoostRegressor(**best_params)
            model.fit(X_train, y_train)
            y_pred_test = model.predict(X_test)
            predictions[ilce] = y_pred_test
            print(f'{ilce} test data is predicted with CatBoostRegressor')

        else:
            print('No valid model found.')
            return predictions


        return predictions