# Utility Functions

This section contains utility functions used across different parts of the codebase. These functions are stored in an external Jupyter notebook file named `utils.ipynb`.

## Command: `%run utils.ipynb`

The `%run` command is used in Jupyter notebooks to execute the code in another notebook. This allows you to reuse functions, classes, and variables defined in the external notebook.


In [None]:
%run utils.ipynb

# Configuration Class

This section contains a configuration class named `CFG` that stores settings and flags for various preprocessing and model tuning steps. These configurations control how data is scaled, encoded, and how models are tuned using Optuna.

## Class: `CFG`

The `CFG` class is designed to hold configuration settings for scaling, encoding, and model tuning.

### Attributes

- **Scaling Options**:
  - `min_max_scaler` (bool): If `True`, use `MinMaxScaler` for scaling numerical features. Default is `True`.
  - `standard_scaler` (bool): If `True`, use `StandardScaler` for scaling numerical features. Default is `False`.
  - `robust_scaler` (bool): If `True`, use `RobustScaler` for scaling numerical features. Default is `False`.
  - `quantile_transformer` (bool): If `True`, use `QuantileTransformer` for scaling numerical features. Default is `False`.

- **Encoding Options**:
  - `label_encoder` (bool): If `True`, use `LabelEncoder` for encoding categorical features. Default is `False`.
  - `one_hot_encoder` (bool): If `True`, use one-hot encoding for categorical features. Default is `False`.

- **Model Tuning Options**:
  - `histgbr_optuna` (bool): If `True`, optimize `HistGradientBoostingRegressor` using Optuna. Default is `True`.
  - `lgb_optuna` (bool): If `True`, optimize `LGBMRegressor` using Optuna. Default is `True`.
  - `xgb_optuna` (bool): If `True`, optimize `XGBRegressor` using Optuna. Default is `True`.
  - `catb_optuna` (bool): If `True`, optimize `CatBoostRegressor` using Optuna. Default is `True`.


In [None]:
class CFG:

    min_max_scaler = True
    standard_scaler = False
    robust_scaler = False
    quantile_transformer = False

    label_encoder = False
    one_hot_encoder = False
    
    histgbr_optuna = True
    lgb_optuna = True
    xgb_optuna = True
    catb_optuna = True

In [None]:
train = pd.read_csv(r"C:\Users\ahmet\VSCode\GDZ-Datathon\gdz-datathon-data\train.csv")
test = pd.read_csv(r"C:\Users\ahmet\VSCode\GDZ-Datathon\gdz-datathon-data\test.csv")
weather = pd.read_csv(r"C:\Users\ahmet\VSCode\GDZ-Datathon\gdz-datathon-data\weather.csv")
holidays = pd.read_csv(r"C:\Users\ahmet\VSCode\GDZ-Datathon\gdz-datathon-data\holidays.csv")
holidays.rename(columns={'Yıl': 'year', 'Ay': 'month', 'Gün': 'day'}, inplace=True)
weather.rename(columns={'date': 'tarih', 'name': 'ilce'}, inplace=True)
weather['ilce'] = weather['ilce'].str.lower()
test['bildirimsiz_sum'] = np.nan

In [None]:
train = process_date(train)
test = process_date(test)
weather = process_date(weather)

In [None]:
train = concatenate(train)

In [None]:
train = process_group(train)
test = process_group(test)

In [None]:
weather = extract_features_on_weather(weather)

In [None]:
weather_df = weather_df_process(weather)

In [None]:
train = merge_data(train, weather_df, holidays)
test = merge_data(test, weather_df, holidays)

In [None]:
train = extract_date_features(train)
test = extract_date_features(test)

In [None]:
train = calculate_date_sin_cos(train)
test = calculate_date_sin_cos(test)

In [None]:
# train = categorize_process(train)
# test = categorize_process(test)

In [None]:
save_districts_csv(train, "trains")
save_districts_csv(test, "tests")

In [None]:
ilce_train_dataframes = load_from_folder(r"C:\Users\ahmet\VSCode\GDZ-Datathon\trains")
ilce_test_dataframes = load_from_folder(r"C:\Users\ahmet\VSCode\GDZ-Datathon\tests")

In [None]:
for ilce, ilce_train_df in ilce_train_dataframes.items():
    ilce_test_df = ilce_test_dataframes[ilce]
    ilce_train_dataframes[ilce], ilce_test_dataframes[ilce] = ExtractFeatures(CFG()).process(ilce_train_df, ilce_test_df, 'bildirimsiz_sum')

In [None]:
ilce_train_scaled_dataframes = {}
ilce_test_scaled_dataframes = {}
for ilce, ilce_train_df in ilce_train_dataframes.items():
    ilce_test_df = ilce_test_dataframes[ilce]
    ilce_train_scaled_dataframes[ilce], ilce_test_scaled_dataframes[ilce] = Scaler(CFG()).process(ilce_train_df, ilce_test_df, 'bildirimsiz_sum')

In [None]:
ilce_train_encoded_dataframes = {}
ilce_test_encoded_dataframes = {}
for ilce, ilce_train_df in ilce_train_scaled_dataframes.items():
    ilce_test_df = ilce_test_scaled_dataframes[ilce]
    ilce_train_encoded_dataframes[ilce], ilce_test_encoded_dataframes[ilce] = Encoder(CFG()).process(ilce_train_df, ilce_test_df, 'ilce')

In [None]:
ilce_train_results = {}
for ilce, ilce_df in ilce_train_encoded_dataframes.items():
    print(f"\nProcessing {ilce} data...")
    ilce_train_results[ilce] = Optuna(CFG()).process(ilce_df, 'ilce', 'bildirimsiz_sum', 50)

# Calculating and Printing the Average Lowest MAE Score

This section calculates the average of the lowest Mean Absolute Error (MAE) scores across different districts (`ilce`) from the training results. The goal is to summarize the performance of the models by considering the best MAE score for each district.

## Steps Involved

1. **Initialize List for Lowest Scores**: An empty list `lowest_scores` is initialized to store the lowest MAE scores for each district.
2. **Iterate Over Training Results**: Iterate over the `ilce_train_results` dictionary, which contains the training results for each district.
3. **Extract Lowest MAE Score for Each District**: For each district, find the lowest MAE score from the available results and append it to the `lowest_scores` list.
4. **Calculate Total and Average Scores**: Calculate the total and average of the lowest MAE scores across all districts.
5. **Print the Average MAE Score**: Print the calculated average MAE score.

In [None]:
lowest_scores = []

for ilce, results in ilce_train_results.items():
    lowest_score = min(result[1] for result in results.values())
    lowest_scores.append(lowest_score)

total_lowest_score = sum(lowest_scores)
average_lowest_score = total_lowest_score / len(ilce_train_results)
print("Total Average MAE Score:", average_lowest_score)

# Generating Predictions for Each District

This section involves generating predictions for each district (`ilce`) using the best model and its corresponding parameters identified during the training phase. The goal is to use the best model for each district to make predictions on the test data.

## Steps Involved

1. **Initialize Predictions Dictionary**: An empty dictionary `predictions` is initialized to store the predictions for each district.
2. **Iterate Over Training Data**: Iterate over the encoded training dataframes for each district.
3. **Select Corresponding Test Data**: For each district, retrieve the corresponding encoded test dataframe.
4. **Identify Best Model and Parameters**: Determine the best model and its parameters for each district by iterating over the training results.
5. **Generate Predictions**: Use the best model and parameters to generate predictions on the test data for each district.
6. **Handle Cases with No Valid Model**: If no valid model is found for a district, print a message and skip the prediction for that district.


In [None]:
predictions = {}

for ilce, ilce_train_df in ilce_train_encoded_dataframes.items():
    ilce_test_df = ilce_test_encoded_dataframes[ilce]
    results = ilce_train_results[ilce]
    best_model = None
    best_score = float('inf') 
    best_params = None

    for model, result in results.items():
        if result[1] < best_score:
            best_score = result[1]
            best_params = result[0]
            best_model = model

    if best_model:
        predictions[ilce] = MLModels(CFG()).process(ilce, ilce_train_df, ilce_test_df, 'ilce', 'bildirimsiz_sum', best_model, best_params)
    else:
        print(f"No valid model found for {ilce}. Skipping...")

# Adding Predictions to Test Data

This section involves integrating the predictions generated for each district into the corresponding test dataframes. The function `add_predictions_to_test_data` performs this operation, ensuring that the predicted values are added to the `bildirimsiz_sum` column in the test dataframes.

### Parameters
- `test_dataframes`: A dictionary of test dataframes for each district.
- `predictions`: A dictionary of predictions for each district.

### Returns
- `test_dataframes`: The updated dictionary of test dataframes with the predictions added.

### Steps Involved
1. **Iterate Over Predictions**: Iterate over the predictions dictionary.
2. **Check for Matching District**: For each district, check if it exists in the test dataframes.
3. **Add Predictions**: Add the predicted values to the `bildirimsiz_sum` column in the corresponding test dataframe.

In [None]:
def add_predictions_to_test_data(test_dataframes, predictions):
    for ilce, pred_dict in predictions.items():
        if ilce in test_dataframes:
            test_dataframes[ilce]['bildirimsiz_sum'] = pred_dict[ilce]
    return test_dataframes

test_dataframes_with_predictions = add_predictions_to_test_data(ilce_test_dataframes, predictions)

# Merging Predictions into a Single DataFrame

This section involves consolidating the predictions for all districts into a single dataframe. The goal is to create a unified dataframe that contains the predictions for `bildirimsiz_sum` for all districts, along with a unique identifier for each prediction.

## Steps Involved

1. **Initialize a List for DataFrames**: An empty list `all_dataframes` is initialized to store the individual district dataframes with predictions.
2. **Iterate Over Test DataFrames**: Iterate over the test dataframes that contain the predictions.
3. **Process Each District's DataFrame**: For each district's dataframe:
   - Sort the dataframe by index (date).
   - Create a `unique_id` column by combining the date and district name.
   - Create a new dataframe with `unique_id` and `bildirimsiz_sum` columns.
   - Append the new dataframe to the `all_dataframes` list.
4. **Concatenate All DataFrames**: Concatenate all the individual district dataframes into a single dataframe.


In [None]:
all_dataframes = []

for ilce, test_df in test_dataframes_with_predictions.items():
    if 'bildirimsiz_sum' in test_df.columns:
        test_df = test_df.sort_index()
        test_df['unique_id'] = test_df.index.strftime('%Y-%m-%d') + '-' + ilce
        ilce_df = pd.DataFrame({
            'unique_id': test_df['unique_id'],
            'bildirimsiz_sum': test_df['bildirimsiz_sum'],
        })
        all_dataframes.append(ilce_df)

merged_df = pd.concat(all_dataframes, ignore_index=True)

# Saving the Merged Predictions to CSV

This section involves saving the consolidated predictions dataframe to a CSV file. The `merged_df` dataframe, which contains the predictions for `bildirimsiz_sum` for all districts along with unique identifiers, is saved to a specified file path.

## Steps Involved

1. **Save DataFrame to CSV**: Use the `to_csv` method to save the `merged_df` dataframe to a CSV file. The file is saved without the index.


In [None]:
merged_df.to_csv(r"C:\Users\ahmet\VSCode\GDZ-Datathon\submissions\submission1.csv", index=False)