# **Libraries**

In this project, various libraries have been utilized for different data processing, modeling, and evaluation procedures. Here's a brief overview of these libraries:

> ### Data Processing

- **Pandas**: An effective library used for data analysis and manipulation. Widely employed to efficiently handle and clean datasets.
- **NumPy**: A high-performance library for data manipulation and computations. Especially suitable for matrix operations.

> ### Visualization

- **Matplotlib**: A fundamental library for data visualization. Used to create various visual presentations such as graphs, histograms, and plots.
- **Seaborn**: A visualization library built on Matplotlib, enhancing data visualization to be more aesthetic and straightforward.
- **Plotly Express and Plotly Graph Objects**: Libraries used to create interactive and customizable graphs, offering various interactive visualization options.

> ### Data Preprocessing and Modeling

- **Scikit-learn**: A comprehensive library used for creating, scaling, evaluating, and selecting machine learning models.
- **Keras Tuner and Optuna**: Libraries utilized for hyperparameter tuning, enabling the automatic adjustment of different model architectures and parameters.
- **LightGBM, XGBoost, CatBoost**: Efficient and fast machine learning libraries containing gradient boosting algorithms. They offer high performance, particularly on large datasets.
- **TensorFlow**: A powerful library used for building deep learning and artificial neural network models.

These libraries were chosen to perform diverse analyses, experiment with different modeling techniques, and leverage various functionalities for data processing, visualization, modeling, and evaluation within the project.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Import required modules from scikit-learn
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest, HistGradientBoostingRegressor

# Import modules for hyperparameter tuning
from keras_tuner.tuners import RandomSearch
import optuna

# Import machine learning frameworks
import lightgbm as lgb
import xgboost as xgb
import catboost as catb
import tensorflow as tf


# **Describe Method**

This function aims to generate a summary statistics table for a given dataset and display it with color-coded styling to highlight variations. Here's a breakdown of its components:

> ### Input Parameters

- **dataframe_name**: Represents the name or identifier of the dataset being analyzed.
- **dataframe**: Refers to the DataFrame object containing the dataset.
Functionality:

- **Compute Statistics**: Utilizes the describe() method to calculate descriptive statistics (including count, mean, standard deviation, minimum, various percentiles, and maximum) for numerical columns within the dataset. Percentiles ranging from 0.01 to 0.99 are specified.
- **Style Presentation**: Applies a background gradient to the statistics table, with color-coding based on the values, using a light green palette.
- **Display**: Presents the styled statistics table using the display() function, allowing a more visual and comprehensive understanding of the dataset's numeric attributes.
- **Final Output**: The function returns the original DataFrame, allowing further analysis or processing.

> ### Output:

The descriptive statistics table, styled with a color gradient, showcasing the distribution and summary of numerical features within the dataset.

> ### Purpose:

This function aids in quickly examining and understanding the central tendencies, dispersion, and distribution of numerical columns in a given dataset, enabling effective initial exploratory data analysis (EDA).

This function is valuable for providing an organized and visually appealing summary of the dataset's numeric attributes, facilitating an initial understanding of the data distribution and statistical characteristics.

In [None]:
def describe_dataset(dataframe_name, dataframe):
    # Print the name of the checked dataframe
    print(f"Checked {dataframe_name}")

    # Generate descriptive statistics with specified percentiles for the dataframe
    describe = dataframe.describe(percentiles=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99]).T

    # Create a color gradient style for the descriptive statistics
    cm = sns.light_palette("green", as_cmap=True)
    styled_describe = describe.style.background_gradient(cmap=cm)
    
    # Display the styled descriptive statistics
    display(styled_describe)
    print("\n\n")  # Add an empty line for better separation
    
    return dataframe  # Return the original dataframe


# **Log Method**

This LogOperation class encapsulates a processing operation applied to a DataFrame based on a configuration parameter (self.cfg.log_operation). Here's an explanation of its functionality:

> ### Attributes

- **cfg**: Represents a configuration object that contains parameters for the operation.
Methods:

- __init__(self, cfg): Initializes the LogOperation object with the provided configuration.

- **process(self, dataframe, drop_columns)**: This method executes the data processing operation based on the configuration flag (self.cfg.log_operation).

> ### Parameters

- **dataframe**: The DataFrame object to be processed.
- **drop_columns**: Columns to be dropped from the DataFrame.

> ### Processing Steps

If self.cfg.log_operation is True:

- Copies the DataFrame (dataframe) to a new variable (data).
- Drops specified columns (drop_columns) from the copied DataFrame.
- Replaces any occurrences of zero (0) with a small positive value (0.001).
- Applies a logarithmic transformation (np.log()) to the DataFrame.

If self.cfg.log_operation is False:

- Copies the DataFrame (dataframe) to a new variable (data).
- Drops specified columns (drop_columns) from the copied DataFrame.

> ### Returns

The processed DataFrame (data), which has undergone either logarithmic transformation or column dropping based on the configuration.

> ### Purpose

This class provides a data processing operation that conditionally applies a logarithmic transformation to specified columns of a DataFrame if the configuration parameter self.cfg.log_operation is set to True. If False, it drops specific columns from the DataFrame.

The LogOperation class offers a way to apply conditional data transformations or column drops to a DataFrame based on the configuration parameter, allowing flexibility in preprocessing based on the specified conditions.

In [None]:
class LogOperation:
    def __init__(self, cfg):
        self.cfg = cfg  # Initialize LogOperation class with a configuration parameter

    def process(self, dataframe, drop_columns):
        # Check if log operation is enabled in the configuration
        if self.cfg.log_operation:
            # If enabled, create a copy of the dataframe and perform log transformation on selected columns
            data = dataframe.copy()
            data = data.drop(drop_columns, axis=1)  # Drop specified columns
            data = data.replace(0, 0.001)  # Replace zeros with a small value to avoid log(0) error
            data = np.log(data)  # Perform log transformation on the data

        else:
            # If log operation is not enabled, create a copy of the dataframe and drop specified columns
            data = dataframe.copy()
            data = data.drop(drop_columns, axis=1)  # Drop specified columns

        return data  # Return the processed data (with or without log transformation)


# **Missing Method**

This function is designed to visualize and analyze missing values within a DataFrame by creating a horizontal bar chart that illustrates the percentage of null and non-null values for each column.

> ### Input Parameters

- **dataframe_name**: Represents the name or identifier of the dataset being analyzed.
- **dataframe**: Refers to the DataFrame object containing the dataset.

> ### Functionality

- **Calculate Missing Values**: Computes the count of null values for each column in the DataFrame using dataframe.isnull().sum().
- **Compute Percentages**: Calculates the percentage of null and non-null values for each column by dividing the count of null values by the total length of the DataFrame.
- **Prepare Visualization**: Constructs a horizontal bar chart using Matplotlib to visually represent the percentage of null and non-null values for each column.
- **Styling**: Applies distinctive colors (red for null values, orange for non-null values) to the bars for clear differentiation.
- **Annotations**: Adds annotations displaying the percentage values on the bars to enhance readability.
- **Display Chart**: Shows the created bar chart, representing the distribution of missing and non-missing values for each column.
Output:

An informative bar chart illustrating the percentage of null and non-null values for each column in the dataset.

> ### Purpose

This function aids in visually assessing the extent of missing values across different columns of a dataset, facilitating quick insights into data completeness and potential imputation needs.

In [None]:
def check_missing_values(dataframe_name, dataframe):
    # Calculate the count and percentage of missing values in the dataframe
    df_null_values = dataframe.isnull().sum().to_frame().rename(columns={0: 'Count'})
    df_null_values['Percentage_nulls'] = (df_null_values['Count'] / len(dataframe)) * 100
    df_null_values['Percentage_no_nulls'] = 100 - df_null_values['Percentage_nulls']

    n = len(df_null_values.index)
    x = np.arange(n)

    # Create a horizontal bar plot to visualize null and non-null percentages
    fig, ax = plt.subplots(figsize=(10, 6))

    bar_width = 0.4
    gap = 0.2

    rects1 = ax.barh(x - gap / 2, df_null_values['Percentage_nulls'], bar_width, label='Null values', color='red')
    rects2 = ax.barh(x + gap / 2, df_null_values['Percentage_no_nulls'], bar_width, label='No null values', color='orange')

    # Set plot properties and labels
    ax.set_title(f'{dataframe_name} Null Values and Non-null Values', fontsize=15, fontweight='bold')
    ax.set_xlabel('% Percentage', fontsize=12, fontweight='bold')
    ax.set_yticks(x)
    ax.set_yticklabels(df_null_values.index, fontsize=10, fontweight='bold')

    # Hide the top and right spines of the plot, add legend, and label bars with percentages
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.legend()

    def autolabel(rects):
        # Function to label bars with their respective percentages
        for rect in rects:
            width = rect.get_width()
            ax.annotate(f'{width:.2f}%',
                        xy=(width, rect.get_y() + rect.get_height() / 2),
                        xytext=(2, 0),
                        textcoords="offset points",
                        ha='left', va='center', size=10, weight='bold')

    autolabel(rects1)
    autolabel(rects2)

    fig.tight_layout()
    plt.show()

# **Correalation Matrix Method**

This function generates a heatmap representing the correlation matrix among numeric columns in a DataFrame, offering insights into the relationships and dependencies between these columns.

> ### Input Parameters

- **dataframe_name**: Represents the name or identifier of the dataset being analyzed.
- **dataframe**: Refers to the DataFrame object containing the dataset.

> ### Functionality

- **Select Numeric Columns**: Identifies and selects the numeric columns (both int and float types) from the DataFrame using select_dtypes() method and include=[int, float].
- **Extract Relevant Data**: Extracts the numerical data columns from the DataFrame based on the identified numeric columns.
- **Calculate Correlation**: Computes the correlation matrix of the extracted numerical data using the corr() method, measuring the pairwise correlations between columns.
- **Create Heatmap**: Constructs a heatmap using Seaborn (sns.heatmap()) to visualize the correlation matrix with annotations (annot=True), color spectrum (cmap='coolwarm'), formatting to three decimal places (fmt=".3f"), and specified linewidths.
- **Title and Display**: Sets the title of the heatmap based on the provided dataframe_name and displays the generated heatmap.

> ### Output

A heatmap visualizing the correlation matrix among numeric columns in the dataset, showing the strength and direction of correlations between variables.

> ### Purpose

This function aids in identifying relationships and patterns among numeric features within the dataset, assisting in feature selection, identifying multicollinearity, and understanding interdependencies between variables.

The function enables a quick visualization of the correlation structure between numeric columns, providing insights into the strength and direction of relationships within the dataset.

In [None]:
def correlation_matrix(dataframe_name, dataframe):
    # Select numerical columns from the dataframe
    num_cols = dataframe.select_dtypes(include=[int, float]).columns

    # Get the numerical data based on selected columns
    variables = num_cols
    data = dataframe[variables]

    # Compute the correlation matrix
    correlation_matrix = data.corr()

    # Plot the correlation matrix as a heatmap
    plt.figure(figsize=(12, 8))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".3f", linewidths=1)
    plt.title(f'{dataframe_name} Correlation Matrix')
    plt.show()
    print("\n\n")  # Add an empty line for better separation


# **About Columns**

This function is designed to categorize columns within a DataFrame based on their data types and unique value counts, aiding in feature categorization and differentiation.

> ### Input Parameters

- **dataframe_name**: Represents the name or identifier of the dataset being analyzed.
- **dataframe**: Refers to the DataFrame object containing the dataset.
- **cat_th**: Threshold value to determine categorical columns based on their unique value counts (default set to 10).
- **car_th**: Threshold value to define categorical but cardinal columns (default set to 20).
- **print_results**: A Boolean parameter controlling the printing of results (default set to True).

> ### Categorize Columns

- **cat_cols**: Identifies columns with data types of "category," "object," or "bool" using list comprehension.
- **num_but_cat**: Selects columns with less than cat_th unique values and having data types of "int64" or "float64".
- **cat_but_car**: Finds columns with more than car_th unique values and data types of "category" or "object".

> ### Create Categorical and Numerical Lists

- **cat_cols**: Removes columns in cat_but_car from cat_cols and appends columns from num_but_cat.
- **num_cols**: Selects columns with data types of "int64" or "float64" that are not in cat_cols.

> ### Print Summary (Optional)
If print_results is True, displays a summary containing dataset information, such as the number of observations, variables, categorical columns (cat_cols), numerical columns (num_cols), and additional categorized columns.

> ### Return
Returns a list of numerical columns (num_cols).

> ### Output

Optionally prints a summary and returns a list of numerical columns.

> ### Purpose

Facilitates the identification and categorization of columns into numerical and categorical types within the dataset, assisting in subsequent analysis, preprocessing, or modeling tasks.

This function is useful for classifying columns based on their data types and unique value counts, aiding in the understanding and handling of different types of features in the dataset.

In [None]:
def grab_col_names(dataframe_name, dataframe, cat_th=10, car_th=20, print_results=True):
    # Identify categorical columns based on data type and unique value counts
    cat_cols = [col for col in dataframe.columns if str(dataframe[col].dtypes) in ["category", "object", "bool"]]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and dataframe[col].dtypes in ["int64", "float64"]]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and str(dataframe[col].dtypes) in ["category", "object"]]
    cat_cols = [col for col in cat_cols if col not in cat_but_car]
    cat_cols = cat_cols + num_but_cat
    
    # Identify numerical columns excluding previously identified categorical columns
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes in ["int64", "float64"]]
    num_cols = [col for col in num_cols if col not in cat_cols]

    # Print results if specified
    if print_results:
        print(f"{dataframe_name} Dataset")
        print("*"*20, "\n")
        print(f'Observations {dataframe.shape[0]}')
        print(f'Variables:  {dataframe.shape[1]}')
        print(f'cat_cols:  {len(cat_cols)}')
        print(f'num_cols:  {len(num_cols)}')
        print(f'cat_but_car:  {len(cat_but_car)}')
        print(f'num_but_cat:  {len(num_but_cat)}\n\n')

    return num_cols  # Return the list of numerical columns


# **BoxPlot Method**

This function is used to create a grid of box plots for numerical columns in a dataset, aiding in the visualization of their distributions and statistical characteristics.

> ### Input Parameters

- **dataset_name**: Represents the name or identifier of the dataset being visualized.
- **dataframe**: Refers to the DataFrame object containing the dataset.
- **num_cols**: A list of numerical columns from the dataset to be visualized using box plots.
- **ncols**: Number of columns in the grid layout (default set to 2).

> ### Functionality

Calculates the number of required rows (nrows) based on the number of numerical columns and the specified ncols.

> ### Create Subplots

Generates a grid of subplots (nrows x ncols) with a specified figure size.

> ### Plot Box Plots

- Iterates through each numerical column in num_cols.
- Checks if the column exists in the DataFrame and creates a box plot for the column using Seaborn's boxplot() function.
- Each box plot represents the distribution of values for a specific numerical column.
- Assigns titles to each subplot based on the column name.

> ### Adjust Layout and Display

Adjusts the layout of subplots to prevent overlapping.
Displays the grid of box plots.

> ### Output

Displays a grid of box plots for the specified numerical columns in the dataset.

> ### Purpose

Facilitates a visual examination of the distribution, central tendency, and spread of numerical data in the dataset through box plots, aiding in identifying potential outliers and understanding data variability.

This function offers an efficient way to visualize multiple numerical columns simultaneously, allowing for a quick overview of their distributions and statistical summaries using box plots.

In [None]:
def boxplot(dataset_name, dataframe, num_cols, ncols=2):
    # Calculate the number of rows needed for subplots based on the number of numerical columns and ncols
    nrows = (len(num_cols) + ncols - 1) // ncols
    
    # Create subplots based on nrows and ncols
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(10, 3*nrows))
    fig.suptitle(f"{dataset_name} Visualization")  # Set the main title of the visualization

    colors = sns.color_palette('Set3', n_colors=len(dataframe.columns))  # Define colors for each boxplot
    
    # Plot boxplots for each numerical column in the dataframe
    for i, col in enumerate(num_cols):
        if col in dataframe.columns:
            ax = axes[i // ncols, i % ncols]  # Define the axes for each subplot
            sns.boxplot(x=dataframe[col], ax=ax, color=colors[i])  # Create the boxplot
            ax.set_title(f"Box Plot of {col}")  # Set the title for each subplot

    plt.tight_layout()  # Adjust layout for better visualization
    plt.show()  # Display the plot
    print("\n\n") 


# **HistPlot Method**

This function is designed to generate a grid layout of histograms for numerical columns in a dataset, providing visual insights into their distributions.

> ### Input Parameters

- **dataframe_name**: Represents the name or identifier of the dataset being visualized.
- **dataframe**: Refers to the DataFrame object containing the dataset.
- **num_cols**: A list of numerical columns from the dataset to be visualized using histograms.
- **ncols**: Number of columns in the grid layout (default set to 2).
- **bins**: Number of bins for the histograms (default set to 10).

> ### Functionality

Calculates the number of required rows (nrows) based on the number of numerical columns and the specified ncols.

> ### Create Subplots

Generates a grid of subplots (nrows x ncols) with a specified figure size.

> ### Plot Histograms

- Iterates through each numerical column in num_cols.
- Checks if the column exists in the DataFrame and creates a histogram for the column using Seaborn's histplot() function.
- Each histogram represents the distribution of values for a specific numerical column.
- Assigns titles to each subplot based on the column name.

> ### Adjust Layout and Display

Adjusts the layout of subplots to prevent overlapping.
Displays the grid of histograms.

> ### Output

Displays a grid of histograms for the specified numerical columns in the dataset.

> ### Purpose

Provides an overview of the distribution patterns of numerical data in the dataset through histograms, enabling insights into the data's central tendency, spread, and potential skewness or outliers.

This function offers a convenient way to visualize the distribution of multiple numerical columns simultaneously by creating a grid of histograms. It allows for a quick examination of the distribution characteristics of each numerical variable in the dataset.

In [None]:
def hist_plot(dataframe_name, dataframe, num_cols, ncols=2, bins=10):
    # Calculate the number of rows needed for subplots based on the number of numerical columns and ncols
    nrows = (len(num_cols) + ncols - 1) // ncols
    
    # Create subplots based on nrows and ncols
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(10, 3*nrows))
    fig.suptitle(f'{dataframe_name} Histogram Visualization')  # Set the main title of the visualization

    # Plot histograms for each numerical column in the dataframe
    for i, col in enumerate(num_cols):
        if col in dataframe.columns:
            ax = axes[i // ncols, i % ncols]  # Define the axes for each subplot
            sns.histplot(dataframe[col], ax=ax, bins=bins, kde=True)  # Create the histogram plot
            ax.set_title(f"Histogram of {col}")  # Set the title for each subplot

    plt.tight_layout()  # Adjust layout for better visualization
    plt.show()  # Display the plot
    print("\n\n") 


# **Check Outlier Method**

This function checks for potential outliers within numerical columns in a dataset using the interquartile range (IQR) method.

> ### Input Parameters

**dataframe_name**: Name or identifier of the dataset being checked for outliers.
**dataframe**: DataFrame object containing the dataset.
**num_cols**: List of numerical columns in the dataset to be checked for outliers.
**low_quantile**: Lower quantile threshold (default set to 0.10, representing 10th percentile).
**up_quantile**: Upper quantile threshold (default set to 0.90, representing 90th percentile).

> ### Functionality

Loops through each numerical column specified in num_cols.

> ### Outlier Detection

- Computes the first quartile (quantile_one) and third quartile (quantile_three) using the specified quantiles.
- Calculates the interquartile range (interquantile_range) as the difference between the third and first quartiles.
- Determines upper and lower limits for potential outliers using the IQR method.
- Checks if any values in the column exceed the defined outlier limits.

> ### Output Display

Prints a message for each column indicating the presence or absence of outliers based on the defined limits.

> ### Purpose

- Aims to identify potential outliers within numerical columns by analyzing their positions relative to the IQR boundaries.
- Offers an insight into columns where values significantly deviate from the central data distribution.

This function serves as a basic outlier detection method by utilizing the IQR technique to flag potential outliers in the numerical columns of a dataset. It helps in identifying data points that may require further investigation due to their extreme values in comparison to the rest of the dataset.

In [None]:
def check_outlier(dataframe_name, dataframe, num_cols, low_quantile=0.10, up_quantile=0.90):
    # Check for outliers in numerical columns within the specified quantiles
    print(f'\n Checking for {dataframe_name} Dataset')
    for col in num_cols:
        quantile_one = dataframe[col].quantile(low_quantile)  # Get the lower quantile
        quantile_three = dataframe[col].quantile(up_quantile)  # Get the upper quantile

        interquantile_range = quantile_three - quantile_one  # Calculate the interquartile range
        up_limit = quantile_three + 1.5 * interquantile_range  # Calculate the upper limit
        low_limit = quantile_one - 1.5 * interquantile_range  # Calculate the lower limit

        # Check if the column has outliers based on the limits
        if dataframe[(dataframe[col] > up_limit) | (dataframe[col] < low_limit)].any(axis=None):
            print(f'{col} column has outliers..')
        else:
            print(f'{col} column has no outliers..')

    print("\n")


# **Outlier Handling Method**

This class provides methods for outlier detection in a dataset using Isolation Forest (IForest) or Local Outlier Factor (LOF) algorithms.

> ### Constructor (__init__)

Initializes the class instance with configuration settings (cfg) used to determine the outlier detection method to apply.


> ### Input Parameters

- **dataframe_name**: Name or identifier of the dataset.
- **dataframe**: DataFrame object containing the dataset.
- **num_cols**: List of numerical columns in the dataset to perform outlier detection.
- **contamination**: The proportion of outliers to expect in the dataset.
- **n_neighbors**: Number of neighbors to consider (for LOF method).

> ### IForest (Isolation Forest)

- Applies Isolation Forest algorithm if cfg.IForest is set to True.
- Detects outliers using Isolation Forest algorithm.
- Marks anomalies in the dataset by assigning 1 to an 'anomaly' column for detected outliers.
- Prints a message indicating the application of Isolation Forest for the specified dataset.

> ### LOF (Local Outlier Factor)

- Applies Local Outlier Factor algorithm if cfg.LOF is set to True.
- Detects outliers using Local Outlier Factor algorithm.
- Marks anomalies in the dataset by assigning 1 to an 'anomaly' column for detected outliers.
- Prints a message indicating the application of LOF for the specified dataset.

> ### Return

Returns the modified DataFrame with an additional 'anomaly' column indicating outlier presence or the original DataFrame if outlier detection is disabled.

> ### Purpose

- Facilitates outlier detection by applying specific algorithms based on the provided configuration settings.
I- dentifies anomalies in the dataset to highlight potential data points that deviate significantly from the general pattern.

This class enables the automatic detection of outliers using Isolation Forest or Local Outlier Factor techniques based on the user-defined configurations, providing a convenient way to identify potential anomalies within a dataset.

In [None]:
class OutlierDetection:
    def __init__(self, cfg):
        self.cfg = cfg

    def process(self, dataframe_name, dataframe, num_cols, contamination, n_neighbors):
        # Apply outlier detection algorithms based on configuration settings
        if self.cfg.IForest:  # Check if Isolation Forest is enabled
            # Isolation Forest algorithm implementation
            X = dataframe[num_cols]
            model = IsolationForest(n_estimators=len(dataframe), contamination=contamination)
            model.fit(X)
            anomaly_label = model.predict(X)
            anomaly = X[anomaly_label == -1]
            # Mark anomalies in the dataset
            dataframe['anomaly'] = 0
            dataframe.loc[anomaly.index, 'anomaly'] = 1
            print(f'Isolation Forest applied for {dataframe_name} dataset..')
        
        elif self.cfg.LOF:  # Check if Local Outlier Factor is enabled
            # Local Outlier Factor algorithm implementation
            X = dataframe[num_cols]
            lof_model = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination)
            outlier_labels = lof_model.fit_predict(X)
            anomaly = X[outlier_labels == -1]
            # Mark anomalies in the dataset
            dataframe['anomaly'] = 0
            dataframe.loc[anomaly.index, 'anomaly'] = 1
            print(f'Local Outlier Factor applied for {dataframe_name} dataset..')
        
        else:
            print('Outlier Detection process is disabled. Returning original dataframe.')
            
        return dataframe


# **Feature Engineering (for wind_plant)**

This class provides a method for extracting new features from a dataset.

> ### Constructor (__init__)

Initializes the class instance with configuration settings (cfg) used to determine whether to extract features.

> ### Input Parameters

- **dataframe_name**: Name or identifier of the dataset.
- **dataframe**: DataFrame object containing the dataset.
- **target_col**: The column considered as the target variable.

### Conditional Feature Extraction

- Derives additional features from the dataset based on numerical columns.
- Computes statistical features like mean, standard deviation, and variance across rows.
- Applies trigonometric functions to specific columns ('windDir10m' and 'windDir100m').
- Calculates rolling mean for 'windSpeed10m' and 'windSpeed100m' columns.

> ### Return

Returns the modified DataFrame with added features or the original DataFrame if feature extraction is disabled.

> ### Purpose

- Facilitates the creation of new derived features from the dataset, such as statistical aggregates and trigonometric transformations.
- Enhances the dataset with additional information that might improve model performance during training or analysis.

This class enables the extraction of supplementary features from the dataset based on user-defined conditions, providing an enriched dataset with new information that might be beneficial for modeling or analysis purposes.

In [None]:
class ExtractFeature:
    def __init__(self, cfg):
        self.cfg = cfg

    def process(self, dataframe_name, dataframe, target_col):
        # Check if feature extraction is enabled in the configuration
        if self.cfg.extract_features:
            # Select numerical columns excluding the target column
            num_cols = dataframe.select_dtypes(include=['int64', 'float64']).columns.tolist()
            num_cols = [col for col in num_cols if col != target_col]

            # Perform feature extraction
            dataframe['mean'] = dataframe[num_cols].mean(axis=1)
            dataframe['std'] = dataframe[num_cols].std(axis=1)
            dataframe['var'] = dataframe[num_cols].var(axis=1)
            dataframe['windir10m_sin'] = np.sin(np.deg2rad(dataframe['windDir10m']))
            dataframe['windDir10m_cos'] = np.cos(np.deg2rad(dataframe['windDir10m']))
            dataframe['windir100m_sin'] = np.sin(np.deg2rad(dataframe['windDir100m']))
            dataframe['windDir100m_cos'] = np.cos(np.deg2rad(dataframe['windDir100m']))
            dataframe['windSpeed10m_hourly_mean'] = dataframe['windSpeed10m'].rolling(window=1).mean()
            dataframe['windSpeed100m_hourly_mean'] = dataframe['windSpeed100m'].rolling(window=1).mean()

            print(f"All features extracted for {dataframe_name} dataset..")
            
        else:
            print("Extract features is disabled. Returning original dataframe.")
        
        return dataframe


# **About Datetime**

This function is used to process a DataFrame's date-related column to derive additional temporal features.

> ### Input Parameters

- **dataframe**: DataFrame containing the date-related information.
- **date_column**: Column name representing the date or timestamp.

> ### Date Processing Steps

- Converts the specified date_column to a Pandas datetime object.
- Sets the datetime column as the index for the DataFrame.
- Sorts the DataFrame based on the datetime index.
- **year**: Extracts the year.
- **quarter**: Extracts the quarter of the year.
- **month**: Extracts the month.
- **week**: Extracts the week of the year using ISO week numbering.
- **day_of_week**: Extracts the day of the week (0 = Monday, 6 = Sunday).
- **day_of_year**: Extracts the day of the year.
- **is_weekend**: Flags whether the day falls on a weekend (Saturday or Sunday).
- **hour**: Extracts the hour of the day.

> ### Return

Returns the modified DataFrame with added temporal features based on the date_column.

> ### Purpose

- Facilitates the extraction of various temporal attributes from a date column.
- Enriches the dataset with time-related information that could be useful for time-series analysis or modeling.

This function enables the creation of a more detailed dataset by incorporating temporal information extracted from the specified date column, allowing for better analysis or modeling, especially in time-oriented datasets.

In [None]:
def datetime_process(dataframe, date_column):
    # Convert the specified column to datetime format
    dataframe[date_column] = pd.to_datetime(dataframe[date_column])
    
    # Set the datetime column as the index
    dataframe = dataframe.set_index(date_column, drop=True)
    
    # Sort the dataframe based on the datetime index
    dataframe.sort_index(inplace=True)
    
    # Extract various date and time-related features
    dataframe['year'] = dataframe.index.year
    dataframe['quarter'] = dataframe.index.quarter
    dataframe['month'] = dataframe.index.month
    dataframe['week'] = dataframe.index.isocalendar().week
    dataframe['day_of_week'] = dataframe.index.dayofweek
    dataframe['day_of_year'] = dataframe.index.dayofyear
    dataframe['is_weekend'] = dataframe.index.dayofweek.isin([5, 6]).astype(int)
    dataframe['hour'] = dataframe.index.hour

    return dataframe


In [None]:
def fig_specs(fig1, fig2, dataframe, year_col, x_col, y_col):
    # Set visibility of initial traces to False
    fig1.data[0].visible = False
    fig2.data[0].visible = False

    # Create buttons to toggle visibility of traces for different years
    buttons = [{'label': 'Choose Option', 'method': 'update', 'args': [{'visible': [False] * len(fig1.data)}]}]

    for year in dataframe[year_col].unique():
        button = {'label': str(year), 'method': 'update',
                  'args': [{'visible': [trace.name.startswith(str(year)) for trace in fig1.data]}]}
        buttons.append(button)

    # Create a menu specification dictionary for the interactive buttons
    menu_spec_dict = {'buttons': buttons, 'direction': 'down', 'showactive': True,
                      'x': 0.5, 'xanchor': 'center', 'y': 1.15, 'yanchor': 'top'}

    # Update layout settings and titles for both figures
    fig1.update_layout(updatemenus=[menu_spec_dict],
                       title=f"{x_col}ly Total {y_col} Usage by Year",
                       xaxis_title=x_col, yaxis_title=y_col, barmode='stack')

    fig2.update_layout(updatemenus=[menu_spec_dict],
                       title=f'{x_col}ly Mean {y_col} Usage by Year',
                       xaxis_title=x_col, yaxis_title=y_col, barmode='stack')

    # Show the figures with interactive controls
    fig1.show()
    fig2.show()


In [None]:
def target_analysis_quarterly(dataframe, year_col, quarter_col, target_col):
    # Initialize figures for total and mean analyses
    fig_total = go.Figure()
    fig_mean = go.Figure()

    # Define color scheme for each year
    colors = px.colors.qualitative.Set2

    # Loop through unique years in the dataset
    for idx, year in enumerate(dataframe[year_col].unique()):
        # Extract data for the current year
        data_of_yearly = dataframe[dataframe[year_col] == year]
        
        # Calculate total demand per quarter for the current year
        weekly_total_demand = data_of_yearly.groupby(quarter_col)[target_col].sum().reset_index()
        # Add a bar trace for total demand to the total figure
        fig_total.add_trace(go.Bar(x=weekly_total_demand[quarter_col], 
                                   y=weekly_total_demand[target_col], 
                                   name=str(year),
                                   visible=False,
                                   marker_color=colors[idx]))

        # Calculate mean demand per quarter for the current year
        weekly_mean_demand = data_of_yearly.groupby(quarter_col)[target_col].mean().reset_index()
        # Add a bar trace for mean demand to the mean figure
        fig_mean.add_trace(go.Bar(x=weekly_mean_demand[quarter_col], 
                                  y=weekly_mean_demand[target_col], 
                                  name=f'{year}', 
                                  marker_color=colors[idx],
                                  visible=False, 
                                  opacity=0.5))

    # Use the fig_specs function to display both figures with interactive controls
    fig_specs(fig_total, fig_mean, dataframe, year_col, quarter_col, target_col)



In [None]:
def target_analysis_monthly(dataframe, year_col, month_col, target_col):
    # Initialize figures for total and mean analyses
    fig_total = go.Figure()
    fig_mean = go.Figure()

    # Define color scheme for each year
    colors = px.colors.qualitative.Vivid

    # Loop through unique years in the dataset
    for idx, year in enumerate(dataframe[year_col].unique()):
        # Extract data for the current year
        data_of_yearly = dataframe[dataframe[year_col] == year]
        
        # Calculate total demand per month for the current year
        monthly_total_demand = data_of_yearly.groupby(month_col)[target_col].sum().reset_index()
        # Add a bar trace for total demand to the total figure
        fig_total.add_trace(go.Bar(x=monthly_total_demand[month_col], 
                                   y=monthly_total_demand[target_col], 
                                   name=str(year),
                                   visible=False,
                                   marker_color=colors[idx]))

        # Calculate mean demand per month for the current year
        monthly_mean_demand = data_of_yearly.groupby(month_col)[target_col].mean().reset_index()
        # Add a bar trace for mean demand to the mean figure
        fig_mean.add_trace(go.Bar(x=monthly_mean_demand[month_col], 
                                  y=monthly_mean_demand[target_col], 
                                  name=f'{year}', 
                                  marker_color=colors[idx],
                                  visible=False, 
                                  opacity=0.5))

    # Use the fig_specs function to display both figures with interactive controls
    fig_specs(fig_total, fig_mean, dataframe, year_col, month_col, target_col)


In [None]:
def target_analysis_weekly(dataframe, year_col, week_col, target_col):
    # Initialize figures for total and mean analyses
    fig_total = go.Figure()
    fig_mean = go.Figure()

    # Define color scheme for each year
    colors = px.colors.qualitative.Vivid

    # Loop through unique years in the dataset
    for idx, year in enumerate(dataframe[year_col].unique()):
        # Extract data for the current year
        data_of_yearly = dataframe[dataframe[year_col] == year]
        
        # Calculate total demand per week for the current year
        weekly_total_demand = data_of_yearly.groupby(week_col)[target_col].sum().reset_index()
        # Add a bar trace for total demand to the total figure
        fig_total.add_trace(go.Bar(x=weekly_total_demand[week_col], 
                                   y=weekly_total_demand[target_col], 
                                   name=str(year),
                                   visible=False,
                                   marker_color=colors[idx]))

        # Calculate mean demand per week for the current year
        weekly_mean_demand = data_of_yearly.groupby(week_col)[target_col].mean().reset_index()
        # Add a bar trace for mean demand to the mean figure
        fig_mean.add_trace(go.Bar(x=weekly_mean_demand[week_col], 
                                  y=weekly_mean_demand[target_col], 
                                  name=f'{year}', 
                                  marker_color=colors[idx],
                                  visible=False, 
                                  opacity=0.5))

    # Use the fig_specs function to display both figures with interactive controls
    fig_specs(fig_total, fig_mean, dataframe, year_col, week_col, target_col)


In [None]:
def target_analysis_hourly(dataframe, year_col, month_col, hour_col, target_col):
    # Group data by year, month, and hour to calculate hourly usage and mean
    hourly_usage = dataframe.groupby([year_col, month_col, hour_col])[target_col].sum().reset_index()
    hourly_mean = dataframe.groupby([year_col, month_col, hour_col])[target_col].mean().reset_index()

    # Initialize figures for total and mean analyses
    fig_total = go.Figure()
    fig_mean = go.Figure()

    # Loop through unique years in the dataset
    for year in dataframe[year_col].unique():
        year_data = hourly_usage[hourly_usage[year_col] == year]
        year_mean = hourly_mean[hourly_mean[year_col] == year]

        # Check if the year's data is available
        if not year_data.empty:
            for month in year_data[month_col].unique():
                monthly_data = year_data[year_data[month_col] == month]
                monthly_mean = year_mean[year_mean[month_col] == month]

                # Check if both monthly data and mean are available
                if not monthly_data.empty and not monthly_mean.empty:
                    # Add a bar trace for hourly total usage to the total figure
                    fig_total.add_trace(go.Bar(
                        x=monthly_data[hour_col],
                        y=monthly_data[target_col],
                        name=f'{year} - {month}',
                        visible=False,
                    ))

                    # Add a bar trace for hourly mean usage to the mean figure
                    fig_mean.add_trace(go.Bar(
                        x=monthly_mean[hour_col],
                        y=monthly_mean[target_col],
                        name=f'{year} - {month}',
                        visible=False,
                        opacity=0.5
                    ))

    # Use the fig_specs function to display both figures with interactive controls
    fig_specs(fig_total, fig_mean, dataframe, year_col, hour_col, target_col)


# **Scaler Method**

This class contains a method to scale numerical columns within DataFrames according to specific scaling techniques.

> ### Attributes

- **cfg**: Configuration settings that determine the type of scaling to be applied.

> ### Inputs

- **dataframe**: The training dataset containing numerical columns to be scaled.
- **dataframe_test**: The testing dataset with the same numerical columns as the training set.
- **week_col**: The column representing weeks, converted to integer type for consistency.

> ### MinMaxScaler

Scales numerical columns within the DataFrame using MinMaxScaler to a range of 0 to 1.

> ### StandardScaler

Utilizes StandardScaler to standardize numerical columns with mean=0 and standard deviation=1.

> ### RobustScaler

Applies RobustScaler, which is robust to outliers by scaling data based on median and interquartile range.

> ### Operations

- Checks the specified configuration settings to determine the scaling technique to be applied.
- Scales numerical columns of the training and testing datasets accordingly.
- Prints a message indicating the successful application of the selected scaling process.

> ### Returns

Modified training and testing DataFrames with scaled numerical columns.

> ### Purpose

Facilitates the scaling of numerical features in datasets based on specified configurations.
Ensures that training and testing datasets undergo the same scaling process for consistency in modeling or analysis.

This class allows for the uniform application of scaling techniques to numerical columns in both training and testing datasets, providing consistent and standardized data for modeling or analysis purposes.

In [None]:
class Scaler:
    def __init__(self, cfg):
        self.cfg = cfg

    def process(self, dataframe, dataframe_test, week_col):
        # Convert week_col to integer type
        dataframe[week_col] = dataframe[week_col].astype(int)
        dataframe_test[week_col] = dataframe_test[week_col].astype(int)
        
        # Extract numerical columns
        num_columns = dataframe.select_dtypes(include=(int, float)).columns.tolist()

        if self.cfg.min_max_scaler:
            # Apply MinMaxScaler
            print("MinMaxScaler process worked..")
            mm = MinMaxScaler(feature_range=(0, 1))
            scaler = mm.fit(dataframe[num_columns])
            dataframe[num_columns] = pd.DataFrame(scaler.transform(dataframe[num_columns]), 
                                     index=dataframe.index, columns=dataframe.columns)
            dataframe_test[num_columns] = pd.DataFrame(scaler.transform(dataframe_test[num_columns]), 
                                          index=dataframe_test.index, columns=dataframe_test.columns)
            
            return dataframe, dataframe_test, mm



        elif self.cfg.standard_scaler:
            # Apply StandardScaler
            print("StandardScaler process worked..")
            ss = StandardScaler()
            scaler = ss.fit(dataframe[num_columns])
            dataframe[num_columns] = pd.DataFrame(scaler.transform(dataframe[num_columns]), 
                                     index=dataframe.index, columns=dataframe.columns)
            dataframe_test[num_columns] = pd.DataFrame(scaler.transform(dataframe_test[num_columns]), 
                                          index=dataframe_test.index, columns=dataframe_test.columns)

            return dataframe, dataframe_test, ss
        


        elif self.cfg.robust_scaler:
            # Apply RobustScaler
            print("RobustScaler process worked..")
            rs = RobustScaler()
            scaler = rs.fit(dataframe[num_columns])
            dataframe[num_columns] = pd.DataFrame(scaler.transform(dataframe[num_columns]), 
                                     index=dataframe.index, columns=dataframe.columns)
            dataframe_test[num_columns] = pd.DataFrame(scaler.transform(dataframe_test[num_columns]), 
                                          index=dataframe_test.index, columns=dataframe_test.columns)

            return dataframe, dataframe_test, rs
        
        else:
            print('All scaler processes are disabled. Returning original dataframes..')
        
            return dataframe, dataframe_test, scaler


# **Hyperparameter Tuning Method**

This class aids in optimizing hyperparameters for various regression models using the Optuna library.

> ### Attributes

- **cfg**: Configuration settings determining the type of hyperparameter tuning to be performed.

> ### Inputs

- **dataframe**: The dataset containing predictor variables.
- **target_col**: The target column (dependent variable) to be predicted.
- **n_trial**: Number of trials (iterations) for hyperparameter tuning.


> ### HistGradientBoostingRegressor

Executes hyperparameter optimization for the HistGradientBoostingRegressor model.

> ### LGBMRegressor

Performs hyperparameter optimization for the LGBMRegressor model.

> ### XGBRegressor

Executes hyperparameter optimization for the XGBRegressor model.

> ### CatBoostRegressor

Conducts hyperparameter tuning for the CatBoostRegressor model.

> ### Operations

- Utilizes Optuna's optimization functionalities to search for optimal hyperparameters.
- Fits regression models with different hyperparameter configurations.
- Evaluates models using Root Mean Squared Error (RMSE) metric.
- Identifies and returns the best hyperparameters found during the optimization process.

> ### Prints

Logs messages indicating te optimization process for each regression model.

> ### Returns

Best hyperparameters discovered through the optimization process for the specified regression model.

> ### Purpose

Enhances regression models by identifying optimal hyperparameters, improving predictive performance.

This class facilitates the optimization of hyperparameters for regression models, aiming to enhance their performance and accuracy in predicting target variables.

In [None]:
class Optuna:
    def __init__(self, cfg):
        self.cfg = cfg

    def process(self, dataframe, target_col, n_trial):
        X = dataframe.drop(columns=[target_col]) # Select independent variables on dataframe
        y = dataframe[target_col] # Select dependent variables on dataframe
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Split the data for training and test
        
        if self.cfg.histgbr_optuna:
            # Optuna for HistGradientBoostingRegressor
            print('Optuna is executing for HistGradientBoostingRegressor...')
            def hist_gradient_objective(trial):
                # Hyperparameters to optimize
                params = {
                    'loss': 'squared_error',
                    'scoring': 'loss',
                    'verbose': 0,
                    'max_iter': trial.suggest_int('max_iter', 100, 2000),
                    'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.25, log=True),
                    'max_leaf_nodes': trial.suggest_int('max_leaf_nodes', 2, 256),
                    'min_samples_leaf': trial.suggest_int('min_samples_leaf', 2, 200),
                    'max_depth': trial.suggest_int('max_depth', 2, 9),
                    'l2_regularization': trial.suggest_float('l2_regularization', 1e-5, 10),
                }
                
                # Create HistGradientBoostingRegressor with suggested parameters
                model = HistGradientBoostingRegressor(**params)
                model.fit(X_train, y_train)
                y_pred = model.predict(X_test)
                rmse = mean_squared_error(y_test, y_pred, squared=False)
                return rmse
            
            study_lgbm = optuna.create_study(direction='minimize')
            study_lgbm.optimize(hist_gradient_objective, n_trials=n_trial)
            best_params = study_lgbm.best_params

            return best_params
        

        elif self.cfg.lgb_optuna:
            # Optuna for LGBMRegressor
            print('Optuna is executing for LGBMRegressor...')
            def lgbm_objective(trial):
                # Hyperparameters to optimize
                params = {
                    'objective': 'regression',
                    'metric': 'rmse',
                    'verbosity': -1,
                    'boosting_type': trial.suggest_categorical('boosting_type', ['gbdt', 'dart']),
                    'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
                    'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.25, log=True),
                    'num_leaves': trial.suggest_int('num_leaves', 2, 256),
                    'max_depth': trial.suggest_int('max_depth', 3, 9),
                    'reg_alpha': trial.suggest_float('reg_alpha', 1e-5, 10),
                    'reg_lambda': trial.suggest_float('reg_lambda', 1e-5, 10),
                    'min_child_samples': trial.suggest_int('min_child_samples', 5, 100)
                }

                # Create LGBMRegressor with suggested parameters
                model = lgb.LGBMRegressor(**params)
                model.fit(X_train, y_train)
                y_pred = model.predict(X_test)
                rmse = mean_squared_error(y_test, y_pred, squared=False)
                return rmse
            
            study_lgbm = optuna.create_study(direction='minimize')
            study_lgbm.optimize(lgbm_objective, n_trials=n_trial)
            best_params = study_lgbm.best_params

            return best_params
        

        elif self.cfg.xgb_optuna:
            # Optuna for XGBRegressor
            print('Optuna is executing for XGBRegressor...')
            def xgb_objective(trial):
                # Hyperparameters to optimize
                params = {
                    'objective': 'reg:squarederror',
                    'verbosity': 0,
                    'eval_metric': 'rmse',
                    'boosting_type': trial.suggest_categorical('boosting_type', ['gblinear', 'dart']),
                    'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
                    'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.25, log=True),
                    'max_depth': trial.suggest_int('max_depth', 3, 9),
                    'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
                    'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
                    'subsample': trial.suggest_float('subsample', 0.5, 1.0),
                    'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0)
                }

                # Create XGBRegressor with suggested parameters
                model = xgb.XGBRegressor(**params)
                model.fit(X_train, y_train)
                y_pred = model.predict(X_test)
                rmse = mean_squared_error(y_test, y_pred, squared=False)
                return rmse

            study_xgb = optuna.create_study(direction='minimize')
            study_xgb.optimize(xgb_objective, n_trials=n_trial)
            best_params = study_xgb.best_params

            return best_params
        

        elif self.cfg.catb_optuna:
            # Optuna for CatBoostRegressor
            print('Optuna is executing for CatBoostRegressor...')
            def catboost_objective(trial):
                # Hyperparameters to optimize
                params = {
                    'objective': 'RMSE',
                    'verbose': False,
                    'eval_metric': 'RMSE',
                    'grow_policy': 'Lossguide',
                    'iterations': trial.suggest_int('iterations', 100, 2000),
                    'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.25, log=True),
                    'depth': trial.suggest_int('depth', 3, 9),
                    'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 1e-3, 10.0),
                    'random_strength': trial.suggest_float('random_strength', 1e-3, 10.0),
                    'max_leaves': trial.suggest_int('max_leaves', 10, 200),
                    'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 20)
                }
                
                # Create CatBoostRegressor with suggested parameters
                model = catb.CatBoostRegressor(**params)
                model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=0)
                y_pred = model.predict(X_test)
                rmse = mean_squared_error(y_test, y_pred, squared=False)
                return rmse

            study_catboost = optuna.create_study(direction='minimize')
            study_catboost.optimize(catboost_objective, n_trials=n_trial)
            best_params = study_catboost.best_params

            return best_params
        
        else:
            # When no Optuna process is enabled
            print('All Optuna processes are disabled. Process passed')
            return None

# **Machine Learning Methods**

This class is responsible for executing Machine Learning models for regression tasks.

> ### Attributes

- **cfg**: Configuration settings determining the models to be processed.

> ### Inputs

- **dataframe**: The dataset containing predictor variables and the target column.
- **target_col**: The target column (dependent variable) to be predicted.
- **params**: Hyperparameters for the selected models.


> ### HistGradientBoostingRegressor

Validates the HistGradientBoostingRegressor model.

> ### LGBMRegressor

Validates the LGBMRegressor model.

> ### XGBRegressor

Validates the XGBRegressor model.

> ### CatBoostRegressor

Executes the CatBoostRegressor model.

> ### Operations

- Fits regression models with specified hyperparameters on training data.
- Evaluates model performance using Root Mean Squared Error (RMSE), R-squared (R2), and Mean Absolute Error (MAE) metrics.
- Logs messages indicating the execution of each model and displays evaluation metrics.

> ### Returns

The original dataframe.

> ### Purpose

Executes and evaluates different regression models to predict the target variable, providing insights into model performance.

This class aids in the execution and evaluation of various regression models, providing insights into their predictive performance concerning the target variable.

In [None]:
class MLModels:
    def __init__(self, cfg):
        self.cfg = cfg

    def process(self, dataframe, target_col, params):
        # Splitting features and target
        X = dataframe.drop(columns=target_col)
        y = dataframe[target_col] 

        # Train-test split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=42)
        
        if self.cfg.histgradient:
            # HistGradientBoostingRegressor model with best parameters from Optuna
            model = HistGradientBoostingRegressor(**params)
            print(f"Validation is Executing with {type(model).__name__}\n")

            # Training the model
            model.fit(X_train, y_train)

            # Predictions and evaluation
            y_pred_test = model.predict(X_test)
            rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
            r2_test = r2_score(y_test, y_pred_test)
            mae_test = mean_absolute_error(y_test, y_pred_test)

            print(f"\n RMSE: {(rmse_test):.4f}   R2: {(r2_test):.4f}   MAE: {(mae_test):.4f}")

        elif self.cfg.lightgbm:
            # LGBMRegressor model with best parameters from Optuna
            model = lgb.LGBMRegressor(**params, verbosity=-1)
            print(f"Validation is Executing with {type(model).__name__}\n")

            # Training the model
            model.fit(X_train, y_train)

            # Predictions and evaluation
            y_pred_test = model.predict(X_test)
            rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
            r2_test = r2_score(y_test, y_pred_test)
            mae_test = mean_absolute_error(y_test, y_pred_test)

            print(f"\n RMSE: {(rmse_test):.4f}   R2: {(r2_test):.4f}   MAE: {(mae_test):.4f}")

        elif self.cfg.xgboost:
            # XGBRegressor model with best parameters from Optuna
            model = xgb.XGBRegressor(**params)
            print(f"Validation is Executing with {type(model).__name__}")
            
            # Training the model
            model.fit(X_train, y_train)
            
            # Predictions and evaluation
            y_pred_test = model.predict(X_test)
            rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)
            r2_test = r2_score(y_test, y_pred_test)
            mae_test = mean_absolute_error(y_test, y_pred_test)

            print(f"\n RMSE: {(rmse_test):.4f}   R2: {(r2_test):.4f}   MAE: {(mae_test):.4f}")

        elif self.cfg.catboost:
            # CatBoostRegressor model with best parameters from Optuna
            params['verbose'] = False
            params['grow_policy'] = 'Lossguide'
            model = catb.CatBoostRegressor(**params)
            print(f"is Executing with {type(model).__name__}")

        else:
            print('All ML models are disabled. Process passed...')
        
        return dataframe

# **Deep Learning Methods**

These are functions for building LSTM, GRU, and combined LSTM-GRU models using TensorFlow's Keras API for sequence prediction tasks. They utilize hyperparameters for tuning the architecture and optimization of the models.

### LSTM Model

> ### Architecture

- Three LSTM layers with different units (hyperparameter).
- Dropout layers to mitigate overfitting.
- Dense output layer.

> ### Hyperparameters

- **units**: Number of LSTM units in each layer.
- **dropout**: Dropout rate for regularization.
- **learning_rate**: Learning rate for the Adam optimizer.

> ### Compilation

- **Optimizer**: Adam
- **Loss**: Mean Squared Error

## GRU Model

> ### Architecture

- Three GRU layers with varying units (hyperparameter).
- Dropout layers to prevent overfitting.
- Dense output layer.

> ### Hyperparameters

- **units**: Number of GRU units in each layer.
- **dropout**: Dropout rate for regularization.
- **learning_rate**: Learning rate for the Adam optimizer.

> ### Compilation

- **Optimizer**: Adam
- **Loss**: Mean Squared Error

## LSTM-GRU Model

> ### Architecture

- Alternating LSTM and GRU layers with different units (hyperparameter).
- Dropout layers after each LSTM/GRU layer.
- Dense output layer.

> ### Hyperparameters

- **units**: Number of units in each LSTM and GRU layer.
- **dropout**: Dropout rate for regularization.
- **learning_rate**: Learning rate for the Adam optimizer.

> ### Compilation

- **Optimizer**: Adam
- **Loss**: Mean Squared Error

These functions allow you to build LSTM, GRU, or hybrid LSTM-GRU models for sequence prediction tasks with flexibility in architecture via hyperparameter tuning.

In [None]:
# Function to build an LSTM model
def build_lstm_model(hp, X):
    """
    Builds an LSTM model with hyperparameters.
    
    Args:
    - hp: Hyperparameters to be tuned by the optimization process.
    - X: Input data shape.
    
    Returns:
    - LSTM model architecture compiled with specified optimizer and loss.
    """

    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.LSTM(units=hp.Int('units', min_value=64, max_value=256, step=64), 
                                   return_sequences=True, 
                                   input_shape=(X.shape[1], X.shape[2]),
                                   kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    # Add a dropout layer
    model.add(tf.keras.layers.Dropout(hp.Float('dropout', min_value=0.1, max_value=0.3, step=0.1)))
    # Add another LSTM layer
    model.add(tf.keras.layers.LSTM(units=int(hp.Int('units', min_value=64, max_value=256, step=64)) // 2, 
                                   return_sequences=True, 
                                   activation='relu',
                                   kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    model.add(tf.keras.layers.Dropout(hp.Float('dropout', min_value=0.1, max_value=0.3, step=0.1)))
    # Final LSTM layer with units reduced to 1
    model.add(tf.keras.layers.LSTM(units=int(hp.Int('units', min_value=64, max_value=256, step=64)) // 4, 
                                   activation='relu',
                                   kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    # Output layer
    model.add(tf.keras.layers.Dense(units=1)) 

    # Compile the model
    optimizer = tf.keras.optimizers.Adam(learning_rate=hp.Choice('learning_rate', values=[1e-4, 1e-3, 1e-2]))
    model.compile(optimizer=optimizer, loss='mean_squared_error')
    return model


# Function to build a GRU model
def build_gru_model(hp, X):
    """
    Builds a GRU model with hyperparameters.
    
    Args:
    - hp: Hyperparameters to be tuned by the optimization process.
    - X: Input data shape.
    
    Returns:
    - GRU model architecture compiled with specified optimizer and loss.
    """

    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.GRU(units=hp.Int('units', min_value=64, max_value=256, step=64), 
                                  return_sequences=True, 
                                  input_shape=(X.shape[1], X.shape[2]),
                                  kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    # Add a dropout layer
    model.add(tf.keras.layers.Dropout(hp.Float('dropout', min_value=0.1, max_value=0.3, step=0.1)))
    # Add another GRU layer
    model.add(tf.keras.layers.GRU(units=int(hp.Int('units', min_value=64, max_value=256, step=64)) // 2, 
                                  return_sequences=True, 
                                  activation='relu',
                                  kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    # Add a dropout layer
    model.add(tf.keras.layers.Dropout(hp.Float('dropout', min_value=0.1, max_value=0.3, step=0.1)))
    # Add another GRU layer
    model.add(tf.keras.layers.GRU(units=int(hp.Int('units', min_value=64, max_value=256, step=64)) // 4, 
                                  activation='relu',
                                  kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    # Output layer
    model.add(tf.keras.layers.Dense(units=1)) 

    # Compile the model
    optimizer = tf.keras.optimizers.Adam(learning_rate=hp.Choice('learning_rate', values=[1e-4, 1e-3, 1e-2]))
    model.compile(optimizer=optimizer, loss='mean_squared_error')
    return model


# Function to build a combined LSTM-GRU model
def build_lstm_gru_model(hp, X):
    """
    Builds a combined LSTM-GRU model with hyperparameters.
    
    Args:
    - hp: Hyperparameters to be tuned by the optimization process.
    - X: Input data shape.
    
    Returns:
    - Combined LSTM-GRU model architecture compiled with specified optimizer and loss.
    """

    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.LSTM(units=hp.Int('units', min_value=64, max_value=256, step=64),
                                   return_sequences=True,
                                   input_shape=(X.shape[1], X.shape[2]),
                                   kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    model.add(tf.keras.layers.Dropout(hp.Float('dropout', min_value=0.1, max_value=0.3, step=0.1)))
    model.add(tf.keras.layers.GRU(units=hp.Int('units', min_value=64, max_value=256, step=64) // 2,
                                   return_sequences=True,
                                   activation='relu',
                                   kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    model.add(tf.keras.layers.Dropout(hp.Float('dropout', min_value=0.1, max_value=0.3, step=0.1)))
    model.add(tf.keras.layers.LSTM(units=hp.Int('units', min_value=64, max_value=256, step=64) // 4,
                                   return_sequences=True,
                                   activation='relu',
                                   kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    model.add(tf.keras.layers.Dropout(hp.Float('dropout', min_value=0.1, max_value=0.3, step=0.1)))
    model.add(tf.keras.layers.GRU(units=hp.Int('units', min_value=64, max_value=256, step=64) // 8,
                                   activation='relu',
                                   kernel_regularizer=tf.keras.regularizers.l2(hp.Choice('l2_rate', values=[0.001, 0.01, 0.1]))))
    model.add(tf.keras.layers.Dense(units=1))

    optimizer = tf.keras.optimizers.Adam(learning_rate=hp.Choice('learning_rate', values=[1e-4, 1e-3, 1e-2]))
    model.compile(optimizer=optimizer, loss='mean_squared_error')

    return model


# **KerasTuner Method**

This KerasTuner class is aimed at hyperparameter tuning for different types of neural network models used for sequence prediction tasks.


> ### Initialization

Takes a configuration parameter (cfg) to define which tuner process is enabled.

> ### Methods

- **process**: Conducts hyperparameter tuning for LSTM, GRU, or LSTM-GRU hybrid models based on the enabled tuner configuration.
- **dataframe**: The input data.
- **target_col**: The target column for prediction.

> ### Functionality

- Uses Keras Tuner's RandomSearch to search through the hyperparameter space for each model type.
- Searches for the best hyperparameters for LSTM, GRU, or LSTM-GRU models based on the enabled tuner configuration (lstm_tuner, gru_tuner, lstm_gru_tuner).
- Performs the hyperparameter search by evaluating the models on validation data (using early stopping to prevent overfitting).

> ### Returned Values

Returns the best hyperparameters found for the selected model type.

> ### Output

Prints the best hyperparameters discovered for each type of model after the search process.

This class provides an automated way to search for the best hyperparameters for LSTM, GRU, or LSTM-GRU models using Keras Tuner, facilitating the optimization of these models for sequence prediction tasks.

In [None]:
class KerasTuner:
    def __init__(self, cfg):
        """
        Initialize KerasTuner class with configuration parameters.

        Args:
        - cfg: Configuration parameters for KerasTuner.
        """
        self.cfg = cfg

    def process(self, dataframe, target_col):
        """
        Perform hyperparameter tuning using KerasTuner.

        Args:
        - dataframe: Input dataframe for training.
        - target_col: Column containing the target variable.

        Returns:
        - best_hps: Best hyperparameters found by KerasTuner.
        """

        # Split independent variables and target
        X = dataframe.drop(columns=target_col).values
        y = dataframe[target_col].values
        X = X.reshape((X.shape[0], 1, X.shape[1]))

        # Train-test split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=42)
        
        if self.cfg.lstm_tuner:
            # KerasTuner for LSTM model
            tuner = RandomSearch(lambda hp: build_lstm_model(hp, X_train), 
                                 objective='val_loss', 
                                 max_trials=50,
                                 executions_per_trial=1)

            # Tuning on dataset for the best hyperparameters
            tuner.search(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), 
                         callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)])
            best_hps = tuner.get_best_hyperparameters(num_trials=1)[0] # Pick best parameters 
            print("Best LSTM parameters:", best_hps.values)

            return best_hps

        elif self.cfg.gru_tuner:
            # KerasTuner for GRU model
            tuner = RandomSearch(lambda hp: build_gru_model(hp, X_train), 
                                 objective='val_loss', 
                                 max_trials=50,
                                 executions_per_trial=1)
            
            # Tuning on dataset for the best hyperparameters
            tuner.search(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), 
                         callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)])
            best_hps = tuner.get_best_hyperparameters(num_trials=1)[0] # Pick best parameters
            print("Best GRU parameters:", best_hps.values)

            return best_hps

        elif self.cfg.lstm_gru_tuner:
            # KerasTuner for combined LSTM-GRU model
            tuner = RandomSearch(lambda hp: build_lstm_gru_model(hp, X_train), 
                                 objective='val_loss', 
                                 max_trials=50,
                                 executions_per_trial=1)
            
            # Tuning on dataset for the best hyperparameters
            tuner.search(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), 
                         callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)])
            best_hps = tuner.get_best_hyperparameters(num_trials=1)[0] # Pick best parameters
            print("Best LSTM-GRU parameters:", best_hps.values)

            return best_hps
        
        else:
            print("KerasTuner process is disabled. Process passed..")
            return None

# **Test Prediction Deep Learning Method**

This DLModels class is designed to build and train different deep learning models for sequence prediction tasks using LSTM, GRU, or LSTM-GRU architectures based on the configuration settings.

> ### Initialization

Takes a configuration parameter (cfg) to determine which DL model is enabled for processing.

> ### Methods

- **process**: Builds and trains LSTM, GRU, or LSTM-GRU models based on the enabled model configuration (lstm_model, gru_model, lstm_gru_model).
- **dataframe**: The training data.
- **dataframe_test**: The test data.
- **target_col**: The target column for prediction.
- **unit**: The number of units/neurons in the layers.
- **dropout_rate**: The dropout rate for regularization.
- **l2_rate**: The L2 regularization rate.
- **learning_rate**: The learning rate for optimization.

> ### Functionality

- Builds and trains LSTM, GRU, or LSTM-GRU models based on the enabled model configuration using TensorFlow/Keras.
- Each model architecture consists of multiple recurrent layers with dropout and optional regularization.
- Uses Adam or Nadam optimizer and mean squared error loss.
- Trains the models for 20 epochs with a batch size of 32 and early stopping based on loss metric.

> ### Returned Values

Returns the predicted values for the test data from the trained models.

> ### Output

- Prints the execution status of the selected DL model (LSTM, GRU, or LSTM-GRU).

This class facilitates the creation and training of deep learning models for sequence prediction tasks, allowing you to select different architectures and hyperparameters based on your dataset and problem requirements.

In [None]:
class DLModels:
    def __init__(self, cfg):
        """
        Initializes the DLModels class with a configuration object.

        Args:
        - cfg: Configuration object containing model flags (e.g., lstm_model, gru_model, lstm_gru_model).
        """
        self.cfg = cfg

    def process(self, dataframe, dataframe_test, target_col, unit, dropout_rate, l2_rate, learning_rate):
        """
        Processes the data and executes the selected deep learning model based on configuration flags.

        Args:
        - dataframe: Training data in pandas DataFrame format.
        - dataframe_test: Test data in pandas DataFrame format.
        - target_col: Name of the target column.
        - unit: Number of units/neurons in the layers.
        - dropout_rate: Dropout rate for regularization.
        - l2_rate: L2 regularization rate.
        - learning_rate: Learning rate for optimization.

        Returns:
        - Predictions if a model is executed; otherwise, returns the original dataframe.
        """

        # Extracting features and target variable from the training data
        X = dataframe.drop(columns=target_col).values
        y = dataframe[target_col].values

        # Reshaping input data to be compatible with LSTM/GRU input shape
        X = X.reshape((X.shape[0], 1, X.shape[1]))

        # Extracting features from the test data and reshaping it
        test = dataframe_test.drop(columns=target_col).values
        test = test.reshape((test.shape[0], 1, test.shape[1]))

        # Model selection based on configuration flags
        if self.cfg.lstm_model:
            # LSTM Model
            print("LSTM Model is executing..\n")
            
            # Initializing an LSTM model
            model = tf.keras.models.Sequential()
            model.add(tf.keras.layers.LSTM(units=unit, return_sequences=True, input_shape=(X.shape[1], X.shape[2]),
                                           kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            model.add(tf.keras.layers.Dropout(dropout_rate))  
            
            model.add(tf.keras.layers.LSTM(units=unit//2, return_sequences=True, activation='relu',
                                           kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            model.add(tf.keras.layers.Dropout(dropout_rate))

            model.add(tf.keras.layers.LSTM(units=unit//4, activation='relu', 
                                           kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            
            model.add(tf.keras.layers.Dense(units=1)) 

            # Compiling the model
            model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), loss='mean_squared_error')

            # Training the model
            model.fit(X, y, epochs=20, batch_size=32, 
                      callbacks=[tf.keras.callbacks.EarlyStopping(monitor='loss',  patience=2)])

            # Making predictions
            y_pred = model.predict(test)

            return y_pred

        elif self.cfg.gru_model:
            # GRU Model
            print("GRU Model is executing..\n")

            # Initializing a GRU model
            model = tf.keras.models.Sequential()
            model.add(tf.keras.layers.GRU(units=unit, return_sequences=True, input_shape=(X.shape[1], X.shape[2]), 
                                          kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            model.add(tf.keras.layers.Dropout(dropout_rate)) 

            model.add(tf.keras.layers.GRU(units=unit//2, return_sequences=True, activation='relu', 
                                          kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            model.add(tf.keras.layers.Dropout(dropout_rate))

            model.add(tf.keras.layers.GRU(units=unit//4, return_sequences=True, activation='relu', 
                                          kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            
            model.add(tf.keras.layers.Dense(units=1)) 

            # Compiling the model
            model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), loss='mean_squared_error')

            # Training the model
            model.fit(X, y, epochs=20, batch_size=32, 
                      callbacks=[tf.keras.callbacks.EarlyStopping(monitor='loss',  patience=2)])
        
            # Making predictions
            y_pred = model.predict(test)

            return y_pred

        elif self.cfg.lstm_gru_model:
            # LSTM-GRU Model
            print("LSTM-GRU Model is executing..\n")

            # Initializing an LSTM-GRU model
            model = tf.keras.models.Sequential()
            model.add(tf.keras.layers.LSTM(units=unit, return_sequences=True, input_shape=(X.shape[1], X.shape[2]), 
                                           kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            model.add(tf.keras.layers.Dropout(dropout_rate))

            model.add(tf.keras.layers.GRU(units=unit//2, return_sequences=True, activation='relu', 
                                          kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            model.add(tf.keras.layers.Dropout(dropout_rate))

            model.add(tf.keras.layers.LSTM(units=unit//4, return_sequences=True, activation='relu',
                                           kernel_regularizer=tf.keras.regularizers.l2(l2_rate)))
            model.add(tf.keras.layers.Dropout(dropout_rate))

            model.add(tf.keras.layers.GRU(units=unit//8, activation='relu', 
                                            kernel_regularizer=tf.keras.regularizers.l2(l2_rate))) 
            
            model.add(tf.keras.layers.Dense(units=1)) 

            # Compiling the model
            model.compile(optimizer=tf.keras.optimizers.Nadam(learning_rate=learning_rate), loss='mean_squared_error')

            # Training the model
            model.fit(X, y, epochs=20, batch_size=32, 
                      callbacks=[tf.keras.callbacks.EarlyStopping(monitor='loss',  patience=2)])

            # Making predictions
            y_pred = model.predict(test)

            return y_pred

        else:
            # No model selected
            print("All Models are disabled. Returning original dataframe..")
            return dataframe
