<a href="https://colab.research.google.com/github/carlos-alves-one/-Energy-Comp/blob/main/project_energy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Understanding the Problem and the Data

**Understand the specific problem of energy imbalance caused by prosumers and how the model can help Enefit**

Specific Problem: Energy Imbalance Resulting from Prosumers

The primary concern is the energy discrepancy that occurs when there is a disparity between the anticipated and the actual energy consumed or generated. The issue is worsened by prosumers contributing to the problem due to their simultaneous roles as energy consumers and producers. Their energy use and generation can be erratic, resulting in logistical and financial difficulties for energy firms like Enefit. These problems encompass the struggle to align supply and demand and the resulting expenditures from this imbalance.

The Role of the Model in Facilitating Enefit
The model aims to address these difficulties by offering precise forecasts of prosumers' energy usage and production. By doing so, the model will:

1. Increase Forecasting Precision: Enhance Enefit's capacity to anticipate energy demands and production levels accurately.
   
2. Minimise Imbalance Expenses: Enefit can optimise its energy allocation by utilising more accurate forecasts, hence decreasing the expenses linked to energy imbalance.

3. Enhance Resource Allocation Efficiency: Precise predictions will allow Enefit to distribute resources more optimally, thereby minimising waste and decreasing operational expenses.

4. Enhance Strategic Decision-Making: Enefit can improve its ability to make strategic decisions on infrastructure investments and policy changes by gaining deeper insights into consumer behaviour.

5. Encourage Sustainable Habits: Enefit may encourage prosumers to use renewable energy sources through efficient energy management, facilitating the shift towards more environmentally friendly energy habits.

The model must incorporate multiple variables that impact consumer behaviour, such as weather patterns, past energy usage trends, pricing fluctuations, etc. The model's performance will be assessed based on its Mean Absolute Error (MAE), which requires your predictions to match the actual values to minimise the error measurement closely.

The competition offers a dataset of historical meteorological data, energy pricing, and details regarding prosumer attributes. The given Python time-series API will guarantee that the model complies with the competition's specifications, including the prohibition of looking ahead in time and utilising just the available data for making predictions.

#Study the Data

#Data Collection

##1. Load the Data
   - Connect to Google Drive to access the dataset
   - Load the data from the provided CSV file.

In [1]:

# Imports the 'drive' module from 'google.colab' and mounts the Google Drive to
# the '/content/drive' directory in the Colab environment.
from google.colab import drive

# This function mounts Google Drive
def mount_google_drive():
    drive.mount('/content/drive')

# Call the function to mount Google Drive
mount_google_drive()

Mounted at /content/drive


In [2]:
# Unit tests
import unittest
from unittest.mock import patch

class TestDriveMount(unittest.TestCase):
    @patch('google.colab.drive.mount')
    def test_drive_mount(self, mock_drive_mount):
        # Mock the drive.mount function
        mock_drive_mount.return_value = None

        # Call the function we want to test
        mount_google_drive()

        # Assert that drive.mount was called with the correct arguments
        mock_drive_mount.assert_called_once_with('/content/drive')

# Running the tests
unittest.main(argv=[''], exit=False)


.
----------------------------------------------------------------------
Ran 1 test in 0.009s

OK


<unittest.main.TestProgram at 0x7c5a786b6e00>

In [3]:
# Import the pandas library to read the data
import pandas as pd

# Importing the os module
import os

# Function to load the dataset
def load_dataset(path):
    data = pd.read_csv(path)
    return data

class TestDatasetLoading(unittest.TestCase):

    def setUp(self):
        # Define the dataset path
        self.dataset_path = '/content/drive/MyDrive/Glaucoma_Global_Analysis.csv'

    def test_file_exists(self):
        # Test whether the file exists
        self.assertTrue(os.path.isfile(self.dataset_path), "Dataset file does not exist at the specified path.")

    def test_load_dataset(self):
        # Test loading the dataset
        data = load_dataset(self.dataset_path)
        self.assertIsInstance(data, pd.DataFrame, "Loaded data is not a pandas DataFrame.")

    def test_dataset_not_empty(self):
        # Test that the dataset is not empty
        data = load_dataset(self.dataset_path)
        self.assertFalse(data.empty, "The loaded dataset is empty.")

# Running the tests
if __name__ == '__main__':
    unittest.main(argv=[''], exit=False)

# Load the dataset and display the first 3 rows
data = load_dataset('/content/drive/MyDrive/project_energy/train.csv')
print(data.head(1).T)


....
----------------------------------------------------------------------
Ran 4 tests in 0.327s

OK


                                      0
county                                0
is_business                           0
product_type                          1
target                            0.713
is_consumption                        0
datetime            2021-09-01 00:00:00
data_block_id                         0
row_id                                0
prediction_unit_id                    0


In [4]:
# Import the pandas library to read the data
import pandas as pd

# Load the training data
train_data = data

# Get the number of rows and columns
num_rows, num_columns = train_data.shape

# Format and print the numbers with dots as thousand separators
print("Number of Rows.....:", "{:,}".format(num_rows).replace(",", "."))
print("Number of Columns..:", "{:,}".format(num_columns).replace(",", "."))


Number of Rows.....: 2.018.352
Number of Columns..: 9


#Preparing the Data

In [9]:
# Import Dask's dataframe module for parallel data processing.
import dask.dataframe as dd

# Load data into a Dask DataFrame
# Replace 'data' with the data source
train_data = dd.from_pandas(data, npartitions=10)  # Adjust npartitions based on your system's capabilities

# Handling missing values
# For numerical columns, fill missing values with the mean
numerical_columns = train_data.select_dtypes(include=['number']).columns
for col in numerical_columns:
    train_data[col] = train_data[col].fillna(train_data[col].mean())

# For categorical data, fill missing values with the mode
categorical_columns = train_data.select_dtypes(include=['object', 'category']).columns
for col in categorical_columns:
    mode = train_data[col].mode().compute()[0]
    train_data[col] = train_data[col].fillna(mode)

# Handle outliers
# Assuming 'target' column could have outliers, use IQR
Q1 = train_data['target'].quantile(0.25).compute()
Q3 = train_data['target'].quantile(0.75).compute()
IQR = Q3 - Q1

# Define the outlier condition
outlier_condition = ((train_data['target'] < (Q1 - 1.5 * IQR)) | (train_data['target'] > (Q3 + 1.5 * IQR)))

# Filter out the outliers
train_data = train_data[~outlier_condition]

# Convert back to pandas DataFrame if needed (optional)
# train_data = train_data.compute()  # Uncomment this line if to need to work with a pandas DataFrame
