<a href="https://colab.research.google.com/github/carlos-alves-one/-Energy-Forecast/blob/main/datasets_merge_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mount Google Drive

In [1]:
# Imports the 'drive' module from 'google.colab' and mounts the Google Drive to
# the '/content/drive' directory in the Colab environment.
from google.colab import drive

# This function mounts Google Drive
def mount_google_drive():
    drive.mount('/content/drive')

# Call the function to mount Google Drive
mount_google_drive()


Mounted at /content/drive


# Importing Necessary Libraries and Packages

In [2]:
import pandas as pd              # Import pandas for data manipulation and analysis


# Input Data Files

In [3]:
test        = pd.read_csv('/content/drive/MyDrive/project_energy/test.csv')
targets     = pd.read_csv('/content/drive/MyDrive/project_energy/revealed_targets.csv')
gas         = pd.read_csv('/content/drive/MyDrive/project_energy/gas_prices.csv')
electricity = pd.read_csv('/content/drive/MyDrive/project_energy/electricity_prices.csv')
client      = pd.read_csv('/content/drive/MyDrive/project_energy/client.csv')
forecast    = pd.read_csv('/content/drive/MyDrive/project_energy/forecast_weather.csv')
historical  = pd.read_csv('/content/drive/MyDrive/project_energy/historical_weather.csv')


# Merge Test and Targets Datasets

The best dataset to start the merge process would be the one that acts as a central hub, having key identifiers that are most common across the other datasets. Based on the critical column presence analysis, the **`test`** dataset is a suitable starting point for merging. Here is why:

1. **Universal Identifier**: The `test` dataset contains the `data_block_id` column, which is present in all other datasets. This makes it an excellent candidate for initial merging, as `data_block_id` can serve as a primary key to link data across different datasets.

2. **Additional Common Identifiers**: It also includes `county`, `product_type`, and `is_business`, which are present in several other datasets (revealed_targets and client). These columns further facilitate merging and provide additional layers of information.

3. **Role in Analysis**: The `test` dataset likely represents the primary data structure into which other data (like weather, prices, and client information) will be integrated. This makes it a logical starting point for building a comprehensive dataset.

Starting with the `test` dataset, we can incrementally merge other datasets like `revealed_targets`, `client`, and then bring in weather and price data, ensuring alignment of time-related variables (like `datetime`, `forecast_date`, `origin_date`) and geographical data (like `latitude` and `longitude`) where applicable.

In [5]:
# Merging the test and revealed_targets datasets on the common columns
merged_df = pd.merge(test, targets,
                     on=['county', 'is_business', 'product_type', 'data_block_id'],
                     how='inner')

# Renaming columns to differentiate between the datasets
merged_df = merged_df.rename(columns={'datetime': 'target_datetime', 'target': 'actual_target'})

# Displaying the merged dataframe
merged_df.head(3).T


Unnamed: 0,0,1,2
county,0,0,0
is_business,0,0,0
product_type,1,1,1
is_consumption_x,0,0,0
prediction_datetime,2023-05-28 00:00:00,2023-05-28 00:00:00,2023-05-28 00:00:00
data_block_id,634,634,634
row_id_x,2005872,2005872,2005872
prediction_unit_id_x,0,0,0
currently_scored,False,False,False
actual_target,2.675,471.887,2.138


In [6]:
test.shape

(12480, 9)

In [7]:
targets.shape

(12576, 9)

In [8]:
merged_df.shape

(599040, 14)

In [9]:
print(merged_df.isnull().sum())

county                  0
is_business             0
product_type            0
is_consumption_x        0
prediction_datetime     0
data_block_id           0
row_id_x                0
prediction_unit_id_x    0
currently_scored        0
actual_target           0
is_consumption_y        0
target_datetime         0
row_id_y                0
prediction_unit_id_y    0
dtype: int64


In [10]:
# Saving the final merged dataset to CSV
output_file_path = "/content/drive/MyDrive/project_energy/merged_data.csv"
merged_df.to_csv(output_file_path, index=False)  # Set index=False if you don't want to include the index

In [11]:
# Load the merged dataset and display the first 3 records
data = pd.read_csv("/content/drive/MyDrive/project_energy/merged_data.csv")
data.head(3).T

Unnamed: 0,0,1,2
county,0,0,0
is_business,0,0,0
product_type,1,1,1
is_consumption_x,0,0,0
prediction_datetime,2023-05-28 00:00:00,2023-05-28 00:00:00,2023-05-28 00:00:00
data_block_id,634,634,634
row_id_x,2005872,2005872,2005872
prediction_unit_id_x,0,0,0
currently_scored,False,False,False
actual_target,2.675,471.887,2.138
