# ETL

In this notebook we are going to build the ETL (Extract, Transform, Load) pipeline.
<br>
To download the data, you will need to create an account on Kaggle if you don't already have one, install the Kaggle API and join the competition [Enefit - Predict Energy Behavior of Prosumers](https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers). 
<br>
Before accessing the API, you will need to authenticate using an API token.
Follow this link if you want to learn more about the Kaggle API : https://www.kaggle.com/discussions/getting-started/524433.

We'll also use boto3 to store the data on AWS S3. We'll put the data in an S3 bucket called `enefit-competition`.
<br>
To do so, you'll have to [install](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and [configure](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html#getting-started-quickstart-new) the aws CLI if you haven't already done it (of course, this assumes that you already have an AWS account).

In [1]:
import pandas as pd
import boto3
from zipfile import ZipFile 

In [2]:
!kaggle competitions download -c predict-energy-behavior-of-prosumers

Downloading predict-energy-behavior-of-prosumers.zip to /Users/gabriel/Documents/Git/End-to-end MLOps for Time Series
100%|███████████████████████████████████████▉| 232M/233M [02:22<00:00, 3.51MB/s]
100%|████████████████████████████████████████| 233M/233M [02:22<00:00, 1.71MB/s]


In [3]:
# Unzip data file using ZipFile
with ZipFile("./predict-energy-behavior-of-prosumers.zip", 'r') as zObject: 
	zObject.extractall(path="./predict-energy-behavior-of-prosumers")

In [4]:
# Delete zip file
!rm predict-energy-behavior-of-prosumers.zip

In [None]:
# Read data
data = pd.read_csv("predict-energy-behavior-of-prosumers/train.csv")
print("data shape :", data.shape)
data.head(5)

data shape : (2018352, 9)


Unnamed: 0,county,is_business,product_type,target,is_consumption,datetime,data_block_id,row_id,prediction_unit_id
0,0,0,1,0.713,0,2021-09-01 00:00:00,0,0,0
1,0,0,1,96.59,1,2021-09-01 00:00:00,0,1,0
2,0,0,2,0.0,0,2021-09-01 00:00:00,0,2,1
3,0,0,2,17.314,1,2021-09-01 00:00:00,0,3,1
4,0,0,3,2.904,0,2021-09-01 00:00:00,0,4,2


In [3]:
# Seperate consumption and production data
consumption = data[data["is_consumption"]==1]
print("consumption data shape :", consumption.shape)
display(consumption.head())
production = data[data["is_consumption"]==0]
print("production data shape :", production.shape)
display(production.head())

consumption data shape : (1009176, 9)


Unnamed: 0,county,is_business,product_type,target,is_consumption,datetime,data_block_id,row_id,prediction_unit_id
1,0,0,1,96.59,1,2021-09-01 00:00:00,0,1,0
3,0,0,2,17.314,1,2021-09-01 00:00:00,0,3,1
5,0,0,3,656.859,1,2021-09-01 00:00:00,0,5,2
7,0,1,0,59.0,1,2021-09-01 00:00:00,0,7,3
9,0,1,1,501.76,1,2021-09-01 00:00:00,0,9,4


production data shape : (1009176, 9)


Unnamed: 0,county,is_business,product_type,target,is_consumption,datetime,data_block_id,row_id,prediction_unit_id
0,0,0,1,0.713,0,2021-09-01 00:00:00,0,0,0
2,0,0,2,0.0,0,2021-09-01 00:00:00,0,2,1
4,0,0,3,2.904,0,2021-09-01 00:00:00,0,4,2
6,0,1,0,0.0,0,2021-09-01 00:00:00,0,6,3
8,0,1,1,0.0,0,2021-09-01 00:00:00,0,8,4


In [4]:
# Keep only target values, datetime and unit ID of consumption data
consumption = consumption.loc[:, ["datetime", "prediction_unit_id", "target"]].rename(columns={"target": "consumption"})
print("consumption data shape :", consumption.shape)
consumption.head()

consumption data shape : (1009176, 3)


Unnamed: 0,datetime,prediction_unit_id,consumption
1,2021-09-01 00:00:00,0,96.59
3,2021-09-01 00:00:00,1,17.314
5,2021-09-01 00:00:00,2,656.859
7,2021-09-01 00:00:00,3,59.0
9,2021-09-01 00:00:00,4,501.76


In [5]:
# save data in ./data directory
!mkdir -p data
consumption.to_csv("./data/consumption.csv", index=False)

In [6]:
# Check data has been correctly saved
consumption_loaded = pd.read_csv("./data/consumption.csv")
print("consumption data shape :", consumption.shape)
consumption.head()

consumption data shape : (1009176, 3)


Unnamed: 0,datetime,prediction_unit_id,consumption
1,2021-09-01 00:00:00,0,96.59
3,2021-09-01 00:00:00,1,17.314
5,2021-09-01 00:00:00,2,656.859
7,2021-09-01 00:00:00,3,59.0
9,2021-09-01 00:00:00,4,501.76


In [7]:
# Load data to AWS S3 (Optional)
s3_client = boto3.client('s3')
with open("./data/consumption.csv", "rb") as file:
    s3_client.upload_fileobj(file, "enefit-competition", "data/consumption.csv")

In [8]:
# Check that data has been uploaded correctly
response = s3_client.list_objects(
    Bucket='enefit-competition')
for obj in response.get("Contents"):
    print(obj.get("Key"))

data/consumption.csv
