# Ingestion

In this notebook we are going to ingest the data from source and save it locally and in a S3 bucket.
<br>
To download the data, you will need to create an account on Kaggle if you don't already have one, install the Kaggle API and join the competition [Enefit - Predict Energy Behavior of Prosumers](https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers). 
<br>
Before accessing the API, you will need to authenticate using an API token.
Follow this link if you want to learn more about the Kaggle API : https://www.kaggle.com/discussions/getting-started/524433.

We'll also use boto3 to store the data on AWS S3. We'll put the data in an S3 bucket called `enefit-competition`.
<br>
To do so, you'll have to [install](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and [configure](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html#getting-started-quickstart-new) the aws CLI if you haven't already done it (of course, this assumes that you already have an AWS account).
<br>
You can also fill in the `.env` file with your AWS credentials to access you account.

In [None]:
import pandas as pd
import boto3
from zipfile import ZipFile
import os
import shutil
from glob import glob

In [2]:
!kaggle competitions download -c predict-energy-behavior-of-prosumers

Downloading predict-energy-behavior-of-prosumers.zip to /Users/gabriel/Documents/Git/End-to-end MLOps for Time Series/notebooks
 99%|███████████████████████████████████████▌| 230M/233M [00:13<00:00, 21.1MB/s]
100%|████████████████████████████████████████| 233M/233M [00:13<00:00, 17.7MB/s]


In [3]:
# Unzip data file using ZipFile
with ZipFile("./predict-energy-behavior-of-prosumers.zip", 'r') as zObject: 
	zObject.extractall(path="./predict-energy-behavior-of-prosumers")

In [4]:
# Delete zip file
!rm predict-energy-behavior-of-prosumers.zip

In [12]:
def create_dir(dir_path):
    try:
        os.mkdir(dir_path)
        print(f"Directory '{dir_path}' created successfully.")
    except FileExistsError:
        print(f"Directory '{dir_path}' already exists.")
    except PermissionError:
        print(f"Permission denied: Unable to create '{dir_path}'.")
    except Exception as e:
        print(f"An error occurred: {e}")


In [15]:
create_dir("../data")
create_dir("../data/raw")

Directory '../data' already exists.
Directory '../data/raw' already exists.


In [None]:
# Move the data we are interested in to the 'data/raw/' directory
source = "./predict-energy-behavior-of-prosumers"
destination = "../data/raw"

files = glob(os.path.join(source, '*.csv'), recursive=True)
files.append("./predict-energy-behavior-of-prosumers/county_id_to_name_map.json")

# iterate on all files to move them to destination folder
for file_path in files:
    dst_path = os.path.join(destination, os.path.basename(file_path))
    shutil.move(file_path, dst_path)
    print(f"Moved {file_path} -> {dst_path}")

Moved ./predict-energy-behavior-of-prosumers/client.csv -> ../data/raw/client.csv
Moved ./predict-energy-behavior-of-prosumers/weather_station_to_county_mapping.csv -> ../data/raw/weather_station_to_county_mapping.csv
Moved ./predict-energy-behavior-of-prosumers/gas_prices.csv -> ../data/raw/gas_prices.csv
Moved ./predict-energy-behavior-of-prosumers/forecast_weather.csv -> ../data/raw/forecast_weather.csv
Moved ./predict-energy-behavior-of-prosumers/electricity_prices.csv -> ../data/raw/electricity_prices.csv
Moved ./predict-energy-behavior-of-prosumers/train.csv -> ../data/raw/train.csv
Moved ./predict-energy-behavior-of-prosumers/historical_weather.csv -> ../data/raw/historical_weather.csv
Moved ./predict-energy-behavior-of-prosumers/county_id_to_name_map.json -> ../data/raw/county_id_to_name_map.json


In [27]:
# Move the remaining files in 'data/'
shutil.move("./predict-energy-behavior-of-prosumers", "../data")

'../data/predict-energy-behavior-of-prosumers'

In [None]:
#TODO: upload the data on S3.