<a href="https://colab.research.google.com/github/fastai-energetic-engineering/ashrae/blob/master/kaggle_data_to_parquet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Kaggle Data for ASHRAE Energy Prediction
> "How to download Kaggle data from Colab."

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [kaggle, preprocessing]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: false


In [None]:
#collapse
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
#collapse
from fastbook import *
import os
from google.colab import files
import pandas as pd
import datetime

This notebook demonstrates how I downloaded the [ASHRAE Energy Prediction Data](https://www.kaggle.com/c/ashrae-energy-prediction/overview) from Kaggle.

First, we need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api#api-credentials).

In [None]:
!pip install kaggle --upgrade -q

I will download the data into a folder in my google drive. First, I will set my home directory.

In [None]:
%cd /content/gdrive/MyDrive/Colab Notebooks/ashrae/

We need to download Kaggle API token and then put the `.json` file in `.kaggle` folder. We can upload the key directly from colab.

In [None]:
files.upload() # use this to upload your API json key
!mkdir ~/.kaggle # create folder
!cp kaggle.json ~/.kaggle/ # move the key into the folder
!chmod 600 ~/.kaggle/kaggle.json # change permissions of the file

In [None]:
%rm ./data/*.*
%rmdir data

In [None]:
%mkdir data
%cd data

We can finally download the file!

In [None]:
!kaggle competitions download -c ashrae-energy-prediction

In [None]:
# extract zip files then remove the .zip
for item in os.listdir(): # for every item in the folder
    if item.endswith('.zip'): # check if it is a .zip file
        file_extract(item) # if it is, then extract file
        os.remove(item) # and then remove the .zip

In [None]:
os.chdir("..") # return to initial folder

In [None]:
def prepare_data(dataset="train"):
    assert dataset in ["train", "test"]

    # read data
    building_df = pd.read_csv("data/building_metadata.csv")
    weather_df = pd.read_csv(f"data/weather_{dataset}.csv")
    data_df = pd.read_csv(f"data/{dataset}.csv")

    # convert datetime
    data_df["timestamp"] = pd.to_datetime(data_df["timestamp"])

    # adjust timestamp
    timediff = {
        0: 4,
        1: 0,
        2: 7,
        3: 4,
        4: 7,
        5: 0,
        6: 4,
        7: 4,
        8: 4,
        9: 5,
        10: 7,
        11: 4,
        12: 0,
        13: 5,
        14: 4,
        15: 4,
    }
    weather_df["time_diff"] = weather_df["site_id"].map(timediff)
    weather_df["time_diff"] = weather_df["time_diff"].apply(
        lambda x: datetime.timedelta(hours=x)
    )
    weather_df["timestamp_gmt"] = pd.to_datetime(weather_df["timestamp"])
    weather_df["timestamp"] = weather_df["timestamp_gmt"] - weather_df["time_diff"]

    # merge table
    data_df = data_df.merge(building_df, on="building_id", how="left")
    data_df = data_df.merge(
        weather_df.drop(columns=["time_diff", "timestamp_gmt"]),
        on=["site_id", "timestamp"],
        how="left",
    )

    return data_df

In [None]:
train_combined_df = prepare_data('train')

In [None]:
train_combined_df.to_parquet(Path("/content/gdrive/MyDrive/Colab Notebooks/ashrae/train_combined.parquet.snappy"))