<a href="https://colab.research.google.com/github/fastai-energetic-engineering/ashrae/blob/master/_notebooks/2021-06-27-Getting-ASHRAE-Energy-Prediction-Data-from-Kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Kaggle Data for ASHRAE Energy Prediction
> "How to download Kaggle data from Colab."

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [kaggle, preprocessing]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: false


In [None]:
#collapse
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

[?25l[K     |▌                               | 10kB 26.3MB/s eta 0:00:01[K     |█                               | 20kB 22.1MB/s eta 0:00:01[K     |█▍                              | 30kB 17.1MB/s eta 0:00:01[K     |█▉                              | 40kB 15.2MB/s eta 0:00:01[K     |██▎                             | 51kB 8.4MB/s eta 0:00:01[K     |██▊                             | 61kB 9.8MB/s eta 0:00:01[K     |███▏                            | 71kB 9.4MB/s eta 0:00:01[K     |███▋                            | 81kB 10.4MB/s eta 0:00:01[K     |████                            | 92kB 9.9MB/s eta 0:00:01[K     |████▌                           | 102kB 8.1MB/s eta 0:00:01[K     |█████                           | 112kB 8.1MB/s eta 0:00:01[K     |█████▌                          | 122kB 8.1MB/s eta 0:00:01[K     |██████                          | 133kB 8.1MB/s eta 0:00:01[K     |██████▍                         | 143kB 8.1MB/s eta 0:00:01[K     |██████▉               

In [None]:
#collapse
from fastbook import *
import os
from google.colab import files
import pandas as pd
import datetime

This notebook demonstrates how I downloaded the [ASHRAE Energy Prediction Data](https://www.kaggle.com/c/ashrae-energy-prediction/overview) from Kaggle.

First, we need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api#api-credentials).

In [None]:
!pip install kaggle --upgrade -q

I will download the data into a folder in my google drive. First, I will set my home directory.

In [None]:
p = Path('drive/MyDrive/Colab Notebooks/ashrae/')
os.chdir(p) # change directory

We need to download Kaggle API token and then put the `.json` file in `.kaggle` folder. We can upload the key directly from colab.

In [None]:
files.upload() # use this to upload your API json key
!mkdir ~/.kaggle # create folder
!cp kaggle.json ~/.kaggle/ # move the key into the folder
!chmod 600 ~/.kaggle/kaggle.json # change permissions of the file

We can finally download the file!

In [None]:
os.chdir('data') # move to data folder
!kaggle competitions download -c ashrae-energy-prediction

In [None]:
# extract zip files then remove the .zip
for item in os.listdir(): # for every item in the folder
    if item.endswith('.zip'): # check if it is a .zip file
        file_extract(item) # if it is, then extract file
        os.remove(item) # and then remove the .zip

In [None]:
os.chdir("..") # return to initial folder

## Joining Tables

Our training data comprised of three tables:
- `building_metadata.csv`
- `weather_train.csv`
- `train.csv`

We need to join the tables. First, let's see what's in the tables.

In [None]:
building = pd.read_csv('data/building_metadata.csv')
weather = pd.read_csv('data/weather_train.csv')
train = pd.read_csv('data/train.csv')

`building` contains the buildings' metadata.

In [None]:
building.head()

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,0,0,Education,7432,2008.0,
1,0,1,Education,2720,2004.0,
2,0,2,Education,5376,1991.0,
3,0,3,Education,23685,2002.0,
4,0,4,Education,116607,1975.0,


- `site_id` - Foreign key for the weather files.
- `building_id` - Foreign key for training.csv
- `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
- `square_feet` - Gross floor area of the building
- `year_built` - Year building was opened
- `floor_count` - Number of floors of the building

`weather` contains weather data from the closest meteorological station.

In [None]:
weather.head()

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
1,0,2016-01-01 01:00:00,24.4,,21.1,-1.0,1020.2,70.0,1.5
2,0,2016-01-01 02:00:00,22.8,2.0,21.1,0.0,1020.2,0.0,0.0
3,0,2016-01-01 03:00:00,21.1,2.0,20.6,0.0,1020.1,0.0,0.0
4,0,2016-01-01 04:00:00,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6


- `site_id`
- `air_temperature` - Degrees Celsius
- `cloud_coverage` - Portion of the sky covered in clouds, in oktas
- `dew_temperature` - Degrees Celsius
- `precip_depth_1_hr` - Millimeters
- `sea_level_pressure` - Millibar/hectopascals
- `wind_direction` - Compass direction (0-360)
- `wind_speed` - Meters per second

Finally, `train` contains the target variable, `meter reading`, which represents energy consumption in kWh.

In [None]:
train.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading
0,0,0,2016-01-01 00:00:00,0.0
1,1,0,2016-01-01 00:00:00,0.0
2,2,0,2016-01-01 00:00:00,0.0
3,3,0,2016-01-01 00:00:00,0.0
4,4,0,2016-01-01 00:00:00,0.0


- `building_id` - Foreign key for the building metadata.
- `meter` - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
- `timestamp` - When the measurement was taken
- `meter_reading` - The target variable. Energy consumption in kWh (or equivalent).

Apparently there was some issues regarding the timestamps, as noted by [this post](https://www.kaggle.com/c/ashrae-energy-prediction/discussion/115040#latest-667889). The timestamp in the weather and meter reading table were in GMT and local time, respectively. We have to keep this in mind before merging the tables.

Here I wrote a function that can prepare train and test data accordingly.

In [None]:
def prepare_data(type='train'):
    assert type in ['train', 'test']
    
    # read data
    building = pd.read_csv('data/building_metadata.csv')
    weather = pd.read_csv(f'data/weather_{type}.csv')
    data = pd.read_csv(f'data/{type}.csv')

    # convert datetime
    data['timestamp'] = pd.to_datetime(data['timestamp'])

    # adjust timestamp
    timediff = {0:4,1:0,2:7,3:4,4:7,5:0,6:4,7:4,8:4,9:5,10:7,11:4,12:0,13:5,14:4,15:4}
    weather['time_diff']= weather['site_id'].map(timediff)
    weather['time_diff'] = weather['time_diff'].apply(lambda x: datetime.timedelta(hours=x))
    weather['timestamp_gmt'] = pd.to_datetime(weather['timestamp'])
    weather['timestamp'] = weather['timestamp_gmt'] - weather['time_diff']

    # merge table
    data = data.merge(building, on='building_id', how='left')
    data = data.merge(weather, on=['site_id','timestamp'], how='left')

    return data

Let's try this function out!

In [None]:
prepare_data('train').head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed,time_diff,timestamp_gmt
0,0,0,2016-01-01,0.0,0,Education,7432,2008.0,,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6,0 days 04:00:00,2016-01-01 04:00:00
1,1,0,2016-01-01,0.0,0,Education,2720,2004.0,,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6,0 days 04:00:00,2016-01-01 04:00:00
2,2,0,2016-01-01,0.0,0,Education,5376,1991.0,,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6,0 days 04:00:00,2016-01-01 04:00:00
3,3,0,2016-01-01,0.0,0,Education,23685,2002.0,,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6,0 days 04:00:00,2016-01-01 04:00:00
4,4,0,2016-01-01,0.0,0,Education,116607,1975.0,,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6,0 days 04:00:00,2016-01-01 04:00:00


That's it! In the next blogpost, I will show how to load this data into FastAI's `dataloaders`.