<a href="https://colab.research.google.com/github/fastai-energetic-engineering/ashrae/blob/master/_notebooks/2021-07-23-tabular1online-presentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ASHRAE Energy Prediction
> by Energetic Engineering, the tabular-1-online group in the 2021 fast.ai course

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [fastai, kaggle]
- image: images/joey-kyber-Pihl8kTtX-s-unsplash.jpg
- hide: false
- search_exclude: false


The [ASHRAE Great Energy Predictor III](https://www.kaggle.com/c/ashrae-energy-prediction) is a Kaggle competition to predict energy use in buildings.

The training dataset is 3.2GB, containing 20 million energy meter readings for 1449 buildings over a year. Four energy meters are sampled: chilled water, electric, hot water, and steam. Each dependent measurement is given with independent variable metadata for the building and weather. The [Kaggle data page](https://www.kaggle.com/c/ashrae-energy-prediction/data) shows the available files, with histograms for each variable.

Our goal was to create a model that predicts energy usage per building and per meter in that building, using fast.ai.

We started by downloading the dataset from Kaggle, joining the tables, fixing timestamp inconsistencies and transforming some values into smaller datatypes to save memory. We aimed to work in Colab so memory was precious.


In [None]:
#collapse
!pip install -Uqq fastbook dtreeviz
import fastbook
fastbook.setup_book()

In [None]:
#collapse
import os
import gc
import pandas as pd
import datetime as dt
from tqdm.auto import tqdm
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

In [None]:
#collapse
%cd '/content/gdrive/MyDrive/Colab Notebooks/ashrae'
train_valid = pd.read_parquet("feature_enhanced_train_combined.parquet.snappy")

In [None]:
#collapse
## Memory optimization

# Original code from https://www.kaggle.com/gemartin/load-data-reduce-memory-usage by @gemartin
# Modified to support timestamp type, categorical type
# Modified to add option to use float16

from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype

def reduce_mem_usage(df, use_float16=False):
    """
    Iterate through all the columns of a dataframe and modify the data type to reduce memory usage.        
    """
    
    start_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))
    
    for col in df.columns:
        if is_datetime(df[col]) or is_categorical_dtype(df[col]) or is_string_dtype(df[col]):
            continue
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if use_float16 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype("category")

    end_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))
    print("Decreased by {:.1f}%".format(100 * (start_mem - end_mem) / start_mem))
    
    return df

train_valid = reduce_mem_usage(train_valid, use_float16=True)

We started by browsing exploratory data analysis notebooks posted to Kaggle. We learnt that this is a gnarly dataset, with many missing values and outliers. For example, the 'meter_reading' dependent variable has mostly small values but also some very large values:

In [None]:
train_valid[['meter_reading']].hist()