# HDF5 Tutorial and Demonstration

***

### Welcome! This notebook will show you how to use the functions associated with ARCTIC'S HDF5 Data Storage, and give a brief overview of its intended usage. We at the ARCTIC Dev Team hope that you find it useful and clear. 

This data storage is done through the use of HDF5, or [Hierarchical Data Format 5,](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) which is a highly compressed data storage type. It stores data in binary format, so it is transferrable across operating systems. 

Our intention with the HDF5 system is for it to be used hierarchically, to help organize and sort data, though you may choose to use it any way that you wish. Just note that if a different format or architecture is chosen, certain components of the code may need to be reorganized. We think we've chosen a reasonable structure, and we hope that you'll find it to be to your liking though! 

The structure we have chosen is to store all data in a single HDF5 file, a choice largely made for ease of loading data into later components of the package. The first level of hierarchy within the HDF5 file, or the first group underneath the HDF5 file header, is names that correspond to the locations of various locales in Alaska (ie, Juneau, Fairbanks, Anchorage). This serves as our first delineator - all data is sorted prior to insertion into the HDF5 file. Note that all things will be grouped based on the location name, so if one location is named Juneau, and the other is named Juneau, AK, they will not be stored in the same location! 

Within each location group, there exists a group that stores all the data for a particular solar installation - These can be named whatever you would like. Inside the installation group, stored as individual datasets are the various parameters of that installation, including things like energy produced. Stored within that installation group is also an attribute that lists the system's DC capacity, to allow for energy production normalization.  

With that description done, lets walk through how this plays out in our functions. One thing to note is that most of the interactions herein do not result in a readily interactable result, so we'll be somewhat limited in our examples here. Other functionality will make use of reading this data, and detailed examples of reading it from an HDF5 file will be provided therein.

**NOTE!** All data is expected to be sorted into a specific format prior to storage in the HDF5 file, otherwise all of these functions will fail. Please see the example.xslx file for an example of the expected format, found here at our [Github page.](https://github.com/acep-solar/ACEP_solar)

***

### Creating the HDF5 File

The first function is to create the HDF5 file on your disk. If you already have an HDF5 file created that you would like to use for the data storage file, you can skip this step.

The function is `create_hdf5_file`. It takes one argument, 'hdf5_filename', which is the name that you would like the file to be saved under. The function will throw an error if you attempt to create a filename that already exists. We'll call that now. 

In [1]:
from ARCTIC import hdf5_interface

In [2]:
hdf5_interface.create_hdf5_file('demo_filename')

In [3]:
#Let's call that again, to see what type of error we end up with:
hdf5_interface.create_hdf5_file('demo_filename')

OSError: Unable to create file (unable to open file: name = 'demo_filename.h5', errno = 17, error message = 'File exists', flags = 15, o_flags = 502)

Great! So now we will have a hdf5 filetype in our directory with 'demo_filename.hdf5' as the filename. Now, we'll begin to create the internal structure of our hdf5 files with our other functions.

***

### Creating groups and datasets within the HDF5 File 

The second function, `add_to_hdf5_file`, allows the entry of individual solar installation data. It takes three inputs, the filename of the hdf5 file, the name of the solar installation, and the name of the file that the solar installation data is saved within. The datafile can be either a csv or an xslx file. Please refer to the example.xslx and example.csv files in the github repo (LINK) for examples of how to format these files. *Note!* The data_filename should include the file extension! if the function isn't passed a file that ends with .csv or .xslx it will throw an error.

This function searches the data file and pulls a location from that data file. It searches within the HDF5 filestructure to find the location, and if it is not found, it adds the location into the HDF5 filestrucutre. After finding or adding the location, the function reads the data from that data file, and saves the data into the HDF5 file under three separate datasets: Energy, Month, and Year, or under four separate datasets, Energy, Day, Month, and Year, depending on the resolution of the data.

In [4]:
import os

In [6]:
hdf5_interface.add_to_hdf5_file('demo_filename', 'example.xlsx', 'solar_installation_name')

This location already exists. Navigating there now.
You've already entered this panel's data!
You should use the `update_panel_data` function instead.


ValueError: This panel already exists in the HDF5 structure

Ok! That shows the command for how the software reads and adds in a brand new solar installation entry. Note that this only works for *NEW* entries. If you want to update an existing entry, say with new data, we will need to use a different functionality. Though this seems odd, it's part of the functionality of HDF5. 

***

### Updating existing entries in the HDF5 system

The third function, `update_existing_panel_entry`, enables the updating of an existing panel entry. Like with the addition of a new entry, this function takes three inputs: the filename of the HDF5 file, the name of the solar installation, and the name of the file that the solar installation data is saved within. The datafile again can be either a csv or an xslx file. Please refer to the example.xslx and example.csv files in the github repo (LINK) for examples of how to format these files. *Note!* The data_filename should include the file extension! if the function isn't passed a file that ends with .csv or .xslx it will throw an error.

This function finds the currently existing entry in the HDF5 file, and deletes the existing entry's datasets (Energy, Month, and Year, or Energy, Day, Month, and Year), and then adds in the new datasets. 

In [7]:
hdf5_interface.update_existing_panel_entry('demo_filename', 'example.xlsx', 'solar_installation_name')

And with that function call, we have updated the "solar_installation_name" installation data within our "hdf5_filename" file!

### Interpolation

A lot of the time, installations will not have a fully reported dataset, with values reported from every day or every month. These gaps in data result in issues, especially when calculating things like rolling 12-month averages. In order to address this, we introduced a function that enables interpolation of this missing data, to fill in the data based on the values that are around it. 

Interpolation is an underlying option for when you are adding or updating existing data. By default, interpolation is turned on, but if you intentionally pass interpolate = True to either of these functions, you'll see that any missing data points have been replaced with interpolated values. To keep track of these values, a column entitled "Interpolate" is used. By default, all values in the "Interpolate" column are set to zero, but if any values are interpolated, the function will flip the zero in the "Interpolate" column to a one. This allows you to keep track of any values that are interpolated.  

The interpolation is done using a polynomial fitting function. By default, we've set the order of the polynomial to 5. We've found that it works best for most datasets; however, feel free to adjust this value as you like to suit your own data.

Let's call the add to hdf5 function, but this time we'll pass a value of True for interpolated. We'll pass a value of 4 for the polynomial fit too, to demonstrate how to do that. We'll use our second example excel sheet, example2.xlsx, for this demo, as it has a couple of missing data points.

In [9]:
hdf5_interface.update_existing_panel_entry('demo_filename', 'example.xlsx', 'solar_installation_name', interpolate = True)

ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

***

### Deleting Panels

Sometimes we make mistakes, and accidentally add the wrong data, or give something the wrong name. This function enables the deletion of a panel entry. Simply pass the name of the hdf5 file, the location name, and the panel name, and this function deletes that value. Don't worry, it'll check to make sure you actually want to do so, in case you get cold feet. Let's delete our newly added interpolated panel now.

***

In addition to these upfront functions, there are three 'hidden' helper functions, called `extract_file_to_dataframe`,  `month_string_to_int`, and `HDF5_to_dataframe`. The first pulls down the excel spreadsheet or csv filename, and opens it from your hard drive into a pandas dataframe. This dataframe is the structure by which all the other functions interact with your data, so it's an important step, but isn't something that is likely important for a user to interact with; however, if in your program you find that some modification to how the code is read in is needed, it could be useful to modify this segment. The second function replaces the values of month if they are strings. HDF5 gets difficult to interact with if you use strings, and this method was to us simpler than working with HDF5 and its string storage, but again, modification here should be straightforward. The final one is used for a lot of the later functionality, but it takes any data that is stored in an HDF5 file and inserts it into a pandas dataframe. 

***

## Interacting with data in the HDF5 file

### Note! This may not be necessary. If you intend to only interact with the HDF5 files through our built-in functions, then this information is not strictly speaking necessary for you to go through.

Interacting with HDF5 files using the python package `h5py` should be relatively familiar for those already comfortable with python's numpy and pandas packages. Additionally, it is possible to reference internal groups using a POSIX-style naming convention. For example, if you have a file called my_file, and an internal location called Fairbanks, you can directly refer to the Fairbanks location by typing "u'/Fairbanks'. You can also name the filepath as a variable, and use that as a reference. We'll walk through both herein. 

In order to interact with the internal structure of the HDF5 files, we'll need to first import the `h5py` package. Then, we'll load our demo file from above into this notebook. There are many types of ways to open the HDF5 file. In particular, we are opening this file using the "r" command here, as we only want the capacity to read the file, not also write it. We also want to be sure that the file exists, and if it doesn't, using the r command will cause an error to be raised.

Note that you'll get yourself into a lot of trouble if you open the SAME HDF5 file in multiple places and try to read and write it. This will likely corrupt the data. If you're going to be doing data editing, we highly recommend backing up your HDF5 file prior to these edits, in case you cause any issues to the file.

In [6]:
import h5py

In [7]:
#First step is to load our earlier file from the hard drive
my_file = h5py.File("demo_filename.hdf5", 'r')

In [11]:
#Then, check to make sure that everything went well by printing.
my_file

<HDF5 file "demo_filename.hdf5" (mode r+)>

Now, we'll move to loading the folder subgroup underneath it, which in our case represents the location for the panel. In the examples above, we've given data for a ficticious panel in Juneau. So, to load that location, we'll have to look within our broader HDF5 file for that location name. To do this, we use the `.get` command on the "my_file" variable that we loaded our HDF5 file into. The `.get` command looks within the HDF5 file at the first layer for the input text, and if it finds it, will load that location into a new variable. If it doesn't find it, it will return a value of `None`. 

In [8]:
#Make sure to pass the location as a string!
my_subgroup = my_file.get('Juneau')

In [10]:
#Check to be sure that the value isn't None!
my_subgroup

<HDF5 group "/Juneau" (10 members)>

We can see that when we go to print the whole subgroup, the program returns "<HDF5 group "/Juneau" (1 members)>" as a response. This shows us that it not only found the location subgroup of Juneau, but also that Juneau has a panel within. We can do more specific queries of what Juneau contains by using actual functions to probe the internal structure, rather than just printing the subgroup. The most important function is `.keys`, which we will go through here, but additional, more detailed info can be found in the [h5py documentation](http://docs.h5py.org/en/stable/index.html).

In [12]:
my_subgroup.keys()

<KeysViewHDF5 ['solar_installation_name', 'solar_installation_name1', 'solar_installation_name10', 'solar_installation_name2', 'solar_installation_name3', 'solar_installation_name4', 'solar_installation_name5', 'solar_installation_name6', 'solar_installation_name8', 'solar_installation_name9']>

We can see that it returns a list which contains the names of all of the panels which we have stored within the subgroup location, Juneau. Though we only have one item so we can't see this behavior, note that h5py sorts the output alphabetically, not by order of input into the HDF5 location. 

If we have a specific panel we'd like to interact with within the subgroup, we can also explicitly call it using notation very similar to querying a column in a pandas dataframe.

In [14]:
my_subgroup['solar_installation_name']

<HDF5 group "/Juneau/solar_installation_name" (0 members)>

In [15]:
my_subgroup['Not_in_the_subgroup']

KeyError: "Unable to open object (object 'Not_in_the_subgroup' doesn't exist)"

We can see that if we call something that doesn't exist within our location, we'll get an error. 

Ok, let's now look into loading a specific panel, and exploring the data within. We'll load the panel just like we loaded the Juneau location, although instead of referencing the name of our HDF5 file in the call, we'll reference the name of our Juneau location variable. Note that you can also skip the Juneau location reference, and directly call the HDF5 file using a POSIX style call.

In [16]:
#Using a hierarchical navigation
my_panel = my_subgroup.get('solar_installation_name')

In [17]:
my_panel

<HDF5 group "/Juneau/solar_installation_name" (0 members)>

In [25]:
#Using POSIX-style navigation
also_my_panel = my_file['/Juneau/solar_installation_name']

In [24]:
also_my_panel

<HDF5 group "/Juneau/solar_installation_name" (0 members)>

Ok, we now have a specific panel loaded. Let's start exploring what's inside of that panel data. A combination of `.keys` and `.get` provides a good tool to begin to do that. 

In [28]:
my_panel.keys()

<KeysViewHDF5 []>

Ok, so we see that our `.keys` call returned three keys. Let's interact with the "Energy" key. This will load the Energy dataset for that specific panel.

In [29]:
my_panel_energy = my_panel.get("Energy")

We can also load our Energy data by using a pandas dataframe-like reference:

In [30]:
also_my_panel_energy = my_panel["Energy"]

KeyError: "Unable to open object (object 'Energy' doesn't exist)"

Once we have loaded our energy into a variable, we can interact with it much like a numpy array. 

In [31]:
#To access everything inside after entry number 4:
my_panel_energy[4,:]

TypeError: 'NoneType' object is not subscriptable

In [32]:
#We can print the contents as well, if desired.
print(my_panel_energy)

None


One thing to note! If you attempt to reference the whole contents of "my_panel_energy" by just typing `print(my_panel_energy)` for example, you will not receive the content of "my_panel_energy", but rather the fact that it is an HDF5 dataset, and the number of entries it has inside it (ie, an H5Py Dataset Object). If you want all of the values from inside "my_panel_energy", you will instead have to call it as below:

In [None]:
my_panel_energy[:]

The final component to discuss is attributes. For our work, we've used attributes as a way to store the DC capacity of the panel installation, so it's always readily accesible. We'll first check what the names of the various attributes are, then we'll access the value stored under that name. 

In [35]:
my_panel.attrs.keys()

<KeysViewHDF5 []>

In [36]:
my_panel.attrs.__getitem__("DC Capacity")

KeyError: "Can't open attribute (can't locate attribute: 'DC Capacity')"

Great, so we can see how we would access the DC Capacity of our panels using attributes. Other information can also be stored in attributes, such as tilt angles, or inverter capacity, but our data did not contain that information consistently, so we have not chosen to include that information. It is relatively simple to add those keys by changing the `add_to_hdf5_file` and `update_existing_panel_entry` functions, if desired. 

***

We hope that you found this demonstration useful in understanding our homebuilt functions, as well as the HDF5 functionality. 
~The ARCTIC Dev Team