## Industrial Machines Dataset for Electrical Load Disaggregation
This notebook is used to convert and analyze the factory load disaggregation dataset from 'nilm' format to 'json' format for an easier inspection, since the 'nilmtk' module for python has some flaws during installation and usage.
The notebook is divided in 2 parts:
1) SKIPPABLE - nilm dataset loading and conversion: use of 'nilmtk' to load and convert the raw dataset (from https://ieee-dataport.org/open-access/industrial-machines-dataset-electrical-load-disaggregation)
2) 

### 1) NILM DATASET LOADING AND CONVERSION
Following installation instructions are only if you want to try out the nilmtk package.

To load and use the dataset you can just call the Dataset class from dataset_functions.py (SEE SECOND PART OF THE NOTEBOOK)

Create a conda environment that uses python 3.8 (do this in whatever way you prefer), then add this channel to your conda config

In [None]:
!conda config --add channels conda-forge

Install the nilmtk package from terminal, because executing the command from the notebook doesn't let you press 'y' to confirm the installation of the module

In [None]:
!conda install -c nilmtk nilmtk

Put the nilm_metadata folder in the .conda environment: put it in .conda/lib/python3.8/site-packages/

Check that nilmtk and nilm_metadata are installed

In [None]:
!conda list

### Dataset conversion from nilm format file to xml
Data is filtered to keep only items of type {date: [machine_name : {'power_apparent': value, 'current': value, 'voltage': value}]} for every date and every useful machine

Date is shifted by 12h BACKWARDS since the machines in the dataset work at night, so from 'original_date' it is transformed to 'original_date-12h'

In [1]:
from dataset_parser import load_dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


#change this to your dataset path
path_to_dataset = '../../brazilian_dataset/IMDELD.hdf5'
output_path_to_json_dataset = 'output/IMDELD.json'
#loads dataset (internally loads all the machines data)
dsh = load_dataset(path_to_dataset)

#print of machines names to check if everything is loaded correctly
machines = dsh.get_machines_ids()
for machine in machines:
    print(dsh.get_machine_name(machine))



In [None]:
#print of the first 5 rows of the first machine loaded
print(dsh.loaded_data[dsh.get_machines_ids()[0]].head())

In [None]:
#actual conversion to json
dsh.convert_nilm_to_json()

In [None]:
#save the dataset to a json file
dsh.save_dataset_to_json(output_path_to_json_dataset)

### 2) LOADING THE JSON DATASET
The original dataset does not cover every hour for every machine from start_time to end_time, so there are blank spaces. This is carried also in the .json dataset, so when asking for data that is not entirely covered, keep in mind that there could be some blanks.

Check file 'dataset_functions.py' to see how the Dataset class works, here it is an example that loads the .json version of the dataset and extracts some time intervals (day, week, month)

Check the .json file to see how data is formatted

In [None]:
from dataset_functions import Dataset, plot_data
import pandas as pd
import matplotlib.pyplot as plt

datasetjson_path = 'output/IMDELD.json'

datasetjson = Dataset('IMDELD', datasetjson_path)
datasetjson.load()

machine_names = datasetjson.get_machine_names()
print(machine_names)

#the keys of the hourly entries are timestamps
print(datasetjson.data.keys())
#select the first row of the dataset
key0 = list(datasetjson.data.keys())[0]
print(datasetjson.data[key0])


#get the start and end time of the dataset (useful to check bounds)
start, end = datasetjson.get_start_end_time()

In [None]:
#pick a day (in this case the second day of the dataset)
day = start + pd.Timedelta(days=1)
#get the data for that day
hour_datas = datasetjson.get_data_day(day)
#daily data is a dictionary with the hour as key and the data as value
#each hour is a dictionary with the machine as key and the data as value
#each machine is a dictionary with 3 elements: 'power_apparent', 'voltage' and 'current'

#plot the data of all the machines for the entire day
plot_data(machine_names, hour_datas)

In [None]:
day = start + pd.Timedelta(days=5)
week_data = datasetjson.get_data_week(day)

plot_data(machine_names, week_data)

for hour_data in week_data.items():
    print(hour_data[0])
    for hour_data in hour_data[1].items():
        print(hour_data)
    print("---")

In [None]:
day = start + pd.Timedelta(days=1)
month_data = datasetjson.get_data_month(day)

plot_data(machine_names, month_data)

print(f"Month data for {day.month}/{day.year}, number of elements: {len(month_data)}")
for hour_data in month_data.items():
    print(hour_data[0])
    for hour_data in hour_data[1].items():
        print(hour_data)
    print("---")

In [None]:
#this, or just use datasetjson.data
whole_data = datasetjson.get_data_start_end(start, end)

plot_data(machine_names, whole_data)