This notebook uses utility functions from utils.pv_utils to read in and preprocess the pv data for eda. To run this notebook, first change the filepath to a location on your machine.

In [6]:
from utils.pv_utils import read_files, read_hdf
import pandas as pd
import logging
logging.basicConfig(level=logging.INFO)

FILEPATH = '/Users/carterdemars/Desktop/pv_data_italy'

The Italy PV data is stored across three files. Two files are csvs that contain metadata such as the names of the solar production sites, their geographical location, output capacity, and panel install dates. The third file contains the time series PV output data and is stored in hdf5 format. The tables can be joined on the name and system_id columns to provide a complete picture of the data.

In [7]:
file_info = read_files(FILEPATH)

The hdf file contains 436 unique keys, mostly representing PV system IDs. There also information regarding missing values, as well as some general statistics.

In [8]:
with pd.HDFStore(path=file_info['hdf_path']) as hdf:
    keys = hdf.keys()

for counter, key in enumerate(keys):
    logging.info(f"Reading {key}: {counter} out of {len(keys)}")

    if counter == 2:
        read_hdf(hdf_path=file_info['hdf_path'], key=key)

INFO:root:Reading /missing_dates: 0 out of 436
INFO:root:Reading /statistics: 1 out of 436
INFO:root:Reading /timeseries/10441: 2 out of 436
INFO:root:Reading /timeseries/10708: 3 out of 436
INFO:root:Reading /timeseries/10878: 4 out of 436
INFO:root:Reading /timeseries/10897: 5 out of 436
INFO:root:Reading /timeseries/10970: 6 out of 436
INFO:root:Reading /timeseries/10979: 7 out of 436
INFO:root:Reading /timeseries/10997: 8 out of 436
INFO:root:Reading /timeseries/11017: 9 out of 436
INFO:root:Reading /timeseries/11080: 10 out of 436
INFO:root:Reading /timeseries/11126: 11 out of 436
INFO:root:Reading /timeseries/11175: 12 out of 436
INFO:root:Reading /timeseries/11176: 13 out of 436
INFO:root:Reading /timeseries/11177: 14 out of 436
INFO:root:Reading /timeseries/11178: 15 out of 436
INFO:root:Reading /timeseries/11182: 16 out of 436
INFO:root:Reading /timeseries/12415: 17 out of 436
INFO:root:Reading /timeseries/12499: 18 out of 436
INFO:root:Reading /timeseries/12822: 19 out of 436

                     cumulative_energy_gen_Wh  instantaneous_power_gen_W  \
datetime                                                                   
2012-08-07 15:00:00                   19162.0                     2436.0   
2012-08-07 15:05:00                   19369.0                     2425.0   
2012-08-07 15:10:00                   19560.0                     2413.0   
2012-08-07 15:15:00                   19763.0                     2355.0   
2012-08-07 15:20:00                   19949.0                     2315.0   

                     temperature_C  voltage  
datetime                                     
2012-08-07 15:00:00           64.4    237.0  
2012-08-07 15:05:00           64.1    236.5  
2012-08-07 15:10:00           63.9    235.4  
2012-08-07 15:15:00           63.6    236.1  
2012-08-07 15:20:00           63.2    235.4  
                     cumulative_energy_gen_Wh  instantaneous_power_gen_W  \
datetime                                                             

Looking at the HDF file, there are