# OSIsoft Academic Hub Datasets Python Library

Version 0.9

Academic Hub datasets are hosted by the OSIsoft Cloud Service (OCS, https://www.osisoft.com/solutions/cloud/vision/), a cloud-native realtime data infrastructure to perform enterprise-wide analytics using tools and languages of their choice. 

**Raw operational data has specific characteristics making it difficult to deal with directly**, among them:

* variable data collection frequencies
* bad values (system error codes)
* data gaps 


**But data science projects against operational data needs to be:**

* **Time-aligned** to deal with the characteritics above in consistent way according to the data type (e.g. interpolation for float values, repeat last good value for categorical data, etc)
* **Context aware** so that the data can be understandable, across as many real-world assets that you need it for
* **Shaped and filtered** to ensure you have the data you need, in the form you need it

**OCS solution for application-ready data are Data Views:**

![](https://academichub.blob.core.windows.net/images/piworld-dse-dataview-p2.png)

**Each Academic Hub datasets comes endowed with a set of asset-centric data views.** The goal of Academic Hub Python library is to allow in a very generic and consistent way to access:

* the list of existing datasets
* for a given dataset: 
  * get the list of its assets
  * get the OCS namespace where the dataset is hosted
* for a given asset, get the list data views it belongs to

The rest of this notebook is a working example of the functionality listed above. 

## Install Academic Hub Python library 

In [1]:
!pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple ocs-academic-hub==0.78.0

Looking in indexes: https://test.pypi.org/simple/, https://pypi.org/simple


## Use the `pip uninstall` only in case of library issues

In [2]:
# It's sometimes necessary to uninstall previous versions, uncomment and run the following line. Then restart kernel and reinstall with previous cell
# !pip uninstall -y ocs-academic-hub ocs-sample-library-preview

# WARNING: uncomment only for testing
#%env OCS_HUB_CONFIG=config.ini

env: OCS_HUB_CONFIG=config.ini


## Import HubClient, necessary to connect and interact with OCS

In [3]:
from ocs_academic_hub import HubClient

## Running the following cell initiate the login sequence

Return to this web page when done

In [4]:
hub = HubClient()

> configuration file: config.ini


## Get list of published hub datasets

NOTE: currently only `Deschutes Brewery` supports the new interface. This notebook is specifically about this dataset. 

In [5]:
hub.datasets()

['Deschutes-v1', 'UCDavis.Facilities', 'Deschutes.Brewery']

## Display current active dataset

NOTE: it will be possible to switch it once other datasets support the new asset interface. 

In [6]:
hub.current_dataset()

'Deschutes.Brewery'

## Get list of assets with Data Views

In [7]:
hub.assets()

['BB11',
 'BB12',
 'BB13',
 'BB14',
 'BB15',
 'C2_BBL1',
 'C2_FT1',
 'C2_PS1',
 'C2_PS2',
 'FV31',
 'FV32',
 'FV33',
 'FV34',
 'FV35',
 'FV36']

## List of all Data Views

Those are all single-asset Data Views

In [8]:
hub.asset_dataviews()

['deschutes-adf_prediction-fv31',
 'deschutes-adf_prediction-fv32',
 'deschutes-adf_prediction-fv33',
 'deschutes-adf_prediction-fv34',
 'deschutes-adf_prediction-fv35',
 'deschutes-adf_prediction-fv36',
 'deschutes-all_columns-bb11',
 'deschutes-all_columns-bb12',
 'deschutes-all_columns-bb13',
 'deschutes-all_columns-bb14',
 'deschutes-all_columns-bb15',
 'deschutes-all_columns-c2_bbl1',
 'deschutes-all_columns-c2_ft1',
 'deschutes-all_columns-c2_ps1',
 'deschutes-all_columns-c2_ps2',
 'deschutes-all_columns-fv31',
 'deschutes-all_columns-fv32',
 'deschutes-all_columns-fv33',
 'deschutes-all_columns-fv34',
 'deschutes-all_columns-fv35',
 'deschutes-all_columns-fv36',
 'deschutes-cooling_prediction-fv31',
 'deschutes-cooling_prediction-fv32',
 'deschutes-cooling_prediction-fv33',
 'deschutes-cooling_prediction-fv34',
 'deschutes-cooling_prediction-fv35',
 'deschutes-cooling_prediction-fv36',
 'deschutes-pca-fv31',
 'deschutes-pca-fv32',
 'deschutes-pca-fv33',
 'deschutes-pca-fv34',
 '

## List of Data Views exclusive to Fermenter Vessel #32 (FV32)

In [9]:
dvs_fv32 = hub.asset_dataviews(asset="FV32")
dvs_fv32

['deschutes-adf_prediction-fv32',
 'deschutes-all_columns-fv32',
 'deschutes-cooling_prediction-fv32',
 'deschutes-pca-fv32']

## List Multi-Asset Data Views Containing FV32

The column `Asset_Id` in data view results indicates which asset the row of data belongs to 

In [10]:
hub.asset_dataviews(asset="FV32", single_asset=False)

['deschutes-adf_prediction-fv31-36',
 'deschutes-all_columns-fv31-36',
 'deschutes-cooling_prediction-fv31-36',
 'deschutes-pca-fv31-36']

## Get the OCS namespace associated to the dataset

Each data set belongs to a namespace within the Academic Hub OCS account. Since dataset may move over time, the function below always return the active namespace for the given dataset. 

In [11]:
namespace_id = hub.namespace_of("Deschutes.Brewery")
namespace_id

'academic_hub_01'

## Get Data View structure

With Stream Name, the column name under which stream data appears, its value type and engineering units if available. We display below the structure of the default data view. 

In [12]:
dataview_id = hub.asset_dataviews(asset="FV32", filter="default")[0]
print(dataview_id)
hub.dataview_definition(namespace_id, dataview_id)

deschutes-all_columns-fv32


Unnamed: 0,Asset_Id,OCS_StreamName,DV_Column,Value_Type,EngUnits
0,FV32,B2_CL_C2_FV32/ADF,ADF,Float,
1,FV32,B2_CL_C2_FV32/BRAND.CV,Brand,Category,
2,FV32,B2_CL_C2_FV32/DcrsFvFullPlato,FV Full Plato,Float,Plato
3,FV32,B2_CL_C2_FV32/Diacetyl,Diacetyl,Integer,ppb
4,FV32,B2_CL_C2_FV32/EndPhaseTime.CV,End Phase Time,Float,m
5,FV32,B2_CL_C2_FV32/Fermentation_Start_Time,Fermentation Start Time,Timestamp,
6,FV32,B2_CL_C2_FV32/Integrator Key,Integrator Key,Float,
7,FV32,B2_CL_C2_FV32/Phase Duration,Phase Duration,Integer,
8,FV32,B2_CL_C2_FV32/Plato,Plato,Float,Plato
9,FV32,B2_CL_C2_FV32/Predicted Transition,Predicted Transition,String,


## Getting data from a Data View

Return interpolated data between a start and end date, with the requested interpolation interval (format is HH:MM:SS)

In [13]:
df_fv32= hub.dataview_interpolated_pd(namespace_id, dataview_id, "2017-01-19", "2020-01-19", "00:30:00")
df_fv32

+++++++++++++++++++
  ==> Finished 'dataview_interpolated_pd' in       52.7742 secs [ 996 rows/sec ]


Unnamed: 0,Timestamp,Asset_Id,ADF,FV Full Plato,Diacetyl,End Phase Time,Fermentation Start Time,Integrator Key,Phase Duration,Plato,...,Bottom TIC SP,Middle TIC OUT,Middle TIC PV,Middle TIC SP,Top TIC OUT,Top TIC PV,Top TIC SP,Brand,Status,Yeast Strain
0,2017-01-19 00:00:00,FV32,0.104108,13.506092,,,2017-01-18T05:59:56.6180112Z,12.10000,,12.10000,...,63.0,17.355146,63.032608,63.0,8.699567,63.067005,63.0,Realtime Hops,Fermentation,NCYC1187
1,2017-01-19 00:30:00,FV32,0.104108,13.506092,,,2017-01-18T05:59:56.6180112Z,12.10000,,12.10000,...,63.0,7.568751,63.035385,63.0,20.299335,63.055607,63.0,Realtime Hops,Fermentation,NCYC1187
2,2017-01-19 01:00:00,FV32,0.104108,13.506092,,,2017-01-18T05:59:56.6180112Z,12.10000,,12.10000,...,63.0,4.256329,63.019897,63.0,46.984615,63.241917,63.0,Realtime Hops,Fermentation,NCYC1187
3,2017-01-19 01:30:00,FV32,0.104108,13.506092,,,2017-01-18T05:59:56.6180112Z,12.10000,,12.10000,...,63.0,17.092400,63.079906,63.0,74.284065,63.305080,63.0,Realtime Hops,Fermentation,NCYC1187
4,2017-01-19 02:00:00,FV32,0.104108,13.506092,,,2017-01-18T05:59:56.6180112Z,12.10000,,12.10000,...,63.0,46.463210,63.176716,63.0,42.702755,63.150780,63.0,Realtime Hops,Fermentation,NCYC1187
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52556,2020-01-18 22:00:00,FV32,0.756629,16.295610,25.0,2880.0,2019-12-16T09:28:19.6750028Z,3.96588,6803.0,3.96588,...,30.0,6.131247,30.028662,30.0,0.000000,29.951517,30.0,Trois Lacs,Ready to Transfer,a38
52557,2020-01-18 22:30:00,FV32,0.756629,16.295610,25.0,2880.0,2019-12-16T09:28:19.6750028Z,3.96588,6833.0,3.96588,...,30.0,0.000000,29.966312,30.0,38.112650,30.199999,30.0,Trois Lacs,Ready to Transfer,a38
52558,2020-01-18 23:00:00,FV32,0.756629,16.295610,25.0,2880.0,2019-12-16T09:28:19.6750028Z,3.96588,6863.0,3.96588,...,30.0,0.000000,30.089066,30.0,42.780502,30.199999,30.0,Trois Lacs,Ready to Transfer,a38
52559,2020-01-18 23:30:00,FV32,0.756629,16.295610,25.0,2880.0,2019-12-16T09:28:19.6750028Z,3.96588,6893.0,3.96588,...,30.0,0.000000,30.011000,30.0,42.780502,30.199999,30.0,Trois Lacs,Ready to Transfer,a38


In [14]:
# Information about the dataframe
df_fv32.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52561 entries, 0 to 52560
Data columns (total 34 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Timestamp                52561 non-null  datetime64[ns]
 1   Asset_Id                 52561 non-null  object        
 2   ADF                      30540 non-null  float64       
 3   FV Full Plato            36987 non-null  float64       
 4   Diacetyl                 32934 non-null  float64       
 5   End Phase Time           37646 non-null  float64       
 6   Fermentation Start Time  44539 non-null  object        
 7   Integrator Key           51627 non-null  float64       
 8   Phase Duration           5244 non-null   float64       
 9   Plato                    34735 non-null  float64       
 10  Predicted Transition     16973 non-null  object        
 11  Deviation                21289 non-null  float64       
 12  VesselID                 52561 n

## Data Views with multiple assets

Some Data Views return data for fermenter vessels 31 up to 36. Cell below is how to get their names. 

In [15]:
multi_asset_dvs = hub.asset_dataviews(single_asset=False)
multi_asset_dvs

['deschutes-adf_prediction-fv31-36',
 'deschutes-all_columns-fv31-36',
 'deschutes-cooling_prediction-fv31-36',
 'deschutes-pca-fv31-36']

## Get result

The column "Asset_Id" indicates which asset the data row belongs to. The data order is all data for FV31 in increasing time, followed by FV32 and so on up to FV36. 

NOTE: the resulting dataframe has almost 800K rows. 

In [16]:
df_fv31_36 = hub.dataview_interpolated_pd(namespace_id, multi_asset_dvs[0], "2017-02-01", "2017-08-01", "00:02:00")
df_fv31_36

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  ==> Finished 'dataview_interpolated_pd' in       31.8837 secs [ 24.52K rows/sec ]


Unnamed: 0,Timestamp,Asset_Id,ADF,FV Full Plato,Fermentation Start Time,Plato,Brand,Status
0,2017-02-01 00:00:00,FV31,0.719046,17.084625,2017-01-26T07:30:03.2369995Z,4.80000,Grey Horse,Cooling
1,2017-02-01 00:02:00,FV31,0.719046,17.084625,2017-01-26T07:30:03.2369995Z,4.80000,Grey Horse,Cooling
2,2017-02-01 00:04:00,FV31,0.719046,17.084625,2017-01-26T07:30:03.2369995Z,4.80000,Grey Horse,Cooling
3,2017-02-01 00:06:00,FV31,0.719046,17.084625,2017-01-26T07:30:03.2369995Z,4.80000,Grey Horse,Cooling
4,2017-02-01 00:08:00,FV31,0.719046,17.084625,2017-01-26T07:30:03.2369995Z,4.80000,Grey Horse,Cooling
...,...,...,...,...,...,...,...,...
781921,2017-07-31 23:52:00,FV36,0.787345,17.026010,2017-07-25T00:43:17.1560056Z,3.62067,Grey Horse,Cooling
781922,2017-07-31 23:54:00,FV36,0.787345,17.026010,2017-07-25T00:43:17.1560056Z,3.62067,Grey Horse,Cooling
781923,2017-07-31 23:56:00,FV36,0.787345,17.026010,2017-07-25T00:43:17.1560056Z,3.62067,Grey Horse,Cooling
781924,2017-07-31 23:58:00,FV36,0.787345,17.026010,2017-07-25T00:43:17.1560056Z,3.62067,Grey Horse,Cooling


In [17]:
df_fv31_36.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 781926 entries, 0 to 781925
Data columns (total 8 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   Timestamp                781926 non-null  datetime64[ns]
 1   Asset_Id                 781926 non-null  object        
 2   ADF                      458834 non-null  float64       
 3   FV Full Plato            627173 non-null  float64       
 4   Fermentation Start Time  646059 non-null  object        
 5   Plato                    480450 non-null  float64       
 6   Brand                    781926 non-null  object        
 7   Status                   781926 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(4)
memory usage: 47.7+ MB


## Refresh datasets information 

When new datasets are published and/or existing ones are extended, you can access the updated information using `refresh_datasets` 

In [18]:
hub.refresh_datasets()

@ Hub data file: hub_datasets.json
@ Current dataset: Deschutes.Brewery
