# Brewery Dataset Quick Start

Version 2.1

The Brewery dataset of the Academic Hub dataset is hosted by the OSIsoft Cloud Service (OCS, https://www.osisoft.com/solutions/cloud/vision/), a cloud-native realtime data infrastructure to perform enterprise-wide analytics using tools and languages of their choice. 

<div class="alert alert-info">
<b>For documentation about the Brewery dataset itself, please go to <a href="https://data.academic.osisoft.com/nbviewer/github/academic-hub/datasets/blob/master/Brewery_Dataset_Doc.ipynb">https://data.academic.osisoft.com/nbviewer/github/academic-hub/datasets/blob/master/Brewery_Dataset_Doc.ipynb</a></b>
</div>


**Raw operational data has specific characteristics making it difficult to deal with directly**, among them:

* variable data collection frequencies
* bad values (system error codes)
* data gaps 


**But data science projects against operational data needs to be:**

* **Time-aligned** to deal with the characteritics above in consistent way according to the data type (e.g. interpolation for float values, repeat last good value for categorical data, etc)
* **Context aware** so that the data can be understandable, across as many real-world assets that you need it for
* **Shaped and filtered** to ensure you have the data you need, in the form you need it

**OCS solution for application-ready data are Data Views:**

![](https://academichub.blob.core.windows.net/images/piworld-dse-dataview-p2.png)

**Each Academic Hub datasets comes endowed with a set of asset-centric data views.** The goal of Academic Hub Python library is to allow in a very generic and consistent way to access:

* the list of existing datasets
* for a given dataset: 
  * get the list of its assets
  * get the OCS namespace where the dataset is hosted
* for a given asset, get the list data views it belongs to

The rest of this notebook is a working example of the functionality listed above. 

<div class="alert alert-info">
<b>The rest of this notebook is a working example of the functionality listed above for the Brewery dataset</b>
</div>

## Install Academic Hub Python library 

In [55]:
!pip install ocs-academic-hub==0.99.40





## Use the `pip uninstall` only in case of library issues

In [56]:
# It's sometimes necessary to uninstall previous versions, uncomment and run the following line. Then restart kernel and reinstall with previous cell
# !pip uninstall -y ocs-academic-hub ocs-sample-library-preview

## Import hub_login, necessary to connect and interact with OCS

In [57]:
from ocs_academic_hub.datahub import hub_login

## Running the following cell initiate the login sequence

**Execute the cell below and follow the indicated steps to log in (an AVEVA banner would show up)** 

In [58]:
widget, hub = hub_login()
widget

<IPython.core.display.Javascript object>

VBox(children=(HTML(value='<p><img alt="AVEVA banner" src="https://academichub.blob.core.windows.net/images/av…

## Refresh datasets information

Over time existing datasets are updated and new ones are added. The cell below makes sure you have the latest version of the production datasets. 

Note: after execution of this method, a file named `hub_datasets.json` will be created in the same directory as this notebook. The data in this file supersedes the one built-in with the `ocs_academic_hub` module. To get back to the built-in datasets information, move/rename/delete `hub_datasets.json`. 

In [59]:
hub.refresh_datasets()

## Get list of published hub datasets


In [60]:
hub.datasets()

['Brewery',
 'Campus_Energy',
 'Classroom_Data',
 'MIT',
 'Pilot_Plant',
 'USC_Well_Data',
 'Wind_Farms']

## Display current active dataset

The default dataset is Brewery. Only one dataset can be active. 

In [61]:
hub.current_dataset()

'Brewery'

## Get list of assets with Data Views

Returned into the form of a pandas dataframe, with column `Asset_Id` and `Description`. Each asset has a unique `Asset_Id` as its identity. The Brewery dataset has a mix of bright tanks (prefixed `BB`), fermenter vessels (prefixed `FV`) and miscellaneous pieces of equipment. 

The cell below with `print` and `.to_string()` allows to see the whole dataframe content. 

In [62]:
print(hub.assets().to_string())

          Asset_Id            Description
0        Acid Tank                    AT1
1              BA1                       
2              BA2                       
3             BB02            Bright Tank
4        BB02 Line                       
5             BB03            Bright Tank
6             BB04            Bright Tank
7             BB05            Bright Tank
8             BB06            Bright Tank
9             BB07            Bright Tank
10            BB08            Bright Tank
11            BB09            Bright Tank
12            BB11            Bright Tank
13            BB12            Bright Tank
14            BB13            Bright Tank
15            BB14            Bright Tank
16            BB15            Bright Tank
17   Beer Transfer  Beer TransferTemplate
18         C1_BBL1            Bright Tank
19          C1_BL1     Cellar 1 Beer Line
20          C1_PS1                    PS1
21          C1_PS2                    PS2
22          C1_YL1    Cellar 1 Yea

## List of all Data Views

Those are all single-asset default (with all data available for the asset) Data Views

In [63]:
hub.asset_dataviews()

['brewery-acid.tank',
 'brewery-ba1',
 'brewery-ba2',
 'brewery-bb02',
 'brewery-bb02.line',
 'brewery-bb03',
 'brewery-bb04',
 'brewery-bb05',
 'brewery-bb06',
 'brewery-bb07',
 'brewery-bb08',
 'brewery-bb09',
 'brewery-bb11',
 'brewery-bb12',
 'brewery-bb13',
 'brewery-bb14',
 'brewery-bb15',
 'brewery-beer.transfer',
 'brewery-c1_bbl1',
 'brewery-c1_bl1',
 'brewery-c1_ps1',
 'brewery-c1_ps2',
 'brewery-c1_yl1',
 'brewery-c2_bbl1',
 'brewery-c2_ft1',
 'brewery-c2_ps1',
 'brewery-c2_ps2',
 'brewery-c3_bl1',
 'brewery-c3_ft1',
 'brewery-c3_yl1',
 'brewery-caustic.tank',
 'brewery-clean.in.place',
 'brewery-cnt1',
 'brewery-cnt2',
 'brewery-cst1',
 'brewery-fv01',
 'brewery-fv02',
 'brewery-fv08',
 'brewery-fv09',
 'brewery-fv10',
 'brewery-fv11',
 'brewery-fv12',
 'brewery-fv13',
 'brewery-fv14',
 'brewery-fv15',
 'brewery-fv16',
 'brewery-fv17',
 'brewery-fv18',
 'brewery-fv19',
 'brewery-fv1__fv2.line',
 'brewery-fv20',
 'brewery-fv21',
 'brewery-fv22',
 'brewery-fv23',
 'brewery-fv

## List of Data Views exclusive to Fermenter Vessel #32 (FV32)

Empty filter (`filter=""`) allows to see all dataviews for the asset instead of simply the default one

In [64]:
dvs_fv32 = hub.asset_dataviews(asset="FV32", filter="")
dvs_fv32

['brewery-fv32',
 'brewery-fv32-adf_prediction',
 'brewery-fv32-cooling_prediction',
 'brewery-fv32-pca']

## List Multi-Asset Data Views Containing FV32

The column `Asset_Id` in data view results indicates which asset the row of data belongs to 

In [65]:
hub.asset_dataviews(asset="FV32", multiple_asset=True, filter="")

['brewery-fv31--36',
 'brewery-fv31--36-adf_prediction',
 'brewery-fv31--36-cooling_prediction',
 'brewery-fv31--36-pca']

## Get the OCS namespace associated to the dataset

Each data set belongs to a namespace within the Academic Hub OCS account. Since dataset may move over time, the function below always return the active namespace for the given dataset. 

In [66]:
dataset = hub.current_dataset()
namespace_id = hub.namespace_of(dataset)
namespace_id

'academic_hub_01'

## Get Data View structure

With Stream Name, the column name under which stream data appears, its value type and engineering units if available. We display below the structure of the default data view. 

In [67]:
dataview_id = hub.asset_dataviews(asset="FV32", filter="default")[0]
print(dataview_id)
print(hub.dataview_definition(namespace_id, dataview_id).to_string(index=False))

brewery-fv32
Asset_Id             Column_Name Stream_Type Stream_UOM                       OCS_Stream_Name
    FV32                     ADF       Float                                B2_CL_C2_FV32/ADF
    FV32          Bottom TIC OUT       Float          %         B2_CL_C2_FV32_TIC1380A/OUT.CV
    FV32           Bottom TIC PV       Float         °F          B2_CL_C2_FV32_TIC1380A/PV.CV
    FV32           Bottom TIC SP       Float         °F          B2_CL_C2_FV32_TIC1380A/SP.CV
    FV32                   Brand    Category                           B2_CL_C2_FV32/BRAND.CV
    FV32               Deviation       Float               B2_CL_C2_FV32/Prediction.Deviation
    FV32                Diacetyl     Integer        ppb                B2_CL_C2_FV32/Diacetyl
    FV32          End Phase Time       Float          m         B2_CL_C2_FV32/EndPhaseTime.CV
    FV32           FV Full Plato       Float      Plato         B2_CL_C2_FV32/DcrsFvFullPlato
    FV32 Fermentation Start Time   Timestamp   

## Getting data from a Data View

**There are two kinds of Data Views: interpolated and stored (archived data)**

### Interpolated Data View

Return interpolated data between a start and end date, with the requested interpolation interval (format is HH:MM:SS)

In [68]:
# Use the first commented out line to access a full 3-year worth of data
# df_fv32= hub.dataview_interpolated_pd(namespace_id, dataview_id, "2017-01-19", "2020-01-19", "00:30:00")
#
# This next line is for a single month of data
df_fv32 = hub.dataview_interpolated_pd(
    namespace_id, dataview_id, "2017-01-19", "2017-02-19", "00:30:00"
)
df_fv32


  ==> Finished 'dataview_interpolated_pd' in       4.1337 secs [ 360 rows/sec ]


Unnamed: 0,Timestamp,Asset_Id,ADF,Volume,Volume In,Volume Out,Top TIC OUT,Top TIC PV,Top TIC SP,Middle TIC OUT,...,Deviation,Yeast Out Totalizer,End Phase Time,Vessel Procedure,Predicted Transition,Integrator Key,Phase Duration,Yeast Strain,Brand,Status
0,2017-01-19 00:00:00,FV32,0.104108,719.617,719.61707,0.0,8.699567,63.067005,63.0,17.355146,...,,,,,,12.1,,NCYC1187,Realtime Hops,Fermentation
1,2017-01-19 00:30:00,FV32,0.104108,719.617,719.61707,0.0,20.299335,63.055607,63.0,7.568751,...,,,,,,12.1,,NCYC1187,Realtime Hops,Fermentation
2,2017-01-19 01:00:00,FV32,0.104108,719.617,719.61707,0.0,46.984615,63.241917,63.0,4.256329,...,,,,,,12.1,,NCYC1187,Realtime Hops,Fermentation
3,2017-01-19 01:30:00,FV32,0.104108,719.617,719.61707,0.0,74.284065,63.305080,63.0,17.092400,...,,,,,,12.1,,NCYC1187,Realtime Hops,Fermentation
4,2017-01-19 02:00:00,FV32,0.104108,719.617,719.61707,0.0,42.702755,63.150780,63.0,46.463210,...,,,,,,12.1,,NCYC1187,Realtime Hops,Fermentation
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1484,2017-02-18 22:00:00,FV32,0.653248,725.297,725.29740,0.0,0.000000,63.400000,68.0,0.000000,...,,,,,,337.0,,NCYC1187,Kerberos,Diacetyl Rest
1485,2017-02-18 22:30:00,FV32,0.653248,725.297,725.29740,0.0,0.000000,63.350674,68.0,0.000000,...,,,,,,337.0,,NCYC1187,Kerberos,Diacetyl Rest
1486,2017-02-18 23:00:00,FV32,0.653248,725.297,725.29740,0.0,0.000000,63.199997,68.0,0.000000,...,,,,,,337.0,,NCYC1187,Kerberos,Diacetyl Rest
1487,2017-02-18 23:30:00,FV32,0.653248,725.297,725.29740,0.0,0.000000,63.199997,68.0,0.000000,...,,,,,,337.0,,NCYC1187,Kerberos,Diacetyl Rest


In [69]:
# Information about the dataframe - this is a Pandas operation 
df_fv32.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1489 entries, 0 to 1488
Data columns (total 34 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Timestamp                1489 non-null   datetime64[ns]
 1   Asset_Id                 1489 non-null   object        
 2   ADF                      753 non-null    float64       
 3   Volume                   1486 non-null   float64       
 4   Volume In                1486 non-null   float64       
 5   Volume Out               1486 non-null   float64       
 6   Top TIC OUT              1486 non-null   float64       
 7   Top TIC PV               1486 non-null   float64       
 8   Top TIC SP               1486 non-null   float64       
 9   Middle TIC OUT           1486 non-null   float64       
 10  Middle TIC PV            1486 non-null   float64       
 11  Middle TIC SP            1486 non-null   float64       
 12  Bottom TIC OUT           1486 non-

### Stored Data View

In case where actual archived data is required, `hub.dataview_stored_pd` returns a dataframe between a start and end date. The format is narrow with always the same 4 columns:

* **Timestamp:** time associated to the event
* **Asset_Id:** which asset produced the data
* **Field:** name a of sensor (corresponds to a column in interpolated data view)
* **Value:** actual value for the event (can be a digital state)

Values for a given sensor are consecutive, sorted by timestamps. 

In [70]:
df_fv32_stored = hub.dataview_stored_pd(
    namespace_id, dataview_id, "2017-01-19", "2017-01-24"
)
df_fv32_stored

++++
  ==> Finished 'dataview_stored_pd' in             9.0903 secs [ 12.87K rows/sec ]


Unnamed: 0,Timestamp,Asset_Id,Field,Value
0,2017-01-19 05:39:13.047012300+00:00,FV32,ADF,0.104108
1,2017-01-19 05:49:00+00:00,FV32,ADF,0.296614
2,2017-01-19 13:29:13.340011500+00:00,FV32,ADF,0.296614
3,2017-01-19 16:13:00+00:00,FV32,ADF,0.474311
4,2017-01-19 21:19:13.616012500+00:00,FV32,ADF,0.474311
...,...,...,...,...
16992,2017-01-22 13:39:15.449005100+00:00,FV32,Status,Diacetyl Rest
16993,2017-01-22 21:29:15.699005100+00:00,FV32,Status,Diacetyl Rest
16994,2017-01-23 05:19:15.915008500+00:00,FV32,Status,Diacetyl Rest
16995,2017-01-23 13:09:16.191009500+00:00,FV32,Status,Diacetyl Rest


**To see how many values of each sensor/status were returned, use Pandas `value_counts()`**

In [71]:
df_fv32_stored["Field"].value_counts()

PIC OUT             60335
Top TIC PV          11868
Middle TIC PV       11274
Bottom TIC PV       10485
Top TIC OUT          6581
Middle TIC OUT       6254
Bottom TIC OUT       6253
PIC PV               3721
PIC SP                 23
ADF                    21
Bottom TIC SP          19
Middle TIC SP          18
Top TIC SP             18
Volume Out             16
Yeast Generation       16
Status                 16
Volume                 15
Volume In              15
Yeast Strain           15
Brand                  15
Integrator Key         11
Plato                   5
Diacetyl                3
Name: Field, dtype: int64

<div class="alert alert-block alert-info">
<b>NOTE: unlike interpolated data views, it is not possible to know how many rows will be returned for given start and end date with stored data views. <tt>hub.dataview_stored_pd()</tt> put a cap of 2 millions rows on the returned dataframe.
</div>
    
### Stored Data Views bigger than 2M rows
    
If a result stored data view has 2M rows, it's almost certain that there are remaining data. To check if this is the case, `hub.remaining_data()` returns a boolean indicating the status. To get the remaining values, `hub.dataview_stored_pd()` should be called with parameter `resume` set to `True`. Note that you may need to call `hub.dataview_stored_pd` with `resume` multiple times (until `hub.remaining_data()` returns `False`). 

## Data Views with multiple assets

Some Data Views return data for fermenter vessels 31 up to 36. Cell below is how to get their names. 

In [72]:
multi_asset_dvs = hub.asset_dataviews(multiple_asset=True)
multi_asset_dvs

['brewery-fv01--28', 'brewery-fv31--36', 'brewery-fv37--46']

## Get result

The column "Asset_Id" indicates which asset the data row belongs to. The data order is all data for FV31 in increasing time, followed by FV32 and so on up to FV36. 


In [73]:
df_fv31_36 = hub.dataview_interpolated_pd(namespace_id, multi_asset_dvs[1], "2017-02-01", "2017-03-01", "00:30:00")
df_fv31_36

++
  ==> Finished 'dataview_interpolated_pd' in       14.9141 secs [ 541 rows/sec ]


Unnamed: 0,Timestamp,Asset_Id,ADF,Volume,Volume In,Volume Out,Top TIC OUT,Top TIC PV,Top TIC SP,Middle TIC OUT,...,Deviation,Yeast Out Totalizer,End Phase Time,Vessel Procedure,Predicted Transition,Integrator Key,Phase Duration,Yeast Strain,Brand,Status
0,2017-02-01 00:00:00,FV31,0.719046,698.33700,698.33730,0.0,100.0,53.674717,30.0,100.000000,...,,,,,,-1.0,,NCYC1187,Grey Horse,Cooling
1,2017-02-01 00:30:00,FV31,0.719046,698.33700,698.33730,0.0,100.0,52.928270,30.0,100.000000,...,,,,,,-1.0,,NCYC1187,Grey Horse,Cooling
2,2017-02-01 01:00:00,FV31,0.719046,698.33700,698.33730,0.0,100.0,52.507034,30.0,100.000000,...,,,,,,-1.0,,NCYC1187,Grey Horse,Cooling
3,2017-02-01 01:30:00,FV31,0.719046,698.33700,698.33730,0.0,100.0,51.933790,30.0,100.000000,...,,,,,,-1.0,,NCYC1187,Grey Horse,Cooling
4,2017-02-01 02:00:00,FV31,0.719046,698.33700,698.33730,0.0,100.0,51.517723,30.0,100.000000,...,,,,,,-1.0,,NCYC1187,Grey Horse,Cooling
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8065,2017-02-28 22:00:00,FV36,0.328751,743.10034,743.10034,0.0,0.0,64.711590,65.0,25.075735,...,,,,,,8.4,,NCYC1187,Adele,Fermentation
8066,2017-02-28 22:30:00,FV36,0.328751,743.10034,743.10034,0.0,0.0,64.600006,65.0,60.707240,...,,,,,,8.4,,NCYC1187,Adele,Fermentation
8067,2017-02-28 23:00:00,FV36,0.328751,743.10034,743.10034,0.0,0.0,64.600006,65.0,36.788890,...,,,,,,8.4,,NCYC1187,Adele,Fermentation
8068,2017-02-28 23:30:00,FV36,0.328751,743.10034,743.10034,0.0,0.0,64.412796,65.0,39.635340,...,,,,,,8.4,,NCYC1187,Adele,Fermentation


In [74]:
df_fv31_36.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8070 entries, 0 to 8069
Data columns (total 34 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Timestamp                8070 non-null   datetime64[ns]
 1   Asset_Id                 8070 non-null   object        
 2   ADF                      4051 non-null   float64       
 3   Volume                   8070 non-null   float64       
 4   Volume In                8070 non-null   float64       
 5   Volume Out               8070 non-null   float64       
 6   Top TIC OUT              8070 non-null   float64       
 7   Top TIC PV               8070 non-null   float64       
 8   Top TIC SP               8070 non-null   float64       
 9   Middle TIC OUT           8070 non-null   float64       
 10  Middle TIC PV            8070 non-null   float64       
 11  Middle TIC SP            8070 non-null   float64       
 12  Bottom TIC OUT           8070 non-

**NOTE: Stored Data Views for multiple assets also work. The column `Asset_Id` indicates which asset the event (row) belongs to** 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=27d58a9b-9aaa-47c5-b041-7e32d6d185a5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>