# PI World 2019: *** OCS Interpolation Investigation *** 

## Instruction: execute cells until reaching the Dataviews / PI Web API plugin comparison (starting at cell [14]) 

![PIWorld 2019](./images/piworld-2019.png)

![Hub Overview](./images/hub-overview.png)

# Learning module developed in partnership with Deschutes Brewery and Lehigh University

![Partnership with Deschutes and Lehigh University](./images/lehigh-deschute-partnership.png)

![CHE_396](./images/lehigh-che-396.png)

![Learning Outcomes](./images/learning-outcomes.png)

![Problem Setting](./images/adf-problem-settings.png)

![Segmented](./images/lehigh-fitted-model-slide.png)

## Imports 

Standard python modules import plus OCS specific module 

In [1]:
# For interaction with OCS
from ocs_datascience import OCSClient
# For HTTP request
import requests
# Pandas dataframe to manipulate table data
import pandas as pd
# Utilities from Python standard library 
import configparser
import datetime as dt
from dateutil import parser 
import json
import io
# For plots 
import plotly.graph_objs as go
import plotly_express as px

# Learning Outcome: Data access from cloud server using web service calls

![Tenant, namespace concepts](https://apimgmtstelkv30lahnuj362.blob.core.windows.net/content/MediaLibrary/lehigh/ocs/tenant-namespace2.png)

## Content of file `config.ini`

## Read in configuration file and create OCS client object

In [2]:
config = configparser.ConfigParser()
config.read('config.ini')

ocs_client = OCSClient(config.get('Access', 'ApiVersion'),config.get('Access', 'Tenant'), config.get('Access', 'Resource'), 
                     config.get('Credentials', 'ClientId'), config.get('Credentials', 'ClientSecret'))

namespace_id = config.get('Configurations', 'Namespace')

## Get an the autorization header with bearer token for access to OCS API 

In [3]:
headers = ocs_client.authorization_headers()
headers

{'Authorization': 'bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjJDQjI4MzFEREJFRDc1NzAyM0NCMTM5OUVBRjRDMjkxQzE3MkQ5RjQiLCJ0eXAiOiJKV1QiLCJ4NXQiOiJMTEtESGR2dGRYQWp5eE9aNnZUQ2tjRnkyZlEifQ.eyJuYmYiOjE1NTUwNzU3MDAsImV4cCI6MTU1NTA3OTMwMCwiaXNzIjoiaHR0cHM6Ly9kYXQtYi5vc2lzb2Z0LmNvbS9pZGVudGl0eSIsImF1ZCI6WyJodHRwczovL2RhdC1iLm9zaXNvZnQuY29tL2lkZW50aXR5L3Jlc291cmNlcyIsIm9jc2FwaSJdLCJjbGllbnRfaWQiOiIxNDE1ZjgzZC01OTQwLTRmYjctYTJjNy1lYTE1ODU1OGE2YmMiLCJ0aWQiOiI2NTI5MmI2Yy1lYzE2LTQxNGEtYjU4My1jZTdhZTA0MDQ2ZDQiLCJqdGkiOiI2OTgzOWJhZjAyMDlhMzRiMjNiY2JhYTllNzIyODQwYiIsInNjb3BlIjpbIm9jc2FwaSJdfQ.GCIs9ARiEQFt12eGhvilec3Gl9Lw23EQsy1HI7OkgHTUPt1_YCBPK9UYKiZd-wb2qnOlo0dNv-RBKYC_KnfDTC4qkt-EAG6LH6MzLvbiQlSlIcKeOmnNjsFr-rt5FzDmHrlYHuJ7sg3O-7VPNQaAI8emLrjqA1UyUTkmGtu0YrD1GhrNa9JnRg7NycW-ftgUr6JM4Txd4QN-HPLiw25FEDE9s2nFfcAnezuTbdz_H0BHxua8u5XNP-KlPtqBQwCcoPpi1A_HR586taHclBDWP0gUzs93yGB9Ug4f4U7fPCzHhnNwEkIZ4vHWUv3ofPr763rwZ1sbfe0pHcloruSgJQ',
 'Content-type': 'application/json',
 'Accept': 'text/plain',
 'Request-Timeout

### URL to access `fermenter_vessels` namespace and its dataviews 

In [4]:
# Endpoint for dataview access
namespace_url = ocs_client.namespace_url(namespace_id)  
dataview_url = namespace_url + '/dataviews/'
namespace_url

'https://dat-b.osisoft.com/api/v1-preview/Tenants/65292b6c-ec16-414a-b583-ce7ae04046d4/Namespaces/fermenter__vessels'

## Get the list of all data streams for Fermenter Vessel #31 and extract information for Bottom Temperature data stream

#### 1) Build a stream query URL

In [5]:
streams_url = namespace_url + '/Streams?query=name:*FV31*'
print('Stream Query URL:', streams_url)

Stream Query URL: https://dat-b.osisoft.com/api/v1-preview/Tenants/65292b6c-ec16-414a-b583-ce7ae04046d4/Namespaces/fermenter__vessels/Streams?query=name:*FV31*


#### 2) Make a web request using URL and authorization header

In [6]:
fv31_streams = requests.get(streams_url, headers=headers)

#### 3) Verify that request status code indicates success (should be 200 for GET) 

In [7]:
fv31_streams.status_code

200

#### 4) Display information about Bottom Temperature Control Value data stream

In [8]:
print([stream for stream in fv31_streams.json() if 'Bottom Temperature Control Value' in stream['Description']])

[{'TypeId': 'PI-Float32', 'Id': 'PI_acad-pida-vm0_2593', 'Name': 'acsbrew.BREWERY.B2_CL_C2_FV31_TIC1360A/PV.CV', 'Description': 'FV31 Bottom Temperature Control Value', 'InterpolationMode': 0, 'ExtrapolationMode': 2}]


# Learning Outcome: Build a URL query to get back data from a data stream

### Get data for Bottom Temperature Sensor of Fermentor Vessel 31

In [9]:
start_index = 'startIndex=2017-03-17T00:00'
end_index =     'endIndex=2017-03-17T02:00' # 2 hours later
fv31_bottom_temp_url = namespace_url + f'/Streams/PI_acad-pida-vm0_2593/Data?{start_index}&{end_index}'
fv31_bottom_temp_url 

'https://dat-b.osisoft.com/api/v1-preview/Tenants/65292b6c-ec16-414a-b583-ce7ae04046d4/Namespaces/fermenter__vessels/Streams/PI_acad-pida-vm0_2593/Data?startIndex=2017-03-17T00:00&endIndex=2017-03-17T02:00'

#### Perform web service for stream data and check status 

In [10]:
fv31_bottom_temp = requests.get(fv31_bottom_temp_url, headers=headers)
fv31_bottom_temp.status_code

200

#### 7) Check raw data in JSON format

In [11]:
fv31_bottom_temp.json()

[{'Timestamp': '2017-03-17T00:14:07Z', 'Value': 30.5642},
 {'Timestamp': '2017-03-17T00:22:36Z', 'Value': 30.2},
 {'Timestamp': '2017-03-17T00:53:25Z', 'Value': 30.2},
 {'Timestamp': '2017-03-17T00:57:55Z', 'Value': 30.06531},
 {'Timestamp': '2017-03-17T00:58:48Z', 'Value': 29.7681},
 {'Timestamp': '2017-03-17T00:59:00Z', 'Value': 29.73208},
 {'Timestamp': '2017-03-17T01:00:00Z', 'Value': 29.67508},
 {'Timestamp': '2017-03-17T01:01:01Z', 'Value': 29.54451},
 {'Timestamp': '2017-03-17T01:02:00Z', 'Value': 29.55049},
 {'Timestamp': '2017-03-17T01:03:00Z', 'Value': 29.65275},
 {'Timestamp': '2017-03-17T01:04:00Z', 'Value': 29.755},
 {'Timestamp': '2017-03-17T01:05:00Z', 'Value': 29.72754},
 {'Timestamp': '2017-03-17T01:06:01Z', 'Value': 29.83604},
 {'Timestamp': '2017-03-17T01:07:00Z', 'Value': 29.93986},
 {'Timestamp': '2017-03-17T01:08:00Z', 'Value': 30.04601},
 {'Timestamp': '2017-03-17T01:09:00Z', 'Value': 30.09564},
 {'Timestamp': '2017-03-17T01:10:00Z', 'Value': 30.06812},
 {'Timest

#### 8) Store result as a Panda dataframe for further manipulations

In [12]:
df = pd.DataFrame(fv31_bottom_temp.json())
df

Unnamed: 0,Timestamp,Value
0,2017-03-17T00:14:07Z,30.5642
1,2017-03-17T00:22:36Z,30.2
2,2017-03-17T00:53:25Z,30.2
3,2017-03-17T00:57:55Z,30.06531
4,2017-03-17T00:58:48Z,29.7681
5,2017-03-17T00:59:00Z,29.73208
6,2017-03-17T01:00:00Z,29.67508
7,2017-03-17T01:01:01Z,29.54451
8,2017-03-17T01:02:00Z,29.55049
9,2017-03-17T01:03:00Z,29.65275


#### 9) Plot the time-series data 

In [13]:
layout = dict(title='Bottom Temperature')
data = [go.Scattergl(x = df['Timestamp'], y = df['Value'], mode='lines+markers')]
fig = go.FigureWidget(data=data, layout=layout)
fig

FigureWidget({
    'data': [{'mode': 'lines+markers',
              'type': 'scattergl',
              'uid': …

## Often analysis requires a single table with data from:

### 1- Values for multiple sensors, settings, and calculations organized as rows of observations

### 2- Multiple similar assets: consistent data shape

### 3- Data at regular intervals

## OSIsoft answer for the above is the Dataview 

## Academic Hub provides a set of ready-to-use dataviews for students with each dataset hosted on the Academic Hub

--- 
--- 
# Comparison between PI Web API and OCS interpolations 
---
--- 

## Creation of the Dataviews, for fermenters 31 up to 36

* Status 201 from POST request indicates success
* Status 401 indicates unauthorized (try refreshing authorization header)
* Status 409 when a Dataview with same Id already exists (go to last cell of this notebook to perform a clean up)
* One Dataview per fermenter vessel 

In [14]:
# Valid Fermenter Vessel IDs are 31 up to 36
for fv_id in range(31, 32):
    dataviews_id = ocs_client.create_fermenter_dataview(fv_id)

Status: 409 Dataview Id: DV_FV31 Error: {"OperationId":"c3d7da66-d61c-4518-bec2-22c5cb3081c1","Error":"Data view with specified id already exists.","DataViewId":"DV_FV31"}


### Prepare OCS dataview request for Fermenter Vessel 31. We're interested in the Volume column 

In [15]:
# Start of data in OCS
start_index = '2017-03-17T07:00'
# Convert to a datetime
start_time = parser.parse(start_index)
# Build a time delta 
delta_time = dt.timedelta(days=1)
# Get back a ISO8601 timestamp string which is delta_time later than start_index 
end_index = (start_time + delta_time).isoformat()
interval = '00:01:00'
dataview_id = 'DV_FV31'
dataview_url = namespace_url + f'/Dataviews/{dataview_id}/preview/interpolated' + \
            f'?startIndex={start_index}&endIndex={end_index}' + \
            f'&interval={interval}&form=csvh&maxcount=20000' 
dataview_url 

'https://dat-b.osisoft.com/api/v1-preview/Tenants/65292b6c-ec16-414a-b583-ce7ae04046d4/Namespaces/fermenter__vessels/Dataviews/DV_FV31/preview/interpolated?startIndex=2017-03-17T07:00&endIndex=2017-03-18T07:00:00&interval=00:01:00&form=csvh&maxcount=20000'

In [16]:
# Perform dataview request
fv31_dataview_result = requests.get(dataview_url, headers=headers)
print(fv31_dataview_result.status_code)
df_ocs = pd.read_csv(io.StringIO(fv31_dataview_result.text), parse_dates=['Timestamp'])
df_ocs

200


Unnamed: 0,Timestamp,Volume,Top TIC PV,Top TIC OUT,Plato,Middle TIC PV,Middle TIC OUT,FV Full Plato,Fermentation ID,Brand,Bottom TIC PV,Bottom TIC OUT,ADF,Status
0,2017-03-17 07:00:00+00:00,716.566,29.613152,0.0,,29.356380,0.000000,,Fermentor 31201731179653,4.0,29.884571,10.935327,,12.0
1,2017-03-17 07:01:00+00:00,716.566,29.552847,0.0,,29.378868,0.000000,,Fermentor 31201731179653,4.0,29.931492,12.445787,,12.0
2,2017-03-17 07:02:00+00:00,716.566,29.497858,0.0,,29.400892,0.000000,,Fermentor 31201731179653,4.0,29.978569,13.956246,,12.0
3,2017-03-17 07:03:00+00:00,716.566,29.456203,0.0,,29.423298,0.000000,,Fermentor 31201731179653,4.0,30.033358,18.254084,,12.0
4,2017-03-17 07:04:00+00:00,716.566,29.438652,0.0,,29.440790,0.000000,,Fermentor 31201731179653,4.0,30.083404,25.958717,,12.0
5,2017-03-17 07:05:00+00:00,716.566,29.421194,0.0,,29.458190,0.000000,,Fermentor 31201731179653,4.0,30.133188,33.663350,,12.0
6,2017-03-17 07:06:00+00:00,716.566,29.430151,0.0,,29.489014,0.000000,,Fermentor 31201731179653,4.0,30.193794,41.367980,,12.0
7,2017-03-17 07:07:00+00:00,716.566,29.476150,0.0,,29.432340,0.000000,,Fermentor 31201731179653,4.0,30.191100,45.145805,,12.0
8,2017-03-17 07:08:00+00:00,716.566,29.468636,0.0,,29.416090,0.000000,,Fermentor 31201731179653,4.0,30.184795,48.042100,,12.0
9,2017-03-17 07:09:00+00:00,716.566,29.464607,0.0,,29.429405,0.000000,,Fermentor 31201731179653,4.0,30.179586,50.938390,,12.0


In [17]:
df_ocs[['Timestamp', 'Volume']].to_csv('ocs_beer_fv31_03_17_1day.csv')
df_ocs[['Timestamp', 'Volume']]

Unnamed: 0,Timestamp,Volume
0,2017-03-17 07:00:00+00:00,716.566
1,2017-03-17 07:01:00+00:00,716.566
2,2017-03-17 07:02:00+00:00,716.566
3,2017-03-17 07:03:00+00:00,716.566
4,2017-03-17 07:04:00+00:00,716.566
5,2017-03-17 07:05:00+00:00,716.566
6,2017-03-17 07:06:00+00:00,716.566
7,2017-03-17 07:07:00+00:00,716.566
8,2017-03-17 07:08:00+00:00,716.566
9,2017-03-17 07:09:00+00:00,716.566


In [22]:
# Perform similar request but towards Hub Plugin
# ----------- User parameters ------------
# Absolute time 
start_time = '2017-03-17T07:00:00Z'  # UTC
end_time = '2017-03-18T07:00:00Z'
# Relative time 
# start_time = '*-1d'
# end_time = '*'
interpolation_interval = '1m'
equipment_path = '?path=\\\\PIAF-ACAD\\Food and Beverage\\Brewery\\Double Hop Brewery\\Assets\\Fermentors\\Fermentor 31'
max_count = '10000' 
# Credential
username = 'reader0'
password = 'OSIsoft2017'

base_url = 'https://academicpi.azure-api.net/hub/api/' 
time_params = '&startTime={0}&endTime={1}&interval={2}&maxCount={3}'

interpolated_url_template = base_url + 'Csv/ElementInterpolated' + equipment_path + time_params  
interpolated_url = interpolated_url_template.format(start_time, end_time, interpolation_interval, max_count)

response = requests.get(interpolated_url, auth=(username, password))
if response.status_code != 200:
    print('# Request failed with code:', response.status_code, response.text)
else:
    df_plugin = pd.read_csv(io.StringIO(response.text), parse_dates=['Timestamp'])
df_plugin[['Timestamp', 'Volume']].to_csv('hub_beer_fv31_03_17_1day_0700.csv')
df_plugin

Unnamed: 0,Element,Timestamp,Quality,Performance,Availability,Batch Active Tag,Active Production (last 24h),Yeast Strain,Yeast Generation,Volume Out,...,Process Cell,Cone Temperature,Bottom Temperature,Active Status,OEE,Brewery Name,Production Status,Yeast Status,Vessel Name,ADF
0,Fermentor 31,2017-03-17 07:00:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
1,Fermentor 31,2017-03-17 07:01:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
2,Fermentor 31,2017-03-17 07:02:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
3,Fermentor 31,2017-03-17 07:03:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
4,Fermentor 31,2017-03-17 07:04:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
5,Fermentor 31,2017-03-17 07:05:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
6,Fermentor 31,2017-03-17 07:06:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
7,Fermentor 31,2017-03-17 07:07:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
8,Fermentor 31,2017-03-17 07:08:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725
9,Fermentor 31,2017-03-17 07:09:00+00:00,No Data,No Data,No Data,1,24,NCYC1187,No Data,0,...,C2,0,0,1.0,,Assets,1.0,,Fermentor 31,0.659725


In [27]:
recorded_url_template = base_url + 'Csv/ElementRecorded' + equipment_path + '&startTime={0}&endTime={1}'
recorded_url = recorded_url_template.format(start_time, end_time)
response = requests.get(recorded_url, auth=(username, password))
response

<Response [200]>

In [28]:
if response.status_code != 200:
    print('# Request failed with code:', response.status_code, response.text)
else:
    df_plugin_rec = pd.read_csv(io.StringIO(response.text), parse_dates=['Timestamp'])
#df_plugin_rec[['Timestamp', 'Volume']].to_csv('hub_beer_fv31_03_17_1day_rec.csv')
#df_plugin_rec[['Timestamp', 'Volume']]
df_plugin_rec[df_plugin_rec.Attribute == 'Volume']

Unnamed: 0,Element,Attribute,Timestamp,Value
32,Fermentor 31,Volume,2017-03-17 07:26:12+00:00,716.565979003906
33,Fermentor 31,Volume,2017-03-17 15:01:06+00:00,Bad Input
34,Fermentor 31,Volume,2017-03-17 15:16:54+00:00,716.565979003906
35,Fermentor 31,Volume,2017-03-17 15:26:00+00:00,Bad Input
36,Fermentor 31,Volume,2017-03-17 15:36:15+00:00,716.565979003906
37,Fermentor 31,Volume,2017-03-17 23:36:01+00:00,716.565979003906


In [29]:
data = []
trace_ocs = go.Scattergl(x = df_ocs['Timestamp'], 
                             y = df_ocs['Volume'], 
                             mode = 'markers', 
                             name = 'OCS',
                             hoverlabel = {'namelength': -1})
data.append(trace_ocs)

In [30]:
trace_hub = go.Scattergl(x = df_plugin['Timestamp'], 
                             y = df_plugin['Volume'], 
                             mode = 'markers', 
                             name = 'PIWeb',
                             hoverlabel = {'namelength': -1})
data.append(trace_hub)

In [32]:
layout = dict(title='OCS Versus PI Web API Interpolation around Bad Input values')
fig = go.FigureWidget(data=data, layout=layout)
fig

FigureWidget({
    'data': [{'hoverlabel': {'namelength': -1},
              'mode': 'markers',
              …

### Multiple sensors - Similar assets have consistent data shape - Regular time intervals

### Note the that resulting dataframe has about 28801 rows

**This is for 1 fermenter X 20 days X 1440 rows per day (24 hours at 1 minute interval)**

## Save data into CSV file locally 

## The file `beer_ocs_all_20days.csv` can be opened with Excel for inspection 


# ADF Prediction 

## First start by reading all fermenter vessels data, 300 days

### (Previously saved on file, takes a little while to read back)

In [None]:
# Warnings while running this cell are normal
# df = pd.read_csv('beer_ocs_all_300days.csv.zip', parse_dates = ['Timestamp'], compression='infer')
# df

### Note: the result CSV above has 2.6 million rows 

### List all unique Fermentation Batch IDs, filter out bad ones

In [None]:
all_ferm_ids = sorted(df['Fermentation ID'].unique()) 
# Check for invalid Fermentation ID
for ferm_id in all_ferm_ids:
    if not ('FV' in ferm_id or 'Fermentor' in ferm_id):
        print('Bad Ferm ID:', ferm_id)
# Keep only valid ones, should contain 'FV' or 'Fermenter'
all_valid_ferm_ids = [ferm_id for ferm_id in all_ferm_ids if 'FV' in ferm_id or 'Fermentor' in ferm_id]
print(all_valid_ferm_ids)
df = df[df['Fermentation ID'].isin(all_valid_ferm_ids)]
df

### Apparent Degree of Fermentation for a set of fermentations

In [None]:
data = []
end_time = parser.parse('2017-04-04').isoformat()
dft = df[df.Timestamp <= end_time]
for ferm_id in all_valid_ferm_ids:
    df_ferm_id = dft[(dft['Fermentation ID'] == ferm_id) & (dft['Status'] == 'Fermentation')][['Timestamp', 'ADF']]
    if len(df_ferm_id) > 0:
        trace = go.Scattergl(x = df_ferm_id['Timestamp'], 
                             y = df_ferm_id['ADF'], 
                             mode = 'lines+markers', 
                             name = str(ferm_id),
                             hoverlabel = {'namelength': -1})
        data.append(trace)

### Add a range slider 

With a few time range selectors: 1 day, 3 days and everything 

In [None]:
layout = dict(
    title='ADF during fermentation stage, 20 days',
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label='1d',
                     step='day',
                     stepmode='backward'),
                dict(count=3,
                    label='3d',
                    step='day',
                    stepmode='backward'),
                dict(step='all')
            ])
        ),
        rangeslider=dict(
            visible = True
        ),
        type='date'
    )
)
 
fig = go.FigureWidget(data=data, layout=layout)

In [None]:
fig

# Learning Outcome: Data cleansing and preparation 
## Example: student selects relevant data for analysis

###  (We only want to look at data while the fermentor is in the fermentation stage)

In [None]:
# Keep only the relevant columns
ndf = df[(df['Status'] == 'Fermentation')][['Timestamp', 'ADF', 'Brand', 'FV Full Plato', 'Plato', 'Fermentation ID']]
ndf

In [None]:
# Filter out rows without valid ADF value
ndf = ndf[~(ndf['ADF'].isin(['Calc Failed']))]
ndf

## Step 2: ADF is an offline measurement, keep only data that corresponds to a new measurement

In [None]:
dff = ndf[ndf.ADF.shift() != ndf.ADF].reset_index()
dff.to_csv('df_step2.csv')
dff[['Timestamp', 'ADF', 'Brand', 'Fermentation ID']]

## Step 3: We want to analyze all fermentations together, so we need to look at elapsed time. Here we find the index that corresponds to the beginning of each fermentation

In [None]:
# Make sure that the ADF column is of type float, necessary for arithmetic operations on it
dff = dff.astype({'ADF': 'float'})
dfferm = dff[abs(dff.ADF) <= 0.000001]
ferm_start_indexes = []
for i, _ in dfferm.iterrows():
    # print(i, row['index'])
    ferm_start_indexes.append(i)
print(ferm_start_indexes)

## Step 4: Create new columns (computed in step 5)

* **Elapsed**: elapsed time since fermentation starts

* **tdif**: time difference with row just before

* **adfdif**: ADF difference with row just before 

In [None]:
dff['Elapsed'] = dff['Timestamp'] - dff['Timestamp']
dff['tdif'] = dff['Timestamp'] - dff['Timestamp']
dff['adfdif'] = dff['ADF'] 
dff.to_csv('dff_step4.csv', index=False)
dff

## Step 5: Compute values for 3 new columns

In [None]:
ferm_start_indexes.append(-1)
adf_start_time = dfferm.iloc[0]['Timestamp']
count = 1
for i, row in dff.iterrows():
    try: 
        dff.at[i,'Elapsed'] = row.Timestamp - dfferm.iloc[count-1].Timestamp
        if i != ferm_start_indexes[count-1]:
            dff.at[i,'tdif'] = row.Timestamp - last_timestamp
            dff.at[i,'adfdif'] = row.ADF - last_adf
        last_timestamp = row.Timestamp 
        last_adf = row.ADF 
        if i+1 == ferm_start_indexes[count]:
            count += 1 
    except IndexError:
        pass
dff.to_csv('dff_step5.csv', index=False)
dff[['Timestamp', 'Elapsed', 'tdif', 'ADF', 'adfdif', 'Brand']]

### Number of fermentations for brand Realtime Hops 
This brand has the most data to work with for the analysis part

In [None]:
len(dff[(dff.Brand == 'Realtime Hops') & (abs(dff.ADF) <= 0.000001)])

## All the way up to Step 6:

### Only keep data for Realtime Hops and remove inconsistent data

* All elapsed time must be positive
* A fermentation cannot last more than 4 days, remove is elapsed is over

In [None]:
delta0 = dt.timedelta(days=0)
delta4 = dt.timedelta(days=4)
rh_df = dff[(dff.Brand == 'Realtime Hops') & (dff.Elapsed >= delta0) & (dff.Elapsed < delta4)]
rh_df

### Compute new columns with Elasped and tdif in seconds
Note: warnings are OK

In [None]:
rh_df.loc[:, 'Elapsed_seconds'] = rh_df['Elapsed'].dt.days * (60*60*24) + rh_df['Elapsed'].dt.seconds
rh_df.loc[:, 'tdif_seconds'] = rh_df.Elapsed_seconds - rh_df.Elapsed_seconds.shift()
l = [i for i, _ in rh_df[(rh_df.tdif_seconds < 0)].iterrows()]
for i in l:
    # rh_df.at[i, 'tdif_seconds'] = 0
    rh_df.loc[i, 'tdif_seconds'] = 0
rh_df

### Plot all relative fermentention curves for Realtime Hops brand

In [None]:
bad_ferm_ids = ['Fermentor 362017423070', 'Fermentor 36201742557961']  # known bad fermentation batch
rh_df = rh_df[~rh_df['Fermentation ID'].isin(bad_ferm_ids)]
for i in ferm_start_indexes:
    rh_df = rh_df.drop(i, errors='ignore')
data = []
for ferm_id in sorted(rh_df['Fermentation ID'].unique()):
    df_ferm_id = rh_df[rh_df['Fermentation ID'] == ferm_id][['Elapsed_seconds', 'ADF']]
    trace = go.Scattergl(x = df_ferm_id['Elapsed_seconds'], 
                         y = df_ferm_id['ADF'], 
                         mode = 'markers', 
                         name = str(ferm_id),
                         hoverlabel = {'namelength': -1})
    data.append(trace)
    
fig = go.FigureWidget(data=data, layout = dict(title='ADF - Realtime Hops'))

# Learning Outcome: data exploration and visual analysis

In [None]:
fig

### The last cell of the data preprocessing part
The output file `regression_ocs.csv` is the input of the analysis part which follows

In [None]:
# This cell produces a CSV in exactly the same format as the one for original R code 

df_csv = rh_df[['Elapsed_seconds', 'ADF', 'tdif_seconds', 'adfdif', 'FV Full Plato', 'Plato']]
df_csvR = df_csv.rename(columns = {'Elapsed_seconds': 'time', 'ADF': 'adf', 'tdif_seconds': 'tdif', 'FV Full Plato': 'A', 'Plato': 'B'})
df_csvR.to_csv('regression_ocs.csv', index=False)
df_csvR

## Learning Outcome: application of analytical techniques
![Linear](./lehigh-linear-model.png)

### Setting up analysis environment (using R)

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
#-------------------------------------------
# LOAD LIBRARIES - 
# NOTE: run at least once with install.packages uncommented
#-------------------------------------------
# install.packages('RCurl', repos='http://cran.us.r-project.org')
library(RCurl);
# install.packages('tibble', repos='http://cran.us.r-project.org')
library(tibble)
# install.packages('ggplot2', repos='http://cran.us.r-project.org')
library(ggplot2)

# install.packages('data.table', repos='http://cran.us.r-project.org')
library(data.table)
# install.packages('segmented', repos='http://cran.us.r-project.org')
library(segmented)

### Plot data to analyse: Realtime Hops brand fermentations

In [None]:
%%R
# Header: time,adf,tdif,adfdif,A,B
MyData <- read.csv(file="regression_ocs.csv", header=TRUE, sep=",")
MyData$time <- MyData$time/60/60
plot(MyData$time, MyData$adf,xlab = "Time Since Fermentation [hours]", ylab='ADF', main = "All data")

### Create filter for outliers

In [None]:
%%R
outliers <- (MyData$time< 50000/60/60 & MyData$adf > 0.2) + (MyData$time< 70000/60/60 & MyData$adf > 0.3) + (MyData$time< 120000/60/60 & MyData$adf > 0.6)

goodData <- !outliers

goodData

## First try with a simple linear model 

In [None]:
%%R
ADF <- MyData$adf[goodData]
Times <- MyData$time[goodData]

In [None]:
%%R -o s 
plot(Times, ADF,xlab = "Time Since Fermentation [hours]", ylab='ADF', main = "ADF ~ Time")
linMod <- lm(ADF ~ Times)
abline(linMod$coefficients)
s <- summary(linMod) 

### Information summary from linear regression library

In [None]:
print(s)

![Error](./images/lehigh-linear-check.png)

## Actual residuals 

In [None]:
%%R 
predicted <- fitted(linMod)
errors <- resid(linMod)
plot( predicted ,errors, main = "Residuals vs. predicted ADF" )

## Normality test - should be randomly distributed if model is OK

In [None]:
%%R
stdres = rstandard(linMod)
qqnorm(stdres, ylab="Standardized Residuals", xlab="Normal Scores", main="Normality test") 
qqline(stdres)

![Change model](./images/lehigh-change-model.png)

![Justification](./images/lehigh-justification-new-model.png)

![Segmented](./images/lehigh-segmented-model.png)

## Improved model: segmented linear 

Using R package called `segmented` 

In [None]:
%%R -o o
plot(Times, ADF,xlab = "Time Since Fermentation [hours]", ylab='ADF', main = "ADF ~ Time")
linMod <- lm(ADF ~ Times)
set.seed(12)
xx <- Times
yy <- ADF
dati <- data.frame(x = xx, y = yy )
out.lm <- lm(y ~ x, data = dati)
o <- segmented(out.lm, seg.Z = ~x, psi = list(x = c(10,40)), control = seg.control(display = FALSE) )
dat2 = data.frame(x = xx, y = broken.line(o)$fit)
o

### Information summary from segmented linear regression library

In [None]:
print(o)

### Graphical verification of the segmented linear model prediction

In [None]:
%%R 
ggplot(dati, aes(x = x, y = y)) + xlab("Time Since Fermentation [hours]") + ylab('ADF') + geom_point() +
  geom_line(data = dat2, color = 'blue')

### Error analysis (again, should be randomly distributed)

In [None]:
%%R 
errors <- resid(o)

plot( predicted ,errors, main = "Residuals vs. predicted ADF" )

In [None]:
%%R
stdres = rstandard(o)
qqnorm(stdres, ylab="Standardized Residuals", xlab="Normal Scores", main="Normality test") 
qqline(stdres)

![noimg](./images/osi_thank_you_slide.png)