# Big data for industrial operations 

#### What is the topic?
**"Industry 4.0"**, which consists in that data created by industrial equipment might hold more potential business values. It uses raw data to support management decision making, so to reduce costs in maintenance and improve customer service.

In particular:  

For the **oil and gas industry** to succeed in the current reality they need to take full advantage of digital transformation opportunities, in this way they will be able to take better decision, reduce risk and ensure efficient use of resources.  
Big data is available from virtually every aspect of drilling, production, operations and mainteinance.
Businesses need to keep pace with their ability to process and analyze that data, to ensure that machines are working properly and check whether a machine is going to breakdown or fail and to improve production performance.


-----------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------

I am going to analyse data from https://openindustrialdata.com/

The data originates from a single **compressor on Aker BP’s Valhall oil platform in the North Sea.**



It includes *time series data*, *maintenance history data*, and *Process & Instrumentation Diagrams (P&IDs)* for Valhall’s first **stage compressor and associated process equipment**: first stage suction cooler, first stage suction scrubber, first stage compressor and first stage discharge coolers. In addition, data from the compressor’s lubrication system, dry gas seal system and condition monitoring system (temperature and vibration) will be available.

(The idea is to use Time Series Data to see how factors change over time)

#### Some of the questions that might be interesting to answer (not all):

- **When gathering data from oil platforms, specific sensors can be very expensive to install and maintain, can you find combination of other sensor readings to predict another?**  
 (Linear regression, MLP, RandomForest...)  
 
 
- **What is the typical difference between the input and output pressure of the compressor?**  
 (Correlation between input and output over time, AR model?)
   
   
- **Can you infer the thermal efficiency of the 1st stage suction cooler from sensor data?**  
 (?)
  
  
- **How do the above compare to the specifications in the process diagram?**  
 (?)
  
  
- **What is the typical start-up time of the compressor, from standstill to stable operation?**  
 (?)

I'm going to use data from the Cognitive Data Platform, using the API.  
Here is the documentation: https://cognite-sdk-python.readthedocs-hosted.com/en/latest/index.html

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

import os
from datetime import datetime, timedelta
from datetime import datetime
from getpass import getpass


from cognite import CogniteClient

In [2]:
client = CogniteClient(api_key=getpass("API-KEY: "))

API-KEY: ········


#### What is an asset? 

Assets are digital representations of physical objects or groups of objects, and assets are organized into an asset hierarchy. For example, an asset can represent a water pump which is part of a subsystem on an oil platform. They are used to connect related data together.

Get the dataset (as we can observe, we have first and second stage compressors, valves, others...):

In [3]:
asset_df=client.assets.get_assets().to_pandas()
asset_df

Unnamed: 0,createdTime,depth,description,id,lastUpdatedTime,metadata,name,parentId,path
0,0,2,GAS COMPRESSION AND RE-INJECTION (PH),3111454725058294,0,"{'SOURCE_DB': 'workmate', 'SOURCE_TABLE': 'wma...",23,4650652196144007,"[6687602007296940, 4650652196144007, 311145472..."
1,0,7,1ST STAGE COMP DRY GAS SEAL SYS ON PH,3904753668320840,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-1ST STAGE COMP DRY GAS SEAL SYS-PH,4856008121737468,"[6687602007296940, 4650652196144007, 311145472..."
2,0,7,1ST STAGE COMP ENCLOSURE ON PH,2499711953216311,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-1ST STAGE COMP ENCLOSURE-PH,4856008121737468,"[6687602007296940, 4650652196144007, 311145472..."
3,0,7,1ST STAGE COMP LUBE OIL SYS ON PH,2137557577165478,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-1ST STAGE COMP LUBE OIL SYS-PH,4856008121737468,"[6687602007296940, 4650652196144007, 311145472..."
4,0,4,1ST STAGE COMPRESSION ON PH,4518112062673878,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-1ST STAGE COMPRESSION-PH,6895991969886325,"[6687602007296940, 4650652196144007, 311145472..."
5,0,5,1ST STAGE COMPRESSOR ON PH,7372310232665628,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-1ST STAGE COMPRESSOR-PH,4518112062673878,"[6687602007296940, 4650652196144007, 311145472..."
6,0,7,2ND STAGE COMP DRY GAS SEAL SYS ON PH,6658342189327214,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-2ND STAGE COMP DRY GAS SEAL SYS-PH,4222791488928479,"[6687602007296940, 4650652196144007, 311145472..."
7,0,4,2ND STAGE COMPRESSION ON PH,5786472304680477,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-2ND STAGE COMPRESSION-PH,6895991969886325,"[6687602007296940, 4650652196144007, 311145472..."
8,0,5,2ND STAGE COMPRESSOR ON PH,4074033093163622,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-2ND STAGE COMPRESSOR-PH,5786472304680477,"[6687602007296940, 4650652196144007, 311145472..."
9,0,6,VRD - PH 1STSTGSUCT SCRUB INLET,1145062594143414,0,"{'ELC_STATUS_ID': '1211', 'RES_ID': '531669', ...",23-AE-92527-S1,53231887945301,"[6687602007296940, 4650652196144007, 311145472..."


In [4]:
asset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
createdTime        1000 non-null int64
depth              1000 non-null int64
description        1000 non-null object
id                 1000 non-null int64
lastUpdatedTime    1000 non-null int64
metadata           1000 non-null object
name               1000 non-null object
parentId           1000 non-null int64
path               1000 non-null object
dtypes: int64(5), object(4)
memory usage: 70.4+ KB


We can visualize a single asset (so a single row) and its properties

In [5]:
asset_name = asset_df["name"][1] #2ND STAGE COMP DRY GAS SEAL SYS ON PH

In [6]:
asset_id = asset_df[asset_df["name"] == asset_name].iloc[0]['id']
client.assets.get_asset(asset_id=asset_id).to_pandas()

Unnamed: 0,0
id,3904753668320840
depth,7
name,23-1ST STAGE COMP DRY GAS SEAL SYS-PH
parentId,4856008121737468
description,1ST STAGE COMP DRY GAS SEAL SYS ON PH
metadata,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma..."
createdTime,0
lastUpdatedTime,0
path,"[6687602007296940, 4650652196144007, 311145472..."


As we have seen, each Asset has various properties, among which:
- Id: its own Id
- Depth: Sub assets up to this many levels below the specified path, i.e. the number of edges from the parent node
- Name: its own name
- ParentId: the Id of its parent Asset
- Description: Search query, which includes information such as the platform and type of sensor being monitored
- Metadata: The metadata values used to filter the results


etc..


We can fetch an asset subtree using get_asset_subtree().   
We can start by specifying a depth of 1, so that we get the subassets 1 level below the asset 23-HA-9107B.

In [7]:
subassets = client.assets.get_asset_subtree(asset_id=asset_id, depth=1).to_pandas()
subassets.head()

Unnamed: 0,createdTime,depth,description,id,lastUpdatedTime,metadata,name,parentId,path
0,0,7,1ST STAGE COMP DRY GAS SEAL SYS ON PH,3904753668320840,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-1ST STAGE COMP DRY GAS SEAL SYS-PH,4856008121737468,"[6687602007296940, 4650652196144007, 311145472..."
1,0,8,VRD - PH 1STSTG COMP SEPAR GAS,311006305196759,0,"{'ELC_STATUS_ID': '1211', 'RES_ID': '532915', ...",23-PT-96160-01,3904753668320840,"[6687602007296940, 4650652196144007, 311145472..."
2,0,8,VRD - 1ST STAGE COMPRESSOR NITROGEN FILTER B,928299407571359,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-CB-9129B,3904753668320840,"[6687602007296940, 4650652196144007, 311145472..."
3,0,8,VRD - PH 1STSTG COMP OUTER SEAL NDE,1129133419719764,0,"{'ELC_STATUS_ID': '1211', 'RES_ID': '495058', ...",23-PT-96157-02,3904753668320840,"[6687602007296940, 4650652196144007, 311145472..."
4,0,8,VRD - 1ST STAGE COMPRESSOR GAS SEAL FILTER A,1510541513998137,0,"{'ELC_STATUS_ID': '1211', 'SOURCE_DB': 'workma...",23-CB-9104A,3904753668320840,"[6687602007296940, 4650652196144007, 311145472..."


What we basically do now is: for each subasset we get its time-series and merge it into a dataframe

In [8]:
all_ts_df = pd.DataFrame()

for index, asset in subassets.iterrows():
  ts_df = client.time_series.get_time_series(asset_id=int(asset['id'])).to_pandas()
  
  if not ts_df.empty:
    all_ts_df = all_ts_df.append(ts_df, ignore_index=True)
    
all_ts_df

Unnamed: 0,assetId,createdTime,description,id,isStep,isString,lastUpdatedTime,name
0,311006305196759,0,PH 1st Stg Comp Separation Gas,8448039626216272,False,False,0,VAL_23-PT-96160:Z.X1.Value
1,1129133419719764,0,PH 1st Stg Outer Seal NDE,527653744772094,False,False,0,VAL_23-PT-96157:Z.X2.Value
2,2456441663607972,0,PH 1st Stg Outer Seal NDE,813184639383284,False,False,0,VAL_23-PT-96157:Z.X1.Value
3,2539007469802785,0,PH 1stStg Seal Gas Heater,5769675878102582,False,False,0,VAL_23-FE-9122-H01-F-EL:XS.MeasuredValues.Curr...
4,2539007469802785,0,PH 1stStg Seal Gas Heater,8417055044933511,False,False,0,VAL_23-FE-9122-H01-F:Z.Y.Value
5,4050790831683279,0,PH 1st Stg Comp Over Inner Seal,713200317738867,False,False,0,VAL_23-PDI-96150:X.Value
6,4050790831683279,0,PH 1st Stg Inner Seal NDE,3054990494708797,False,False,0,VAL_23-PT-96150:Z.X1.Value
7,4239585628663887,0,PH 1stStg Comp Inner Seal DE,712404469104819,False,False,0,VAL_23-FT-96151:X.Value
8,5784138902448314,0,PH 1st Stg Comp Separation Gas,4249054303737874,False,False,0,VAL_23-PT-96160:Z.X2.Value
9,6223287279641772,0,PH 1st Stage Comp Inner Seal DE,362924890178640,False,False,0,VAL_23-PT-96149:Z.X2.Value


#### View datapoints for each one of the time series
A Datapoint in the CDP is stored as a key value pair
- timestamp is the time since epoch in milliseconds
- value is the value which is read from the sensor

In [9]:
client.datapoints.get_datapoints(name=all_ts_df['name'][0], start="30d-ago").to_pandas().head()

Unnamed: 0,timestamp,value
0,1547819099291,57.301441
1,1547819100291,57.354084
2,1547819101291,57.617302
3,1547819102291,57.196156
4,1547819103291,57.617302
