##Follow the Section Links to read the TalentLMS content before running code examples

##[1. Environment Set Up](https://cognite.talentlms.com/unit/view/id:2116)

###Install the Cognite SDK package

In [0]:
!pip install cognite-sdk
!pip install --upgrade numpy

###Import other required packages

In [0]:
%matplotlib inline

import os
from datetime import datetime, timedelta
from datetime import datetime
from getpass import getpass

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from cognite.client.cognite_client import CogniteClient

### Connect to Cognite Data Fusion
This client object is how all queries will be sent to the Cognite API to retrieve data.

When prompted for your API key, use the stored key generated previously in the course.

In [0]:
client = CogniteClient(api_key=getpass("Open Industrial Data API-KEY: "))

##[2. Retrieving Lists of Assets](https://cognite.talentlms.com/unit/view/id:2118)

###Random Search
The *client.assets.get_assets()* actually retrieves the assets, *to_pandas()* creates the Dataframe in Python and *.head()* simply displays the top of the Dataframe. 

In [0]:
client.assets.get_assets().to_pandas().head()

NameError: ignored

###Fuzzy Search

In [0]:
asset_name = "23-HA-9103"
asset_df = client.assets.search_for_assets(name=asset_name).to_pandas()
asset_df.head()

###Specific Search
The _get_asset()_ interface provides the same information for one specific asset based on the provided ID

In [0]:
asset_id = asset_df[asset_df["name"] == asset_name].iloc[0]['id']
asset = client.assets.get_asset(asset_id=asset_id).to_pandas()
asset

##[3. Asset Hierarchy and Relationships](https://cognite.talentlms.com/unit/view/id:2124)

We will generate a list of all children of the main asset of interest. This is done by specifying a depth of 1.

In [0]:
subtree_df = client.assets.get_asset_subtree(asset_id=asset_id, depth=1).to_pandas()
subtree_df.head()

##[4. Collecting Time Series Data](https://cognite.talentlms.com/unit/view/id:2119)

###Compile a list of time series objects under the asset
For each of the children assets, we get the associated time series objects and merge them into a DataFrame

In [0]:
all_timeseries = client.time_series.get_time_series().to_pandas()
print(len(all_timeseries))
all_timeseries.head()

If you are curious about which asset a time series is attached to, you can retrieve more information of the asset by:

In [0]:
client.assets.get_asset(asset_id=2853212781345885).to_pandas()

###View datapoints for specific time series
The identifier to retrieve Datapoints is the name column from the DataFrame above.

In [0]:
client.datapoints.get_datapoints("VAL_45-PT-92508:X.Value", start="10d-ago").to_pandas().head()

##[5. Use Cases of CDF Data](https://cognite.talentlms.com/unit/view/id:2120)

###Collect datapoints from CDF
The time series names are defined in the in_ts_names and out_ts_names lists below.

In [0]:
in_ts_names = ["VAL_23-FT-92512:X.Value", "VAL_23-PT-92512:X.Value", "VAL_23-TT-92502:X.Value"]
out_ts_name = "VAL_23-PT-92504:X.Value"

###Retrieve Data Points from CDF


In [0]:
ts_names = in_ts_names + [out_ts_name]

train_start_date = datetime(2018, 8, 1)
# 30 days of training data chosen arbitrarily
train_end_date = train_start_date + timedelta(days=30)

datapoints_df = client.datapoints.get_datapoints_frame(time_series=ts_names,
                                                       aggregates=['avg'],
                                                       granularity='1m',
                                                       start=train_start_date,
                                                       end=train_end_date
                                                       )
datapoints_df.fillna(method="ffill", inplace = True)
datapoints_df.head()

In [0]:
datapoints_df.isna().any()

In [0]:
# Remove "|average"
datapoints_df.rename(columns=lambda x: x[:-8] if x != "timestamp" else x, inplace=True)
datapoints_df.head()

###Visualize the Time Series Data
The bottom right plot is the output time series, while the other 3 are the inputs used to create an estimate for the output.

In [0]:
cols = list(datapoints_df.columns)
cols.remove('timestamp')

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10,10))
for i, col in enumerate(cols):
    datapoints_df.plot(x='timestamp', y=col, ax=axes[int(i>1), i%2]);

##[6. Model Creation](https://cognite.talentlms.com/unit/view/id:2121)

###Gathering Training Data

In [0]:
train_X = datapoints_df[in_ts_names].values
train_y = datapoints_df[out_ts_name].values

###Get a seperate DataFrame from CDF
The data which we will use to predict the output pressure will be stored in a seperate dataframe as collected below.

In [0]:
predict_start_date = train_end_date
# Make the prediction on 1 hour of data
predict_end_date = train_end_date + timedelta(hours=1)

predict_df = client.datapoints.get_datapoints_frame(time_series=ts_names,
                                                       aggregates=['avg'],
                                                       granularity='1m',
                                                       start=predict_start_date,
                                                       end=predict_end_date
                                                       )
predict_df.fillna(method="ffill", inplace =True)
# Remove "|average"
predict_df.rename(columns=lambda x: x[:-8] if x != "timestamp" else x, inplace=True)
predict_df.head()

In [0]:
cols = list(predict_df.columns)
cols.remove('timestamp')

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10,10))
for i, col in enumerate(cols):
    predict_df.plot(x='timestamp', y=col, ax=axes[int(i>1), i%2]);

##[7. Linear Regression Model](https://cognite.talentlms.com/unit/view/id:2122)
As a simple starting point we will check to see how a linear regression model performs to predict the output pressure

###Utilize sklearn to create a basic linear regression model
Sklearn is common package utilized to import and deploy data science models. Linear Regression is only one of many options for constructing models.

In [0]:
lin_reg = LinearRegression()
lin_reg.fit(train_X, train_y)

X = predict_df[in_ts_names].values
predict_df["prediction_lin_reg1"] = lin_reg.predict(X)

# print out mse of the prediction
mse = mean_squared_error(predict_df[out_ts_name], predict_df["prediction_lin_reg1"])
r2_s = r2_score(predict_df[out_ts_name], predict_df["prediction_lin_reg1"])
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 4)))
print('The R2 score of our forecasts is {}'.format(round(r2_s, 4)))

predict_df.plot(x="timestamp", y=[out_ts_name, "prediction_lin_reg1"], figsize=(10,10));

###Look at the fit for the training data

In [0]:
lin_reg = LinearRegression()
lin_reg.fit(train_X, train_y)

X = datapoints_df[in_ts_names].values
datapoints_df["prediction_lin_reg1"] = lin_reg.predict(X)

# print out mse of the prediction
mse = mean_squared_error(predict_df[out_ts_name], predict_df["prediction_lin_reg1"])
r2_s = r2_score(predict_df[out_ts_name], predict_df["prediction_lin_reg1"])
print('The Mean Squared Error on the training data is {}'.format(round(mse, 4)))
print('The R2 score of our training data is {}'.format(round(r2_s, 4)))

datapoints_df.plot(x="timestamp", y=[out_ts_name, "prediction_lin_reg1"], figsize=(10,10));

###Add dummy variable for anomalous period

In [0]:
datapoints_df['state'] = (datapoints_df[out_ts_name]< 4)*1
predict_df['state'] = (predict_df[out_ts_name]< 4)*1

In [0]:
train_X2 = datapoints_df[in_ts_names + ['state']].values

lin_reg = LinearRegression()
lin_reg.fit(train_X2, train_y)

X = predict_df[in_ts_names + ['state']].values
predict_df["prediction_lin_reg2"] = lin_reg.predict(X)

# print out mse of the prediction
mse = mean_squared_error(predict_df[out_ts_name], predict_df["prediction_lin_reg2"])
r2_s = r2_score(predict_df[out_ts_name], predict_df["prediction_lin_reg2"])
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 4)))
print('The R2 score of our forecasts is {}'.format(round(r2_s, 4)))

predict_df.plot(x="timestamp", y=[out_ts_name, "prediction_lin_reg2"], figsize=(10,10));

###Look at the fit for the training data

In [0]:
lin_reg = LinearRegression()
lin_reg.fit(train_X2, train_y)

X = datapoints_df[in_ts_names + ['state']].values
datapoints_df["prediction_lin_reg2"] = lin_reg.predict(X)

# print out mse of the prediction
mse = mean_squared_error(datapoints_df[out_ts_name], datapoints_df["prediction_lin_reg2"])
r2_s = r2_score(datapoints_df[out_ts_name], datapoints_df["prediction_lin_reg2"])
print('The Mean Squared Error on the training data is {}'.format(round(mse, 4)))
print('The R2 score of our training data is {}'.format(round(r2_s, 4)))

datapoints_df.plot(x="timestamp", y=[out_ts_name, "prediction_lin_reg2"], figsize=(10,10));

###Remove Outliers

In [0]:
quantiles = [0.95, 0.975, 0.98, 0.99]
quantiles_df = pd.DataFrame([np.array(quantiles), np.quantile(datapoints_df[out_ts_name], q=quantiles)]).T
quantiles_df.columns = ['quantile', 'value']
quantiles_df

In [0]:
datapoints_df_adj = datapoints_df.loc[datapoints_df[out_ts_name]< 4,:]


In [0]:
train_X_adj = datapoints_df_adj[in_ts_names].values
train_y_adj = datapoints_df_adj[out_ts_name].values

In [0]:
lin_reg = LinearRegression()
lin_reg.fit(train_X_adj, train_y_adj)

X = predict_df[in_ts_names].values
predict_df["prediction_lin_reg3"] = lin_reg.predict(X)

# print out mse of the prediction
mse = mean_squared_error(predict_df[out_ts_name], predict_df["prediction_lin_reg3"])
r2_s = r2_score(predict_df[out_ts_name], predict_df["prediction_lin_reg3"])
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 4)))
print('The R2 score of our forecasts is {}'.format(round(r2_s, 4)))

predict_df.plot(x="timestamp", y=[out_ts_name, "prediction_lin_reg3"], figsize=(10,10));

###Look at the fit for the training data

In [0]:
lin_reg = LinearRegression()
lin_reg.fit(train_X_adj, train_y_adj)

X = datapoints_df[in_ts_names].values
datapoints_df["prediction_lin_reg3"] = lin_reg.predict(X)

# print out mse of the prediction
mse = mean_squared_error(datapoints_df[out_ts_name], datapoints_df["prediction_lin_reg3"])
r2_s = r2_score(datapoints_df[out_ts_name], datapoints_df["prediction_lin_reg3"])
print('The Mean Squared Error on the training data is {}'.format(round(mse, 4)))
print('The R2 score of our training data is {}'.format(round(r2_s, 4)))

datapoints_df.plot(x="timestamp", y=[out_ts_name, "prediction_lin_reg3"], figsize=(10,10));

##[8. Random Forest Ensemble Model](https://cognite.talentlms.com/unit/view/id:2123)

In [0]:
rnd_forest_reg = RandomForestRegressor(n_estimators=10, min_samples_split=20, max_depth=5)
rnd_forest_reg.fit(train_X, train_y)

X = predict_df[in_ts_names].values
predict_df["prediction_rnd_forest"] = rnd_forest_reg.predict(X)

# print out mse of the prediction
mse = mean_squared_error(predict_df[out_ts_name], predict_df["prediction_rnd_forest"])
r2_s = r2_score(predict_df[out_ts_name], predict_df["prediction_rnd_forest"])
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 4)))
print('The R2 score of our forecasts is {}'.format(round(r2_s, 4)))

predict_df.plot(x="timestamp", y=[out_ts_name, "prediction_rnd_forest"], figsize=(10,10));

In [0]:
rnd_forest_reg = RandomForestRegressor(n_estimators=10, min_samples_split=20, max_depth=5)
rnd_forest_reg.fit(train_X, train_y)

X = datapoints_df[in_ts_names].values
datapoints_df["prediction_rnd_forest"] = rnd_forest_reg.predict(X)

datapoints_df.plot(x="timestamp", y=[out_ts_name, "prediction_rnd_forest"], figsize=(10,10));

###Anomaly Detection

In [0]:
#Train up until 100 timestamps before anomalous period
predict_start_index = min(datapoints_df[datapoints_df[out_ts_name]> 5].index)-100

datapoints_df_ad = datapoints_df.loc[:predict_start_index, :]
train_X = datapoints_df_ad[in_ts_names].values
train_y = datapoints_df_ad[out_ts_name].values

predict_df_ad = datapoints_df.loc[predict_start_index+1:, in_ts_names + [out_ts_name, "timestamp"]]

In [0]:
plt.figure(figsize=(10,10))
plt.plot(datapoints_df_ad["timestamp"], datapoints_df_ad[out_ts_name], label="train")
plt.plot(predict_df_ad["timestamp"], predict_df_ad[out_ts_name], label="predict")
plt.legend()
plt.xlabel("timestamp")
plt.title(out_ts_name)

In [0]:
rnd_forest_reg = RandomForestRegressor(n_estimators=10, min_samples_split=20, max_depth=5)
rnd_forest_reg.fit(train_X, train_y)

X = predict_df_ad[in_ts_names].values
predict_df_ad["prediction_rnd_forest"] = rnd_forest_reg.predict(X)
predict_df_ad["residual"] = np.abs(predict_df_ad["prediction_rnd_forest"]-predict_df_ad[out_ts_name])

In [0]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,15))
predict_df_ad.plot(x="timestamp", y=[out_ts_name, "prediction_rnd_forest"], figsize=(12,7), ax=ax1, 
                color=["C1", "C2"]);
predict_df_ad.plot(x="timestamp", y=["residual"], figsize=(12,7), ax=ax2, color="C3");