![](assets/solutions-microsoft-logo-small.png)
<img src="assets/ai.jpg" style="height:200px;float:right;vertical-align:text-top">

## Artificial Intelligence on IaaS++

This is part 3 of a 7-part workshop. The Jupyter Notbooks we are using are arranged in the same order as the Team Data Science Process: 

0 - [Introduction and Setup](./0%20-%20Introduction.ipynb)

1 - [Business Understanding](./1%20-%20Business%20Understanding.ipynb)

2 - *(This Module)* [Data Acquisition and Understanding](./2%20-%20Data%20Acquisition%20and%20Understanding.ipynb)

3 - [Modeling](./3%20-%20Modeling.ipynb)

4 - [Deployment](./4%20-%20Deployment.ipynb)

5 - [Customer Acceptance](./5%20-%20Customer%20Acceptance.ipynb)

6 - [Workshop Wrap-up](./6%20-%20Workshop%20Wrap-up.ipynb)

<p style="border-bottom: 3px solid lightgrey;"></p> 

<h3><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/check.png">Phase Two - Data Acquisition and Understanding</h3>

Read the [Documentation Reference here](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle-data)

The Data Aquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, you’ll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, you’ll replace missing values, add and change columns. You’ll cover more extensive Data Wrangling tasks in other labs.

In this section, we’ll use a file-based dataset source to train our model, using three sources of data.

**Goals**

  - Produce a clean, high-quality data set whose relationship to the target variables is understood. Locate the data set in the appropriate analytics environment so you are ready to model.
  - Develop a solution architecture of the data pipeline that refreshes and scores the data regularly.

**How to do it**

  - Ingest the data into the target analytic environment.
  - Explore the data to determine if the data quality is adequate to answer the question.
  - Set up a data pipeline to score new or regularly refreshed data.

<p><img style="float: right; margin: 0px 15px 15px 0px;" src="./assets/aml-logo.png"><b>More information on Using Azure Machine Learning for this Phase:</b></p>

<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Load data into storage environments for analytics](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/ingest-data)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Explore data in the Team Data Science Process](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/explore-data)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Sample data in Azure blob containers, SQL Server, and Hive tables](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/sample-data)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Access datasets with Python using the Azure Machine Learning Python client library](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/python-data-access)</p>

### Data description

<img src="assets/files.jpg" style="height:25px;float:right;vertical-align:text-top">

This workshop uses three data sets as inputs in the files **PM_train.txt**, **PM_test.txt**, and **PM_truth.txt** that we will download in code and persist locally:

*Train data:* This is the aircraft engine run-to-failure data. The train data (*PM_train.txt*) consists of multiple, multivariate time series with cycle as the time unit. It includes 21 sensor readings for each cycle.
  Each time series is generated from a different engine of the same type. Each engine starts with different degrees of initial wear and some unique manufacturing variation. This information is unknown to the user.
  In this simulated data, the engine is assumed to be operating normally at the start of each time series. It starts to degrade at some point during the series of the operating cycles. The degradation progresses and grows in magnitude.
  When a predefined threshold is reached, the engine is considered unsafe for further operation. The last cycle of each time series is the failure point of that engine.

*Test data:* The aircraft engine operating data, without failure events recorded. The test data (*PM_test.txt*) has the same data schema as the training data. The only difference is that the data does not indicate when the failure occurs (the last time period does not represent the failure point). It is not known how many more cycles this engine can last before it fails.

*Truth data:* The information of true remaining cycles for each engine in the testing data. The ground truth data provides the number of remaining working cycles for the engines in the testing data.

This data is from simulated aircraft values, from 21 sensors to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance. The data ingestion notebook will download the simulated predicitive maintenance data sets from a public Azure Blob Storage. Labels are created from the `truth` data and joined to the `training` and `test` data. After some preliminary data cleaning and verification, the results are stored in a local (to the notebook server) folder for analysis.

<p style="border-bottom: 1px solid lightgrey;"></p> 

### Lab 2.0 - Ingest data from a local source

<img src="assets/checkmark.jpg" style="float:right;vertical-align:text-top">

We will be reusing the raw simulated data files from the [Predictive Maintenance](https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Template-3) tutorial. The notebook programatically downloads these files from http://azuremlsamples.azureml.net/templatedata/. 

The three data files are:

  - `PM_train.txt`
  - `PM_test.txt`
  - `PM_truth.txt`
    
This notebook labels the train and test set and does some preliminary cleanup. We'll also create some summary graphics for each data set to verify the data download, and store the resulting data sets in a local folder.

Instructions:
 1. Run the Python Code in the cells below. 
 
 #### Lab verification
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">Ensure that you have the three datasets loaded into the variables successfully, and that you see the graphical output of the basic analysis.</p>

In [None]:
# Project and Data Ingestion setup:
# Import libraries you'll need thorughout the labs
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Libraries to display graphics of your exploration
%matplotlib inline
import matplotlib.pyplot as plt
import glob
import urllib

# Logging could optionally be placed here

In [None]:
# The raw train data is stored at this location:
basedataurl = "http://azuremlsamples.azureml.net/templatedata/"
# <TODO: Find or make the real data source>

# We will store each of these data sets in a local persistance folder. Change this 
# location for production, or use a call out to Azure or other online storage.
SHARE_ROOT = '/gpuclass/data/'

# These file names detail where we store each data file. 
TRAIN_DATA = 'PM_train_files.pkl'
TEST_DATA = 'PM_test_files.pkl'
TRUTH_DATA = 'PM_truth_files.pkl'

#### Data Ingestion
In this section, we ingest the training, test and ground truth datasets from a remote location. 

The training data consists of multiple multivariate time series with `cycle` as the time unit, together with 21 sensor readings and 3 settings for each cycle. Each time series can be assumed as being generated from a different engine of the same type. The testing data has the same data schema as the training data, except that the data does not indicate *when* the failure occurs. The ground truth data provides the number of remaining working cycles for the engines in the testing data. (You can find more details about the type of data used for this notebook at [Predictive Maintenance Template](https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Template-3).)

The training data consists of data from 100 engines (`id`) in the form of multivariate time series with `cycle` as the unit of time with 21 sensor readings `s1:s21` and 3 operational `setting` features for each `cycle`. In this simulated data, an engine is assumed to be operating normally at the start of each time series. Engine degradation progresses and grows in magnitude until a predefined threshold is reached where the engine is considered unsafe for further operation. In this simulation, the last cycle in each time series can be considered as the failure point of the corresponding engine:

In [None]:
# Load raw training data from Azure blob
train_df = 'PM_train.txt'

# Download the file once, and only once.
if not os.path.isfile(train_df):
    urllib.request.urlretrieve(basedataurl+train_df, train_df)

# read training data 
train_df = pd.read_csv('PM_train.txt', sep=" ", header=None)
train_df.drop(train_df.columns[[26, 27]], axis=1, inplace=True)
train_df.columns = ['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3',
                    's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14',
                    's15', 's16', 's17', 's18', 's19', 's20', 's21']

# Display the results
train_df.head()

The testing data has the same data schema as the training data except that the failure point is unknown:

In [None]:
# Load raw data from Azure blob
test_df = 'PM_test.txt'

# Download the file once, and only once.
if not os.path.isfile(test_df):
    urllib.request.urlretrieve(basedataurl+test_df, test_df)
    
# read test data
test_df = pd.read_csv('PM_test.txt', sep=" ", header=None)
test_df.drop(test_df.columns[[26, 27]], axis=1, inplace=True)
test_df.columns = train_df.columns

test_df.head()

The ground truth data provides the number of remaining working cycles (Ramaining useful life (RUL)) for the engines in the testing data. We use this data to evaluation the model after training with the training data set only:

In [None]:
# Load raw data from Azure blob
truth_df = 'PM_truth.txt'

# Download the file once, and only once.
if not os.path.isfile(truth_df):
    urllib.request.urlretrieve(basedataurl+truth_df, truth_df)
    
# read ground truth data
truth_df = pd.read_csv('PM_truth.txt', sep=" ", header=None)
truth_df.drop(truth_df.columns[[1]], axis=1, inplace=True)

truth_df.head()

#### Data Preprocessing
We next generate labels for the training data. Since the last observation is assumed to be a failure point, we can calculate the Remaining Useful Life (`RUL`) for every cycle in the data:

In [None]:
# Data Labeling - generate column RUL
rul = pd.DataFrame(train_df.groupby('id')['cycle'].max()).reset_index()
rul.columns = ['id', 'max']
train_df = train_df.merge(rul, on=['id'], how='left')
train_df['RUL'] = train_df['max'] - train_df['cycle']
train_df.drop('max', axis=1, inplace=True)
train_df.head()

Using RUL, we can create a label indicating time to failure. We define a boolean (`True\False`) value for `label1` indicating the engine will fail within 30 days (RUL $<= 30$). We can also define a multiclass `label2` $\in \{0, 1, 2\}$ indicating {Healthy, RUL <=30, RUL <=15} cycles:

In [None]:
# generate label columns for training data
w1 = 30
w0 = 15

# Label1 indicates a failure will occur within the next 30 cycles.
# 1 indicates failure, 0 indicates healthy 
train_df['label1'] = np.where(train_df['RUL'] <= w1, 1, 0 )

# label2 is multiclass, value 1 is identical to label1,
# value 2 indicates failure within 15 cycles
train_df['label2'] = train_df['label1']
train_df.loc[train_df['RUL'] <= w0, 'label2'] = 2
train_df.head()

In the [Predictive Maintenance Template](https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Template-3) , the `cycle` column is also used for training, so we will also include it. Here, we normalize the columns in the training data:

In [None]:
# MinMax normalization
train_df['cycle_norm'] = train_df['cycle']
cols_normalize = train_df.columns.difference(['id','cycle','RUL','label1','label2'])
min_max_scaler = MinMaxScaler()
norm_train_df = pd.DataFrame(min_max_scaler.fit_transform(train_df[cols_normalize]), 
                             columns=cols_normalize, 
                             index=train_df.index)
join_df = train_df[train_df.columns.difference(cols_normalize)].join(norm_train_df)
train_df = join_df.reindex(columns = train_df.columns)
train_df.head()

Next, we prepare the test data. We normalize the data using the same parameters from the training data normalization:

In [None]:
test_df['cycle_norm'] = test_df['cycle']
norm_test_df = pd.DataFrame(min_max_scaler.transform(test_df[cols_normalize]), 
                            columns=cols_normalize, 
                            index=test_df.index)
test_join_df = test_df[test_df.columns.difference(cols_normalize)].join(norm_test_df)
test_df = test_join_df.reindex(columns = test_df.columns)
test_df = test_df.reset_index(drop=True)
test_df.head()

Next, we use the ground truth dataset to generate labels for the test data:

In [None]:
# generate column max for test data
rul = pd.DataFrame(test_df.groupby('id')['cycle'].max()).reset_index()
rul.columns = ['id', 'max']
truth_df.columns = ['more']
truth_df['id'] = truth_df.index + 1
truth_df['max'] = rul['max'] + truth_df['more']
truth_df.drop('more', axis=1, inplace=True)

# generate RUL for test data
test_df = test_df.merge(truth_df, on=['id'], how='left')
test_df['RUL'] = test_df['max'] - test_df['cycle']
test_df.drop('max', axis=1, inplace=True)
test_df.head()

We then create the same labels as used for the `training` data:

In [None]:
# generate label columns w0 and w1 for test data
test_df['label1'] = np.where(test_df['RUL'] <= w1, 1, 0 )
test_df['label2'] = test_df['label1']
test_df.loc[test_df['RUL'] <= w0, 'label2'] = 2
test_df.head()

<p style="border-bottom: 1px solid lightgrey;"></p> 

### Lab 2.1 - Data Exploration and Understanding

<img src="assets/checkmark.jpg" style="float:right;vertical-align:text-top">

One critical advantage of LSTMs is their ability to remember from long-term sequences (window sizes) which is hard to achieve by traditional feature engineering as computing rolling averages over large window sizes (i.e. 50 cycles) may lead to loss of information due to smoothing and abstracting of values over such a long period. While feature engineering over large window sizes may not make sense, LSTMs are able to use all the information in the window as input. We first look at an example of the sensor values for 50 cycles prior to the failure for engine `id = 3`. 

Instructions:
 1. Run the cells below one at a time.

#### Lab verification
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">Ensure that you understand the data, it's layout, and know any missing values in the data.</p>

In [None]:
# preparing data for visualizations 
# window of 50 cycles prior to a failure point for engine id 3
engine_id3 = test_df[test_df['id'] == 3]
engine_id3_50cycleWindow = engine_id3[engine_id3['RUL'] <= engine_id3['RUL'].min() + 50]
cols1 = ['s1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10']
engine_id3_50cycleWindow1 = engine_id3_50cycleWindow[cols1]
cols2 = ['s11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
engine_id3_50cycleWindow2 = engine_id3_50cycleWindow[cols2]

# plotting sensor data for engine ID 3 prior to a failure point - sensors 1-10 
ax1 = engine_id3_50cycleWindow1.plot(subplots=True, sharex=True, figsize=(20,20))

In [None]:
# plotting sensor data for engine ID 3 prior to a failure point - sensors 11-21 
ax2 = engine_id3_50cycleWindow2.plot(subplots=True, sharex=True, figsize=(20,20))

#### Persist the data sets

With the training and testing data created, we can turn our attention to modelling the engine failures. In order to pass the data set to out next notebook, we will write the data to a folder shared within the Azure ML project. https://docs.microsoft.com/en-us/azure/machine-learning/preview/how-to-read-write-files

The `Code\2_model_building_and_evaluation.ipynb` the code will then read these data files and train a LSTM network to predict the probability of engine failure within the next 30 cycles using the previous 50 cycles:

In [None]:
# The data was read in using a Pandas data frame. We'll convert 
# store it for later manipulations in subsequent notebooks.
train_df.to_pickle(SHARE_ROOT + TRAIN_DATA)
test_df.to_pickle(SHARE_ROOT + TEST_DATA)

print("Data stream saved at: " + SHARE_ROOT + TEST_DATA)

<p style="border-bottom: 3px solid lightgrey;"></p> 

### Phase 2 wrap-up

<img src="assets/wrapup.jpg" style="float:right;vertical-align:text-top">

<p>This module covered the Data Acquisition and Understanding phase of the solution.</p>

<p>The Notebooks are arranged in the same order as the Team Data Science Process:</p> 

0 - [Introduction and Setup](./0%20-%20Introduction.ipynb)

1 - [Business Understanding](./1%20-%20Business%20Understanding.ipynb)

2 - *(This module)* [Data Acquisition and Understanding](./2%20-%20Data%20Acquisition%20and%20Understanding.ipynb)

3 - *(Proceed to this Notebook Next)* [Modeling](./3%20-%20Modeling.ipynb)

4 - [Deployment](./4%20-%20Deployment.ipynb)

5 - [Customer Acceptance](./5%20-%20Customer%20Acceptance.ipynb)

6 - [Workshop Wrap-up](./6%20-%20Workshop%20Wrap-up.ipynb)