## Use Case: Predictive maintenance (PdM) on IoT data for early fault detection with Delta
#### Introduction 
* <b> Predictive Maintenance (PdM) </b> is different from other <em> routine or time-based maintenance </em> approaches as it combines various sensor readings and sophisticated analytics on thousands of logged events in near real time and promises several fold improvements in cost savings mainly because tasks are performed only when <b>warranted</b>. 
* A key enabler of PdM is <b> Continuous Monitoring </b>that needs to be done while the equipment is running during production to help establish baselines for normal running conditions. 
* <b> IoT </b>allows for easier integration of the sensor data to follow the health of the equipment along its lifecycle by transmitting the monitoring data to systems where the baselining and analysis can be done. The top industries leading the IoT revolution include <em> manufacturing, transportation, utilities, healthcare, consumer electronics & automobiles. </em>
* By constantly comparing data trends for anomalies, mishaps can be prevented and this directly translates to lower maintenance costs and safety. 
* <b> PdM + IoT = Smart Factories (Industry 4.0) </b> IDC forecasts connected IoT devices to generate 79.4ZB of data in 2025 

#### Motivation 
* The global market size for PdM is expected to grow at a <b> CAGR of 28% </b> and  org savings to the tune of $188B by 2024. 
* Customers report <em> 10-15% reduction in maintenance costs </em> which translates directly to increased ROI from their PdM initiatives. 
* This is driven by parallel advances in many fields including advanced sensing technologies, better connectivity options, breakthroughs in ML & AI techniques.
* PdM plays a key role in Industry 4.0 (digitization of manufacturing) to help corporations not only <b> reduce unplanned downtime</b>, but also <b>improve productivity and safety</b>. 
* <em> In other words, it is far cheaper for a company to be able to predict when a part will fail and have a replacement ready, than to wait for it to fail and take the equipment offline until repairs can be completed.</em> 
##### PdM with AI-based analytics make it possible to monitor, analyze and predict the health of machines that are driving our everyday lives.

* A <b> variety </b> of sensor data ingested at high <b> velocity </b> and stored over time leading to huge <b>volume</b> processed at scale retaining its <b>veracity</b> to deliver <b> value </b> to the business 
* A <b>consolidated</b> view of all the data and an ability to seamlessly transition of workloads across <b> different personas </b>
* Pipelines have a mix of <b>streaming and some batch<b> and should be both <b>reliable and performant</b>
* Ingestion should be able to tolerate <b>delays</b> in upstream data and <b>errors aka changes</b> to previously pushed data at scale
* A mix of statistical & ML based tools and frameworks will be used to determine best predictive algorithm
* Analysis done to generate insigts should be <b>sound and defensible</b>
##### Delta along with MLFlow is a great technology fit to facilitate PdM use cases to ingest & process sensor data, run ML models to generate valuable insights.
* <b> Delta </b> with its integration to Structured Streaming provides a simple, reliable & performant e2e multi-hop streaming pipeline for all data personas to work in collaboration.
* <b> MLFlow </b>  with its Tracking server component helps bring discipline to the series of experiment runs a ML practitioner has to do to find the best prediction model to detect engine failure

* Affordable, highly available & durable
* The elasticity and scalability offered by cloud computing together with the increased maturity of enterprise-grade features including security have enabled companies to consolidate data operations in the cloud, breaking traditional silos and leading to a rise of <b> Data Lakes </b> backed by cloud storage such as S3  which has enabled advanced analytics with proven ROIs.
* Most traditional data lakes suffer from <b> reliability and performance issues </b> which is an impediment to their ability to generate sound business insights. 
* One side effect of dumping to an S3 bucket is that as the buckets grow, listing the files takes longer and requires more compute which translates to increased latency and cost. To counter this need, file notification based options such as <b> S3-SQS connector</b> and <b> Auto Loader </b> are handy to let you find new files written to an S3 bucket without repeatedly listing all of the files.

| Traditional Systems | Benefit   | The Delta Promise     |
| :--------------------| :------------------: | -----------: |
| Streaming            |  <b> Speed & Low Latency </b> | Transactionally incorporates new data in seconds    |
| Data Lake    | <b> Scale & Affordability </b> |Cloud Storage provides low cost; Massive Scalability, Support for concurrent access, High R/W throughput    |
|Data Warehouse       | <b> Reliability and Performance  </b> | ACID tx, UpDates/Delete/Merges/Upserts, Compacts & Caches  |

<img src="/files/tables/anindita/images/MLFlowComponents.png" width="850"/>
##### Design Philosophy
* <b> API first </b>: built around REST APIs that allows submitting runs, models, etc. from any library & language
* <b>Modular </b>: independent components; Easy to integrate into existing ML platforms & workflows
* <b>Easy </b>: minimal, ubiquitous dependencies. Runs the same way anywhere (local, cloud platforms) Easy for a single dev to use locally, or very large teams
* <b>Ecosysytem first </b>: Make it easy to use ecosystem frameworks, rather than competing with those frameworks.
* <b> Open Source & Open interface </b>: MLflow is designed to work with any ML library, algorithm, deployment tool or language.

* A multi-hop strategy provides the opportunity for different SLAs on the consumption access pattern. 
* As data progresses along the pipeline, its quality is incrementally refined. 
* The definition of bronze, silver and gold zones of the pipeline are quality guidelines for the consumer. 
* A modern data platform is a multi-tenant environment with different personas. Eg.
  * A Business Analyst may want aggregated data from the gold zone which uses a window function on the last few min of data; 
  * A Data Scientist may want to tap into the silver zone to repeat an experiment from data produced a day back; 
  * A Data Engineer may want to look into the bronze zone to re-apply a different set of business transformations on data that arrived a week back. 
* To ensure business continuity & quality of analytics, at no point should bad data be presented for consumption. 
* With Delta’s support for ACID transactions and reader/writer isolation, partial data is never visible.

<img src="/files/tables/anindita/Multi_hop_Delta_Streaming_Pipeline-27d73.png" width="850"/>
<img src="/files/tables/anindita/multi_hop-11fbd.png" width="900"/>

<img src=" /files/tables/anindita/NotebookLayout.png" width="800"/>

# AWS services setup

* The S3 bucket used for this demo is <b> delta-streaming-demo </b> and a folder called <b>sensor</b> is the receiving point for incoming sensor files 
* The SQS queue used is called <b> delta-sqs-stream </b>
* Set permissions for SQS as shown below
* From S3 - Properties choose <b>events</b> and add a new notification - use 'sensor' as the prefix to ensure files starting with that trigger notification
* Create an <b> IAM role </b> that gives EC2 access priveleges to both S3 & SQS
* Add the IAM role to the workspace
* At the time of databricks cluster launch, choose the latest <b> ML Runtime </b>to leverage popular ML librarries out of the box & attach the IAM role to give the EC2 instance the required privileges to access S3 & SQS

In [10]:
{
  "Version": "2012-10-17",
  "Id": "arn:aws:sqs:us-east-2:<ACCOUNT>:delta-sqs-stream/SQSDefaultPolicy",
  "Statement": [
    {
      "Sid": "Sid1571094836461",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "SQS:SendMessage",
      "Resource": "arn:aws:sqs:us-east-2:<ACCOUNT>:delta-sqs-stream",
      "Condition": {
        "ArnLike": {
          "aws:SourceArn": "arn:aws:s3:*:*:delta-streaming-demo"
        }
      }
    }
  ]
}

<img src="/files/tables/anindita/S3.png" width="400"/>
<img src="/files/tables/anindita/SQS_.png" width="700">
<img src="/files/tables/anindita/s3_sqs-14df1.png" width="400"/>

* IAM role is <b>databricks-s3-access </b>
* It has an inline policy to allow SQS resource <b> delta-sqs-stream </b>
* It has an inline policy to allow S3 resource <b> demo-delta-streaming </b>
* Add it to workspace
* attach this IAM role to cluster 
  
<img src="/files/tables/anindita/iam.png" />

In [13]:
%run "./Include"

In [14]:
dbutils.fs.unmount("/mnt/%s" % MOUNT_NAME)
dbutils.fs.mount("s3a://%s" % s3Bucket, "/mnt/%s" % mount_name)
dbutils.fs.mount("s3a://%s" % "delta-autoloader", "/mnt/%s" % "delta-autoload")

# Kaggle Dataset

##### https://www.kaggle.com/behrad3d/nasa-cmaps/metadata
* Public data set for asset degradation modeling from NASA : https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/
* Multiple multivariate time series from different engines : 3 operational settings and 21 sensor readings
* Predict the remaining useful life (RUL) of each engine in the test dataset (#operational cycles left)
<br><br>
* 3 Datasets for 100 engines across different cycle runs
  * <b> Training data </b>: It is the aircraft engine run-to-failure data. (20631 data points over 100 engines)
  * <b> Testing data: </b> It is the aircraft engine operating data without failure events recorded. (13096 data points over the same engines)
  * <b> Ground truth data: </b> It contains the information of true remaining cycles for each engine in the testing data.


* Engine Metadata: simulfor each of the 100 engines, additional metadata that needs to be joined with the incoming data

In [17]:
dbutils.fs.ls("/demo-datasets/sensor/")

In [18]:
train_df=spark.read.format("csv").option("delimiter", " ").schema(sensor_schema).load("dbfs:/demo-datasets/sensor/train_FD001.txt")
display(train_df)

unit_num,cycle_time,ops_1,ops_2,ops_3,s_1,s_2,s_3,s_4,s_5,s_6,s_7,s_8,s_9,s_10,s_11,s_12,s_13,s_14,s_15,s_16,s_17,s_18,s_19,s_20,s_21
1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392.0,2388.0,100.0,39.06,23.419
1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392.0,2388.0,100.0,39.0,23.4236
1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390.0,2388.0,100.0,38.95,23.3442
1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392.0,2388.0,100.0,38.88,23.3739
1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393.0,2388.0,100.0,38.9,23.4044
1,6,-0.0043,-0.0001,100.0,518.67,642.1,1584.47,1398.37,14.62,21.61,554.67,2388.02,9049.68,1.3,47.16,521.68,2388.03,8132.85,8.4108,0.03,391.0,2388.0,100.0,38.98,23.3669
1,7,0.001,0.0001,100.0,518.67,642.48,1592.32,1397.77,14.62,21.61,554.34,2388.02,9059.13,1.3,47.36,522.32,2388.03,8132.32,8.3974,0.03,392.0,2388.0,100.0,39.1,23.3774
1,8,-0.0034,0.0003,100.0,518.67,642.56,1582.96,1400.97,14.62,21.61,553.85,2388.0,9040.8,1.3,47.24,522.47,2388.03,8131.07,8.4076,0.03,391.0,2388.0,100.0,38.97,23.3106
1,9,0.0008,0.0001,100.0,518.67,642.12,1590.98,1394.8,14.62,21.61,553.69,2388.05,9046.46,1.3,47.29,521.79,2388.05,8125.69,8.3728,0.03,392.0,2388.0,100.0,39.05,23.4066
1,10,-0.0033,0.0001,100.0,518.67,641.71,1591.24,1400.46,14.62,21.61,553.59,2388.05,9051.7,1.3,47.03,521.79,2388.06,8129.38,8.4286,0.03,393.0,2388.0,100.0,38.95,23.4694


In [19]:
display(train_df.describe())

summary,unit_num,cycle_time,ops_1,ops_2,ops_3,s_1,s_2,s_3,s_4,s_5,s_6,s_7,s_8,s_9,s_10,s_11,s_12,s_13,s_14,s_15,s_16,s_17,s_18,s_19,s_20,s_21
count,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0
mean,51.50656778634095,108.80786195530996,-8.870146866366216e-06,2.350831273326574e-06,100.0,518.6699999999346,642.6809335466004,1590.5231186079263,1408.933781687751,14.619999999996605,21.60980320875388,553.3677112112824,2388.0966516407484,9065.242940720273,1.299999999999534,47.54116814502418,521.4134700208454,2388.0961523920328,8143.752722117232,8.44214558189131,0.0299999999999844,393.2106538703892,2388.0,100.0,38.81627066065643,23.289705360864783
stddev,29.22763290879931,68.88099017721827,0.0021873134490151,0.000293062124566145,0.0,0.0,0.5000532700605589,6.131149519690709,9.000604780543718,0.0,0.0013889849127084,0.8850922576633927,0.070985478890557,22.082879525067696,0.0,0.2670873986396923,0.7375533922096187,0.0719189156975736,19.076175975950623,0.0375050379519658,0.0,1.5487630246142872,0.0,0.0,0.1807464278736582,0.1082508747449128
min,1.0,1.0,-0.0087,-0.0006,100.0,518.67,641.21,1571.04,1382.25,14.62,21.6,549.85,2387.9,9021.73,1.3,46.85,518.69,2387.88,8099.94,8.3249,0.03,388.0,2388.0,100.0,38.14,22.8942
max,100.0,362.0,0.0087,0.0006,100.0,518.67,644.53,1616.91,1441.49,14.62,21.61,556.06,2388.56,9244.59,1.3,48.53,523.38,2388.56,8293.72,8.5848,0.03,400.0,2388.0,100.0,39.43,23.6184


In [20]:
test_df=spark.read.format("csv").option("delimiter", " ").schema(sensor_schema).load("dbfs:/demo-datasets/sensor/test_FD001.txt")
display(test_df.describe())

summary,unit_num,cycle_time,ops_1,ops_2,ops_3,s_1,s_2,s_3,s_4,s_5,s_6,s_7,s_8,s_9,s_10,s_11,s_12,s_13,s_14,s_15,s_16,s_17,s_18,s_19,s_20,s_21
count,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0,13096.0
mean,51.543906536346974,76.83651496640195,-1.117898594990834e-05,4.237935247403789e-06,100.0,518.6699999999398,642.475087813072,1588.099204337202,1404.735361942574,14.619999999997328,21.60970067195389,553.757522907757,2388.070964416578,9058.40736331701,1.2999999999996843,47.41620418448384,521.7477237324363,2388.0710247403417,8138.947816890653,8.425843967623669,0.0299999999999911,392.5716249236408,2388.0,100.0,38.89250152718381,23.335742692425278
stddev,28.289423226608932,53.05774946991259,0.0022026850808076,0.00029403056709582817,0.0,0.0,0.4008993420739847,5.003273930117532,6.688309292918633,0.0,0.0017040847560489,0.6812861095938021,0.057441784144966,11.43626052134709,0.0,0.195917245313219,0.5596267536792706,0.0569343103208322,10.18860497999252,0.0290093275455669,0.0,1.233576825647673,0.0,0.0,0.141680755304967,0.0841202801556253
min,1.0,1.0,-0.0082,-0.0006,100.0,518.67,641.13,1569.04,1384.39,14.62,21.6,550.88,2387.89,9024.53,1.3,46.8,519.38,2387.89,8108.5,8.3328,0.03,389.0,2388.0,100.0,38.31,22.9354
max,100.0,303.0,0.0078,0.0007,100.0,518.67,644.3,1607.55,1433.36,14.62,21.61,555.84,2388.3,9155.03,1.3,48.26,523.76,2388.32,8220.48,8.5414,0.03,397.0,2388.0,100.0,39.41,23.6419


In [21]:
rul_df=spark.read.format("csv").option("delimiter", " ").schema(engine_metadata_schema).load("dbfs:/demo-datasets/sensor/RUL_FD001.txt")
display(rul_df.describe())

summary,rul,engine_id,engine_name,description
count,100.0,0.0,0.0,0.0
mean,75.52,,,
stddev,41.76497009783112,,,
min,7.0,,,
max,145.0,,,


# Other Setup

In [23]:
#bronze table paths
dbutils.fs.rm(bronzeOutPath, True)
dbutils.fs.rm(bronzeCheckpointPath, True)

dbutils.fs.mkdirs(bronzeOutPath)
dbutils.fs.mkdirs(bronzeCheckpointPath)

#silver table paths
dbutils.fs.rm(silverOutPath, True)
dbutils.fs.rm(silverCheckpointPath, True)

dbutils.fs.mkdirs(silverOutPath)
dbutils.fs.mkdirs(silverCheckpointPath)

#gold table paths
dbutils.fs.rm(goldOutPath, True)
dbutils.fs.rm(goldCheckpointPath, True)

dbutils.fs.mkdirs(goldOutPath)
dbutils.fs.mkdirs(goldCheckpointPath)

dbutils.fs.rm(landing_dir)
dbutils.fs.mkdirs(landing_dir)

In [24]:
dbutils.fs.ls(basePath)

In [25]:
database_query = "CREATE DATABASE IF NOT EXISTS {}".format(databaseName)
sqlContext.sql(database_query)

In [26]:
from pyspark.sql.functions import row_number,lit, concat, col
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))

rul_df=spark.read.format("csv").option("delimiter", " ").schema(engine_metadata_schema).load("dbfs:/demo-datasets/aircraft-sensor/RUL_FD001.txt")
rul_df=rul_df.withColumn("engine_id", row_number().over(w))
rul_df=rul_df.withColumn("engine_name", concat(lit("Engine:"),col("engine_id"))).withColumn("description", concat(lit("Details ..."),col("engine_id")))

EngineDataTable=databaseName+".EngineMetadata"
rul_df.write.mode("overwrite").saveAsTable(EngineDataTable)

display(rul_df)

rul,engine_id,engine_name,description
112,1,Engine:1,Details ...1
98,2,Engine:2,Details ...2
69,3,Engine:3,Details ...3
82,4,Engine:4,Details ...4
91,5,Engine:5,Details ...5
93,6,Engine:6,Details ...6
91,7,Engine:7,Details ...7
95,8,Engine:8,Details ...8
111,9,Engine:9,Details ...9
96,10,Engine:10,Details ...10
