# IoT Data

Chapter 3 also contains an end-to-end example with IoT data. Unfortunately, this example is not available in the GitHub repo of the book. So we defining some own examples here.

Also the data is not avaiable on the GitHub page. The data can be created with the notebook `chapter_03_create_IoT-data.ipynb` in our repo. The data can also be found in the directory `data`.

An interesting Databricks notebook with similar data and experiments can be found [here](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2201444230243664/3601578643761083/latest.html).

*Christoph Windheuser - April 11, 2022*

In [25]:
# Import required python spark libraries
import findspark
import pyspark

from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Connect Jupyter Notebook with the Spark application and create Spark Context
findspark.init()
sc = pyspark.SparkContext(appName="chapter_3")

#create a SparkSession
spark = (SparkSession
       .builder
       .appName("Example_chapt_03")
       .getOrCreate())


ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=chapter_3, master=local[*]) created by __init__ at <ipython-input-3-a8f18e26ff71>:12 

Checking the data in the `data` directory:

In [1]:
!head ./data/iot_devices.json

{"device_id": 1, "device_name": "meter-gauge-1HGVsyUwYig", "humidity": 33, "ip": "202.67.93.34", "lat": 65, "long": 81, "scale": "Celius", "temp": 28, "timestamp": "1649692100.0120409", "zipcode": 96132}
{"device_id": 2, "device_name": "sensor-pad-2JRBkwxFk2T", "humidity": 31, "ip": "149.5.43.88", "lat": 96, "long": 75, "scale": "Celius", "temp": 12, "timestamp": "1649692100.0121286", "zipcode": 95521}
{"device_id": 3, "device_name": "device-mac-34laB6pMWAt", "humidity": 50, "ip": "124.59.42.6", "lat": 22, "long": 25, "scale": "Celius", "temp": 27, "timestamp": "1649692100.0121949", "zipcode": 96935}
{"device_id": 4, "device_name": "sensor-pad-4WVBUDpLFYG", "humidity": 35, "ip": "120.88.72.6", "lat": 65, "long": 18, "scale": "Celius", "temp": 13, "timestamp": "1649692100.0122566", "zipcode": 95071}
{"device_id": 5, "device_name": "therm-stick-5URzWV3PHmC", "humidity": 68, "ip": "72.33.109.52", "lat": 77, "long": 36, "scale": "Celius", "temp": 26, "timestamp": "1649692100.0123167", 

Defining the schema for the data:

In [4]:
iot_schema = (StructType([
        StructField("device_id", LongType(), False),
        StructField("device_name", StringType(), False),
        StructField("humidity", IntegerType(), False),
        StructField("ip",StringType(), False),
        StructField("lat", IntegerType(), False),
        StructField("long", IntegerType(), False),
        StructField("scale", StringType(), False),
        StructField("temp", IntegerType(), False),
        StructField("timestamp", StringType(), False),
        StructField("zipcode", IntegerType(), False)]))


In [5]:
jsonFile = "data/iot_devices.json"

# Reading with or without specifying the schema works. It leads to the same results
# When the schema is not specified, Spark is infering the schema from the json data. 
# This takes time and is impractical for large data sets.

# iot_data_df =  spark.read.schema(iot_schema).json(jsonFile)
iot_data_df =  spark.read.json(jsonFile)


In [6]:
iot_data_df.show()

+---------+--------------------+--------+--------------+---+----+------+----+------------------+-------+
|device_id|         device_name|humidity|            ip|lat|long| scale|temp|         timestamp|zipcode|
+---------+--------------------+--------+--------------+---+----+------+----+------------------+-------+
|        1|meter-gauge-1HGVs...|      33|  202.67.93.34| 65|  81|Celius|  28|1649692100.0120409|  96132|
|        2|sensor-pad-2JRBkw...|      31|   149.5.43.88| 96|  75|Celius|  12|1649692100.0121286|  95521|
|        3|device-mac-34laB6...|      50|   124.59.42.6| 22|  25|Celius|  27|1649692100.0121949|  96935|
|        4|sensor-pad-4WVBUD...|      35|   120.88.72.6| 65|  18|Celius|  13|1649692100.0122566|  95071|
|        5|therm-stick-5URzW...|      68|  72.33.109.52| 77|  36|Celius|  26|1649692100.0123167|  95208|
|        6|sensor-pad-6cexjv...|      39| 149.68.122.12| 70|  11|Celius|  15| 1649692100.012375|  95249|
|        7|meter-gauge-7q2Gq...|      76|  177.9.101.81

In [7]:
iot_data_df.count()

1389

Show sensors with a temperature below  20 degree celsius

In [20]:
iot_results_df = (iot_data_df
                  .select("device_name", "temp")
                  .where(col("temp") < 20))
iot_results_df.show(5)
iot_results_df.count()

+--------------------+----+
|         device_name|temp|
+--------------------+----+
|sensor-pad-2JRBkw...|  12|
|sensor-pad-4WVBUD...|  13|
|sensor-pad-6cexjv...|  15|
|meter-gauge-7q2Gq...|  17|
|sensor-pad-1nWt8Q...|  14|
+--------------------+----+
only showing top 5 rows



554

What different values for "scale" exist in the data?

In [22]:
iot_data_df.select("scale").where(col("scale").isNotNull()).distinct().show()

+------+
| scale|
+------+
|Celius|
+------+



What is the min and max value of humidity over all sensors:

In [28]:
iot_data_df.select(min("humidity"), max("humidity")).show()

+-------------+-------------+
|min(humidity)|max(humidity)|
+-------------+-------------+
|           25|           99|
+-------------+-------------+

