# IoT Data

Chapter 3 also contains on page 71 ff an end-to-end example with IoT data. Unfortunately, except the code snippets in the book, there is no comprehensive code available in the GitHub repo of the book. 

The data file `iot-devices.json` can be found in the repo in this directory:
https://github.com/databricks/LearningSparkV2/tree/master/databricks-datasets/learning-spark-v2/iot-devices

*Christoph Windheuser - April 13, 2022*

In [1]:
# Import required python spark libraries
import findspark
import pyspark

from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Connect Jupyter Notebook with the Spark application and create Spark Context
findspark.init()
sc = pyspark.SparkContext(appName="LearningSpark")

#create a SparkSession
spark = (SparkSession
       .builder
       .appName("iot_Example_chapt_03")
       .getOrCreate())


Checking the data in the `data` directory:

In [2]:
!head ./data/iot_devices.json

{"device_id": 1, "device_name": "meter-gauge-1xbYRYcj", "ip": "68.161.225.1", "cca2": "US", "cca3": "USA", "cn": "United States", "latitude": 38.000000, "longitude": -97.000000, "scale": "Celsius", "temp": 34, "humidity": 51, "battery_level": 8, "c02_level": 868, "lcd": "green", "timestamp" :1458444054093 }
{"device_id": 2, "device_name": "sensor-pad-2n2Pea", "ip": "213.161.254.1", "cca2": "NO", "cca3": "NOR", "cn": "Norway", "latitude": 62.470000, "longitude": 6.150000, "scale": "Celsius", "temp": 11, "humidity": 70, "battery_level": 7, "c02_level": 1473, "lcd": "red", "timestamp" :1458444054119 }
{"device_id": 3, "device_name": "device-mac-36TWSKiT", "ip": "88.36.5.1", "cca2": "IT", "cca3": "ITA", "cn": "Italy", "latitude": 42.830000, "longitude": 12.830000, "scale": "Celsius", "temp": 19, "humidity": 44, "battery_level": 2, "c02_level": 1556, "lcd": "red", "timestamp" :1458444054120 }
{"device_id": 4, "device_name": "sensor-pad-4mzWkz", "ip": "66.39.173.154", "cca2": "US", "cca3"

Reading the json-file and infer the schema from the file:

In [3]:
jsonFile = "data/iot_devices.json"
iot_data_df =  spark.read.json(jsonFile)


Show the first 5 lines of the DataFrame and the number of lines:

In [4]:
iot_data_df.show(5)

+-------------+---------+----+----+-------------+---------+--------------------+--------+-------------+--------+------+---------+-------+----+-------------+
|battery_level|c02_level|cca2|cca3|           cn|device_id|         device_name|humidity|           ip|latitude|   lcd|longitude|  scale|temp|    timestamp|
+-------------+---------+----+----+-------------+---------+--------------------+--------+-------------+--------+------+---------+-------+----+-------------+
|            8|      868|  US| USA|United States|        1|meter-gauge-1xbYRYcj|      51| 68.161.225.1|    38.0| green|    -97.0|Celsius|  34|1458444054093|
|            7|     1473|  NO| NOR|       Norway|        2|   sensor-pad-2n2Pea|      70|213.161.254.1|   62.47|   red|     6.15|Celsius|  11|1458444054119|
|            2|     1556|  IT| ITA|        Italy|        3| device-mac-36TWSKiT|      44|    88.36.5.1|   42.83|   red|    12.83|Celsius|  19|1458444054120|
|            6|     1080|  US| USA|United States|        4

In [5]:
iot_data_df.count()

198164

Show sensors with a temperature < 30 and a humidity > 70:

In [6]:
iot_results_df = (iot_data_df
                  .where((col("temp") < 30) & (col("humidity") > 70)))
iot_results_df.show(5)
iot_results_df.count()

+-------------+---------+----+----+-------------+---------+--------------------+--------+--------------+--------+------+---------+-------+----+-------------+
|battery_level|c02_level|cca2|cca3|           cn|device_id|         device_name|humidity|            ip|latitude|   lcd|longitude|  scale|temp|    timestamp|
+-------------+---------+----+----+-------------+---------+--------------------+--------+--------------+--------+------+---------+-------+----+-------------+
|            3|      807|  JP| JPN|        Japan|        9| device-mac-9GcjZ2pw|      85| 118.23.68.227|   35.69| green|   139.69|Celsius|  13|1458444054124|
|            3|     1544|  IT| ITA|        Italy|       11|meter-gauge-11dlM...|      85| 88.213.191.34|   42.83|   red|    12.83|Celsius|  16|1458444054125|
|            0|     1260|  US| USA|United States|       12|sensor-pad-12Y2kIm0o|      92|   68.28.91.22|    38.0|yellow|    -97.0|Celsius|  12|1458444054126|
|            6|     1007|  IN| IND|        India|   

61347

Show me the device name, device id and country of all devices with a temperature > 25:

In [7]:
iot_results_df = (iot_data_df
                  .select("temp", "device_name", "device_id", "cca3")
                  .where(col("temp") > 25))
iot_results_df.show(5)
iot_results_df.count()

+----+--------------------+---------+----+
|temp|         device_name|device_id|cca3|
+----+--------------------+---------+----+
|  34|meter-gauge-1xbYRYcj|        1| USA|
|  28|   sensor-pad-4mzWkz|        4| USA|
|  27|sensor-pad-6al7RT...|        6| USA|
|  27|sensor-pad-8xUD6p...|        8| JPN|
|  26|sensor-pad-10Bsyw...|       10| USA|
+----+--------------------+---------+----+
only showing top 5 rows



71451

Show me the max and min of: Temperature, Battery Level, C02 Level and Humidity:

In [9]:
iot_data_df.select(min("temp"), max("temp")).show()

+---------+---------+
|min(temp)|max(temp)|
+---------+---------+
|       10|       34|
+---------+---------+



In [10]:
iot_data_df.select(min("battery_level"), max("battery_level")).show()

+------------------+------------------+
|min(battery_level)|max(battery_level)|
+------------------+------------------+
|                 0|                 9|
+------------------+------------------+



In [12]:
iot_data_df.select(min("c02_level"), max("c02_level")).show()

+--------------+--------------+
|min(c02_level)|max(c02_level)|
+--------------+--------------+
|           800|          1599|
+--------------+--------------+



In [13]:
iot_data_df.select(min("humidity"), max("humidity")).show()

+-------------+-------------+
|min(humidity)|max(humidity)|
+-------------+-------------+
|           25|           99|
+-------------+-------------+



Show me all sensors with a battery level below a critical threshold:

In [45]:
low_battery_threshold = 2

iot_results_df = (iot_data_df
                  .select("battery_level", "device_name", "device_id", "cca3")
                  .where(col("battery_level") < low_battery_threshold))
iot_results_df.show(5)
iot_results_df.count()

+-------------+--------------------+---------+----+
|battery_level|         device_name|device_id|cca3|
+-------------+--------------------+---------+----+
|            0|sensor-pad-8xUD6p...|        8| JPN|
|            0|sensor-pad-12Y2kIm0o|       12| USA|
|            1|sensor-pad-14QL93...|       14| NOR|
|            0|meter-gauge-17zb8...|       17| USA|
|            1|sensor-pad-36VQv8...|       36| CYP|
+-------------+--------------------+---------+----+
only showing top 5 rows



39727

Show me the top 20 countries with the highest C02 level:

In [46]:
iot_max_df = (iot_data_df
                  .select("c02_level", "cca3")
                  .groupBy("cca3")
                  .agg(max("c02_level"))
                  .sort("max(c02_level)", ascending=False))

iot_max_df.show(20)

+----+--------------+
|cca3|max(c02_level)|
+----+--------------+
| AUS|          1599|
| HUN|          1599|
| POL|          1599|
| THA|          1599|
| BRA|          1599|
| NOR|          1599|
| BOL|          1599|
| FIN|          1599|
| UKR|          1599|
| PER|          1599|
| GBR|          1599|
| NLD|          1599|
| BMU|          1599|
| TUR|          1599|
| LVA|          1599|
| USA|          1599|
| ITA|          1599|
| VNM|          1599|
| ARE|          1599|
| KOR|          1599|
+----+--------------+
only showing top 20 rows



In [49]:
iot_avg_df = (iot_data_df
                  .select("temp", "c02_level", "humidity", "cca3")
                  .groupBy("cca3")
                  .agg(avg("temp"), avg("c02_level"), avg("humidity")))

iot_avg_df.show()

+----+------------------+------------------+------------------+
|cca3|         avg(temp)|    avg(c02_level)|     avg(humidity)|
+----+------------------+------------------+------------------+
| HTI|25.333333333333332|1291.3333333333333| 64.58333333333333|
| POL|21.983965014577258|1193.7452623906706| 62.33163265306123|
| LVA|21.899441340782122|1189.1340782122904| 63.11173184357542|
| BRB|23.210526315789473|1257.5526315789473| 58.36842105263158|
| BRA|21.958126550868485|1208.7382133995038| 61.96867245657568|
| ARM| 21.58823529411765|1207.9117647058824| 63.23529411764706|
| MOZ| 19.59090909090909|            1264.0| 58.77272727272727|
| JOR|21.065217391304348|1222.3478260869565| 63.84782608695652|
| CUB|25.866666666666667|1222.5333333333333| 49.53333333333333|
| FRA|22.115739868049012|1200.7059377945334| 61.82054665409991|
| ABW|             20.75|          1190.125|             64.75|
| TCA|              17.0|             862.0|              38.0|
| BRN|21.894736842105264|1200.2105263157

In [52]:
iot_avg_df = (iot_data_df
                  .select("temp", "cca3")
                  .groupBy("cca3")
                  .agg(avg("temp"))
                  .sort("avg(temp)", ascending=False))

iot_avg_df.show()

+----+------------------+
|cca3|         avg(temp)|
+----+------------------+
| AIA|31.142857142857142|
| GRL|              29.5|
| GAB|              28.0|
| VUT|              27.3|
| LCA|              27.0|
| TKM|26.666666666666668|
| MWI|26.666666666666668|
| IRQ|26.428571428571427|
| LAO|26.285714285714285|
| IOT|              26.0|
| CUB|25.866666666666667|
| HTI|25.333333333333332|
| FJI| 25.09090909090909|
| DMA| 24.73076923076923|
| BEN|24.666666666666668|
| SYR|              24.6|
| BWA|              24.5|
| TLS|24.333333333333332|
| MNP|24.333333333333332|
| BHS| 24.27777777777778|
+----+------------------+
only showing top 20 rows



In [53]:
iot_avg_df = (iot_data_df
                  .select("c02_level", "cca3")
                  .groupBy("cca3")
                  .agg(avg("c02_level"))
                  .sort("avg(c02_level)", ascending=False))

iot_avg_df.show()

+----+------------------+
|cca3|    avg(c02_level)|
+----+------------------+
| GAB|            1523.0|
| FLK|            1424.0|
| MCO|            1421.5|
| SMR|1379.6666666666667|
| LBR|            1374.5|
| SYR|            1345.8|
| MRT|1344.4285714285713|
| COD|          1333.375|
| TON|            1323.0|
| TLS|            1310.0|
| GIN|            1308.0|
| BWA|1302.6666666666667|
| HTI|1291.3333333333333|
| LAO|            1291.0|
| MDV|1284.7272727272727|
| AND|            1279.0|
| LSO|            1274.6|
| MOZ|            1264.0|
| FSM|            1261.0|
| LBY|         1260.5625|
+----+------------------+
only showing top 20 rows



In [54]:
iot_avg_df = (iot_data_df
                  .select("humidity", "cca3")
                  .groupBy("cca3")
                  .agg(avg("humidity"))
                  .sort("avg(humidity)", ascending=False))

iot_avg_df.show()

+----+-----------------+
|cca3|    avg(humidity)|
+----+-----------------+
| COK|85.66666666666667|
| WSM|             81.0|
| GGY|           79.625|
| MHL|             75.0|
| AND|             75.0|
| GRD|73.83333333333333|
| BWA|            73.75|
| LBR|             72.0|
| JEY|70.79411764705883|
| SWZ|70.54545454545455|
| DMA|70.46153846153847|
| MAC|70.27272727272727|
| MDV|69.72727272727273|
| TKM|             69.0|
| MCO|             69.0|
| BHS|68.61111111111111|
| ZWE| 68.4074074074074|
| FSM|68.33333333333333|
| PYF|            68.05|
| ALB|         67.21875|
+----+-----------------+
only showing top 20 rows

