# IoT Data

Chapter 3 also contains on page 71 ff an end-to-end example with IoT data. Unfortunately, except the code snippets in the book, there is no comprehensive code available in the GitHub repo of the book. 

The data file `iot-devices.json` can be found in the repo in this directory:
https://github.com/databricks/LearningSparkV2/tree/master/databricks-datasets/learning-spark-v2/iot-devices

*Christoph Windheuser - April 13, 2022*

In [1]:
# Import required python spark libraries
import findspark
import pyspark

from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Connect Jupyter Notebook with the Spark application and create Spark Context
findspark.init()
sc = pyspark.SparkContext(appName="LearningSpark")

#create a SparkSession
spark = (SparkSession
       .builder
       .appName("iot_Example_chapt_03")
       .getOrCreate())


Checking the data in the `data` directory:

In [2]:
!head ./data/iot_devices.json

{"device_id": 1, "device_name": "meter-gauge-1xbYRYcj", "ip": "68.161.225.1", "cca2": "US", "cca3": "USA", "cn": "United States", "latitude": 38.000000, "longitude": -97.000000, "scale": "Celsius", "temp": 34, "humidity": 51, "battery_level": 8, "c02_level": 868, "lcd": "green", "timestamp" :1458444054093 }
{"device_id": 2, "device_name": "sensor-pad-2n2Pea", "ip": "213.161.254.1", "cca2": "NO", "cca3": "NOR", "cn": "Norway", "latitude": 62.470000, "longitude": 6.150000, "scale": "Celsius", "temp": 11, "humidity": 70, "battery_level": 7, "c02_level": 1473, "lcd": "red", "timestamp" :1458444054119 }
{"device_id": 3, "device_name": "device-mac-36TWSKiT", "ip": "88.36.5.1", "cca2": "IT", "cca3": "ITA", "cn": "Italy", "latitude": 42.830000, "longitude": 12.830000, "scale": "Celsius", "temp": 19, "humidity": 44, "battery_level": 2, "c02_level": 1556, "lcd": "red", "timestamp" :1458444054120 }
{"device_id": 4, "device_name": "sensor-pad-4mzWkz", "ip": "66.39.173.154", "cca2": "US", "cca3"

Reading the json-file and infer the schema from the file:

In [3]:
jsonFile = "data/iot_devices.json"
iot_data_df =  spark.read.json(jsonFile)


Show the first 5 lines of the DataFrame and the number of lines:

In [4]:
iot_data_df.show(5)

+-------------+---------+----+----+-------------+---------+--------------------+--------+-------------+--------+------+---------+-------+----+-------------+
|battery_level|c02_level|cca2|cca3|           cn|device_id|         device_name|humidity|           ip|latitude|   lcd|longitude|  scale|temp|    timestamp|
+-------------+---------+----+----+-------------+---------+--------------------+--------+-------------+--------+------+---------+-------+----+-------------+
|            8|      868|  US| USA|United States|        1|meter-gauge-1xbYRYcj|      51| 68.161.225.1|    38.0| green|    -97.0|Celsius|  34|1458444054093|
|            7|     1473|  NO| NOR|       Norway|        2|   sensor-pad-2n2Pea|      70|213.161.254.1|   62.47|   red|     6.15|Celsius|  11|1458444054119|
|            2|     1556|  IT| ITA|        Italy|        3| device-mac-36TWSKiT|      44|    88.36.5.1|   42.83|   red|    12.83|Celsius|  19|1458444054120|
|            6|     1080|  US| USA|United States|        4

In [5]:
iot_data_df.count()

198164

Show sensors with a temperature < 30 and a humidity > 70:

In [11]:
iot_results_df = (iot_data_df
                  .where((col("temp") < 30) & (col("humidity") > 70)))
iot_results_df.show(5)
iot_results_df.count()

+-------------+---------+----+----+-------------+---------+--------------------+--------+--------------+--------+------+---------+-------+----+-------------+
|battery_level|c02_level|cca2|cca3|           cn|device_id|         device_name|humidity|            ip|latitude|   lcd|longitude|  scale|temp|    timestamp|
+-------------+---------+----+----+-------------+---------+--------------------+--------+--------------+--------+------+---------+-------+----+-------------+
|            3|      807|  JP| JPN|        Japan|        9| device-mac-9GcjZ2pw|      85| 118.23.68.227|   35.69| green|   139.69|Celsius|  13|1458444054124|
|            3|     1544|  IT| ITA|        Italy|       11|meter-gauge-11dlM...|      85| 88.213.191.34|   42.83|   red|    12.83|Celsius|  16|1458444054125|
|            0|     1260|  US| USA|United States|       12|sensor-pad-12Y2kIm0o|      92|   68.28.91.22|    38.0|yellow|    -97.0|Celsius|  12|1458444054126|
|            6|     1007|  IN| IND|        India|   

61347

Show me the device name, device id and country of all devices with a temperature > 34:

In [14]:
iot_results_df = (iot_data_df
                  .select("temp", "device_name", "device_id", "cca3")
                  .where(col("temp") > 25))
iot_results_df.show(5)
iot_results_df.count()

+----+--------------------+---------+----+
|temp|         device_name|device_id|cca3|
+----+--------------------+---------+----+
|  34|meter-gauge-1xbYRYcj|        1| USA|
|  28|   sensor-pad-4mzWkz|        4| USA|
|  27|sensor-pad-6al7RT...|        6| USA|
|  27|sensor-pad-8xUD6p...|        8| JPN|
|  26|sensor-pad-10Bsyw...|       10| USA|
+----+--------------------+---------+----+
only showing top 5 rows



71451

# ============== OLD STUFF ============================

What different values for "scale" exist in the data?

In [22]:
iot_data_df.select("scale").where(col("scale").isNotNull()).distinct().show()

+------+
| scale|
+------+
|Celius|
+------+



What is the min and max value of humidity over all sensors:

In [28]:
iot_data_df.select(min("humidity"), max("humidity")).show()

+-------------+-------------+
|min(humidity)|max(humidity)|
+-------------+-------------+
|           25|           99|
+-------------+-------------+

