-
Skill level
Intermediate, some knowledge of windowing and pandas will help
-
Time to complete
Approx. 25 min
Introduction: In this guide, we will show you how you can combine bytewax with ydata-profiling to profile and understand the quality of your streaming data!
Python modules
bytewax==0.16.2
ydata-profiling==4.3.1
matplotlib==3.7
You'll be able to handle and structure data streams into snapshots using Bytewax, and then analyze them with ydata-profiling to create a comprehensive report of data characteristics for each device at each time interval.
- Resources
- Data Profiling
- Step 1. Environmental Sensor Telemetry Dataset
- Step 2. Inputs and parsing
- Step 3. Windowing
- Step 4. Profile report
- Summary
Instead of the usual approach, where data quality is assessed during the creation of the data warehouse or dashboard solution, it is a cheaper, more effective and ultimately more robust approach to monitor the quality closer to the source, which is a great fit for stream processing, since most data is created in real-time. This will prevent any data quality issues from multiplying in downstream tables and ending up in customer-facing services.
In what concerns data profiling, ydata-profiling has consistently been a crowd favorite, either for tabular or time-series data. And no wonder why — it’s one line of code for an extensive set of analysis and insights.
Let's see it in action!
Let's download a subset of the Environmental Sensor Telemetry Dataset (License — CC0: Public Domain), which contains several measurements of temperature, humidity, carbon monoxide liquid petroleum gas, smoke, light, and motion from different IoT devices
In a production environment, these measurements would be continuously generated by each device, and the input would look like what we expect in a streaming platform such as Kafka.
To simulate a stream of data, we will use the Bytewax CSVInput connector to read the CSV file we downloaded one line at a time. In a production use case, you could easily swap this out with the KafkaInput connector.
First, let’s make some necessary imports:
profiling-time-series-data/dataflow.py
Lines 1 to 5 in 475cc88
Then, we define our dataflow object and add our CSV input.
profiling-time-series-data/dataflow.py
Lines 7 to 8 in 475cc88
Afterward, we will use a stateless map
method where we pass in a function to convert the string to a datetime
object and restructure the data to the format (device_id, data).
profiling-time-series-data/dataflow.py
Lines 10 to 19 in 475cc88
The map
method will make the change to each data point in a stateless way. The reason we have modified the shape of our data is so that we can easily group the data in the next steps to profile data for each device separately rather than for all of the devices simultaneously.
Now we will take advantage of the stateful capabilities of bytewax to gather data for each device over a duration of time that we have defined. ydata-profiling expects a snapshot of the data over time, therefore the window
operator is the perfect method to use to do this.
profiling-time-series-data/dataflow.py
Lines 21 to 42 in 475cc88
In ydata-profiling, we are able to produce summary statistics for a Pandas DataFrame which is specified for a particular context. For instance, in our example, we can produce snapshots of data referring to each IoT device or to particular time frames.
After the snapshots are defined, leveraging ydata-profiling is as simple as calling the PorfileReport
method for each of the dataframes we would like to analyze:
profiling-time-series-data/dataflow.py
Lines 46 to 64 in 475cc88
Once the profile is complete, the dataflow expects some output, so we can use the built-in StdOutput
to print the device that was profiled and the time it was profiled at that was returned by the profile function in the map step:
profiling-time-series-data/dataflow.py
Line 66 in 475cc88
And we are ready to run our program! You can clone this repository to your machine and run the following commands:
profiling-time-series-data/run.sh
Line 3 in 1eb1752
We can now use the profiling reports to validate the data quality, check for changes in schemas or data formats, and compare the data characteristics between different devices or time windows.
Being able to process and profile incoming data appropriately opens up a plethora of use cases across different domains, from the correction of errors in data schemas and formats to the highlighting and mitigation of additional issues that derive from real-world activities, such as anomaly detection (e.g., fraud or intrusion/threats detection), equipment malfunction, and other events that deviate from the expectations (e.g., data drifts or misalignment with business rules).
This guide was written with the support of the Ydata team
If you have any trouble with the process or have ideas about how to improve this document, come talk to us in the #troubleshooting Slack channel!
See our full gallery of tutorials →