# **Monitoring a data pipeline**

- Once a data pipeline is developed, it should be monitored for changes to data, and failures during execution. 
- Sometimes, source systems fail to provide data, or data types change. 
- Other times, the tools that Data Engineers had previously used become deprecated or functionality changes. 
- Whatever the reason, monitoring a data pipeline ensures the solution is transparent, and proper alerting notifies Data Engineers of an issue before data consumers discover it themselves.

- Missing data
- Shifting data types
- Package deprecation or functionality change

# **Logging data pipeline performance**

- In this course, we'll use logging to alert engineers of data pipeline performance. 
- Logs are messages created and written during the execution of a pipeline. 
- They are configured by the developing party, and document the performance of a pipeline. 
- Logs provide a starting point when solutions fail by letting Data Engineers revisit the execution of the pipeline. 
- Logs are the foundation for all monitoring and alerting efforts, and are essential for creating transparent data pipelines. 
- The logging module in Python makes it easy to configure and create your own logs. 
- There are six levels of logging provided by the logging module. We'll explore four; debug, info, warning, and error. 
- Each has an associated function and are used to reflect differing severity events. 
- Debug logs are typically used when building a data pipeline, and give a Data Engineer insight into things such as data dimensionality, type, and variable values. 
- The info function is used to provide basic information and checkpoints throughout the execution of a pipeline, such as notifying an engineer about operations that occur on the data.

- Document performance at execution
- Provides a starting point when a solution fails

In [None]:
import logging
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)
# Create different types of logs
logging.debug(f"Variable has value {path}")
logging.info("Data has been transformed and will now be loaded.")

In [None]:
DEBUG: Variable has value raw_file.csv
INFO: Data has been transformed and will now be Loaded.

# **Logging warnings and errors**

- In addition to debug and info-level logs, warnings and errors should also be captured using logging. 
- Warnings are logged when something unexpected happens, but an exception has not necessarily occurred. 
- A use case for a warning log could be an unexpected number of rows, or previously unseen data types. 
- Error logs are used when an exception occurs that should halt the execution of a pipeline, such as when data has changed format, or is totally unavailable. 
- Properly created logs can save Data Engineers time when trying to discover why a pipeline failed, or why results have changed

In [None]:
import logging
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)

# Create different types of logs
logging.warning("Unexpected number of rows detected.")
logging.error("{ke} arose in execution.")

In [None]:
WARNING: Unexpected number of rows detected.
ERROR: KeyError arose in execution.

# **Handling exceptions with try-except**

- When building a data pipeline, it's common for errors to occur. 
- The best data pipelines handle common exceptions, and create logs to help debug. 
- One of the most basic ways to handle these exceptions is by using Python's built-in try-except logic. 
- This functionality allows for code to be run in the "try" block, and if an error occurs, rather than ending execution, code in the except block will be triggered

In [None]:
try:
# Execute some code here
...
except:
# Logging about failures that occured
# Logic to execute upon exception
...

- Provides a way to execute code if errors occur

# **Handling specific exceptions with try-except**

Pass the specific exception in the except clause

In [None]:
try:
    # Try to filter by price_change
    clean_stock_data = transform(raw_stock_data)
    logging.info("Successfully filtered DataFrame by 'price_change'")
except KeyError as ke:
    # Handle the error, create new column, transform
    logging.warning(f"{ke}: Cannot filter DataFrame by 'price_change'")
    raw_stock_data["price_change"] = raw_stock_data["close"] - raw_stock_data["open"]
    clean_stock_data = transform(raw_stock_data)