# 🌊 Data Engineering Flow

This notebook demonstrates how to build a data engineering flow for the KubeSentiment project.

## 🎯 Learning Objectives

By the end of this notebook, you will:
1. Understand the importance of data engineering in MLOps.
2. Learn how to build a data engineering flow using Dask.
3. Understand how to handle errors and monitor the data engineering flow.

## 📦 Setup and Dependencies

First, let's install the required dependencies and set up our environment.

In [None]:
# Install required packages for this notebook
!pip install -r ../requirements.txt

### ✅ Version Check
Let's check the versions of the installed libraries to ensure our environment is reproducible.

In [None]:
# List installed packages to ensure reproducibility
!pip list

## 🌊 What is a Data Engineering Flow?

A data engineering flow is a series of steps that are used to collect, clean, and transform data. This is a critical part of any MLOps pipeline, as the quality of the data will have a direct impact on the performance of the model.

### Building a Data Engineering Flow with Dask

Dask is a parallel computing library for Python. It allows you to scale your data engineering flows to multiple machines.

In [None]:
import dask.dataframe as dd

# Load the data
df = dd.read_csv("https://storage.googleapis.com/kubesentiment-data/sentiment-data.csv")

# Clean the data
df = df.dropna()
df = df[df["sentiment"] != "neutral"]

# Transform the data
df["sentiment"] = df["sentiment"].apply(lambda x: 1 if x == "positive" else 0, meta=("sentiment", "int64"))

# Save the data
df.to_csv("sentiment-data-cleaned.csv", single_file=True)

### Error Handling and Monitoring

It is important to handle errors and monitor the data engineering flow. This will help you to identify and fix problems before they impact the performance of the model.

In [None]:
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)

try:
    # Load the data
    df = dd.read_csv("https://storage.googleapis.com/kubesentiment-data/sentiment-data.csv")

    # Clean the data
    df = df.dropna()
    df = df[df["sentiment"] != "neutral"]

    # Transform the data
    df["sentiment"] = df["sentiment"].apply(lambda x: 1 if x == "positive" else 0, meta=("sentiment", "int64"))

    # Save the data
    df.to_csv("sentiment-data-cleaned.csv", single_file=True)

    logging.info("Data engineering flow completed successfully.")
except Exception as e:
    logging.error(f"Data engineering flow failed with error: {e}")