# 1. Data Exploration & Validation Setup

**Objective:** Load the raw and reference datasets to understand their structure, distribution, and content. This notebook is also used to scaffold our initial Great Expectations (GE) suite.

In [None]:
import pandas as pd
import great_expectations as ge
from great_expectations.cli.datasource import GxDatasourceWarning
import warnings
import os

# Suppress datasource warnings from GE
warnings.filterwarnings("ignore", category=GxDatasourceWarning)

# Set display options for pandas
pd.set_option('display.max_colwidth', 100)

## Load Datasets

We'll load both the `raw` data (which simulates incoming data) and the `reference` data (our "golden" set for drift comparison).

In [None]:
# Define paths (assuming notebook is run from project root)
RAW_DATA_PATH = "../data/raw/feedback.csv"
REFERENCE_DATA_PATH = "../data/reference/sentiment_reference.csv"

# Check if paths are correct, adjust if running from inside /notebooks dir
if not os.path.exists(RAW_DATA_PATH):
    RAW_DATA_PATH = "data/raw/feedback.csv"
    REFERENCE_DATA_PATH = "data/reference/sentiment_reference.csv"
    # Change CWD to project root
    if os.path.basename(os.getcwd()) == "notebooks":
        os.chdir("..")
        print(f"Changed directory to: {os.getcwd()}")

raw_df = pd.read_csv(RAW_DATA_PATH)
ref_df = pd.read_csv(REFERENCE_DATA_PATH)

### Raw Data (`feedback.csv`)

In [None]:
print(f"Raw Data Shape: {raw_df.shape}")
raw_df.head()

In [None]:
raw_df.info()

### Reference Data (`sentiment_reference.csv`)

This dataset includes a `prediction` column, simulating the output of a model that was run on this data. This is crucial for *model performance monitoring* with Evidently AI.

In [None]:
print(f"Reference Data Shape: {ref_df.shape}")
ref_df.head()

In [None]:
ref_df.info()

## Check Value Distributions

Let's look at the distribution of the target variable, `sentiment`.

In [None]:
print("--- Raw Data Sentiment Distribution ---")
print(raw_df['sentiment'].value_counts(normalize=True))
print("\n--- Reference Data Sentiment Distribution ---")
print(ref_df['sentiment'].value_counts(normalize=True))

## Initializing Great Expectations

We can use this notebook to create our first "Expectation Suite". We'll base our initial suite on the `reference_data.csv` file, as it represents our "golden standard" for data.

**Note:** You must run `great_expectations init` in your terminal *before* running the cells below.

In [None]:
# Get the GE Data Context
context = ge.get_context()
print("Great Expectations context loaded.")

### Create a Datasource

First, we tell GE where our data lives. We'll set up a Pandas datasource pointing to the `data/` directory.

In [None]:
try:
    datasource = context.sources.add_pandas("pandas_data_source")
    print("Datasource 'pandas_data_source' added.")
except Exception as e:
    print(f"Datasource already exists or error: {e}")
    datasource = context.datasources["pandas_data_source"]

### Define Data Assets

Now, we define specific "assets" within that datasource. We'll create one for our `reference` data, which we will use to *create* the expectations, and one for the `raw` data, which we will *validate*.

In [None]:
try:
    ref_asset = datasource.add_csv_asset("reference_asset", filepath_or_buffer="data/reference/sentiment_reference.csv")
except Exception as e:
    print(f"Asset 'reference_asset' already exists.")
    ref_asset = datasource.get_asset("reference_asset")

try:
    raw_asset = datasource.add_csv_asset("raw_asset", filepath_or_buffer="data/raw/feedback.csv")
except Exception as e:
    print(f"Asset 'raw_asset' already exists.")
    raw_asset = datasource.get_asset("raw_asset")

### Create an Expectation Suite

We will create a new, empty suite called `data_quality_suite`.

In [None]:
suite_name = "data_quality_suite"
try:
    context.add_expectation_suite(suite_name)
    print(f"Expectation suite '{suite_name}' created.")
except ge.exceptions.DataContextError:
    print(f"Expectation suite '{suite_name}' already exists.")

# Create a validator using our reference data
validator = context.get_validator(
    batch_request=ref_asset.build_batch_request(),
    expectation_suite_name=suite_name
)

print("Validator created using reference data.")

### Define Expectations

Here we define the "rules" for our data based on the `reference` asset. These are the core of our data validation.

In [None]:
# 1. Schema Expectations (Columns)
validator.expect_table_columns_to_match_ordered_list(["id", "text", "sentiment"]) # We check the *raw* data schema
validator.expect_column_to_exist("id")
validator.expect_column_to_exist("text")
validator.expect_column_to_exist("sentiment")

# 2. ID Column Expectations
validator.expect_column_values_to_be_unique("id")
validator.expect_column_values_to_not_be_null("id")
validator.expect_column_values_to_be_of_type("id", "int64")

# 3. Text Column Expectations
validator.expect_column_values_to_not_be_null("text")
validator.expect_column_values_to_be_of_type("text", "str")
validator.expect_column_value_lengths_to_be_between("text", min_value=5, max_value=500)

# 4. Sentiment Column Expectations (Target Variable)
validator.expect_column_values_to_not_be_null("sentiment")
validator.expect_column_values_to_be_in_set("sentiment", ["positive", "negative", "neutral"])

print("Expectations added to the validator.")

### Save the Expectation Suite

Finally, we save our defined expectations to a JSON file in the `great_expectations/expectations` directory. This suite can now be loaded and run by our Prefect pipeline.

In [None]:
validator.save_expectation_suite(discard_failed_expectations=False)
print(f"Expectation suite '{suite_name}' saved!")

### Next Steps

1.  **Checkpoint:** We will create a Checkpoint (a YAML file) that bundles this suite with a data asset (like our `raw_asset`) to make validation runnable.
2.  **Data Docs:** Run `great_expectations docs build` in your terminal to see a beautiful HTML report of these expectations.