## Exhibit 21 layout classifier
Some EX21 filings are formatted as a paragraph of text rather than a structured table. Given that the extraction model is trained/designed to work with a table layout, it tends to perform poorly on these filings. In this notebook we will develop a classifier model to detect these filings, so we can filter them out, and potentially develop a dedicated model to handle them.

### Load labeled layouts from upstream asset

In [18]:
from mozilla_sec_eia.models.sec10k import defs

ex21_layout_labels = defs.load_asset_value("ex21_layout_labels")
ex21_layout_classifier_training_dataset = defs.load_asset_value("ex21_layout_classifier_training_dataset")

No dagster instance configuration file (dagster.yaml) found at /home/zach/catalyst/workspace. Defaulting to loading and storing all metadata with /home/zach/catalyst/workspace. If this is the desired behavior, create an empty dagster.yaml file in /home/zach/catalyst/workspace.
2024-10-08 13:55:05 -0400 - dagster - DEBUG - system - Loading file from: /home/zach/catalyst/workspace/storage/ex21_layout_labels using PickledObjectFilesystemIOManager...
No dagster instance configuration file (dagster.yaml) found at /home/zach/catalyst/workspace. Defaulting to loading and storing all metadata with /home/zach/catalyst/workspace. If this is the desired behavior, create an empty dagster.yaml file in /home/zach/catalyst/workspace.
2024-10-08 13:55:05 -0400 - dagster - DEBUG - system - Loading file from: /home/zach/catalyst/workspace/storage/ex21_layout_classifier_training_dataset using PickledObjectFilesystemIOManager...


### Implement method to construct feature dataset

In [43]:
import pandas as pd

from mozilla_sec_eia.models.sec10k.ex_21.data.common import BBOX_COLS_PDF


def calculate_features(record):
    """Compute features from bounding boxes in inference dataset."""
    df = pd.DataFrame(record["bboxes"], columns=BBOX_COLS_PDF)
    features = {}
    features["n_bboxes"] = len(df)

    # block density wasn't a very useful feature, maybe rework?
    # Calculate the bounding box density of the area of the page with text
    # x_width = df["bottom_right_x_pdf"].max() - df["top_left_x_pdf"].min()
    # y_height = df["bottom_right_y_pdf"].max() - df["top_left_y_pdf"].min()
    # text_area = x_width * y_height
    # features["block_density"] = features["n_bboxes"] / text_area

    # Calculate average y-distance between bounding boxes for a given document
    df = df.sort_values(by=["top_left_y_pdf", "top_left_x_pdf"])
    y_diffs = df["top_left_y_pdf"].diff().dropna()
    features["avg_y_distance"] = y_diffs.mean()
    features["std_y_distance"] = y_diffs.std()

    # Calculate x-distance to assess horizontal alignment
    x_diffs = df.groupby("top_left_y_pdf")["top_left_x_pdf"].apply(lambda x: x.diff().dropna())
    features["avg_x_distance"] = x_diffs.mean()
    features["std_x_distance"] = x_diffs.std()

    # Define a small threshold to group bounding boxes that are on the same line
    y_threshold = 0.1
    df.loc[:, "line_group"] = (df["top_left_y_pdf"].diff().fillna(0).abs() > y_threshold).cumsum()
    boxes_per_line = df.groupby("line_group").size()
    features["median_boxes_per_line"] = boxes_per_line.median()
    return pd.Series(features)

### Create training/test sets

In [60]:
import numpy as np
from sklearn.model_selection import train_test_split

X = ex21_layout_classifier_training_dataset.sort_values(by=["id"]).apply(calculate_features, axis=1)
y = np.where(ex21_layout_labels.sort_values(by=["filename"])["layout"] == "Paragraph", 1, 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=16)

### Create mlflow model to wrap classifier

In [61]:
import mlflow


class Ex21LayoutClassifier(mlflow.pyfunc.PythonModel):
    """Wrap sklearn classifier in mlflow pyfunc model."""

    def load_context(self, context):
        """Load sklearn model."""
        self.model = mlflow.sklearn.load_model(context.artifacts["layout_classifier"])

    def predict(self, context, model_input: pd.DataFrame):
        """Create feature matrix from inference dataset and use trained model for prediction."""
        features_df = model_input.apply(calculate_features, axis=1)
        return self.model.predict(features_df)

### Train and log model

In [66]:
from dotenv import load_dotenv
from mlflow.models import infer_signature
from sklearn.linear_model import LogisticRegression

from mozilla_sec_eia.library.mlflow import configure_mlflow

load_dotenv()


configure_mlflow()
mlflow.set_experiment("exhibit21_layout_classifier")

# Autolog sklearn model
mlflow.autolog()

model = LogisticRegression()
pyfunc_model = Ex21LayoutClassifier()
with mlflow.start_run():
    model.fit(X_train, y_train)
    model.score(X_test, y_test)
    sklearn_model_uri = mlflow.get_artifact_uri("model")
    mlflow.pyfunc.log_model(
        artifact_path="exhibit21_layout_classifier",
        python_model=pyfunc_model,
        artifacts={"layout_classifier": sklearn_model_uri},
        signature=infer_signature(ex21_layout_classifier_training_dataset, y),
    )

2024/10/08 16:10:39 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2024/10/08 16:10:40 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2024/10/08 16:11:30 INFO mlflow.tracking._tracking_service.client: 🏃 View run languid-shrimp-450 at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/15/runs/08802dbf347c4cd5b66751c11328a06f.
2024/10/08 16:11:30 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/15.
2024/10/08 16:11:30 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2024/10/08 16:11:31 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!
