## Exhibit 21 layout classifier
Some EX21 filings are formatted as a paragraph of text rather than a structured table. Given that the extraction model is trained/designed to work with a table layout, it tends to perform poorly on these filings. In this notebook we will develop a classifier model to detect these filings, so we can filter them out, and potentially develop a dedicated model to handle them.

### Load labeled layouts from upstream asset

In [1]:
from mozilla_sec_eia.models.sec10k import defs

ex21_layout_labels = defs.load_asset_value("ex21_layout_labels")
ex21_layout_classifier_training_dataset = defs.load_asset_value("ex21_layout_classifier_training_dataset")

No dagster instance configuration file (dagster.yaml) found at /home/zach/catalyst/workspace. Defaulting to loading and storing all metadata with /home/zach/catalyst/workspace. If this is the desired behavior, create an empty dagster.yaml file in /home/zach/catalyst/workspace.
2024-10-08 18:11:22 -0400 - dagster - DEBUG - system - Loading file from: /home/zach/catalyst/workspace/storage/ex21_layout_labels using PickledObjectFilesystemIOManager...
No dagster instance configuration file (dagster.yaml) found at /home/zach/catalyst/workspace. Defaulting to loading and storing all metadata with /home/zach/catalyst/workspace. If this is the desired behavior, create an empty dagster.yaml file in /home/zach/catalyst/workspace.
2024-10-08 18:11:22 -0400 - dagster - DEBUG - system - Loading file from: /home/zach/catalyst/workspace/storage/ex21_layout_classifier_training_dataset using PickledObjectFilesystemIOManager...


### Implement method to construct feature dataset

In [1]:
import pandas as pd

from mozilla_sec_eia.models.sec10k.ex_21.data.common import BBOX_COLS_PDF


def calculate_features(record):
    """Compute features from bounding boxes in inference dataset."""
    df = pd.DataFrame(record["bboxes"], columns=BBOX_COLS_PDF)
    features = {}
    
    y_height = df["bottom_right_y_pdf"].max() - df["top_left_y_pdf"].min()
    features["block_y_density"] = len(df) / y_height

    # Calculate average y-distance between bounding boxes for a given document
    df = df.sort_values(by=["top_left_y_pdf", "top_left_x_pdf"])
    y_diffs = df["top_left_y_pdf"].diff().dropna()
    features["avg_y_distance"] = y_diffs.mean()
    features["std_y_distance"] = y_diffs.std()
    
    # Define a small threshold to group bounding boxes that are on the same line
    y_threshold = 0.5
    df.loc[:, 'line_group'] = (df['top_left_y_pdf'].diff().fillna(0).abs() > y_threshold).cumsum()

    # Calculate x-distance to assess horizontal alignment
    x_diffs = df.groupby('line_group')['top_left_x_pdf'].apply(lambda x: x.diff().dropna())
    features['avg_x_distance'] = x_diffs.mean()

    boxes_per_line = df.groupby("line_group").size()
    features["median_boxes_per_line"] = boxes_per_line.median()
    return pd.Series(features)

### Create training/test sets

In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = ex21_layout_classifier_training_dataset.sort_values(by=["id"]).apply(calculate_features, axis=1)
X = StandardScaler().fit_transform(X)
y = np.where(ex21_layout_labels.sort_values(by=["filename"])["layout"] == "Paragraph", 1, 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=16)

### Create mlflow model to wrap classifier

In [4]:
import mlflow


class Ex21LayoutClassifier(mlflow.pyfunc.PythonModel):
    """Wrap sklearn classifier in mlflow pyfunc model."""

    def load_context(self, context):
        """Load sklearn model."""
        self.model = mlflow.sklearn.load_model(context.artifacts["layout_classifier"])

    def predict(self, context, model_input: pd.DataFrame):
        """Create feature matrix from inference dataset and use trained model for prediction."""
        features_df = model_input.apply(calculate_features, axis=1)
        scaled_features = StandardScaler().fit_transform(features_df)
        return self.model.predict(scaled_features)

### Train and log model

In [6]:
from dotenv import load_dotenv
from mlflow.models import infer_signature
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from mozilla_sec_eia.library.mlflow import configure_mlflow

load_dotenv()


configure_mlflow()
mlflow.set_experiment("exhibit21_layout_classifier")

# Autolog sklearn model
mlflow.autolog()

classifiers = {
    "LogisticRegression": LogisticRegression(max_iter=500),
    "RandomForest": RandomForestClassifier(n_estimators=100),
    "SVM": SVC(kernel="linear")
}
pyfunc_model = Ex21LayoutClassifier()

for classifier, model in classifiers.items():
    with mlflow.start_run(run_name=classifier):
        model.fit(X_train, y_train)
        model.score(X_test, y_test)
        sklearn_model_uri = mlflow.get_artifact_uri("model")
        mlflow.pyfunc.log_model(
            artifact_path="exhibit21_layout_classifier",
            python_model=pyfunc_model,
            artifacts={"layout_classifier": sklearn_model_uri},
            signature=infer_signature(ex21_layout_classifier_training_dataset, y),
        )

2024/10/08 18:23:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2024/10/08 18:23:24 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2024/10/08 18:24:13 INFO mlflow.tracking._tracking_service.client: 🏃 View run LogisticRegression at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/15/runs/5f5d526e1e16442983679d6035599df2.
2024/10/08 18:24:13 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/15.
2024/10/08 18:24:14 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2024/10/08 18:24:14 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!
2024/10/08 18:24:15 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2024/10/08 18:25:04 INFO mlflow.tracking._tracking_service.client: 🏃 View run RandomForest at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/15/runs/84642d0599894058b3ebe85f7f43eab9.
2024/10/08 18:25:04 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/15.
2024/10/08 18:25:05 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2024/10/08 18:25:05 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!
2024/10/08 18:25:06 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2024/10/08 18:25:56 INFO mlflow.tracking._tracking_service.client: 🏃 View run SVM at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/15/runs/cbdd906766b2427c93e9c957be6ea9c8.
2024/10/08 18:25:56 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://mlflow-ned2up6sra-uc.a.run.app/#/experiments/15.
2024/10/08 18:25:56 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2024/10/08 18:25:57 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!
