BCCP: An MLOps Framework for Self-cleansing Real-Time Data Noise via Bayesian Cut-off-based Closest Pair Sampling
Early Korean Version available at: https://github.com/duneag2/capstone-mlops
This is the implementation of the approach described in the paper:
S. Lee, N. Jeong, J. Je, and S.-Y. Lee, "BCCP: An MLOps Framework for Self-cleansing Real-Time Data Noise via Bayesian Cut-off-based Closest Pair Sampling", 2024 IEEE International Conference on AI x Data & Knowledge Engineering [link]
To get started as quickly as possible, follow the instructions in this section. This will allow you to prepare the image classification dataset and train the model from scratch.
Make sure you have the following dependencies installed before proceeding:
- Python 3+ distribution
- PyTorch >= 2.1.2
Our classification experiments utilize three distinct datasets. To emphasize the practicality of MLOps, we select datasets related to factory management, waste management, and agricultural business. The datasets we used can be downloaded from the link below. You can also test our model using any image classification dataset.
- Cargo Dataset (https://www.kaggle.com/datasets/morph1max/definition-of-cargo-transportation)
- Bag Dataset (https://www.kaggle.com/datasets/vencerlanz09/plastic-paper-garbage-bag-synthetic-images)
- Sugarcane Leaf Disease Dataset (https://www.kaggle.com/datasets/nirmalsankalana/sugarcane-leaf-disease-dataset)
Once the dataset is prepared, place the image file folders into the bccp_mlops/api_serving folder. Make sure the structure looks like the one below.
api_serving
├── dataset
│ ├── class1 // contains many image files
│ ├── class2 // contains many image files
│ └── class3 // contains many image files
├── Makefile
├── app.py
├── docker-compose.yaml
├── download_model.py
└── schemas.py
To prepare the Monday and Tuesday Datasets and split the images into train and test datasets, run the following command from the dataset_prepare/ directory. (For a description of the Monday and Tuesday datasets, please refer to our paper.
python3 prepare_dataset.py -d dataset_name
Once the execution is complete, a JSON file will be generated in the data_generate folder.
This step requires Docker Desktop and PostgreSQL to be installed.
Run the following command in the data_generate/ directory to create the Data Generator container.
DATASET=dataset_name TARGET_DAY=target_day docker compose up -d --build --force-recreate
If you want to use only the Monday dataset, input monday for target_day. If you want to use both the Monday and Tuesday datasets, input tuesday for target_day.
Create a container for model training in the model_registry/ folder.
docker compose up -d --build --force-recreate
Access localhost:5001

Access localhost:9001 (username: minio, password: miniostorage).
When you first access it, there will be no buckets. Go to the Create a bucket section, set the Bucket Name to mlflow, and create the bucket. (There is no need to click on the toggles below.)

If you want to train our model, please execute the following command:
python3 save_model_to_registry.py -d dataset_name -t target_day -l label --user_accuracy user_accuracy -b Y/N -s sampling_type --monday_num number_of_images --tuesday_num number_of_images -r ratio --model_name model_name
-dor--dataset: Specifies the dataset to use, e.g.,cargo.-tor--target: Specifies the target day (mondayortuesday). If you want to use only the Monday dataset, inputmondayfor thetarget_day. If you want to use both the Monday and Tuesday datasets, inputtuesdayfor thetarget_day. Default:monday.-lor--label: Specifies the label to use for training. Theground_truthoption uses the correct labels for the images during training. Theuser_feedbackoption assumes a scenario where the labels are generated based on user feedback on the images, thus assuming lower accuracy of the labels. The accuracy of theuser_feedbacklabels can be set using theuser_accuracyoption. Default:ground_truth.--user_accuracy: Sets the accuracy when using theuser_feedbacklabels. Default:0.7.-bor--bayesian_cut_off: Determines whether to use Bayesian cut-off for the dataset before model training. EnterYto use it orNto not use it. Default:N.-sor--sampling_type: Determines the method for sampling images to be re-trained from the Reuse Buffer. If set tonone, the Reuse Buffer will not be used. If set torandom, random sampling will be applied. If set tol1_norm, the L1-norm-based CP sampling method will be used. If set tol2_norm, the L2-norm-based CP sampling method will be used. If set tocosine_similarity, the Cosine Similarity-based CP sampling method will be used. Default:none.--monday_num: Specifies the number of images in the Monday dataset.--tuesday_num: Specifies the number of images in the Tuesday dataset.-ror--ratio: When the sampling type is notnone, this sets the proportion of samples to extract from the Reuse Buffer.--model-name: Specifies the model name. Default:cls_model.
Additionally, we conducted comparative experiments with two notable papers in data cleansing: Cleanlab and Ye et al. You can run the two comparative experiments using the following commands.
Cleanlab (C. G. Northcutt, et al., ”Confident Learning: : Estimating Uncertainty in Dataset Labels,” Journal of Artificial Intelligence Research, pp. 1373-1411, 2021.)
python3 save_model_to_registry_cleanlab.py -d dataset_name -l label --user_accuracy user_accuracy --monday_num number_of_images --tuesday_num number_of_images --model_name model_name
ANL_CE (X. Ye, et al., ”Active Negative Loss Functions for Learning with Noisy Labels,” 37th International Conference on Neural Information Processing Systems, pp. 6917-6940, 2023.)
python3 save_model_to_registry_ANL_CE.py -d dataset_name -l label --user_accuracy user_accuracy --monday_num number_of_images --tuesday_num number_of_images --model_name model_name
We have documented the process of implementing a real-time visualization dashboard using Grafana to display original data and predicted values. This system involves several key steps, including the creation of a data subscriber to retrieve data from a Kafka topic, transmitting this data to an API server, receiving the predicted values, and sending them to the target database. The entire setup allows for real-time monitoring and visualization of both original and predicted data using Grafana. The implementation process is outlined in detail across the following three documents, presented in sequential order:
- Document 1: API Serving
- This document describes how to implement a REST API using FastAPI. The API receives input data and returns the predicted values generated by the model.
- Document 2: Kafka
- In this document, you'll learn how to build a real-time data pipeline using Kafka. This setup is essential for stream serving, enabling the real-time transmission of data.
- Document 3: Grafana Dashboard Configuration
- This final document covers how to configure a Grafana dashboard to monitor data in real-time. It includes instructions on visualizing both the original data and the model's predictions.
