This repository contains the official implementation of our paper "AdaptMol: Domain Adaptation for Molecular Image Recognition with Limited Supervision".
If you use our work in your research, please cite:
# Citation will be added upon publicationThe dataset used in this work is available at: train evaluation
Download our pretrained model from: model
conda env create -f environment.yml
conda activate adaptmolWe are using Python 3.8
pip install -r requirements.txtMake sure Docker Desktop is installed, then:
docker-compose up --build -d
docker exec -it adaptmol bashRun prediction on molecular images:
python predict.py --model_path checkpoints_path --image_path image_pathTraining consists of four stages. Run them sequentially:
Stage 1:
bash scripts/stage1.shStage 2: Generate predictions on USPTO dataset
bash scripts/predict_uspto.shStage 3:
bash scripts/stage2.shStage 4:
bash scripts/stage3.sh--data_path: Base directory prefix for all data file paths--train_file: Path to training data file (relative todata_path)--validation_file: Path to validation file, evaluated after each epoch during training--test_file: Path to test set file, used for final model evaluation after training completes--valid_file: Path to single-file validation for post-training evaluation--save_path: Directory to save model checkpoints, evaluation results, and training logs--vocab_file: Path to vocabulary file mapping tokens to characters (e.g.,adaptmol/vocab/vocab_chars.json)--molblock: When present, model outputs both SMILES and MOL file format
--coord_bins: Number of coordinate bins for discretization (default: 64)--sep_xy: Use separate encoding for x and y coordinates--input_size: Input image resolution (e.g., 384×384)
--encoder_lr: Learning rate for encoder (e.g., 4e-6)--decoder_lr: Learning rate for decoder (e.g., 4e-6)--epochs: Number of training epochs--batch_size: Batch size per GPU (automatically calculated:BATCH_SIZE / NUM_GPUS_PER_NODE / ACCUM_STEP)--gradient_accumulation_steps: Number of gradient accumulation steps before updating weights--warmup: Warmup ratio for learning rate scheduler--label_smoothing: Label smoothing factor to prevent overfitting (e.g., 0.1)
--augment: Enable data augmentation--do_train: Enable training mode--do_valid: Enable validation during training--do_test: Enable testing after training--fp16: Use mixed precision (FP16) training for efficiency--use_checkpoint: Enable gradient checkpointing to save memory
--mmd_file: Path to hand-drawn dataset CSV file--load_path: Path to pretrained model checkpoint or previous stage model (e.g.,output/stage1/swin_base_transformer_best.pth)--resume: Continue training from loaded checkpoint (preserves training state)--init_scheduler: Reset scheduler settings for the loaded model, including:- Encoder and decoder learning rates
- Training epoch counter
--finetune_data: Path to predicted data suitable for fine-tuning (i.e. generated from Stage 2)--finetune_label: Predicted labels corresponding tofinetune_data
--save_mode: Checkpoint saving strategy (e.g.,allto save all checkpoints)--print_freq: Logging frequency (print every N batches, e.g., 50)
To reproduce the results reported in our paper:
bash scripts/paper_evaluation.shThis work builds upon several excellent projects:
- MolScribe: We thank the authors for their work. Our code architecture is based on their implementation.
- MolDepictor: Our synthetic training data generation is based on their code with modifications.
This project is licensed under the MIT License - see the LICENSE file for details.
This project includes components from third-party sources - see THIRD_PARTY_LICENSES for details.
