November 3-5, 2025 | University of Toronto
Repository of hands-on workshop materials for Foundation Models for Science Workshop. These tutorials cover the complete workflow of working with protein language models, from data preparation to advanced analysis techniques.
The required packages are pre-installed in a Docker container.
You will need to configure your local computer using the official Docker instructions.
docker pull ghcr.io/carte-toronto/utoronto-fms-workshop-pytorch:latest
The official repository for this Docker is here.
Important A
requirements.txtfile is provided as a courtesy only. Please use the Docker container.
git clone https://github.com/ai-for-science-org/tutorials.git
cd tutorialsjupyter lab📂 Location: Tutorial_1_Data_Cleaning/Data_Extraction.ipynb
What You'll Learn:
- Download and extract datasets from ProteinGym
- Clean and standardize tabular data with pandas
- Handle missing values and duplicates
- Normalize DMS (Deep Mutational Scanning) scores for machine learning
- Visualize dataset characteristics
Key Skills: Data wrangling, pandas operations, data quality assessment
📂 Location: Tutorial_2_Fine_Tuning/ (Notebook TBD)
What You'll Learn:
- Load and use pre-trained protein language models (ESM-2)
- Generate protein embeddings for downstream tasks
- Perform zero-shot similarity search
- Fine-tune models with LoRA (Low-Rank Adaptation)
- Predict DMS stability scores using adapted models
Key Skills: Transfer learning, model adaptation, embedding generation
📂 Location: Tutorial_3_Uncertainty_Quant/UQ_tutorial.ipynb
What You'll Learn:
- Assess and improve model calibration using temperature scaling
- Implement heteroscedastic models to capture prediction uncertainty
- Use MC dropout to estimate epistemic uncertainty
- Apply conformal prediction for distribution-free uncertainty intervals
- Distinguish between different types of uncertainty in your predictions
Key Skills: Confidence estimation, calibration methods, probabilistic prediction
📂 Location: Tutorial_4_Latent_Space/Latent_Space_Analysis.ipynb
What You'll Learn:
- Extract and manipulate protein embeddings from pre-trained ESM2 models
- Reduce high-dimensional embeddings to 2D for visualization using UMAP
- Quantify clustering quality using mutual information metrics
- Optimize dimensionality reduction hyperparameters automatically with Optuna
- Analyze how features change across different layers of a transformer model
- Interpret latent space structure in relation to protein function (EC classes)
Key Skills: Embedding analysis, dimensionality reduction, hyperparameter optimization, interpretability
- Python: 3.8 or higher
- PyTorch: For deep learning models
- Transformers: HuggingFace library for ESM models
- Pandas: Data manipulation
- NumPy: Numerical computing
- Matplotlib/Seaborn: Visualization
- UMAP: Dimensionality reduction (Tutorial 4)
- Optuna: Hyperparameter optimization (Tutorial 4)
- scikit-learn: Machine learning utilities
- tqdm: Progress bars
- GPU: Recommended for Tutorial 2 (fine-tuning) and Tutorial 4 (embeddings)
- RAM: 16GB minimum, 32GB recommended
- Storage: ~10GB for datasets and models
See LICENSE file for details.
This is a private workshop repository. For questions or issues, please contact the workshop organizers.
For technical support or questions about the tutorials, please reach out to the FOMO4Sci Workshop team.
Happy Learning! 🚀