diff --git a/packages/docs/tool.md b/packages/docs/tool.md index 7520efa..2f42751 100644 --- a/packages/docs/tool.md +++ b/packages/docs/tool.md @@ -49,7 +49,9 @@ embedding-atlas huggingface_org/dataset_name ## Visualizing Embeddings -The script will use [SentenceTransformers](https://sbert.net/) to compute embedding vectors for the specified column containing the text data. The script will then project the high-dimensional embedding vectors to 2D with [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html). +The script will use [SentenceTransformers](https://sbert.net/) to compute embedding vectors for the specified column containing the text or image data. You may use the `--model` option to specify an embedding model. If not specified, a default model will be used. The current defaults are `all-MiniLM-L6-v2` for text and `google/vit-base-patch16-384` for images, but these are subject to change in future releases. + +After embedding vectors are computed, the script will then project the high-dimensional vectors to 2D with [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html). ::: tip Optionally, if you know what column your text data is in beforehand, you can specify which column to use with the `--text` flag, for example: @@ -74,6 +76,20 @@ If this column is specified, you'll be able to see nearest neighbors for a selec Once this script completes, it will print out a URL like `http://localhost:5055/`. Open the URL in a web browser to view the embedding. +## Reproducibility + +For reproducible embedding visualizations, we recommend pre-computing both the embedding vectors and their UMAP projections, and storing them with your dataset. This ensures consistency since the default embedding model may change over time, floating-point precision may vary across different devices, and UMAP introduces randomness through both its default random initialization and its use of parallelism (see [here](https://umap-learn.readthedocs.io/en/latest/reproducibility.html)). + +The `embedding_atlas` package provides utility functions to compute the embedding projections: + +```python +from embedding_atlas.projection import compute_text_projection + +compute_text_projection(df, text="text_column", + x="projection_x", y="projection_y", neighbors="neighbors" +) +``` + ## Usage ```