# Getting started

Data associated with the Allen Brain Cell Atlas is hosted on Amazon Web Services (AWS) in an S3 bucket as a AWS Public Dataset. 
No account or login is required. The S3 bucket is located here [arn:aws:s3:::allen-brain-cell-atlas](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html). You will need to be connected to the internet to run this notebook.

Each data release has an associated **manifest.json** which lists all the specific version of directories and files that are part of the release. We recommend using the [**AbcProjectCache**](https://github.com/AllenInstitute/abc_atlas_access/blob/41eb836e41e516ee528c98bf3979d0aff60c0b85/src/abc_atlas_access/abc_atlas_cache/abc_project_cache.py#L36C7-L36C22) to download the data.

Expression matricies are stored in the [anndata h5ad format](https://anndata.readthedocs.io/en/latest/) and need to be downloaded to a local file system for usage.

This notebook shows how to use the **AbcProjectCache** to download the data required for the tutorials.

Below we install the python library we will be using throughout to this python enviroment.

**If you haven't, copy the line below in your Anaconda terminal to install the AbcProjectCache package:**

pip install git+https://github.com/alleninstitute/abc_atlas_access.git

In [1]:
from pathlib import Path
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

## Using the cache

Below we show how to setup up the cache to download from S3, and how to list the directories available, their size, and the files in that directory.

Setup the **AbcProjectCache** object by specifying a directory and calling ``from_cache_dir`` as shown below. 

We also print what version of the manifest is being currently loaded by the cache. This will automatically instantiate the cache and set it up to download data via a AWS S3 enabled cache. 

In [3]:
# download_base = Path('../../data/abc_atlas') # change this path to a directory where you want to save your data
download_base = Path('../data/abc_atlas')

abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest

'releases/20251031/manifest.json'

In [7]:
abc_cache.list_manifest_file_names

['releases/20230630/manifest.json',
 'releases/20230830/manifest.json',
 'releases/20231215/manifest.json',
 'releases/20240330/manifest.json',
 'releases/20240831/manifest.json',
 'releases/20241115/manifest.json',
 'releases/20241130/manifest.json',
 'releases/20250131/manifest.json',
 'releases/20250331/manifest.json',
 'releases/20250531/manifest.json',
 'releases/20250930/manifest.json',
 'releases/20251031/manifest.json']

In [9]:
abc_cache.load_manifest('releases/20250531/manifest.json')
print("old manifest loaded:", abc_cache.current_manifest)

# Return to the latest manifest
# abc_cache.load_latest_manifest()
# print("after latest manifest loaded:", abc_cache.current_manifest)

old manifest loaded: releases/20250531/manifest.json


In [10]:
dir(abc_cache)

['MANIFEST_COMPATIBILITY',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__firstlineno__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__static_attributes__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_default_bucket_name',
 '_get_directory_files',
 '_get_local_path',
 '_local',
 '_ui_class_name',
 '_warn_directory_size',
 'cache',
 'compare_manifests',
 'current_manifest',
 'from_cache_dir',
 'from_local_cache',
 'from_s3_cache',
 'get_data_path',
 'get_directory_expression_matrices',
 'get_directory_expression_matrix_size',
 'get_directory_image_volume_size',
 'get_directory_image_volumes',
 'get_directory_mapmycells',
 'get_directory_mapmycells_size',
 'get_directory_metadata',
 'get_directory_metadata_size',
 'get_directory_s

We can list all available directories in the release we loaded using the method below. We can then list all the available data and metadata files in those directories. Note that the cache will raise an exception if the requested kind of files (data files [e.g. h5ad expression_matricies, nii.gz image_volumes] or metadata files [e.g. csv files]) are not available in the directory.

In [4]:
abc_cache.list_directories # all datasets from Allen Institute

['ASAP-PMDBS-10X',
 'ASAP-PMDBS-taxonomy',
 'Allen-CCF-2020',
 'Consensus-WMB-AIBS-10X',
 'Consensus-WMB-Macosko-10X',
 'Consensus-WMB-integrated-taxonomy',
 'HMBA-10xMultiome-BG',
 'HMBA-10xMultiome-BG-Aligned',
 'HMBA-BG-taxonomy-CCN20250428',
 'HMBA-MERSCOPE-H22.30.001-BG',
 'HMBA-MERSCOPE-QM23.50.001-BG',
 'HMBA-Xenium-CJ23.56.004-BG',
 'MERFISH-C57BL6J-638850',
 'MERFISH-C57BL6J-638850-CCF',
 'MERFISH-C57BL6J-638850-imputed',
 'MERFISH-C57BL6J-638850-sections',
 'SEAAD-taxonomy',
 'WHB-10Xv3',
 'WHB-taxonomy',
 'WMB-10X',
 'WMB-10XMulti',
 'WMB-10Xv2',
 'WMB-10Xv3',
 'WMB-neighborhoods',
 'WMB-taxonomy',
 'Zeng-Aging-Mouse-10Xv3',
 'Zeng-Aging-Mouse-WMB-taxonomy',
 'Zhuang-ABCA-1',
 'Zhuang-ABCA-1-CCF',
 'Zhuang-ABCA-2',
 'Zhuang-ABCA-2-CCF',
 'Zhuang-ABCA-3',
 'Zhuang-ABCA-3-CCF',
 'Zhuang-ABCA-4',
 'Zhuang-ABCA-4-CCF',
 'mmc-gene-mapper']

In [5]:
abc_cache.list_data_files('Zeng-Aging-Mouse-10Xv3') # mouse aging dataset
# https://alleninstitute.github.io/abc_atlas_access/descriptions/Zeng_Aging_Mouse_dataset.html 

AttributeError: 'AbcProjectCache' object has no attribute 'list_data_files'

In [7]:
abc_cache.list_metadata_files('Zeng-Aging-Mouse-WMB-taxonomy') # WMB-taxonomy

['aging_degenes',
 'cell_cluster_mapping_annotations',
 'cell_cross_mapping_annotations',
 'cluster_mapping',
 'cluster_mapping_pivot']

In [16]:
abc_cache.list_metadata_files('WMB-10X') # WMB-taxonomy

['cell_metadata',
 'cell_metadata_with_cluster_annotation',
 'example_genes_all_cells_expression',
 'gene',
 'region_of_interest_metadata']

Before we start downloading data, we can check how much total data is in a given directory for both data files and metadata files.

In [8]:
abc_cache.get_directory_data_size('Zeng-Aging-Mouse-10Xv3')

'25.33 GB'

In [9]:
abc_cache.get_directory_metadata_size('Zeng-Aging-Mouse-WMB-taxonomy')

'287.04 MB'

## Downloading files

The next set of examples shows how to download data to the directory you specified when setting up the cache object. There are two main ways of downloading the data: individually by file or by full directory.

### Downloading all data files or metadata files in a directory.

Here we show how one can download the full set of data files or metadata files contained in a directory in the release. Use the ``list_directories`` as a guide here as to what data is available. Here we download all the data in two directories we know to be small. Once the download of all files is complete, a list of Paths to the downloaded files is returned.

The user should be warned that several directories are significant in size, >100 GB. If a directory is over 10 GB in size total, the cache will warn the user when requesting to download the data in the directory.

In [15]:
allen_ccf_list = abc_cache.get_directory_data('Zeng-Aging-Mouse-10Xv3')
print("Zeng-Aging-Mouse-10Xv3 data files:\n\t", allen_ccf_list)


	Total directory size = 25.33 GB


Zeng-Aging-Mouse-10Xv3-log2.h5ad: 100%|██████████████████████████████████████████| 13.3G/13.3G [35:56<00:00, 6.19MMB/s]
Zeng-Aging-Mouse-10Xv3-raw.h5ad: 100%|███████████████████████████████████████████| 13.8G/13.8G [19:27<00:00, 11.9MMB/s]

Zeng-Aging-Mouse-10Xv3 data files:
	 [WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/expression_matrices/Zeng-Aging-Mouse-10Xv3/20241130/Zeng-Aging-Mouse-10Xv3-log2.h5ad'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/expression_matrices/Zeng-Aging-Mouse-10Xv3/20241130/Zeng-Aging-Mouse-10Xv3-raw.h5ad')]





In [16]:
allen_ccf_list = abc_cache.get_directory_metadata('Zeng-Aging-Mouse-WMB-taxonomy')
print("Zeng-Aging-Mouse-WMB-taxonomy metadata files:\n\t", allen_ccf_list)

aging_degenes.csv: 100%|█████████████████████████████████████████████████████████| 1.13M/1.13M [00:00<00:00, 2.01MMB/s]
cell_cluster_mapping_annotations.csv: 100%|████████████████████████████████████████| 173M/173M [02:01<00:00, 1.43MMB/s]
cell_cross_mapping_annotations.csv: 100%|██████████████████████████████████████████| 126M/126M [01:33<00:00, 1.34MMB/s]
cluster_mapping.csv: 100%|██████████████████████████████████████████████████████████| 405k/405k [00:00<00:00, 854kMB/s]
cluster_mapping_pivot.csv: 100%|████████████████████████████████████████████████████| 118k/118k [00:00<00:00, 631kMB/s]

Zeng-Aging-Mouse-WMB-taxonomy metadata files:
	 [WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/aging_degenes.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/cell_cluster_mapping_annotations.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/cell_cross_mapping_annotations.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/cluster_mapping.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/cluster_mapping_pivot.csv')]





In [18]:
allen_ccf_list = abc_cache.get_directory_metadata('WMB-10X')
print("WMB-10X:\n\t", allen_ccf_list)

WMB-10X:
	 [WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/WMB-10X/20241115/cell_metadata.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/WMB-10X/20241115/views/cell_metadata_with_cluster_annotation.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/WMB-10X/20241115/views/example_genes_all_cells_expression.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/WMB-10X/20241115/gene.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/WMB-10X/20241115/region_of_interest_metadata.csv')]


Note that, after downloading the file successfully, running the ``get_directory_data`` or ``get_directory_metadata`` methods will return the list of the local paths without having to redownload the files.

In [21]:
allen_ccf_list = abc_cache.get_directory_data('Zeng-Aging-Mouse-10Xv3')
print("Zeng-Aging-Mouse-10Xv3 data files:\n\t", allen_ccf_list, "\n\n")
allen_ccf_list = abc_cache.get_directory_metadata('Zeng-Aging-Mouse-WMB-taxonomy')
print("Zeng-Aging-Mouse-WMB-taxonomy metadata files:\n\t", allen_ccf_list, "\n\n")
allen_ccf_list = abc_cache.get_directory_metadata('WMB-10X')
print("WMB-10X metadata files:\n\t", allen_ccf_list)

Zeng-Aging-Mouse-10Xv3 data files:
	 [WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/expression_matrices/Zeng-Aging-Mouse-10Xv3/20241130/Zeng-Aging-Mouse-10Xv3-log2.h5ad'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/expression_matrices/Zeng-Aging-Mouse-10Xv3/20241130/Zeng-Aging-Mouse-10Xv3-raw.h5ad')] 


Zeng-Aging-Mouse-WMB-taxonomy metadata files:
	 [WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/aging_degenes.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/cell_cluster_mapping_annotations.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/cell_cross_mapping_annotations.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/metadata/Zeng-Aging-Mouse-WMB-taxonomy/20241130/cluster_mapping.csv'), WindowsPath('C:/Users/guoyu/Documents/python/data/abc_atlas/m