# Records by Project and Bucket Analysis
This notebook analyzes the distribution of records across different projects and storage buckets.

We'll use the AIND Document Database to:
* Count records per project and bucket combination
* Display results in a clear tabular format

## Imports

In [2]:
from aind_data_access_api.document_db import MetadataDbClient
import pandas as pd

# Configure pandas to display all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Connect to the metadata database

In [3]:
# Initialize the client
client = MetadataDbClient(
    host="api.allenneuraldynamics.org",
    database="metadata_index",
    collection="data_assets",
)

## Define a pipeline to retrieve the project and bucket from the location field and execute it

In [None]:
# Simpler pipeline that only returns project and bucket
pipeline = [
    {
        '$project': {
            'project': '$data_description.project_name',
            'bucket': {'$arrayElemAt': [{'$split': ['$location', '/']}, 2]},
            '_id': 0
        }
    }
]

# Execute the aggregation
results = client.aggregate_docdb_records(pipeline=pipeline)

## Convert the results to a pandas dataframe, do some grouping and sorting


In [14]:
# Convert to DataFrame
df = pd.DataFrame(results)

# Create a grouped count with hierarchical index
grouped_df = df.groupby(['bucket', 'project']).size().reset_index(name='count')

# Convert to hierarchical index
hierarchical_df = grouped_df.set_index(['bucket', 'project'])

# Sort within each bucket by count (descending)
hierarchical_df = hierarchical_df.groupby('bucket', group_keys=False).apply(lambda x: x.sort_values('count', ascending=False))

# Calculate bucket totals
bucket_totals = hierarchical_df.groupby('bucket')['count'].sum()

display(hierarchical_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
bucket,project,Unnamed: 2_level_1
aind-ephys-data,Dynamic Routing,117
aind-ephys-data,Discovery-Neuromodulator circuit dynamics during foraging,95
aind-ephys-data,Ephys Platform,94
aind-ephys-data,Cell Type Lookup Table,87
aind-ephys-data,Cell Type LUT,24
aind-ephys-data,MRI-Guided Electrophysiology,4
aind-ephys-data,Thalamus in the middle,3
aind-ephys-data,MRI-Guided Elecrophysiology,2
aind-open-data,Thalamus in the middle,349
aind-open-data,MSMA Platform,193


## make a nice markdown table to paste into a github comment

In [13]:
# Calculate bucket totals and sort buckets by size
bucket_totals = hierarchical_df.groupby('bucket')['count'].sum().sort_values(ascending=False)

# For each bucket, create and print a separate table
for bucket in bucket_totals.index:
    # Get the data for this bucket
    bucket_data = hierarchical_df.loc[bucket].sort_values('count', ascending=False)
    
    # Reset index to make the project name a column
    bucket_data = bucket_data.reset_index()
    
    # Print bucket header with total
    print(f"## {bucket}: {bucket_totals[bucket]:,} records\n")
    
    # Convert to markdown and print
    print(bucket_data.to_markdown(index=False))
    print("\n")  # Add extra newline between tables

## aind-private-data-prod-o5171v: 17,228 records

| project                                                                                                                                  |   count |
|:-----------------------------------------------------------------------------------------------------------------------------------------|--------:|
| Dynamic Routing                                                                                                                          |    8641 |
| Behavior Platform                                                                                                                        |    3709 |
| Cognitive flexibility in patch foraging                                                                                                  |    1330 |
| Discovery-Neuromodulator circuit dynamics during foraging                                                                                |    1227 |
| Brain Computer Interface                  