# LSD data overview

###Overview:
  * most data is stored as downloadable files/folders in an amazon s3 bucket
  * some data is stored in a precomputed format inside a google bucket
  * the s3 bucket contains about 1.7 tb of data, so bear in mind before downloading the whole bucket
  * boto3 is a python api useful for accessing s3 data, you can use it to find sizes of folders before committing to downloading a directory (here is an example https://stackoverflow.com/questions/49759940/how-to-find-size-of-a-folder-inside-an-s3-bucket), and there are examples for downloading some data below (links at bottom of cell)
  * s3 bucket contains a mixture of zarr, n5, nml, json, bson
  * google bucket volume metadata is stored in info files. zarr/n5 metadata is stored in attributes files. To see offsets/shapes of cropped data (i.e hemi roi 1) check these files. 
  * google bucket volumes are in xyz voxel space.
  * zarr volumes are in zyx world space (i.e nanometers)
  * n5 volumes are in xyz world space

---

###Key: 
  * s3 = amazon s3 bucket
  * gb = google bucket

---

###Storage structure:

* Zebrafinch (**s3 and gb**) [view raw data](https://neuroglancer-demo.appspot.com/#!%7B%22dimensions%22:%7B%22x%22:%5B9e-9%2C%22m%22%5D%2C%22y%22:%5B9e-9%2C%22m%22%5D%2C%22z%22:%5B2e-8%2C%22m%22%5D%7D%2C%22position%22:%5B5500.7841796875%2C5398.7626953125%2C2948.97021484375%5D%2C%22crossSectionScale%22:28.722131709470613%2C%22projectionOrientation%22:%5B-0.17645424604415894%2C-0.3241989314556122%2C-0.0025745832826942205%2C0.9293827414512634%5D%2C%22projectionScale%22:18561.198026865284%2C%22layers%22:%5B%7B%22type%22:%22image%22%2C%22source%22:%22precomputed://gs://j0126-nature-methods-data/GgwKmcKgrcoNxJccKuGIzRnQqfit9hnfK1ctZzNbnuU/rawdata_realigned%22%2C%22tab%22:%22annotations%22%2C%22annotationColor%22:%22#0091ff%22%2C%22name%22:%22rawdata_realigned%22%7D%5D%2C%22selectedLayer%22:%7B%22layer%22:%22rawdata_realigned%22%7D%2C%22layout%22:%224panel%22%7D)

  * training (**s3**)
    * 33 **zarrs** with raw and label data

  * testing
    * raw (**gb-precomputed**)
    * neuropil mask (**s3-zarr**)
    * ground truth (**s3**)
      * testing (**50 skeletons**)
          * original (two versions - **nml** & **json**)
          * consolidated (cropped, masked, relabelled connected components) (**bson**)
            * nodes (total raw roi)
            * edges (total raw roi)
            * masks for each sub roi
            * connected components for each sub roi
      * validation (**12 skeletons**)
        * original (**nml**)
        * consolidated (**bson**)
          * same structure as validation

    * segmentations
      * full ffn seg (**gb-precomputed**) (total raw roi)
      * arrays (**s3-zarr-volumes**)
        * supervoxels for each affinity-based network on the benchmark roi
        * cropped/masked/relabelled ffn segmentation for each sub roi
      * graphs (**s3-zarr-rags**)
        * rags (region adjacency graphs) for each affinity-based network on benchmark roi (**bson**)
          * nodes
          * edges

* Hemi-Brain (**s3**) [view raw data](https://neuroglancer-demo.appspot.com/#!%7B%22dimensions%22:%7B%22x%22:%5B8e-9%2C%22m%22%5D%2C%22y%22:%5B8e-9%2C%22m%22%5D%2C%22z%22:%5B8e-9%2C%22m%22%5D%7D%2C%22position%22:%5B17137.673828125%2C20718.560546875%2C19731.599609375%5D%2C%22crossSectionScale%22:109.75827448209598%2C%22projectionOrientation%22:%5B-0.1407826691865921%2C-0.3292594850063324%2C0.0014168535126373172%2C0.9336843490600586%5D%2C%22projectionScale%22:65536%2C%22layers%22:%5B%7B%22type%22:%22image%22%2C%22source%22:%22precomputed://gs://neuroglancer-janelia-flyem-hemibrain/emdata/clahe_yz/jpeg%22%2C%22tab%22:%22source%22%2C%22annotationColor%22:%22#00aaff%22%2C%22name%22:%22raw%22%7D%5D%2C%22selectedLayer%22:%7B%22layer%22:%22raw%22%7D%2C%22layout%22:%224panel%22%7D)

  * training
    * 8 **zarrs** with raw and label data

  * testing
    * ground truth (**s3-zarr-volumes**)
      * EB mask
      * three rois
        * raw
        * labels (dense, cropped to roi)
        * consolidated labels (filtered, masked to neuropil, eroded boundaries, relabelled connected components)
    * segmentations
      * arrays (**s3-zarr-volumes**)
        * supervoxels for each affinity-based network on 3 rois
        * ffn segmentation for each sub roi
        * cropped/masked/relabelled ffn segmentation for each sub roi
      * graphs (**s3-zarr-rags**)
        * rags (region adjacency graphs) for each affinity-based network on sub rois (**bson**)


* FIB-25 (**s3**) [view raw data](https://neuroglancer-demo.appspot.com/#!%7B%22dimensions%22:%7B%22x%22:%5B8e-9%2C%22m%22%5D%2C%22y%22:%5B8e-9%2C%22m%22%5D%2C%22z%22:%5B8e-9%2C%22m%22%5D%7D%2C%22position%22:%5B3326.20947265625%2C3379.239013671875%2C4060.844482421875%5D%2C%22crossSectionScale%22:19.54375528952488%2C%22projectionOrientation%22:%5B-0.21999090909957886%2C-0.4366961419582367%2C-0.12671823799610138%2C0.8630428314208984%5D%2C%22projectionScale%22:15850.514470059576%2C%22layers%22:%5B%7B%22type%22:%22image%22%2C%22source%22:%22precomputed://gs://neuroglancer-public-data/flyem_fib-25/image%22%2C%22tab%22:%22annotations%22%2C%22annotationColor%22:%22#009dff%22%2C%22name%22:%22raw%22%7D%5D%2C%22selectedLayer%22:%7B%22layer%22:%22raw%22%7D%2C%22layout%22:%224panel%22%7D)

  * training
      * 4 **zarrs** with raw and label data

    * testing
      * ground truth (**s3-n5**)
        * raw
        * neuropil mask
        * labels
        * cropped/relabelled ids for two subrois
      * segmentations
        * arrays (**s3-zarr-volumes**)
          * supervoxels for each affinity-based network
          * full ffn segmentation
          * cropped/relabelled ffn segmentation for two subrois
        * graphs (**s3-zarr-rags**)
          * rags (region adjacency graphs) for each affinity-based network on full roi (**bson**)

---

###Examples:

**make sure to install / import packages in the next two cells first**

- [fetch/view from google bucket](#fetch-gb)
- [connect to s3 bucket](#s3-connect)
- [fetch/view s3 array data](#s3-array)
- [fetch/view s3 graph data](#s3-graph)




In [None]:
!pip install boto3
!pip install cloud-volume
!pip install matplotlib

# C header error bc of cloud-volume? have to specify numpy version...
!pip install numpy==1.20.0

!pip install pandas
!pip install plotly
!pip install requests
!pip install scikit-image
!pip install zarr

# since colab has its own versions pre-installed but we
# need some other versions, the runtime needs to be restarted
# there will be warnings, but the runtime restart should resolve.
# just hit the x on the popup that comes up in bottom left corner
# then run following cells
import os
os.kill(os.getpid(), 9)

In [None]:
import boto3
import bson
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import plotly.express as px
import zarr
from cloudvolume import CloudVolume

<a name="fetch-gb"></a>
# fetch/view google bucket data

In [None]:
raw_vol = CloudVolume(
            "https://storage.googleapis.com/j0126-nature-methods-data/GgwKmcKgrcoNxJccKuGIzRnQqfit9hnfK1ctZzNbnuU/rawdata_realigned",
            bounded=True,
            progress=True)

seg_vol = CloudVolume(
            "https://storage.googleapis.com/j0126-nature-methods-data/GgwKmcKgrcoNxJccKuGIzRnQqfit9hnfK1ctZzNbnuU/ffn_segmentation",
            bounded=True,
            progress=True)

In [None]:
# cloud data shape is stored as x,y,z,channel
print(raw_vol.shape, seg_vol.shape)

In [None]:
# view metadata
seg_vol.info

In [None]:
# function to slice cloud volume, convert to 2d numpy array for viewing
def cloud_to_np(vol,x0,x1,y0,y1,z0,z1):

  # ensure there is a voxel offset in the metadata so cloudvolume is happy
  for scale in vol.info['scales']:
        scale['voxel_offset'] = [0, 0, 0]
  
  # slice data
  data = vol[x0:x1, y0:y1, z0:z1]

  # transpose (z,y,x)
  data = np.transpose(data[...,0], [2,1,0])

  # remove z dim so we can view as 2d image
  return np.squeeze(data)

In [None]:
# get a random 1000x1000 voxel patch 
raw_data = cloud_to_np(raw_vol,1000,2000,2000,3000,300,301)
seg_data = cloud_to_np(seg_vol,1000,2000,2000,3000,300,301)

In [None]:
# utility function for viewing unique labels
def create_lut(labels):

    max_label = np.max(labels)

    lut = np.random.randint(
            low=0,
            high=255,
            size=(int(max_label + 1), 3),
            dtype=np.uint64)

    lut = np.append(
            lut,
            np.zeros(
                (int(max_label + 1), 1),
                dtype=np.uint8) + 255,
            axis=1)

    lut[0] = 0
    colored_labels = lut[labels]

    return colored_labels

In [None]:
# view the data
plt.imshow(raw_data, cmap='gray')
plt.imshow(create_lut(seg_data), alpha=0.5)

<a name="s3-connect"></a>
# connect to s3 bucket

In [None]:
# set bucket credentials
access_key = 'AKIA4XXGEV6ZQOTMTHX6'
secret_key = '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'
bucket = 'open-neurodata'

In [None]:
# connect to client
client = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)

In [None]:
# list data
client.list_objects(Bucket=bucket, Prefix="funke")

In [None]:
# download directory structure file - this shows exactly how the s3 data is stored
client.download_file(
    Bucket=bucket,
    Key="funke/structure.md",
    Filename="structure.md")

In [None]:
ls

In [None]:
less structure.md

<a name="s3-array"></a>
# fetch/view s3 array data

In [None]:
# function to download all files nested in a bucket path
def downloadDirectory(
    bucket_name,
    path,
    access_key,
    secret_key):
  
    resource = boto3.resource(
        's3',
        aws_access_key_id=access_key,
        aws_secret_access_key=secret_key)
    
    bucket = resource.Bucket(bucket_name)

    for obj in bucket.objects.filter(Prefix=path):
        if not os.path.exists(os.path.dirname(obj.key)):
            os.makedirs(os.path.dirname(obj.key))
        
        key = obj.key

        print(f'Downloading {key}')
        bucket.download_file(key, key)

In [None]:
# download example fib25 training data
downloadDirectory(
    bucket,
    'funke/fib25/training/trvol-250-1.zarr',
    access_key,
    secret_key)

In [None]:
ls

In [None]:
ls funke/fib25/training/trvol-250-1.zarr/volumes/

In [None]:
# load zarr file
f = zarr.open('funke/fib25/training/trvol-250-1.zarr')

In [None]:
# load data into numpy arrays
raw = f['volumes/raw'][:]
labels = f['volumes/labels/neuron_ids'][:]

In [None]:
# view offset, resolution (z,y,x)
print(f['volumes/raw'].attrs['offset'])
print(f['volumes/raw'].attrs['resolution'])

In [None]:
# view raw data
raw

In [None]:
# view shapes (voxels)
print(raw.shape, labels.shape)

In [None]:
# slice first section, squeeze z axis for viewing in matplotlib
raw_0 = np.squeeze(raw[0:1, :, :])
labels_0 = np.squeeze(labels[0:1, :, :])

In [None]:
# show data
plt.imshow(raw_0, cmap='gray')
plt.imshow(create_lut(labels_0), alpha=0.5)

<a name="s3-graph"></a>
# fetch/view s3 graph data

In [None]:
# download example hemi region adjacency graph
downloadDirectory(
    bucket,
    'funke/hemi/testing/segmentations/data.zarr/rags/ACLSD/hemi_affs_from_lsd_200k_roi_1',
    access_key,
    secret_key)

In [None]:
ls funke/hemi/testing/segmentations/data.zarr/rags/ACLSD/hemi_affs_from_lsd_200k_roi_1

In [None]:
# function to create pandas dataframe from bson data
def create_df(bson_file):

  with open(bson_file, 'rb') as f:
    data = bson.decode_all(f.read())

  df = pd.DataFrame(data)
  del df['_id']

  return df

In [None]:
# create nodes dataframe
nodes = create_df('funke/hemi/testing/segmentations/data.zarr/rags/ACLSD/hemi_affs_from_lsd_200k_roi_1/nodes.bson')

In [None]:
# view nodes - coordinates are in world units, divide by voxel size (8,8,8)
# to see voxel space
nodes

In [None]:
# randomly sample some points
sample = nodes.sample(10000)

In [None]:
# view nodes - color mapping allows us to see block boundaries, since
# unique node ids are incremented spatially with respect to a block.
# the non-uniform cube is due to masking an axon tract along the edge
fig = px.scatter_3d(
    sample,
    x='center_x',
    y='center_y',
    z='center_z',
    color='id',
    color_continuous_scale=px.colors.sequential.Jet)
fig.show()

In [None]:
# get the edges
edges = create_df('funke/hemi/testing/segmentations/data.zarr/rags/ACLSD/hemi_affs_from_lsd_200k_roi_1/edges_hist_quant_75.bson')

In [None]:
# the edges map between nodes (u=source, v=target) and have a merge score
# based on underlying affinity values. this merge score determines when the nodes
# would become merged (edges with lower scores are merged earlier)
edges

In [None]:
# find first edge source location
nodes.loc[nodes['id'] == 77322656252]

In [None]:
# download example zebrafinch validation skeletons
downloadDirectory(
    bucket,
    'funke/zebrafinch/testing/ground_truth/validation/consolidated/zebrafinch_gt_skeletons_new_gt_9_9_20_validation',
    access_key,
    secret_key)

In [None]:
# get the validation skeleton nodes
val_nodes = create_df('funke/zebrafinch/testing/ground_truth/validation/consolidated/zebrafinch_gt_skeletons_new_gt_9_9_20_validation/zebrafinch.nodes.bson')

In [None]:
# sample the nodes
val_nodes_sample = val_nodes.sample(20000)

In [None]:
# default zoomed in camera
camera = dict(
    eye=dict(x=0.2, y=0.2, z=0.2)
)

# color by neuron id
fig = px.scatter_3d(
    val_nodes_sample,
    x='x',
    y='y',
    z='z',
    color='neuron_id',
    color_continuous_scale=px.colors.sequential.Jet)
fig.update_layout(scene_camera=camera)
fig.show()

In [None]:
# check unique neurons
print(val_nodes['neuron_id'].unique())

In [None]:
# view an example neuron
val_nodes_single = val_nodes.loc[val_nodes['neuron_id']==1]

fig = px.scatter_3d(
    val_nodes_single,
    x='x',
    y='y',
    z='z')
fig.show()