# Intake for Bluesky

## Acquire some sample data.

For data acquisition (but not for data access!) we assume that we have direct access to MongoDB (or some message queue that has a sink into MongoDB).

In [1]:
from bluesky import RunEngine
from intake_bluesky import MongoInsertCallback
from bluesky.plans import scan
from bluesky.preprocessors import SupplementalData
from ophyd.sim import det, motor

RE = RunEngine({})
sd = SupplementalData(baseline=[motor])
RE.preprocessors.append(sd)

# This is just a simple callback that does MongoDB insert_one. No databroker.
uri = 'mongodb://localhost:27017/test1'
insert = MongoInsertCallback(uri)
RE.subscribe(insert)


uid, = RE(scan([det], motor, -1, 1, 20))

## Access data using intake.

We could use intake to access the data _directly_ like this, though we will probably never do so at NSLS-II.

In [2]:
from intake_bluesky import MongoMetadataStoreCatalog

mds = MongoMetadataStoreCatalog(uri)

Instead we will access data through an HTTP service. We will start an intake server like this:

```
intake-server facility_catalog.yml
```

where `facility_catalog.yml` encodes the MongoDB ``uri`` above, and potentially many such URIs.

In [3]:
import intake

facility_catalog = intake.Catalog("intake://localhost:5000", page_size=100)
facility_catalog

<Intake catalog: None>

A Catalog contains entries, which we can access by iteration:

```
for entry in catalog:
    ...
```

or individually by name:

```
entry = catalog[entry_name]
```

For small Catalogs, it is convenient to ``list`` their contents.

In [4]:
list(facility_catalog)

['xyz']

The ``facility_catalog`` contains a catalog for each beamline. Let's access the ``xyz`` entry, which is also a Catalog.

In [5]:
cat = facility_catalog['xyz']()
cat

<Intake catalog: xyz>

Each entry in this Catalog represents one scan. There are too many to list them all. (We could _try_ but it would take a long time and probably run out of memory.)

We can find scans of interest in a couple ways.

## Progressive Search

We can search ``cat`` by passing it a Mongo Query. The result is another Catalog, with a subset of the entries in ``cat``.

In [6]:
search_results = cat.search({'plan_name': 'scan'})
search_results

<Intake catalog: None>

We can progressively serach, generating yet another Catalog.

In [7]:
import time
recent_counts = search_results.search({'time': {'$gt': time.time() - 60 * 60 * 24}})
recent_counts

<Intake catalog: None>

Having narrowed the results to a small Catalog, we can list them.

In [8]:
list(recent_counts)

['446be272-9bb3-4480-90e8-a0899587fa65',
 'bb640ef9-b3e8-443c-b4b1-6f98dc5af7a0',
 'a2c1e7c3-e194-45cc-b853-fd3089d0782c',
 '5f6a3669-8724-4673-9d45-d7be80c40522']

## Random access

We can access entries by their unique ID "name" as in:

In [9]:
cat['bb640ef9-b3e8-443c-b4b1-6f98dc5af7a0']

<Catalog Entry: bb640ef9-b3e8-443c-b4b1-6f98dc5af7a0>

We can also access entries by *recency* with this syntactic sugar:

In [10]:
entry = recent_counts[-1]

## Metadata

The entry's metadata is available via ``entry.metadata``. Notice that this includes ``entry.metadata.start`` and ``entry.metadata.stop``, the documents generated at the beginning and end of the corresponding scan.

In [11]:
entry.metadata

{'cache': None,
 'catalog_dir': None,
 'start': {'uid': '446be272-9bb3-4480-90e8-a0899587fa65',
  'time': 1543877944.73986,
  'scan_id': 1,
  'plan_type': 'generator',
  'plan_name': 'scan',
  'detectors': ['det'],
  'motors': ['motor'],
  'num_points': 20,
  'num_intervals': 19,
  'plan_args': {'detectors': ["SynGauss(name='det', value=1.0, timestamp=1543877944.721153)"],
   'num': 20,
   'args': ["SynAxis(prefix='', name='motor', read_attrs=['readback', 'setpoint'], configuration_attrs=['velocity', 'acceleration'])",
    -1,
    1],
   'per_step': 'None'},
  'hints': {'dimensions': [[['motor'], 'primary']]},
  'plan_pattern': 'inner_product',
  'plan_pattern_module': 'bluesky.plan_patterns',
  'plan_pattern_args': {'num': 20,
   'args': ["SynAxis(prefix='', name='motor', read_attrs=['readback', 'setpoint'], configuration_attrs=['velocity', 'acceleration'])",
    -1,
    1]}},
 'stop': {'run_start': '446be272-9bb3-4480-90e8-a0899587fa65',
  'time': 1543877944.801445,
  'uid': 'eeb566f

### Accessing Data

We can pull the data all at once:

In [12]:
# entry.read()  # an xarray of numpy.arrays -- BROKEN

Or in chunks:

In [13]:
# entry.read_chunked()  # a generator of xarrays of numpy arrays -- BROKEN

Or lazily, using dask:

In [14]:
entry.to_dask()  # an xarray of dask.arrays

<xarray.Dataset>
Dimensions:         (time: 22)
Coordinates:
  * time            (time) float64 1.544e+09 1.544e+09 ... 1.544e+09 1.544e+09
Data variables:
    det             (time) float64 dask.array<shape=(22,), chunksize=(22,)>
    motor           (time) float64 dask.array<shape=(22,), chunksize=(22,)>
    motor_setpoint  (time) float64 dask.array<shape=(22,), chunksize=(22,)>

The above is quite clever. It will use dask to make calls to the server to pull the data when required --- for example, if we convert the data to a ``pandas.DataFrame``.

In [15]:
entry.to_dask().to_dataframe()

Unnamed: 0_level_0,det,motor,motor_setpoint
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1543878000.0,,0.0,0.0
1543878000.0,0.606531,-1.0,-1.0
1543878000.0,0.670134,-0.894737,-0.894737
1543878000.0,0.732249,-0.789474,-0.789474
1543878000.0,0.791305,-0.684211,-0.684211
1543878000.0,0.8457,-0.578947,-0.578947
1543878000.0,0.893876,-0.473684,-0.473684
1543878000.0,0.934385,-0.368421,-0.368421
1543878000.0,0.965967,-0.263158,-0.263158
1543878000.0,0.987612,-0.157895,-0.157895
