# Intake for Bluesky

Intake has a concept of a `Catalog` whose entries may be other Catalogs or a `Datasource` that can be `read()` into a PyData/SciPy data structure, in whole or in chunks, or into its lazy dask-based counterpart.

Intake includes:
* authentication
* caching
* an intake server and client
* solutions for packaging so that Catalogs can be installable and accessible via import hooks (`from intake import csx_catalog`)

The demo below employs intake plugins for access and a simple callback using pymongo directly for insert. It does not import `databroker`.

## Acquire some sample data.

In [1]:
from bluesky import RunEngine
from intake_bluesky import MongoInsertCallback
from bluesky.plans import scan
from bluesky.preprocessors import SupplementalData
from ophyd.sim import det, motor

RE = RunEngine({})
sd = SupplementalData(baseline=[motor])
RE.preprocessors.append(sd)

# This is just a simple callback that does MongoDB insert_one. No databroker.
uri = 'mongodb://localhost:27017/test1'
insert = MongoInsertCallback(uri)
RE.subscribe(insert)


uid, = RE(scan([det], motor, -1, 1, 20))

## Access data using intake.

Instantiate an intake Catalog aimed at our MongoDB. (This boilerplate code could be made more magical via config files and import hooks; this is the explicit way.)

In [2]:
from intake_bluesky import MongoMetadataStoreCatalog

mds = MongoMetadataStoreCatalog(uri)

In [3]:
mds

<Intake catalog: mongodb://localhost:27017/test1>

Access a Run by `uid`. A Run is also a Catalog. It has special `__repr__`.

In [4]:
run = mds[uid]
run

<Catalog Entry: daf14422-9b68-4027-9a1f-fefa9a3bc2d8>

In [5]:
run()

<Intake catalog: Run daf14422...>
  2018-11-26 14:43:29.319 -- 2018-11-26 14:43:29.375
  Streams:
    * baseline
    * primary

Read the data from all the streams in one structure, time-sorted. This is a convenient starting point for interpolation workflows.

In [6]:
run.read()

<xarray.Dataset>
Dimensions:         (time: 22)
Coordinates:
  * time            (time) float64 1.543e+09 1.543e+09 ... 1.543e+09 1.543e+09
Data variables:
    motor           (time) float64 0.0 -1.0 -0.8947 -0.7895 ... 0.8947 1.0 1.0
    motor_setpoint  (time) float64 0.0 -1.0 -0.8947 -0.7895 ... 0.8947 1.0 1.0
    det             (time) float64 nan 0.6065 0.6701 ... 0.6701 0.6065 nan

In [7]:
run.read().to_dataframe().head()

Unnamed: 0_level_0,motor,motor_setpoint,det
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1543261000.0,0.0,0.0,
1543261000.0,-1.0,-1.0,0.606531
1543261000.0,-0.894737,-0.894737,0.670134
1543261000.0,-0.789474,-0.789474,0.732249
1543261000.0,-0.684211,-0.684211,0.791305


In [8]:
run.read(include=['motor']).to_dataframe().head()

Unnamed: 0_level_0,motor
time,Unnamed: 1_level_1
1543261000.0,0.0
1543261000.0,-1.0
1543261000.0,-0.894737
1543261000.0,-0.789474
1543261000.0,-0.684211


In [9]:
run.read(exclude=['motor_setpoint']).to_dataframe().head()

Unnamed: 0_level_0,motor,det
time,Unnamed: 1_level_1,Unnamed: 2_level_1
1543261000.0,0.0,
1543261000.0,-1.0,0.606531
1543261000.0,-0.894737,0.670134
1543261000.0,-0.789474,0.732249
1543261000.0,-0.684211,0.791305


The `mds` catalog has a `serach()` method. It returns... a Catalog! This Catalog will have a subset of the entries from `mds`. This Catalog in turn has a `search()` method, which can be used to further refine the results in yet another Catalog, and so on.

In [10]:
results = mds.search({'plan_name': 'scan'})
len(list(results))

94

In [11]:
import time
refined_results = results.search({'time': {'$lt': time.time() - 2 * 60 * 60 * 24}})
len(list(refined_results))

72

Whitelist or blacklist fields. (You can't do both at once -- that's a `ValueError`.)

In [12]:
run.read(include=['motor']).to_dataframe().head()

Unnamed: 0_level_0,motor
time,Unnamed: 1_level_1
1543261000.0,0.0
1543261000.0,-1.0
1543261000.0,-0.894737
1543261000.0,-0.789474
1543261000.0,-0.684211


In [13]:
run.read(exclude=['motor_setpoint']).to_dataframe().head()

Unnamed: 0_level_0,motor,det
time,Unnamed: 1_level_1,Unnamed: 2_level_1
1543261000.0,0.0,
1543261000.0,-1.0,0.606531
1543261000.0,-0.894737,0.670134
1543261000.0,-0.789474,0.732249
1543261000.0,-0.684211,0.791305


Remember that `run` is a `Catalog`. Its entries are the Streams. We can read them individually.

In [15]:
list(run())

['baseline', 'primary']

In [17]:
run()['primary']

<Intake catalog: Stream 'primary' from Run daf14422...>

Same as pandas DataFrame columns, dot access works as well unless the stream name collides with an existing attribute. Tab-complete works as well.

In [18]:
run.primary

<Intake catalog: Stream 'primary' from Run daf14422...>

We can read the data all at once:

In [19]:
run.primary.read().to_dataframe().head()

Unnamed: 0_level_0,det,motor,motor_setpoint
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1543261000.0,0.606531,-1.0,-1.0
1543261000.0,0.670134,-0.894737,-0.894737
1543261000.0,0.732249,-0.789474,-0.789474
1543261000.0,0.791305,-0.684211,-0.684211
1543261000.0,0.8457,-0.578947,-0.578947


Or access a slice (along the Event axis, potentially along other axes in the future):

In [20]:
run.primary.read_slice(slice(7, 13)).to_dataframe()

Unnamed: 0_level_0,det,motor,motor_setpoint
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1543261000.0,0.934385,-0.368421,-0.368421
1543261000.0,0.965967,-0.263158,-0.263158
1543261000.0,0.987612,-0.157895,-0.157895
1543261000.0,0.998616,-0.052632,-0.052632
1543261000.0,0.998616,0.052632,0.052632
1543261000.0,0.987612,0.157895,0.157895


We can also read the data as a generator of chunks. The chunk size is some default provided by the `Catalog`, but it is optionally configurable.

In [21]:
for chunk in run.primary.read_chunked():
    print(chunk.to_dataframe())

                   det     motor  motor_setpoint
time                                            
1.543261e+09  0.606531 -1.000000       -1.000000
1.543261e+09  0.670134 -0.894737       -0.894737
1.543261e+09  0.732249 -0.789474       -0.789474
1.543261e+09  0.791305 -0.684211       -0.684211
1.543261e+09  0.845700 -0.578947       -0.578947
1.543261e+09  0.893876 -0.473684       -0.473684
1.543261e+09  0.934385 -0.368421       -0.368421
1.543261e+09  0.965967 -0.263158       -0.263158
1.543261e+09  0.987612 -0.157895       -0.157895
1.543261e+09  0.998616 -0.052632       -0.052632
                   det     motor  motor_setpoint
time                                            
1.543261e+09  0.998616  0.052632        0.052632
1.543261e+09  0.987612  0.157895        0.157895
1.543261e+09  0.965967  0.263158        0.263158
1.543261e+09  0.934385  0.368421        0.368421
1.543261e+09  0.893876  0.473684        0.473684
1.543261e+09  0.845700  0.578947        0.578947
1.543261e+09  0.7913

In [22]:
for chunk in run.primary.read_chunked(3):
    print(chunk.to_dataframe())

                   det     motor  motor_setpoint
time                                            
1.543261e+09  0.606531 -1.000000       -1.000000
1.543261e+09  0.670134 -0.894737       -0.894737
1.543261e+09  0.732249 -0.789474       -0.789474
                   det     motor  motor_setpoint
time                                            
1.543261e+09  0.791305 -0.684211       -0.684211
1.543261e+09  0.845700 -0.578947       -0.578947
1.543261e+09  0.893876 -0.473684       -0.473684
                   det     motor  motor_setpoint
time                                            
1.543261e+09  0.934385 -0.368421       -0.368421
1.543261e+09  0.965967 -0.263158       -0.263158
1.543261e+09  0.987612 -0.157895       -0.157895
                   det     motor  motor_setpoint
time                                            
1.543261e+09  0.998616 -0.052632       -0.052632
1.543261e+09  0.998616  0.052632        0.052632
1.543261e+09  0.987612  0.157895        0.157895
                   d

The stream is *also* a Catalog. Its entries are fields a.k.a data keys a.k.a. columns.

In [23]:
list(run.primary)

['det', 'motor', 'motor_setpoint']

In [24]:
run.primary.det

<Intake datasource: Field 'det' of Stream 'primary' from Run daf14422...>

The same methods --- `read()`, `read_slice()`, `read_chunked()` --- apply. They can typically return simpler data structures because the data they represent is more homogeneous.

In [25]:
run.primary.det.read().to_dataframe().head()

Unnamed: 0_level_0,det
time,Unnamed: 1_level_1
1543261000.0,0.606531
1543261000.0,0.670134
1543261000.0,0.732249
1543261000.0,0.791305
1543261000.0,0.8457
