_Strategy:_

1. mongo dump / reload on a local machine

2. Compose ranges where start is run_start time and stop is run_stop time

3. Compose a dictionary keyed on run_start oid that holds run_start and run_stop tuples

4. iterate over the range dictionary and get distinct event_descriptor_ids for events that fall under each range and put these descriptors in a dictionary formatted as rs_desc_pairs['run_start_id'] = distinct_event_descriptors

5. By iterating over the dict's contents in step 4, findthe cases that an event_descriptor is associated with more than one run_start (there are 8 and this is due to creation of two consecutive run_starts not too far apart in time). Pick the run_start that is closest to the event_descriptor in time. This step results in desc_rstart_pairs dict where each event_descriptor has a corresponding run_start

6. For event_descriptors that I have in desc_rstart_pairs, get distinct data fields. This information will be stored in dt_key dict where keys are event_descriptor oids and contents are the corresponding data keys and their shapes 

7. Extract the source of all keys from the beamline. Get all keys from the beamline 

8. Using steps 5, 6, and 7 create and insert all event_descriptors



In [1]:
from pymongo import MongoClient
from bson import ObjectId
from pymongo.errors import OperationFailure
from collections import deque
from difflib import SequenceMatcher

In [2]:
MONGO_HOST = 'localhost'
MONGO_PORT = 27017
MIGRATION_DB = 'datastore2'
# 1) mongo dump / reload on a local machine
pymongo_client = MongoClient(MONGO_HOST, MONGO_PORT)
database = pymongo_client['datastore2']

In [3]:
desc_oids = list()
desc_oids = database.event.distinct('descriptor_id')
print(len(desc_oids))

13801


In [4]:
# 2. Compose ranges where start is run_start time and stop is run_stop time

# 3. Compose a dictionary keyed on run_start oid that holds run_start and run_stop tuples

rstt_crsr = database.run_start.find()
pairs = dict()
for rstart in rstt_crsr:
    try:
        run_stop = next(database.run_stop.find({'run_start_id': rstart['_id']}))
    except StopIteration:
        run_stop = None 
    # there are some rstop that are not created. What to do with these!? Igonore for now
    if run_stop:
        time_range = (rstart['time'], run_stop['time'])
        pairs[rstart['_id']] = time_range

In [5]:
# 4. iterate over the range dictionary and get distinct event_descriptor_ids for events that fall under each range and put these descriptors in a dictionary formatted as rs_desc_pairs['run_start_id'] = distinct_event_descriptors

rs_desc_pairs = {}
for k, v in pairs.items():
    # give me all the distinct event descriptors for events in the
    # time range between start and stop
    query = {'time': {'$gt': v[0],'$lt': v[1]}}
    rs_desc_pairs[k] = database.event.find(query).distinct(key='descriptor_id')
    # rs_desc_pairs holds run_start event_descriptor pairs!

In [6]:
# 5. By iterating over the dict's contents in step 4, find 
# the cases that an event_descriptor is associated with more than 
# one run_start (there are 8 and this is due to creation of two consecutive 
# run_starts not too far apart in time). Pick the run_start that is closest 
# to the event_descriptor in time. This step results in desc_rstart_pairs dict where each event_descriptor has a corresponding run_start


descs = deque()
desc_rstart_pairs = {}
def similar(a, b):
    return SequenceMatcher(None, str(a), str(b)).ratio()
# One run_start can have multiple descriptors. rs_desc_pairs holds run_start (k) and its
# corresponding descriptors (v)
for k,v in rs_desc_pairs.items():
    for _ in list(v):
        # Check if this descriptor has already been seen
        # if so, compare the two run_starts and pick the one closest
        # to the descriptor in time
        if _ not in descs:
            descs.append(_)
            desc_rstart_pairs[_] = k
        else:
            # For the 8 overlapping descriptors, select the most likely
            # run_start (oid is +1 is as close as it can possibly get)
            print('run_start 1', desc_rstart_pairs[_].generation_time,
                  'run_start 2', k.generation_time,
                  'descriptor', _.generation_time)    
            if similar(desc_rstart_pairs[_], _) < similar(k, _):
                desc_rstart_pairs[_] = k

run_start 1 2015-06-19 00:48:30+00:00 run_start 2 2015-06-19 00:52:58+00:00 descriptor 2015-06-19 00:53:01+00:00
run_start 1 2015-06-19 00:48:30+00:00 run_start 2 2015-06-19 00:50:51+00:00 descriptor 2015-06-19 00:50:54+00:00
run_start 1 2015-02-19 20:51:47+00:00 run_start 2 2015-02-19 20:51:47+00:00 descriptor 2015-02-19 20:51:47+00:00
run_start 1 2015-02-19 20:51:47+00:00 run_start 2 2015-02-19 20:51:47+00:00 descriptor 2015-02-19 20:51:47+00:00
run_start 1 2015-06-19 00:48:30+00:00 run_start 2 2015-06-19 00:53:54+00:00 descriptor 2015-06-19 00:53:57+00:00
run_start 1 2015-06-19 00:48:30+00:00 run_start 2 2015-06-19 00:53:54+00:00 descriptor 2015-06-19 00:48:39+00:00
run_start 1 2015-06-18 21:15:21+00:00 run_start 2 2015-06-18 21:16:27+00:00 descriptor 2015-06-18 21:15:25+00:00
run_start 1 2015-06-18 21:15:21+00:00 run_start 2 2015-06-18 21:16:27+00:00 descriptor 2015-06-18 21:16:30+00:00


In [7]:
# 6. For event_descriptors that I have in desc_rstart_pairs, get distinct data fields. 
# This information will be stored in dt_key dict where keys are event_descriptor oids and contents are the corresponding data keys and their shapes 
data_key_templates = {}
leftovers = []
unique_keys = []
for d in descs:
    dt_key = {}
    try:
        res = database.event.find({"descriptor_id" : ObjectId(d)}).distinct(key='data')[0]
    except OperationFailure:
        leftovers.append(d)
    for k, v in res.items():
        if k not in unique_keys:
            unique_keys.append(k)
        if (type(v[0])==int) or (type(v[0])==float):
            data_type = 'number'
        elif (type(v[0])==str):
            data_type = 'array'
        else:
            data_type = type(v[0])
        try:
            if (1356998400.0 <= v[1] <= 1452612569587.0): #2013 to now
                shape = []
            else:
                shape = (len(v),)
        except TypeError:
            shape = []
        if data_type == 'array':
            dt_key[k] = {'shape': shape, 'dtype': data_type, 
                         'external': 'FILESTORE:'}
        else:
            dt_key[k] = {'shape': shape, 'dtype': data_type}
    data_key_templates[d] = dt_key

# the difference in the number of different descriptor count is
# due to run_stop-less run_starts
# There is a 16 MB limit on distinct. SOmetimes, even the smallest documents 
# get caught in the threshold and get neglected. Catch those that are excluded and 

In [8]:
print(len(unique_keys), ' unique keys')

223  unique keys


In [9]:
# Add the leftovers to the dict
m_templates = {}
for l in leftovers:
    dt_key = {}
    res = database.event.find({"descriptor_id" : ObjectId(l)})
    data = next(res)['data']
    for k, v in data.items():
        print(v)
        if k not in unique_keys:
            unique_keys.append(k)
        if (type(v[0])==int) or (type(v[0])==float):
            data_type = 'number'
        elif (type(v[0])==str):
            data_type = 'array'
        else:
            data_type = type(v[0])
        try:
            if (1356998400.0 <= v[1] <= 1452612569587.0): #2013 to now
                shape = []
            else:
                shape = (len(v),)
        except TypeError:
            shape = []
        if data_type == 'array':
            dt_key[k] = {'shape': shape, 'dtype': data_type, 
                         'external': 'FILESTORE:'}
        else:
            dt_key[k] = {'shape': shape, 'dtype': data_type}
        data_key_templates[l] = dt_key

[71457568.0, 816875253.5713149]
564f4eaf7368e37740aa5712 {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[161576467.0, 817206249.6746507]
56545ba37368e3ce421164b9 {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[161861225.0, 817064735.4397624]
565232d97368e3ce42cef358 {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[163193577.0, 817111571.0875765]
5652e9cc7368e3ce42e54249 {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[164138955.0, 817192133.8452698]
5654247f7368e3ce42091d20 {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[162215895.0, 816999978.2690133]
565135f07368e3ce42a6df0b {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[160588260.0, 817149374.8139615]
56537d787368e3ce42fe1913 {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[160462186.0, 817136734.4582783]
56534c197368e3ce42f5d18a {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[162076572.0, 817125383.6342568]
56531fc17368e3ce42ed89e5 {'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}
[1

In [10]:
# sanity check. Make sure events are between stop and start for "some" ObjectId
t = ObjectId('54e5f9b87368e36ad949a25b')
print('RunStart time', next(database.run_start.find({'_id': t}))['time'])
print(rs_desc_pairs[t][0])
crsr = database.event.find({'descriptor_id': rs_desc_pairs[t][0]})
for c in crsr:
    print('Event time ',c['time'], c['_id'])
print('RunStop time', next(database.run_stop.find({'run_start_id': t}))['time'])

RunStart time 1424357816.968672
54e5f9da7368e36ad949a25c
Event time  1424357850.284224 54e5f9da7368e36ad949a25d
Event time  1424357883.543736 54e5f9fb7368e36ad949a25e
Event time  1424357913.724856 54e5fa197368e36ad949a25f
Event time  1424357946.947482 54e5fa3a7368e36ad949a260
Event time  1424357973.540224 54e5fa557368e36ad949a261
Event time  1424358006.766089 54e5fa767368e36ad949a262
Event time  1424358033.844119 54e5fa917368e36ad949a263
Event time  1424358067.188064 54e5fab37368e36ad949a264
Event time  1424358093.697332 54e5facd7368e36ad949a265
Event time  1424358126.950872 54e5faee7368e36ad949a266
Event time  1424358153.969373 54e5fb0a7368e36ad949a267
RunStop time 1424358154.394573


In [11]:
# 7. Extract the source of all keys from the beamline. Get all keys from the beamline 


In [12]:
# 8. Using steps 5, 6, and 7 create and insert all event_descriptors



In [16]:
a = [str(key).split(', ') for key in data_key_templates.keys()]

print(data_key_templates[ObjectId(a[1199][0])])

{'diag6_flyer1': {'dtype': 'number', 'shape': (2,)}}


In [18]:
print(data_key_templates[ObjectId(a[199][0])]['fccd_image_lightfield'])

{'dtype': 'array', 'external': 'FILESTORE:', 'shape': []}


In [None]:
print(a[1199][0])