_Strategy:_

1. mongo dump / reload on a local machine

2. Compose ranges where start is run_start time and stop is run_stop time

3. Compose a dictionary keyed on run_start oid that holds run_start and run_stop tuples

4. iterate over the range dictionary and get distinct event_descriptor_ids for events that fall under each range and put these descriptors in a dictionary formatted as rs_desc_pairs['run_start_id'] = distinct_event_descriptors

5. By iterating over the dict's contents in step 4, findthe cases that an event_descriptor is associated with more than one run_start (there are 8 and this is due to creation of two consecutive run_starts not too far apart in time). Pick the run_start that is closest to the event_descriptor in time. This step results in desc_rstart_pairs dict where each event_descriptor has a corresponding run_start

6. For event_descriptors that I have in desc_rstart_pairs, get distinct data fields. This information will be stored in dt_key dict where keys are event_descriptor oids and contents are the corresponding data keys and their shapes 

7. Extract the source of all keys from the beamline. Get all keys from the beamline 

8. Using steps 5, 6, and 7 create and insert all event_descriptors



In [1]:
from pymongo import MongoClient
from bson import ObjectId
from pymongo.errors import OperationFailure
from collections import deque
from difflib import SequenceMatcher

In [2]:
MONGO_HOST = 'localhost'
MONGO_PORT = 27017
MIGRATION_DB = 'datastore2'
# 1) mongo dump / reload on a local machine
pymongo_client = MongoClient(MONGO_HOST, MONGO_PORT)
database = pymongo_client['datastore2']

In [3]:
desc_oids = list()
desc_oids = database.event.distinct('descriptor_id')
print(len(desc_oids))

13801


In [4]:
# 2. Compose ranges where start is run_start time and stop is run_stop time

# 3. Compose a dictionary keyed on run_start oid that holds run_start and run_stop tuples

rstt_crsr = database.run_start.find()
pairs = dict()
for rstart in rstt_crsr:
    try:
        run_stop = next(database.run_stop.find({'run_start_id': rstart['_id']}))
    except StopIteration:
        run_stop = None 
    # there are some rstop that are not created. What to do with these!? Igonore for now
    if run_stop:
        time_range = (rstart['time'], run_stop['time'])
        pairs[rstart['_id']] = time_range

In [5]:
# 4. iterate over the range dictionary and get distinct event_descriptor_ids for events that fall under each range and put these descriptors in a dictionary formatted as rs_desc_pairs['run_start_id'] = distinct_event_descriptors

rs_desc_pairs = {}
for k, v in pairs.items():
    # give me all the distinct event descriptors for events in the
    # time range between start and stop
    query = {'time': {'$gt': v[0],'$lt': v[1]}}
    rs_desc_pairs[k] = database.event.find(query).distinct(key='descriptor_id')
    # rs_desc_pairs holds run_start event_descriptor pairs!

In [6]:
# 5. By iterating over the dict's contents in step 4, find 
# the cases that an event_descriptor is associated with more than 
# one run_start (there are 8 and this is due to creation of two consecutive 
# run_starts not too far apart in time). Pick the run_start that is closest 
# to the event_descriptor in time. This step results in desc_rstart_pairs dict where each event_descriptor has a corresponding run_start


descs = deque()
desc_rstart_pairs = {}
def similar(a, b):
    return SequenceMatcher(None, str(a), str(b)).ratio()

for k,v in rs_desc_pairs.items():
    for _ in list(v):
        # TODO: Add the decision making process for those descriptors that are duplicated
        if _ not in descs:
            descs.append(_)
            desc_rstart_pairs[_] = k
        else:
            # For the 8 overlapping descriptors, select the most likely
            # run_start (oid is +1 is as close as it can possibly get)
            print(desc_rstart_pairs[_].generation_time, _.generation_time, k.generation_time)
            print(desc_rstart_pairs[_], _, k)
            if similar(desc_rstart_pairs[_], _) < similar(k, _):
                desc_rstart_pairs[_] = k
            print(desc_rstart_pairs[_], _)

2015-06-18 21:16:27+00:00 2015-06-18 21:15:25+00:00 2015-06-18 21:15:21+00:00
5583352b0712a63780f7ccdb 558334ed7368e3a5578ab81e 558334e97368e3a5578ab81d
558334e97368e3a5578ab81d 558334ed7368e3a5578ab81e
2015-06-18 21:16:27+00:00 2015-06-18 21:16:30+00:00 2015-06-18 21:15:21+00:00
5583352b0712a63780f7ccdb 5583352e0712a63780f7ccdc 558334e97368e3a5578ab81d
5583352b0712a63780f7ccdb 5583352e0712a63780f7ccdc
2015-02-19 20:51:47+00:00 2015-02-19 20:51:47+00:00 2015-02-19 20:51:47+00:00
54e64ce324467976d380b00e 54e64ce324467976d380afed 54e64ce324467976d380afec
54e64ce324467976d380afec 54e64ce324467976d380afed
2015-02-19 20:51:47+00:00 2015-02-19 20:51:47+00:00 2015-02-19 20:51:47+00:00
54e64ce324467976d380b00e 54e64ce324467976d380b00f 54e64ce324467976d380afec
54e64ce324467976d380b00e 54e64ce324467976d380b00f
2015-06-19 00:53:54+00:00 2015-06-19 00:48:39+00:00 2015-06-19 00:48:30+00:00
558368220712a64e78d58c53 558366e77368e3b3b857fbf3 558366de7368e3b3b857fbf2
558366de7368e3b3b857fbf2 558366e773

In [7]:
# 6. For event_descriptors that I have in desc_rstart_pairs, get distinct data fields. 
# This information will be stored in dt_key dict where keys are event_descriptor oids and contents are the corresponding data keys and their shapes 


dt_key = {}

# TODO: Compose data keys
# Get all data keys and corresponding shapes
# Get source info
# Tada!
for d in descs:
    try:
        res = database.event.find({"descriptor_id" : ObjectId(d)}).distinct(key='data')[0]
    except OperationFailure:
        print(d)
    dt_key[d] = res.keys()
# the difference in the number of different descriptor count is
# due to run_stop-less run_starts


# There is a 16 MB limit on distinct. SOmetimes, even the smallest documents 
# get caught in the threshold and get neglected. Need to fix these 

56545ba37368e3ce421164b9
565135f07368e3ce42a6df0b
565232d97368e3ce42cef358
564eed4a7368e3774099aeaa
5652e9cc7368e3ce42e54249
56518b2f7368e3ce42b4d475
56537d787368e3ce42fe1913
5651e75c7368e3ce42c563f6
5651b7867368e3ce42bd1c16
56534c197368e3ce42f5d18a
56531fc17368e3ce42ed89e5
564f4eaf7368e37740aa5712
564f19857368e37740a202dd
5654247f7368e3ce42091d20


In [8]:
# sanity check. Make sure events are between stop and start
t = ObjectId('54e5f9b87368e36ad949a25b')
print('RunStart time', next(database.run_start.find({'_id': t}))['time'])
crsr = database.event.find({'descriptor_id': rs_desc_pairs[t][0]})
for c in crsr:
    print('Event time ',c['time'])
print('RunStop time', next(database.run_stop.find({'run_start_id': t}))['time'])

RunStart time 1424357816.968672
Event time  1424357850.284224
Event time  1424357883.543736
Event time  1424357913.724856
Event time  1424357946.947482
Event time  1424357973.540224
Event time  1424358006.766089
Event time  1424358033.844119
Event time  1424358067.188064
Event time  1424358093.697332
Event time  1424358126.950872
Event time  1424358153.969373
RunStop time 1424358154.394573


In [None]:
# 7. Extract the source of all keys from the beamline. Get all keys from the beamline 



In [None]:
# 8. Using steps 5, 6, and 7 create and insert all event_descriptors

