# Generating metadata

### What this notebook does
**Step 1:** Create metadata file, containing a bunch of JSON-formatted trial metadata objects

**Step 2:**  Insert each trial as a record into a mongo database

This assumes that the stimuli have been uploaded to the S3 bucket using `upload_stims_to_s3.ipynb`.

In [19]:
#Which experiment? bucket_name is the name of the experiment and will be name of the databases both on mongoDB and S3
bucket_name = 'compositional-abstractions-prior-elicitation'
stim_version = 'example'

In [20]:
import os
import numpy as np
from PIL import Image
import pandas as pd
import json
import pymongo as pm
from glob import glob
from IPython.display import clear_output
import ast
import itertools
import random
import h5py

In [21]:
def list_files(paths, ext='png'):
    """Pass list of folders if there are stimuli in multiple folders. 
    Make sure that the containing folder is informative, as the rest of the path is ignored in naming. 
    Also returns filenames as uploaded to S3"""
    if type(paths) is not list:
        paths = [paths]
    results = []
    names = []
    for path in paths:
        results += [y for x in os.walk(path) for y in glob(os.path.join(x[0], '*.%s' % ext))]
        names += [os.path.basename(os.path.dirname(y))+'_'+os.path.split(y)[1] for x in os.walk(path) for y in glob(os.path.join(x[0], '*.%s' % ext))]
    return results,names

In [22]:
## where are your stimulus images stored?
#Where on disk are the stimuli stored?
data_dirs = [
    "/Users/choldawa/Documents/Projects/composition-abstractions/stimuli/prior-elicitation-example-stims/"
]

dataset_name = '{}_{}'.format(bucket_name, stim_version)
stimulus_extension = "png" #what's the file extension for the stims? Provide without dot

## get a list of paths to each one
full_stim_paths,filenames = list_files(data_dirs,stimulus_extension)
print('We have {} stimuli to evaluate.'.format(len(full_stim_paths)))

We have 2 stimuli to evaluate.


We also want to have a number of familiarization trials. Put a couple of hand selected filenames here. The stims are expected in the S3 bucket with that filename. 

In [23]:
# # familiarization_stem = 'pilot_dominoes_default_boxroom'
# familiarization_stems = [
#     'pilot_towers_nb4_SJ025_mono1_boxroom',
#     'pilot_towers_nb2_fr015_SJ010_mono1_tdwroom'
# ]
# familiarization_filenames = [(familiarization_stems[0] + ('_%04d_img.mp4' % d)) 
#                              for d in range(4,10)]
# familiarization_filenames.extend([(familiarization_stems[1] + ('_%04d_img.mp4' % d)) 
#                              for d in range(3,7)])
# rng = np.random.RandomState(seed=0)
# familiarization_filenames = list(rng.permutation(familiarization_filenames))
# familiarization_filenames

In [24]:
## helper to build stim urls
def build_s3_url(filename, bucket_name):    
    return 'https://{}.s3.amazonaws.com/{}'.format(bucket_name, filename)

In [25]:
## basic metadata lists
stim_urls = [build_s3_url(p,bucket_name) for p in filenames]
stim_IDs = [name.split('.')[0] for name in filenames]
# familiarization_stim_urls = [build_s3_url(p,bucket_name) for p in familiarization_filenames]
# familiarization_stim_IDs = [name.split('.')[0] for name in familiarization_filenames]
# len(familiarization_stim_urls)

In [26]:
## convert to pandas dataframe
M = pd.DataFrame([stim_IDs,stim_urls]).transpose()
M.columns = ['stim_ID', 'stim_url']

## drop the stims that have the same stem as the familiarization trials
# for fstem in familiarization_stems:
#     M = M[~M['stim_ID'].str.contains(fstem)]

# familiarization_M = pd.DataFrame([familiarization_stim_IDs,familiarization_stim_urls]).transpose()
# familiarization_M.columns = ['stim_ID', 'stim_url']
# # save the familiariaziation dict
# familiarization_trials = familiarization_M.transpose().to_dict()
# # needs to have strings as keys
# familiarization_trials = {str(key):value for key, value in familiarization_trials.items()}

In [27]:
# len(familiarization_M)

In [28]:
# remove some bad stimuli -- regenerate these
# bad_stimuli = [
#     "pilot_dominoes_1mid_J025R45_o1flex_tdwroom_0006_img",
#     "pilot_dominoes_1mid_J025R45_o1flex_tdwroom_0009_img"
# ]
# bad_stimuli = []

# for nm in bad_stimuli:
#     M = M[~M['stim_ID'].str.contains(nm)]

In [29]:
len(M)

2

In [30]:
M.head()

Unnamed: 0,stim_ID,stim_url
0,prior-elicitation-example-stims_image_1,https://compositional-abstractions-prior-elici...
1,prior-elicitation-example-stims_image_0,https://compositional-abstractions-prior-elici...


Add metadata to the stimuli

In [38]:
M['games'] = '[]' ## empty games list for marking records when retrieved from mongo (see store.js)
M['games'] = M['games'].apply(lambda x: ast.literal_eval(x))


assert len(M) == 2

In [39]:
#initalize list of all version dictionaries
Meta = []
stimList = M.to_dict(orient='records')
stimDict = {}
stimDict['meta'] = stimList
stimDict['games'] = [] 
stimDict['experimentName'] = dataset_name
Meta.append(stimDict)
Meta

[{'meta': [{'stim_ID': 'prior-elicitation-example-stims_image_1',
    'stim_url': 'https://compositional-abstractions-prior-elicitation.s3.amazonaws.com/prior-elicitation-example-stims_image_1.png',
    'games': []},
   {'stim_ID': 'prior-elicitation-example-stims_image_0',
    'stim_url': 'https://compositional-abstractions-prior-elicitation.s3.amazonaws.com/prior-elicitation-example-stims_image_0.png',
    'games': []}],
  'games': [],
  'experimentName': 'compositional-abstractions-prior-elicitation_example'}]

In [40]:
print('Saving out json dictionary out to file...') 
with open('{}_meta.js'.format(dataset_name), 'w') as fout:
    json.dump(Meta, fout)
print('Done!')

Saving out json dictionary out to file...
Done!


Set up ssh bridge to write to mongodb. Insert your username. If you don't have an SSH secret set yet, run `ssh -fNL 27017:127.0.0.1:27017 USERNAME@cogtoolslab.org` in your shell.

In [41]:
#ssh -fNL 27017:127.0.0.1:27017 choldawa@cogtoolslab.org

In [42]:
# set vars 
auth = pd.read_csv('../analysis/auth.txt', header = None) # this auth.txt file contains the password for the sketchloop user. Place it in the toplevel of the repo
pswd = auth.values[0][0]
user = 'sketchloop'
host = 'cogtoolslab.org' ## cogtoolslab ip address

conn = pm.MongoClient('mongodb://sketchloop:' + pswd + '@127.0.0.1')
db = conn['stimuli']
coll = db[dataset_name]

In [43]:
J = json.loads(open('{}_meta.js'.format(dataset_name),mode='r').read())
print('dataset_name: {}'.format(dataset_name))
print('Length of J is: {}'.format(len(J)))

dataset_name: compositional-abstractions-prior-elicitation_example
Length of J is: 1


In [44]:
#⚠️drop collection if necessary. 
if False: #change to run
    db.drop_collection(dataset_name) 

{'ok': 0.0,
 'errmsg': 'ns not found',
 'code': 26,
 'codeName': 'NamespaceNotFound'}

In [49]:
#get list of current collections
sorted(db.list_collection_names())

['block-construction-silhouette-exp01',
 'block-construction-silhouette-exp02',
 'causaldraw',
 'causaldraw_annotations',
 'causaldraw_annotations_patching',
 'causaldraw_identification',
 'causaldraw_intervention',
 'causaldraw_intervention_patching',
 'collabdraw_collab8_recog',
 'compositional-abstractions-prior-elicitation_example',
 'curiotower-tdw',
 'curiotower-tdw-height3Jitter3',
 'curiotower_curiodrop',
 'dominoes-pilot_example',
 'graphical_conventions_object_annotation',
 'graphical_conventions_semantic_mapping',
 'graphical_conventions_semantic_mapping_patching',
 'graphical_conventions_semantic_mapping_spline_version_old',
 'human-physics-benchmarking-dominoes-pilot_example',
 'human-physics-benchmarking-gravity-pilot_example',
 'human-physics-benchmarking-linking-pilot_example',
 'human-physics-benchmarking-towers-pilot_example',
 'iternum_classification',
 'photodraw2',
 'semantic_parts_graphical_conventions',
 'svg_annotation_sketchpad_basic_allcats',
 'tools_for_block

Let's **do it**!

In [46]:
for (i,m) in enumerate(J):
    coll.insert_one(m)
    print('{} of {}'.format(i+1, len(J)))
    clear_output(wait=True)

print('Done inserting records into mongo!')

Done inserting records into mongo!


In [47]:
coll.estimated_document_count()

1

In [48]:
coll.find_one()

{'_id': ObjectId('6073c5823b522a4208bb416b'),
 'meta': [{'stim_ID': 'prior-elicitation-example-stims_image_1',
   'stim_url': 'https://compositional-abstractions-prior-elicitation.s3.amazonaws.com/prior-elicitation-example-stims_image_1.png',
   'games': []},
  {'stim_ID': 'prior-elicitation-example-stims_image_0',
   'stim_url': 'https://compositional-abstractions-prior-elicitation.s3.amazonaws.com/prior-elicitation-example-stims_image_0.png',
   'games': []}],
 'games': [],
 'experimentName': 'compositional-abstractions-prior-elicitation_example'}

In [None]:
list(coll.find())