Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain every entry in the metadata file. #17

Closed
moorepants opened this issue Nov 20, 2014 · 7 comments
Closed

Explain every entry in the metadata file. #17

moorepants opened this issue Nov 20, 2014 · 7 comments
Assignees

Comments

@moorepants
Copy link
Member

This needs to be either in the GATK docs or in the paper or with the data.

@moorepants
Copy link
Member Author

This is in the paper, but needs to be reviewed for completeness.

@moorepants
Copy link
Member Author

  • Make sure all of the files listed in the meta data are correct, especially the mapping to the compensation files.
  • Add Cortex versions to the meta data files.

@moorepants
Copy link
Member Author

@spinningplates @tvdbogert

I'm about to push the data to Zenodo and just went through all of the meta data in detail. I fixed a bunch of errors, but would you all mind looking through the data too to see if you notice any oddities?

You can view the tables of data indexed by trial number here:

http://nbviewer.ipython.org/github/moorepants/walking-sys-id/blob/meta-data/notebooks/meta_data_check.ipynb

@tvdbogert
Copy link
Member

I did not see any obvious errors, but have a couple of comments:

  • The Table with the test conditions for each trial would benefit from
    having a subject ID number. Without that, it's quite a puzzle to
    find the three tests for each subject. You have to go back and
    forth between the Tables and use age, mass, height etc. as clues to
    the subject identity. It's all in the database, so you can write
    code to do this, but a human-readable table might be useful. Or you
    write the code to generate that Table?
  • If you introduce subject ID numbers, you no longer have to duplicate
    subject characteristics in the multiple trial. There can be a
    separate (and much shorter) table with subject characteristics. And
    of course, you can keep the database as it is and write code to
    generate that table.
  • We should probably not include trials that were not part of the
    actual study with the 3 speeds and the perturbation protocol.
    Unless (again...) you write code to extract a list of the relevant
    trials. It is kind of neat to give everything you have, but not if
    that makes it hard to find the data that is likely to be useful.

Ton

On 11/26/2014 11:32 AM, Jason K. Moore wrote:

@spinningplates https://github.com/spinningplates @tvdbogert
https://github.com/tvdbogert

I'm about to push the data to Zenodo and just went through all of the
meta data in detail. I fixed a bunch of errors, but would you all mind
looking through the data too to see if you notice any oddities?

You can view the tables of data indexed by trial number here:

http://nbviewer.ipython.org/github/moorepants/walking-sys-id/blob/meta-data/notebooks/meta_data_check.ipynb


Reply to this email directly or view it on GitHub
#17 (comment).

@moorepants
Copy link
Member Author

The meta data is stored in a single file per trial (e.g., https://gist.github.com/moorepants/6bbc495128b181393023) and is located in that trial's directory. I did it this way, instead of using a proper database, to simplify things because no one in the lab seemed interested in using a real database to manage this. Thus, there is redundant "study" and "subject" data in each meta data file so that all the meta data for one trial is with the data files for that trial. The function generate_meta_data_tables() simply scrapes the directory of trials for meta data files and recursively parsers them to construct all of the singleton tables (ones without nested structure) which would be akin to single tables in a relational database. These tables are stored in DataFrame objects which are designed to allow easy reduction, grouping, joining, etc. With those tables a few lines of code are needed to form any table you like. Line 10 in the link shows an example of merging some data from two tables. If you specify what you'd like to see in a table, I can generate it for you. What you see is simply a raw parsed version so that you can visually look at all the data at one time on the screen. I will generate some simplified tables to go in the paper and the source code will be shipped along with the paper source.

I'd like to include all the trials we measured because they include potentially useful data. The code already exists that allows you to query trial numbers from the data I have. I could write some code to store the data in an HDF5 or sqlite database file and then the database can be queried with libraries that already exist instead of me writing custom bits for scraping a directory tree.

@tvdbogert
Copy link
Member

It's OK to have the extra trials as long as it is not a puzzle for the
reader to put the complete perturbation study together. Ideally by just
extracting the right files, rather than writing code to find them.

Perhaps just this Table to generate for the paper:

column 1: subject id number
columns 2-5: gender, age, mass, height
column 6: 0.8 m/s trial number
column 7: 1.2 m/s trial number
column 8: 1.6 m/s trial number

That presents a nice birds-eye view of the dataset and helps people find
the right files without much trouble.

Ton

On 11/26/2014 1:23 PM, Jason K. Moore wrote:

The meta data is stored in a single file per trial (e.g.,
https://gist.github.com/moorepants/6bbc495128b181393023) and is
located in that trial's directory. I did it this way, instead of using
a proper database, to simplify things because no one in the lab seemed
interested in using a real data base to manage this. Thus, there is
redundant "study" and "subject" data in each meta data file so that
all the meta data for one trial is with the data files for that trial.
The function |generate_meta_data_tables()| simply scrapes the
directory of trials for meta data files and recursively parsers them
to construct all of the singleton tables (ones without nested
structure) which would be akin to single tables in a relational
database. These tables are stored in |DataFrame| objects which are
designed to allow easy reduction, grouping, joining, etc. With those
tables a few lines of code are needed t o form any table you like.
Line 10 in the link shows an example of merging some data from two
tables. If you specify what you'd like to see in a table, I can
generate it for you. What you see is simply a raw parsed version so
that you can visually look at /all/ the data at one time on the
screen. I will generate some simplified tables to go in the paper and
the source code will be shipped along with the paper source.

I'd like to include all the trials we measured because they include
potentially useful data. The code already exists that allows you to
query trial numbers from the data I have. I could write some code to
store the data in an HDF5 or sqlite database file and then the
database can be queried with libraries that already exist instead of
me writing custom bits for scraping a directory tree.


Reply to this email directly or view it on GitHub
#17 (comment).

@moorepants
Copy link
Member Author

Ok, I'll generate that table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants