Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSF Data Download and Upload Scripts #25

Merged
merged 42 commits into from
Feb 20, 2022
Merged

OSF Data Download and Upload Scripts #25

merged 42 commits into from
Feb 20, 2022

Conversation

arjunsingh3600
Copy link
Collaborator

@thisTyler @jedyeo @calhep
TEMUpload uploads data generated by TEMSimulator in the structure to CryoEM-dataset page the discussed here.

The implementation for generate_tags_from_tem() and test_post_files() will be merged in an upcoming PR.

@arjunsingh3600
Copy link
Collaborator Author

The unit tests for TEMUpload are currently failing as they require an authorization token from OSF.io to run.
I do have an authorization token that is scoped to have access to the OSF page, however, I wasn't sure what the best practice for sharing that was.
Is there anything in the existing git workflow to accommodate this?

Copy link
Contributor

@geoffwoollard geoffwoollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very clean. I liked the typing.

It would be good to include some screenshots of what the code does in the PR. Ie how things look like on a browser once they are uploaded.

Also, let's try to find some fix for the private token... So the issue is that we don't want anyone to be able to interact with OSF, but only those with the token. And if we put the token on github, then it's compromised...

@geoffwoollard
Copy link
Contributor

There's some issues with importing. Have a look at the details of the failed Linting and Testing

==================================== ERRORS ====================================
____________________ ERROR collecting tests/test_fourier.py ____________________
ImportError while importing test module '/home/runner/work/ioSPI/ioSPI/tests/test_fourier.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_fourier.py:5: in <module>
    from ioSPI import fourier
E   ImportError: cannot import name 'fourier' from 'ioSPI' (/home/runner/work/ioSPI/ioSPI/__init__.py)
__________________ ERROR collecting tests/test_tem_upload.py ___________________
ImportError while importing test module '/home/runner/work/ioSPI/ioSPI/tests/test_tem_upload.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_tem_upload.py:9: in <module>
    from ioSPI.ioSPI import tem_upload
ioSPI/tem_upload.py:6: in <module>
    from simSPI.simSPI import tem
E   ModuleNotFoundError: No module named 'simSPI'
__________ ERROR collecting tests/test_iotools/test_atomic_models.py ___________
ImportError while importing test module '/home/runner/work/ioSPI/ioSPI/tests/test_iotools/test_atomic_models.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_iotools/test_atomic_models.py:8: in <module>
    from ioSPI.iotools.atomic_models import read_gemmi_model, write_gemmi_model
E   ModuleNotFoundError: No module named 'ioSPI.iotools'
___________ ERROR collecting tests/test_iotools/test_cryo_dataset.py ___________
ImportError while importing test module '/home/runner/work/ioSPI/ioSPI/tests/test_iotools/test_cryo_dataset.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_iotools/test_cryo_dataset.py:6: in <module>
    from ioSPI.iotools import cryodataset as cryo_dataset
E   ModuleNotFoundError: No module named 'ioSPI.iotools'

@geoffwoollard
Copy link
Contributor

You need to install simSPI in https://github.com/compSPI/ioSPI/blob/dataset/upload/environment.yml from github

See how in simSPI we import ioSPI from git hub: https://github.com/compSPI/simSPI/blob/master/environment.yml#L23

@fredericpoitevin fredericpoitevin requested review from fredericpoitevin and removed request for fredericpoitevin January 13, 2022 17:54
@arjunsingh3600 arjunsingh3600 marked this pull request as draft January 27, 2022 01:21
@arjunsingh3600
Copy link
Collaborator Author

Converted to draft to review and refactor

@geoffwoollard
Copy link
Contributor

I'm confused. Are we going ahead with PR #40, or is a major move of tem specific code from ioSPI --> simSPI called for?

@geoffwoollard geoffwoollard mentioned this pull request Feb 15, 2022
@thisFreya
Copy link
Collaborator

@geoffwoollard I'm confused. Are we going ahead with PR #40, or is a major move of tem specific code from ioSPI --> simSPI called for?

We are currently in the process of adjusting this PR to match up with master (which has had PR#46 merged in). That PR (refactoring) moved things into ioSPI that were TEM-specific - the major move has already happened. All the code we will push into ioSPI will be TEM-agnostic, whereas TEM-specific items will be pushed into their relevant places in simSPI. Unfortunately this will take us some time to get sorted which has caused some confusion - apologies!

@codecov
Copy link

codecov bot commented Feb 16, 2022

Codecov Report

Merging #25 (8614ad7) into master (1f37233) will decrease coverage by 1.10%.
The diff coverage is 93.62%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #25      +/-   ##
==========================================
- Coverage   97.53%   96.43%   -1.09%     
==========================================
  Files           4        5       +1     
  Lines         121      168      +47     
==========================================
+ Hits          118      162      +44     
- Misses          3        6       +3     
Impacted Files Coverage Δ
__init__.py 100.00% <ø> (ø)
ioSPI/datasets.py 93.62% <93.62%> (ø)
ioSPI/__init__.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f37233...8614ad7. Read the comment docs.

@arjunsingh3600 arjunsingh3600 marked this pull request as ready for review February 16, 2022 18:26
Copy link
Contributor

@ninamiolane ninamiolane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, great job on connecting with OSF through the tokens!! 🙌 not an easy task.

Important remarks:

  • As mentioned in the previous code reviews: ioSPI cannot have elements that are simulator-specific: ioSPI is simulation agnostic: the name TEMUpload does not go in this direction and should be changed, there should be no sim_config.yml, etc.
  • This PR also does not seem to respect the design that we discussed about ioSPI:
    • the io functions go in files that are named after the biological structures they deal with: in this case, if you are downloading or uploading atomic models ("molecules"), then these functions should go in the atomic_models.py (your file osf_upload.py is misnamed, we do not name files in terms of where we are pulling the data from).
    • we also decided that we would respect a write/read convention to upload or download data: is there any reason why you are not using this? (you use "post", "get", etc?)?

Let me know?



class TEMUpload:
"""Class to upload data to OSF.io generated by simSPI TEM Simulator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ioSPI module is agnotic from the TEM simulator (and any type of simulator).

Either:

  • put only functions that know how to upload to OSF (independently of where the data initially came from)
  • if this is TEM-specific, it should go in simSPI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much this whole file is already simSPI agnostic. I think you can just change some docstrings and function / class names for generality. I didn't notice much code at all that was intimately connected to the type of data that will be uploaded.

Returns
-------
dict of type str : str
Returns dictionary of node labels mapped to node GUIDs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Missing . at the end

See Also
--------
Protein Data Bank(PDB) : https://www.rcsb.org/
EM DataR esouce(EMDB) : https://www.emdataresource.org/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Parameters
----------
token : str
Personal token from OSF.io with access to cryoEM dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could generalize Personal token from OSF.io with access to dataset (e.g. cryoEM, etc).

Comment on lines 1 to 5
# absolute paths
pdb_file: './test_files/4v6x.pdb'
mrc_keyword: '_randomrot'
output_dir: './test_files'
local_sim_dir: './TEM-simulator'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ninamiolane this seems fine to me. While the ioSPI code is agnostic to simulator, it needs to be tested on something concretely.

@@ -0,0 +1,39 @@
molecular_model:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the "test meta data" that is uploaded. It makes sense to me to include it in the test to be uploaded.

random.choice(string.ascii_letters) for i in range(5)
)

print(f"Creating test node CryoEM Dataset -> internal -> {test_node_label} ")
Copy link
Contributor

@geoffwoollard geoffwoollard Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't your code generalize beyond cryoEM datasets? If so, just refer to Dataset not CryoEM Dataset

Copy link
Contributor

@geoffwoollard geoffwoollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalize for any dataset as @ninamiolane is emphasizing. It seems that it shouldn't be to much work.

@geoffwoollard
Copy link
Contributor

geoffwoollard commented Feb 16, 2022

We are currently in the process of adjusting this PR to match up with master (which has had PR#46 merged in). That PR (refactoring) moved things into ioSPI that were TEM-specific - the major move has already happened. All the code we will push into ioSPI will be TEM-agnostic, whereas TEM-specific items will be pushed into their relevant places in simSPI. Unfortunately this will take us some time to get sorted which has caused some confusion - apologies!

Ok so don't merge any more code that isn't aligned with the vision of ioSPI.

@arjunsingh3600
Copy link
Collaborator Author

@ninamiolane @geoffwoollard Thank you for the review! As pointed out while the script was TEM-agnostic, the docstrings/method names certainly weren't. I've made a few changes based on the feedback above.

  • This module houses functions that can be used to be upload datasets (micrographs + metadata) to OSF. Since it deals with a compound of two structures/datatypes, I renamed the file to "datasets.py".
  • Since the script isn't reading/writing to the file system but is rather is a helpful wrapper that bundles API calls to OSF, My instinct was to reflect the API call in the function names. The thinking here was a new developer might look for GET/POST methods when encountering a new API class. However, I've refactored the names into a read/write format with references to get/post calls in the description where applicable.
  • I've refactored all instances of the word "molecule" with "structure" since it seemed more in line with vocab being used in the repository.

Please let me know if these changes work.

@ninamiolane ninamiolane changed the title TEM Data Upload Script OSF Data Download and Upload Scripts Feb 20, 2022
Copy link
Contributor

@ninamiolane ninamiolane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thank you very much!

@ninamiolane ninamiolane merged commit 5943c63 into master Feb 20, 2022
@fredericpoitevin fredericpoitevin deleted the dataset/upload branch March 25, 2022 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants