Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ndaysnapse to be 'aware' of location for replicated data #26

Open
obenshaindw opened this issue May 17, 2019 · 3 comments
Open
Assignees

Comments

@obenshaindw
Copy link

Data submitted to NDA through the standard data submission endpoint (NOT BSMN-S3) are distributed across 5 buckets: gpop, NDAR_Central_1, NDAR_Central_2, NDAR_Central_3, and NDAR_Central_4. Making requests of the submission API (https://nda.nih.gov/api/submission/docs/swagger-ui.html) will return these locations for any files related to a submission.

The following python functions manipulate the URL returned by the submission service and return a dictionary with the bucket and key for objects in NDAR_Central_* and nda-bsmn locations, which can be passed as arguments to boto functions for working with the S3 API.

    def ndar_central_location(self, file):
        bucket, key = (file['file_remote_path']
                       .split('//')[1]
                       .split('/', 1))
        return {'Bucket': bucket, 'Key': key}

    def nda_bsmn_location(self, file):
        original_key = (file['file_remote_path']
                        .split('//')[1]
                        .split('/', 1)[1]
                        ('ndar_data/DataSubmissions', 'submission_{}/ndar_data/DataSubmissions'.format(self.submission_id)))
        nda_bsmn_key = 'collection_{}/{}'.format(self.collection_id, original_key)
        return {'Bucket': 'nda-bsmn', 'Key': nda_bsmn_key}

These functions are included in an update to the NDASubmissionFiles class, and the file argument each accepts is from the list returned from /api/submission/submission_id/files. That response is used as an initialization argument to NDASubmissionFiles class.

            files = []
            request = requests.get(
                self.submission_api + '/{}/files'.format(s),
                headers=self.headers,
                auth=self.auth
            )
            try:
                files = json.loads(request.text)
                submission_files.append({'files': NDASubmissionFiles(files, collection_id, s),
                                         'collection_id': collection_id,
                                         'submission_id': s})
            except json.decoder.JSONDecodeError:
                print('Error occurred retrieving files from submission {}'.format(s))
                print('Request returned {}'.format(request.text))
@kdaily kdaily self-assigned this May 17, 2019
@kdaily
Copy link
Member

kdaily commented May 17, 2019

Thanks @obenshaindw!

@kdaily
Copy link
Member

kdaily commented Sep 10, 2019

@obenshaindw I'm just getting around to replying and closing this, but wanted to confirm that you're missing a replace call in the nda_bsmn_location function. The function definition should be:

def nda_bsmn_location(self, file):
        original_key = (file['file_remote_path']
                        .split('//')[1]
                        .split('/', 1)[1]
                        .replace('ndar_data/DataSubmissions', 'submission_{}/ndar_data/DataSubmissions'.format(self.submission_id)))
        nda_bsmn_key = 'collection_{}/{}'.format(self.collection_id, original_key)
        return {'Bucket': 'nda-bsmn', 'Key': nda_bsmn_key}

@obenshaindw
Copy link
Author

@kdaily you are correct, original_key should be using a replace function, although looking at it again it would be more readable, since we are mutating the original_key to get the nda_bsmn_key, to write it as:

def nda_bsmn_location(self, file):
        original_key = (file['file_remote_path']
                        .split('//')[1]
                        .split('/', 1)[1])
        nda_bsmn_key = 'collection_{}/{}'.format(self.collection_id, 
                                                 original_key
                                                 .replace('ndar_data/DataSubmissions',
                                                          'submission_{}/ndar_data/DataSubmissions'
                                                          .format(self.submission_id)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants