-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve metadata #12
Improve metadata #12
Conversation
…s and adds a test for it.
…iles they're testing.
…y into first-downloader
… the records can be created before they have been picked up.
It probably makes sense for me to look at the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed all files in data_models
directory. Overall it is written very well. I found a few places that may need clarifications and some coding style issues. One of the main reasons why I am so obsessed with docstring in a Python class is that in the future we can compile a README file dynamically with tools such as sphinx
.
Something that is not directly related to this PR: Since you are expanding the django models on this project, do you have a database schema online somewhere? That will make both code review and your future life easier. As an example, here is adage web server's database schema drawn by Casey, Rene, Matt and me:
https://sketchboard.me/rzMAA9ecMFEh
(It is up to you which tool you want to use to draw it. We used sketchboard because Casey bought it.)
from data_refinery_models.models.base_models import TimeTrackedModel | ||
|
||
|
||
class SurveyJob(TimeTrackedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring for this class too? (If you feel it is obvious, just one line will be good enough.) Thanks.
from data_refinery_models.models.surveys import SurveyJob | ||
|
||
|
||
class BatchStatuses(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring of class description will be nice.
PROCESSED = "PROCESSED" | ||
|
||
|
||
class Batch(TimeTrackedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring of class description will be nice.
survey_job = models.ForeignKey(SurveyJob, on_delete=models.PROTECT) | ||
source_type = models.CharField(max_length=256) | ||
size_in_bytes = models.IntegerField() | ||
download_url = models.CharField(max_length=2048) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: Is this download URL long enough? (I know that many browsers supports a URL as long as 4K.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I suppose it'd be better to have an unnecessarily large field here than have to make a migration later just to extend this.
from data_refinery_models.models.batches import Batch | ||
|
||
|
||
class ProcessorJob(TimeTrackedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring of class description will be nice.
try: | ||
taxonomy_id = get_taxonomy_id_scientific(name) | ||
is_scientific_name = True | ||
except UnscientifcNameError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If UnscientifcNameError
is a typo, it should be corrected here too.
|
||
class SurveyJobKeyValue(TimeTrackedModel): | ||
""" | ||
This table is used for tracking fields onto a SurveyJob record that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to use "This class" instead of "This table", since the table is implementation details hidden in the database (with a slightly different name), and the docstring here refers to the current class.
call(ESEARCH_URL, | ||
{"db": "taxonomy", "term": "HUMAN"})]) | ||
|
||
# The first call should have stored it in the database so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose "it" here refers the requested organism? Do you mind elaborating this line a little bit to make the comment more readable? Thanks.
{"db": "taxonomy", "id": "9606"} | ||
) | ||
|
||
# The first call should have stored it in the database so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my previous comment on "it".
new_id = Organism.get_id_for_name("BLAH") | ||
|
||
self.assertEqual(new_id, 0) | ||
mock_get.assert_not_called() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great test cases overall! 💯
@@ -1,5 +1,5 @@ | |||
# -*- coding: utf-8 -*- | |||
# Generated by Django 1.11 on 2017-05-02 18:50 | |||
# Generated by Django 1.10.6 on 2017-05-23 15:17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skipped this file, is it okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it was automatically generated.
from django.utils import timezone | ||
|
||
|
||
class TimeTrackedModel(models.Model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that this class needs docstring too. Thanks.
from data_refinery_models.models.surveys import SurveyJob | ||
|
||
|
||
class BatchStatuses(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more places for docstrings. Nothing else. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve the code that I reviewed in this PR.
Ok cool, I am reviewing the files in the |
def get_name_for_id(cls, taxonomy_id: int) -> str: | ||
try: | ||
organism = (cls.objects | ||
.filter(taxonomy_id=taxonomy_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case when the taxonomy_id would not be unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If the same organism was listed with two different names in two different batches but they both corresponded to the same NCBI taxonomy_id (since NCBI has a list structure for names the same ID can match multiple names), then there would be one record for each of those names, however they would share an id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... I think it would be interesting to see if that just corresponded to different names for the same actual organism object vs. different names for different organism objects (but then why would they have the same taxonomy ID)? Are there any examples in the stuff that was returned from the NCBI API of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure I understand what you're asking. What do you mean by an organism object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple more comments & questions, but will continue reviewing the other files.
@@ -16,68 +17,111 @@ | |||
logging.basicConfig(level=logging.INFO) | |||
logger = logging.getLogger(__name__) | |||
|
|||
EXPERIMENTS_URL = "https://www.ebi.ac.uk/arrayexpress/json/v3/experiments/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here - would it make sense for the users to be able to override this in their settings.py
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would because if that URL ever changes then I think there will be additional changes that need to happen to work with that URL. We can also just change this in the code instead of in the settings.py
file if we need to.
def get_name_for_id(cls, taxonomy_id: int) -> str: | ||
try: | ||
organism = (cls.objects | ||
.filter(taxonomy_id=taxonomy_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... I think it would be interesting to see if that just corresponded to different names for the same actual organism object vs. different names for different organism objects (but then why would they have the same taxonomy ID)? Are there any examples in the stuff that was returned from the NCBI API of this?
|
||
NCBI_ROOT_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/" | ||
ESEARCH_URL = NCBI_ROOT_URL + "esearch.fcgi" | ||
EFETCH_URL = NCBI_ROOT_URL + "efetch.fcgi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds reasonable for the organisms - if you think this is a better fit then go for it.
In terms of the URLs, I think that at some point we might want to get this information from another URL endpoint. Also, what if we have a saved XML file with the organism information we want and want to pass it to these methods and model? What parameters would we need to abstract to make this possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've made it through the workers
and other files. I have several questions and suggestions, but overall this looks like a major step forward. Looking forward to seeing this thing work!
@@ -1,5 +1,5 @@ | |||
# -*- coding: utf-8 -*- | |||
# Generated by Django 1.11 on 2017-05-02 18:50 | |||
# Generated by Django 1.10.6 on 2017-05-26 15:12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't usually look at migration files, but happened to notice: is this switch from Django 1.11 to 1.10.6 intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this a while ago so I'm not 100% certain what it was but I believe I found a bug that was introduced in 1.11. I wish I had left a comment in the requirements.in
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds intentional then. 👍
run_rabbitmq.sh
Outdated
@@ -0,0 +1,3 @@ | |||
#!/bin/bash | |||
|
|||
docker run -d --hostname my-rabbit --name some-rabbit rabbitmq:3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names my-rabbit
and some-rabbit
look like defaults from a tutorial. Would it make sense to use something that ties this service in with its function in this implementation a little more? I'm assuming that these are just names here... if not, then the rabbitmq team is having way more fun than I would expect. 😄
RUN groupadd user && useradd --create-home --home-dir /home/user -g user user | ||
WORKDIR /home/user | ||
|
||
COPY workers/environment.yml . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: does this copy of the environment.yml
file ever get used again somewhere else? I'm wondering why we would make a copy of the file and then run it rather than just modify the RUN
command below to say
RUN conda env create -f workers/environment.yml
Maybe there's a good reason, but it seems cleaner to me not to have multiple copies lying around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This COPY command is not just copying the file from one directory to another, but rather from the directory that the Dockerfile is being run from into the docker container. Docker containers start off completely empty so
RUN conda env create -f workers/environment.yml
wouldn't work because the container can't access a file until it's copied into the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the piece I was missing. Thanks! Makes sense now.
|
||
ENV PATH "/opt/conda/envs/workers/bin:$PATH" | ||
|
||
COPY data_models/dist/data-refinery-models-* data_models/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar question to the one above about why we copy all of the files and then (seemingly) only ever refer to the (ls data_models -1 | tail -1)
th file... maybe I'm not quite following what happens with the pip install
at line 15 below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer to this question is similar to the answer to the one above it. Because the RUN command is always executed within the context of the container I cannot run (ls data_models -1 | tail -1)
in the Dockerfile and have it interact with files/directories within the host machine without first copying the files into the container. If it were possible, I certainly would agree that figuring out the correct file and THEN copying it into the container would be greatly preferable.
raise | ||
finally: | ||
zip_ref.close() | ||
os.remove(file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to remove the .zip
file if an exception occurs during extraction? Maybe this line should go after line 55 above? I guess in that case the zip_ref.close()
would also have to happen first...
Anyway, just want to be sure that deleting the .zip
on an exception is desired. If not, some change to the logic here is warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct because if extraction fails then the downloader job will fail and need to be retried. When that happens the zip file will be re-downloaded so it won't be an issue that it's been deleted. Additionally the worker that picks up the new downloader job won't necessarily be the same one so the zip file could end up sitting on the original worker's disk forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good plan, then. Thanks for the clear explanation!
I was thinking that it would be easier to diagnose a failure if the .zip
file were left behind, but automated re-trying makes sense. If the error is repeatable it could probably be diagnosed by a newly downloaded .zip
file as well.
processor_job.save() | ||
|
||
kwargs = utils.start_job({"job": processor_job}) | ||
self.assertFalse(kwargs["success"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me why this should fail. Can you make a comment about why this would be expected here?
utils.run_pipeline(job_dict, [mock_processor]) | ||
mock_processor.assert_called_once() | ||
processor_job.refresh_from_db() | ||
self.assertFalse(processor_job.success) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this time I get it. Just caught line 135.
|
||
def test_value_passing(self): | ||
"""The keys added to kwargs and returned by processors will be | ||
passed through to other processors.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-line docstring should have ending """
on a line by itself... see above.
mock_dict = {"something_to_pass_along": True, | ||
"job": processor_job, | ||
"batches": [batch]} | ||
mock_dict["something_to_pass_along"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is redundant with the end of line 156. Did you mean to leave it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, good catch.
except Batch.DoesNotExist: | ||
logger.error("Cannot find batch record with ID %d.", job.batch_id) | ||
batch_relations = ProcessorJobsToBatches.objects.filter(processor_job_id=job.id) | ||
batches = list(map(lambda x: x.batch, batch_relations)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be simpler as described above:
batches = [br.batch for br in batch_relations]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions and suggestions, mainly about list and set comprehensions and how errors are handled. Thanks @kurtwheeler!
the line length limit for the autopep8 and flake8 linters. If you run either | ||
of those programs from anywhere within the project's directory tree they will | ||
enforce a limit of 100 instead of 80. This will also be true for editors which | ||
rely on them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice documentation! 👍
SurveyJobKeyValue | ||
.objects | ||
.get(survey_job_id=self.survey_job.id, | ||
key__exact="experiment_accession_code") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen here if no SurveyJobKeyValue objects match the survey_job_id
and key
passed? Do we just expect django to return that error to the users in the command line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This survey
function is actually called by surveyor.run_job
within a try
block. If there's no match here the exception will be caught, logged, and the survey_job will be marked as a failure. Given that an experiment_accession_code
is required here I think that's the appropriate behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that sounds good
experiment = self.get_experiment_metadata(experiment_accession_code) | ||
|
||
r = requests.get(SAMPLES_URL.format(experiment_accession_code)) | ||
samples = r.json()["experiment"]["sample"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious of what the expected behavior is here if the API is down or you get a response that cannot be serialized into json - do you let the Python exception appear to the user in the command line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My answer here is basically the same as the one above except that I'll add that at some point surveying will be triggered by a foreman who will be responsible for starting and retrying survey_jobs. If the API is down and the job fails then once I get around to writing the foreman code it'll have mechanisms to retry the survey_jobs a few times so hopefully the API will come back up before the retry limit is reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
)) | ||
|
||
# Group batches based on their download URL and handle each group. | ||
download_urls = set(map(lambda x: x.download_url, batches)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @mhuyck had a similar suggestion in his review. I think you can do a set comprehension for this line to make it simpler:
download_urls = {bt.download_url for bt in batches}
# Group batches based on their download URL and handle each group. | ||
download_urls = set(map(lambda x: x.download_url, batches)) | ||
for url in download_urls: | ||
batches_with_url = list(filter(lambda x: x.download_url == url, batches)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, I think you can do:
batches_with_url = [bt.download_url for bt in batches if bt.download_url == url]
organism_id = Organism.get_id_for_name(organism_name) | ||
|
||
for sample_file in sample["file"]: | ||
if sample_file["type"] != "data" and sample_file["name"] is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any case where sample_file["name"]
would be False
or ""
instead of None
, but we would still not want the sample_file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this line of code confused me. I can't tell why we would ever want to continue if sample_file["name"]
is None
since we use that value later and call split on it. I think it's probably just a bug so I'm going to remove everything from and
onward.
""" | ||
This command will create and run survey jobs for each experiment in the | ||
experiment_list. experiment list should be a file containing one | ||
experiment accession code per line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
def handle(self, *args, **options): | ||
if options["experiment_list"] is None: | ||
logger.error("You must specify an experiment list.") | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't a 0
return value mean that a function has succeeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, good catch.
foreman/requirements.txt
Outdated
@@ -7,11 +7,16 @@ | |||
amqp==2.1.4 # via kombu | |||
billiard==3.5.0.2 # via celery | |||
celery==4.0.2 | |||
django==1.10.6 | |||
django==1.11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you are running different versions of django in different parts of this project. I'm just curious, any reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, whoops. I don't think I have a reason to be doing that. I think I may have run into a bug that was introduced in 1.11
though so that's why most of the project is 1.10.6
. I'll change this to match that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -26,7 +26,7 @@ def test_calls_survey(self, survey_method): | |||
job.save() | |||
surveyor.run_job(job) | |||
|
|||
survey_method.assert_called_with(job) | |||
survey_method.assert_called_with() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, now that this method isn't taking this argument, I'm wondering what exactly it is doing. Should it be called assert_called_with_job
or something of the sort? Also, where is it defined again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be assert_called
. The survey_method
Mock is passed in by the @patch
decorator, which is mocking surveyor.ArrayExpressSurveyor
's survey
method.
The reason I no longer am passing job to this method is that the ArrayExpressSurveyor
object has the job as an instance variable, so it doesn't need to have it passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if you change it to assert_called
, I think that it makes sense
@@ -0,0 +1,3 @@ | |||
#!/bin/bash | |||
|
|||
docker run -d --hostname rabbit-queue --name message-queue rabbitmq:3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
target_file_path, | ||
batch.id, | ||
job_id) | ||
if success: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
request.close() | ||
# The files for all of the batches in the grouping are | ||
# contained within the same zip file. Therefore only | ||
# download the one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
self.assertEqual(len(kwargs["batches"]), 2) | ||
|
||
def test_failure(self): | ||
"""Fails because there are no batches for the job.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
utils.end_job(job, None, False) | ||
return | ||
batch_relations = DownloaderJobsToBatches.objects.filter(downloader_job_id=job_id) | ||
batches = [br.batch for br in batch_relations] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
except Batch.DoesNotExist: | ||
logger.error("Cannot find batch record with ID %d.", job.batch_id) | ||
batch_relations = ProcessorJobsToBatches.objects.filter(processor_job_id=job.id) | ||
batches = [br.batch for br in batch_relations] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Except for a handful of |
OK @rzelayafavila I just pushed the last change you requested I think. Is it lookin good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, LGTM @kurtwheeler! Good stuff!
This PR got out of hand. The original scope of it was to improve the metadata we were collecting about batches of data from Array Express. However halfway through completing that it was decided that we should change the batch size for Array Express from an experiment to a single sample. Since the batches the metadata was about were changing, it didn't seem to make sense to change that metadata without also changing the batch size. However in the process of doing that I discovered that it's not possible to download a single batch worth of data (i.e. a single sample). Therefore I needed to also refactor the assumption that
one downloader job
==one batch
. This was a large change that affected a lot of the existing code. I also added a lot of new test code to previously untested functionality, because up until these changes that functionality was changing so much that I was hesitant to write tests for it.So that's the story of why I now have so many changes in a single PR. I'm not entirely sure how to proceed with this PR. I don't think that too much of it can be broken out into smaller PRs. The only large chunk of code that makes sense on its own is the organism model code. However breaking this out will make things somewhat more complicated because the addition of the organism model code prompted me to break the
models.py
file up into several files contained in a module. I then made changes to other models as I implemented the rest of the functionality in this branch.Anyway... I'm happy to do whatever makes this PR more digestible for reviewers. If that means breaking out the organism model code into it's own PR and duplicating the split of
models.py
in two PRs I can do that. If it's easier to leave it in this PR but have a separate reviewer look at the organism model code I can do that.