Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored bulk uploading of objects #382

Merged
merged 11 commits into from
Oct 14, 2020
Merged

Conversation

ajstewart
Copy link
Contributor

  • Moved to generator and iteration uploading.
  • Removed the storage of meas_dj_obj and src_dj_obj.
  • Links made on id values instead of objects.

Fixes #381.

- Moved to generator and iteration uploading.
- Removed the storage of meas_dj_obj and src_dj_obj.
- Links made on id values instead of objects.
@ajstewart ajstewart added the enhancement New feature or request label Oct 12, 2020
@ajstewart ajstewart self-assigned this Oct 12, 2020
@github-actions github-actions bot added this to In progress in Pipeline Backlog Oct 12, 2020
@ajstewart ajstewart mentioned this pull request Oct 12, 2020
@ajstewart ajstewart marked this pull request as ready for review October 12, 2020 23:24
@ajstewart
Copy link
Contributor Author

@srggrs @marxide After a successful full test this is ready. You'll see that I put the generator functions into their own new file. However I am also happy to move these to somewhere like loading.py if you think that's better.

@ajstewart
Copy link
Contributor Author

Just to record some of the information in this PR, below is a before and after plot of RAM usage on a full tiles run of the vast pilot survey (the after run does not have the image injection stage):
Screen Shot 2020-10-11 at 11 44 27
Screen Shot 2020-10-13 at 10 20 17

The high two epoch creation and upload highlighted the memory issue that led to these changes.

Copy link
Contributor

@srggrs srggrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should aim to do 2 things:
1 - pass the dataframe to upload* and inside that create the id column
2 - aim at building a generator that would return the chunk directly, see example above

def souce_model_f(row, pipe_run_id):
    name = (
        f"ASKAP_{deg2hms(row['wavg_ra'])}"
        f"{deg2dms(row['wavg_dec'])}".replace(":", "")
    )
    src = Source()
    src.run_id = pipe_run_id
    src.name = name
    for fld in src._meta.get_fields():
        if getattr(fld, 'attname', None) and fld.attname in row.index:
            setattr(src, fld.attname, row[fld.attname])

    return src

def source_model_generator(chuck: pd.DataFrame):
    models = chuck.apply(lambda row: souce_model_f(row), axis=1)
    return models.tolist()


## MAIN LOGIC
for idx in range(0, size, batch_size):
    models = source_model_generator(src_df.iloc[idx: idx + batch_size])
    ids = upload_source(models, return_ids=True)

Potentially we can use yield in yield models.tolist() and make generators, then call them using the list(mygenerator)

Open to discussion @ajstewart @marxide

vast_pipeline/pipeline/association.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/forced_extraction.py Show resolved Hide resolved
vast_pipeline/pipeline/generators.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/generators.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/generators.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/generators.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/loading.py Show resolved Hide resolved
vast_pipeline/pipeline/loading.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/loading.py Show resolved Hide resolved
@ajstewart
Copy link
Contributor Author

I think we should aim to do 2 things:

1 - pass the dataframe to upload* and inside that create the id column

2 - aim at building a generator that would return the chunk directly, see example above

def souce_model_f(row, pipe_run_id):

    name = (

        f"ASKAP_{deg2hms(row['wavg_ra'])}"

        f"{deg2dms(row['wavg_dec'])}".replace(":", "")

    )

    src = Source()

    src.run_id = pipe_run_id

    src.name = name

    for fld in src._meta.get_fields():

        if getattr(fld, 'attname', None) and fld.attname in row.index:

            setattr(src, fld.attname, row[fld.attname])



    return src



def source_model_generator(chuck: pd.DataFrame):

    models = chuck.apply(lambda row: souce_model_f(row), axis=1)

    return models.tolist()





## MAIN LOGIC

for idx in range(0, size, batch_size):

    models = source_model_generator(src_df.iloc[idx: idx + batch_size])

    ids = upload_source(models, return_ids=True)

Potentially we can use yield in yield models.tolist() and make generators, then call them using the list(mygenerator)

Open to discussion @ajstewart @marxide

So compared to how I've done it now, you'd like just the data frame passed to upload function which then does the generator and upload? Though it looks like you don't think 'yield' is needed?

@ajstewart
Copy link
Contributor Author

ajstewart commented Oct 13, 2020

@srggrs I've moved things around a bit to make more sense and be consistent with the measurements and images upload. All the generators are now loaded in loading.py only (apart from forced fits upload) and only the data frames are passed to the upload functions (apart from sources which also needs the Run).

I tried to use a for loop with islice but for some reason I couldn't get it to go which confused me.

marxide
marxide previously approved these changes Oct 13, 2020
vast_pipeline/pipeline/loading.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/generators.py Outdated Show resolved Hide resolved
Pipeline Backlog automation moved this from In progress to Reviewer approved Oct 13, 2020
@ajstewart ajstewart requested a review from srggrs October 13, 2020 23:45
Pipeline Backlog automation moved this from Reviewer approved to Review in progress Oct 13, 2020
Copy link
Contributor

@srggrs srggrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave the generators in that file, maybe call it model_generator.py then move the id column creation inside the upload function and rename the functions make_upload_

vast_pipeline/pipeline/finalise.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/loading.py Show resolved Hide resolved
vast_pipeline/pipeline/forced_extraction.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/finalise.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/loading.py Outdated Show resolved Hide resolved
vast_pipeline/pipeline/main.py Outdated Show resolved Hide resolved
- Also fixed numpy reference in new_sources.py
ajstewart and others added 2 commits October 14, 2020 11:14
Co-authored-by: Serg <34258464+srggrs@users.noreply.github.com>
@ajstewart
Copy link
Contributor Author

ajstewart commented Oct 14, 2020

@srggrs I think I've addressed all the comments above in a2c8224 apart from the loop (see my comment on that).

p.s. I didn't have time to do any docstrings at the moment sorry.

EDIT. Forgot to add that the latest commits also contain a fix to newsources.py where a reference to numpy has snuck in where it should np.

vast_pipeline/pipeline/loading.py Show resolved Hide resolved
Pipeline Backlog automation moved this from Review in progress to Reviewer approved Oct 14, 2020
@srggrs srggrs merged commit e959fb5 into master Oct 14, 2020
Pipeline Backlog automation moved this from Reviewer approved to Done Oct 14, 2020
@srggrs srggrs deleted the issue-381-change-bulk-upload branch October 14, 2020 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

Improve bulk uploading to lower memory usage (generators, islice)
3 participants