Skip to content
This repository has been archived by the owner on Sep 18, 2018. It is now read-only.

Initial commit for the spark job used to generate the aggregated hw j… #1

Merged
merged 1 commit into from Aug 15, 2016
Merged

Conversation

Dexterp37
Copy link
Collaborator

…son.

@almossawi @megavaughn please do not merge this PR just yet. Let's use this to start a conversation/review.

Notebook or Jar: summarize_json.ipynb
Spark Submission Args: N/A
Cluster Size: 5
Output Visibility: Public
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output visibility should be "Private", as we should be fetching/uploading to S3 manually to incrementally update the aggregates.

@SamPenrose
Copy link

Here is my review of the notebook (r+ with one small fix):

Bug:
In run_survey() if you set only one of start and end date, it will be ignored.

Request:
Per https://bugzilla.mozilla.org/show_bug.cgi?id=1262609#c32, I do believe we should count and report failures in get_newest_per_client() and get_valid_client_record(). In what detail we break them out is of less importance.

Comments:
- This work is oriented by submission date. Should that choice (vs activity date) be highlighted?
- These functions have logic that seems like a natural fit for a unit test:
get_newest_per_client()
aggregate_pings()
collapse_to_other_bucket()

Nits:
fetch_previous_state() and store_new_state() have the same docstring. I think it means "Load previously computed results from S3, if they exist."
run_survey():
If both are start and end are required (see Bug), why not pass as a single parameter?
get_valid_client_record():
whitespace in second if block
vendor_name_from_id():
if this code will live long, load from network?

@Dexterp37
Copy link
Collaborator Author

Thanks Sam, sorry for the delayed reply. I've addressed all your comments locally, and added some tests within the file. I'll update the PR soon.

@Dexterp37
Copy link
Collaborator Author

@SamPenrose , @rjweiss I've updated the notebook. Given the r+, should I go on and merge it?

Changes:

  • Added more comments.
  • Removed functions that were not used.
  • Addressed the bug Sam found and his request (records discarded due to broken data are no reported in the "discarded" bucket).
  • Fixed the nits

I left unit testing out for now. I'm considering moving some functions out to a library for easier testing.

@SamPenrose
Copy link

Works for me!

This notebook generates some statistics about the hardware
used by a representative sample of the Firefox Release population
and reports them in a JSON file.
@Dexterp37 Dexterp37 merged commit cec0cb0 into mozilla:master Aug 15, 2016
@Dexterp37 Dexterp37 deleted the spark_job branch August 16, 2016 07:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants