Initial commit for the spark job used to generate the aggregated hw j… #1

Dexterp37 · 2016-04-02T15:50:06Z

…son.

@almossawi @megavaughn please do not merge this PR just yet. Let's use this to start a conversation/review.

Dexterp37 · 2016-04-07T08:05:06Z

report/README.txt

+    Notebook or Jar:       summarize_json.ipynb
+    Spark Submission Args: N/A
+    Cluster Size:          5
+    Output Visibility:     Public


The output visibility should be "Private", as we should be fetching/uploading to S3 manually to incrementally update the aggregates.

SamPenrose · 2016-06-15T13:19:56Z

Here is my review of the notebook (r+ with one small fix):

Bug:
In run_survey() if you set only one of start and end date, it will be ignored.

Request:
Per https://bugzilla.mozilla.org/show_bug.cgi?id=1262609#c32, I do believe we should count and report failures in get_newest_per_client() and get_valid_client_record(). In what detail we break them out is of less importance.

Comments:
- This work is oriented by submission date. Should that choice (vs activity date) be highlighted?
- These functions have logic that seems like a natural fit for a unit test:
get_newest_per_client()
aggregate_pings()
collapse_to_other_bucket()

Nits:
fetch_previous_state() and store_new_state() have the same docstring. I think it means "Load previously computed results from S3, if they exist."
run_survey():
If both are start and end are required (see Bug), why not pass as a single parameter?
get_valid_client_record():
whitespace in second if block
vendor_name_from_id():
if this code will live long, load from network?

Dexterp37 · 2016-07-12T17:20:04Z

Thanks Sam, sorry for the delayed reply. I've addressed all your comments locally, and added some tests within the file. I'll update the PR soon.

Dexterp37 · 2016-08-12T16:30:27Z

@SamPenrose , @rjweiss I've updated the notebook. Given the r+, should I go on and merge it?

Changes:

Added more comments.
Removed functions that were not used.
Addressed the bug Sam found and his request (records discarded due to broken data are no reported in the "discarded" bucket).
Fixed the nits

I left unit testing out for now. I'm considering moving some functions out to a library for easier testing.

SamPenrose · 2016-08-12T16:33:11Z

Works for me!

This notebook generates some statistics about the hardware used by a representative sample of the Firefox Release population and reports them in a JSON file.

Dexterp37 reviewed Apr 7, 2016
View reviewed changes

Generate the hardware survey report.

a7d6460

This notebook generates some statistics about the hardware used by a representative sample of the Firefox Release population and reports them in a JSON file.

Dexterp37 merged commit cec0cb0 into mozilla:master Aug 15, 2016

Dexterp37 deleted the spark_job branch August 16, 2016 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial commit for the spark job used to generate the aggregated hw j… #1

Initial commit for the spark job used to generate the aggregated hw j… #1

Dexterp37 commented Apr 2, 2016

Dexterp37 Apr 7, 2016

SamPenrose commented Jun 15, 2016

Dexterp37 commented Jul 12, 2016

Dexterp37 commented Aug 12, 2016

SamPenrose commented Aug 12, 2016

Initial commit for the spark job used to generate the aggregated hw j… #1

Initial commit for the spark job used to generate the aggregated hw j… #1

Conversation

Dexterp37 commented Apr 2, 2016

Dexterp37 Apr 7, 2016

Choose a reason for hiding this comment

SamPenrose commented Jun 15, 2016

Dexterp37 commented Jul 12, 2016

Dexterp37 commented Aug 12, 2016

SamPenrose commented Aug 12, 2016