Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Participation stats #3447

Merged
merged 23 commits into from Apr 13, 2017
Merged

Participation stats #3447

merged 23 commits into from Apr 13, 2017

Conversation

nilbus
Copy link

@nilbus nilbus commented Apr 3, 2017

This is for analyzing the results of the experiment in exercism/discussions#123.

Implements #3445; see also the preliminary discussion there.

Goal: Measure review participation quantity and quantity (length) before, during, and after gamification is applied.

This branch contains DB migrations.

WIP screenshot: screenshot

Todo:

  • Convert stats.js to Coffeescript
  • Use use Plotly only on .comments-chart and future friends
  • Determine data format needed for comment length plot, and build a prototype
  • Migrate: Create crc32 function
  • Ruby: Look up and generate stats JSON in the format needed by Plotly
  • Extract data from the JS, and feed in via data attribute
  • Add comment length stats using box & whiskers plots
  • Exclude unqualified users
    • who participated in experiment discussions
    • who joined exercism after the experiment analysis period started (can't compare withdrawal to nonexistent earlier data)
    • who opted out to view the experiment stats
  • Exclude the last day of stats, which is in progress
  • Add a simple route integration test
  • Report performance metrics
  • Optimize performance (only as needed):
    • Test with 1 million comments (currently 300k in production)
    • Query less
      • Cache results, expire daily
      • Append new day's results to existing set, rather than recalculating everything daily
    • Optimize query
      • Store crc32(username) % 100 in users table
      • [ ] Create derived column created_date, updated via trigger
      • Index by created_date to eliminate sort on query
      • Create derived column body_length, updated via trigger

@nilbus nilbus force-pushed the participation-stats branch 2 times, most recently from da0c4ba to 6f1d8df Compare April 4, 2017 02:51
This replaces the /stats redirect to the first track.

See exercism/discussions#123

Feature flag: participation_stats

This introduces a new plotting library, Plotly.js.
See discussion in exercism#3445.

The migration introduces postgres to the crc32 hashing function, so it
can determine which users are in the experiment group and not.

This branch will close exercism#3445.
List control first, so experimental has the brighter color.
@nilbus nilbus force-pushed the participation-stats branch 3 times, most recently from 2061cb0 to 2344195 Compare April 8, 2017 20:25
The new Track Stats header provides for a better empty state.

We don't particularly want people to opt out of participation tracking.
I intentionally leave unsaid that this is how you can see the
statistics, so as not to reward curiosity during the experiment.
@nilbus
Copy link
Author

nilbus commented Apr 9, 2017

Release notes:

  • Feature flag: participation_stats
  • Run migrations
  • This PR introduces application metric reporting to Hosted Graphite using the metric exercism_io.stats.experiment.query.time. Under the Heroku app's resources tab, add the Hosted Graphite add-on (under Find more add-ons), and metrics will begin reporting. Please add me as a team member on the Hosted Graphite dashboard (linked from the Heroku Resources page). In environments without HOSTEDGRAPHITE_APIKEY set, metrics reporting is a no-op.

@nilbus
Copy link
Author

nilbus commented Apr 9, 2017

Forgot to mention… Hosted Graphite's free developer tier should be sufficient for us.

class CreateFunctionCrc32 < ActiveRecord::Migration
def up
ActiveRecord::Base.connection.execute <<~SQL
CREATE FUNCTION crc32(text_string text) RETURNS bigint AS $$

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a check for NULL in this function just in case?

Copy link
Author

@nilbus nilbus Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise it starts an infinite loop. Fixed in 8ee996d. Thanks!

Otherwise, crc32(NULL) starts an infinite loop.

Thanks @bernardoamc
new.increment(metric_name)
end

def initialize(api_key: ENV['HOSTEDGRAPHITE_APIKEY'], host: 'carbon.hostedgraphite.com', port: 2003)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this class simply doesn't work unless we have a valid api_key. Should we be explicit and raise when we don't have this attribute?

Copy link
Author

@nilbus nilbus Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was an intentional choice to make Metrics no-op when the API key is not present. Consider that most everyone working with this project in development will not have an API key. I don't want to force a dummy key to be provided or an account to be set up. In most cases, developers won't care to look at the metrics anyway.

Now that you mention this though, I just thought of something… instead of doing nothing, Metrics can just print to stdout. That would be even better for development. 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the context! <3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 1e23ff2.

where('comments.created_at < ?', end_date).
group('created_date').
order('created_date')
relation = filter_experiment_group(relation, experiment_group)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experiment_group is accessible anywhere in the class, we don't need to use it as a parameter here. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ff9a080.

}
if gamification_markers
result.merge!(
gamification_start_date: GAMIFICATION_START_DATE,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move this to a constant and do something like: result = result.merge(GAMIFICATION_MARKERS) if gamification_markers.present?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less garbage collection for the win! 🐎

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 334c2f6.

end

def results
result = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the order important here? date[0] refer to daily_review_count[0] which refer to daily_review_count[0] and so on? If so, we could create a struct like: DailyReview = Struct.new(:date, :count, :length) and instantiate it when iterating through the results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order is important. The trick is that these values get converted immediately to JSON for storage in a data attribute for the Plotly, and this is the format that it expects. In other circumstances, I agree that would make more sense.

See where this is used in stats.coffee:40.


attr_reader :alice, :opted_out
def setup
super

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the super here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DBCleaner starts a transaction in the superclass. When overriding setup, not calling super means that the transaction doesn't start, and any records created get left in the database after each test run.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Maybe it would be a good case to use prepend.. not the scope of this PR though. :)

class MetricsTest < MiniTest::Test
def test_reports_without_error
metrics = Metrics.new(api_key: 'test', host: 'localhost')
UDPSocket.any_instance.expects(:send)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use UDPSocket.any_instance.expects(:send).twice here?

metrics = Metrics.new(api_key: 'test', host: 'localhost')
UDPSocket.any_instance.expects(:send).twice
metrics.time('test.time') { 1 }
metrics.increment('test.count')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I wasn't sure if that was a thing in Minitest. 👍

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is provided by mocha, it has a bunch of interesting expectations. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's what I meant. 😄 My mind was just going to "not rspec".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in fc7b1bd.

@kytrinyx
Copy link
Member

This looks great @nilbus. Let me know if/when you're ready to merge (I see a few unchecked boxes).

Spread both comments and users over 4 years, with quantity increasing
exponentially toward the present.
@nilbus
Copy link
Author

nilbus commented Apr 11, 2017

Performance tests complete. With 50k users and 1M comments, the statistics query for each experiment group (2) takes under 200ms. I chose not to make a few other optimizations that brought the query time down to 100ms, because the complexity of pre-calculated columns that would need to be updated using triggers isn't worth the performance gain. I may revisit these though for long-term stats. With sub-second request times and no obvious existing cache mechanism, I'm also not concerned with application-level caching for this low-traffic page.

This is ready to go! (release notes)

@nilbus
Copy link
Author

nilbus commented Apr 11, 2017

Thanks for the thorough review @bernardoamc! 🚀

- Withdrawal date matches actual date.
- Pre gamification period matches withdrawal period duration.
@kytrinyx kytrinyx merged commit 76aebfa into exercism:master Apr 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants