Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable test: Facilitator View of dashboard is as expected #35775

Merged
merged 3 commits into from
Jul 13, 2020

Conversation

islemaster
Copy link
Contributor

@islemaster islemaster commented Jul 10, 2020

Enables the eyes test disabled in #35749, and fixes a performance issue that we believe caused the flakiness in the first place.

Which test are we talking about?

Here is the scenario in question:

Scenario: Facilitator View of dashboard is as expected
Given I am a facilitator with started and completed courses
And I am on "http://studio.code.org/pd/workshop_dashboard"
And I wait to see element with ID "endedWorkshopsTable"
And I wait to see element with ID "inProgressWorkshopsTable"
And I wait to see element with ID "notStartedWorkshopsTable"
And I open my eyes to test "facilitator dashboard"
And I see no difference for "viewing facilitator dashboard"
And I close my eyes

At first glance, there's not a whole lot going on here, except I suspect that first step I am a facilitator with started and completed courses is doing the heavy lifting. We'll come back to that.

How often does it fail?

Here's the last 30 days of Sauce Labs' history on this test, from our test environment. You can see that the test gradually gets more flaky over time, then suddenly stops failing two days ago.

image

The test has been passing consistently in our Drone environment. This suggested that our failures were somehow related to persistent state on the test environment, which wouldn't be a problem on Drone. Given that it recently started succeeding again, maybe we didn't need to skip it? Still, it seems like a good idea to investigate, since it got so bad that it was only passing 20% of the time.

How does it fail?

We looked at several sample failures and they are very consistent. Here's a sample cucumber log:

image

When you open the Sauce Labs test log, you are greeted with this message:

image

The last cucumber step in the SL log is I am a facilitator with started and completed courses and the last selenium command is this execute_async which successfully creates a teacher account:

image

The video record also shows that the browser stays up for 60 seconds at this point, without receiving a command. Conclusion: Our test client is doing something that takes longer than 60 seconds, so Sauce Labs gives up and closes the remote browser. Next time our test tries to send a command to Sauce Labs, it discovers that the test session has expired.

Another observation: Although this is an "eyes" test, we don't need to check Applitools logs because the test is consistently failing before it reaches any Applitools steps.

What is the test doing that takes so long?

To understand this, we need to dig down into our step definitions.

I am a facilitator with started and completed courses invokes several other steps in turn.

Given(/^I am a facilitator with started and completed courses$/) do
random_name = "TestFacilitator" + SecureRandom.hex[0..9]
steps %Q{
And I create a teacher named "#{random_name}"
And I make the teacher named "#{random_name}" a facilitator for course "CS Fundamentals"
And I create a workshop for course "CS Fundamentals" facilitated by "#{random_name}" with 5 people and start it
And I create a workshop for course "CS Fundamentals" facilitated by "#{random_name}" with 5 people and end it
And I create a workshop for course "CS Fundamentals" facilitated by "#{random_name}" with 5 people
}
end

Ben and I looked at these steps one at a time and tried to identify potentially slow code.

And I create a teacher named "#{random_name}"

And(/^I create a teacher named "([^"]*)"( and go home)?$/) do |name, home|
create_user(name, age: '21+', user_type: 'teacher', email_preference_opt_in: 'yes')
navigate_to replace_hostname('http://studio.code.org') if home
end

def create_user(name, url: '/users.json', code: 201, **user_opts)
navigate_to replace_hostname('http://studio.code.org/reset_session')
Retryable.retryable(on: RSpec::Expectations::ExpectationNotMetError, tries: 3) do
email, password = generate_user(name)
browser_request(
url: url,
method: 'POST',
body: {
user: {
user_type: 'student',
email: email,
password: password,
password_confirmation: password,
name: name,
age: '16',
terms_of_service_version: '1'
}.merge(user_opts)
},
code: code
)
end
end

This step does interact with the remote browser. It accounts for the navigation to /reset_session and for the successful POST request we observed in the Sauce Labs log. I think it's safe to say this step completes successfully.

And I make the teacher named "#{random_name}" a facilitator for course "CS Fundamentals"

And(/^I make the teacher named "([^"]*)" a facilitator for course "([^"]*)"$/) do |name, course|
require_rails_env
user = User.find_by(email: @users[name][:email])
user.permission = UserPermission::FACILITATOR
Pd::CourseFacilitator.create(facilitator_id: user.id, course: course)
end

The require_rails_env step is known to be slow, but we do use it in other places. I'd also expect these tests to be failing on Drone if that was the only issue. We'll keep it in mind.

Everything else here looks fine - we're finding a user by email, granting them a permission, and inserting a CourseFacilitator row.

Note that this step does not interact with the remote browser.

And I create a workshop for course "CS Fundamentals" facilitated by "#{random_name}" with 5 people and start it

The last three steps are actually all the same, heavily-parameterized step:

And(/^I create a workshop for course "([^"]*)" ([a-z]+) by "([^"]*)" with (\d+) (people|facilitators)(.*)$/) do |course, role, name, number, number_type, post_create_actions|
# Organizer
organizer =
if role == 'organized'
User.find_by(name: name)
else
User.find_or_create_teacher(
{name: 'Organizer', email: "organizer#{SecureRandom.hex[0..5]}@code.org"}, nil, 'workshop_organizer'
)
end
workshop = Retryable.retryable(on: [ActiveRecord::RecordNotUnique, ActiveRecord::RecordInvalid], tries: 5) do
FactoryGirl.create(:workshop, :funded,
on_map: true,
course: course,
organizer_id: organizer.id,
capacity: number.to_i,
location_name: 'Buffalo',
num_sessions: 1,
sessions_from: Date.new(2018, 4, 1),
enrolled_and_attending_users: number_type == 'people' ? number.to_i : 0
)
end
# Facilitators
if number_type == 'facilitators'
number.to_i.times do
workshop.facilitators << create_facilitator(course)
end
else
facilitator = role == 'facilitated' ? User.find_by(name: name) : create_facilitator(course)
workshop.facilitators << facilitator
end
# Attendees
if number_type == 'people'
number.to_i.times do
create_enrollment(workshop)
end
else
if role == 'attended'
create_enrollment(workshop, name)
end
end
if post_create_actions.include?('and end it')
workshop.update!(started_at: DateTime.new(2016, 3, 15))
workshop.update!(ended_at: DateTime.new(2016, 3, 15))
elsif post_create_actions.include?('and start it')
workshop.update!(started_at: DateTime.new(2016, 3, 15))
else
workshop.update!(started_at: nil, ended_at: nil)
end
end

There's a lot going on here. First, notice that this code also does not interact with the remote browser.

Second, there are two User.find_by(name: name) calls, and one of them (line 500) is invoked by all three steps we are investigating. Our users table does not contain an index on name, so this accounts for at least three table scans during this step. On Drone that's probably fine - our database will only contain users we've created for tests. In our test environment, however, we can have millions of rows in the users table.

Today there were more than three million rows in users on test, which was sufficient to make this query take several seconds. We haven't yet uncovered whether this number was higher two days ago.

Third, this method invokes create_enrollment for each of the three steps in our parent step:

def create_enrollment(workshop, name=nil)
first_name = name.nil? ? "First - #{SecureRandom.hex}" : name
last_name = name.nil? ? "Last - #{SecureRandom.hex}" : "Last"
user = Retryable.retryable(on: [ActiveRecord::RecordInvalid], tries: 5) do
FactoryGirl.create :teacher
end
enrollment = Pd::Enrollment.create!(
first_name: first_name,
last_name: last_name,
email: user.email,
school_info: SchoolInfo.find_or_create_by(
{
country: 'US',
school_type: 'other',
state: 'WA',
zip: '98101',
school_name: 'Code.org'
}
),
pd_workshop_id: workshop.id
)
PEGASUS_DB[:forms].where(kind: 'PdWorkshopSurvey', source_id: enrollment.id).delete
end

create_enrollment in turn contains a School.find_or_create_by call that doesn't use any of the schools table's indexes, resulting in another table scan. There are about 100 thousand rows in schools on test today.

Conclusion

This test will fail if the require_rails_env call and the six table scans we found take more than 60 seconds all together. It seems likely that the slow queries are the main issue here, since this has been a problem on test but not Drone.

It looks like something (we're not sure what) happened two days ago to make these queries faster on test. Despite this, we're going ahead and fixing the users table scans here, to avoid similar issues in the future.

Links

Reviewer Checklist:

  • Tests provide adequate coverage
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

And I wait to see element with ID "inProgressWorkshopsTable"
And I wait to see element with ID "notStartedWorkshopsTable"
And I open my eyes to test "facilitator dashboard"
And I see no difference for "viewing facilitator dashboard"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire test changes here because we are reversing an error in #35749 when the test was disabled. This actually puts the test back the way it originally was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was me (bad copy/paste), thanks for fixing!

And I make the teacher named "#{random_name}" a workshop organizer
And I create a workshop for course "CS Fundamentals" organized by "#{random_name}" with 5 people and end it
}
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We noticed these two steps while looking for other code that might be affected by these table scans. It turns out, both of these are entirely unused.

@@ -497,7 +479,7 @@ def create_facilitator(course)
workshop.facilitators << create_facilitator(course)
end
else
facilitator = role == 'facilitated' ? User.find_by(name: name) : create_facilitator(course)
facilitator = role == 'facilitated' ? find_test_user_by_name(name) : create_facilitator(course)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We believe this particular slow query was the cause of this test failing.

user.permission = UserPermission::WORKSHOP_ORGANIZER
end

And(/^I make the teacher named "([^"]*)" a workshop admin$/) do |name|
require_rails_env

user = User.find_by(email: @users[name][:email])
user = find_test_user_by_name(name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these replacements are simply a refactor, since they were already looking the test user up by email.

@@ -1062,6 +1062,10 @@ def generate_user(name)
return email, password
end

def find_test_user_by_name(name)
User.find_by(email: @users[name][:email])
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've located this helper in steps.rb adjacent to the generate_user method which is responsible for initializing and populating the @users variable, and confirmed that it is accessible from pd.rb.

@islemaster islemaster marked this pull request as ready for review July 11, 2020 00:16
@islemaster islemaster requested review from bencodeorg and a team July 11, 2020 00:16
Copy link
Member

@breville breville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting!

Would it make sense to erase and recreate the test env every so often?

@islemaster
Copy link
Contributor Author

Would it make sense to erase and recreate the test env every so often?

Yes, it probably would! It might be reasonable to reset the test database on every run. Didn't we do this at some point? If that slows down our test cycle too much, I could see us instead choosing to wipe it once a week.

I think the counterargument is that not resetting the test database makes it more productionlike, giving us the chance of catching issues with migrations and performance issues caused by having large numbers of rows. I think the counter-counterargument is that test is probably not large in a productionlike way, since it grows with tests that don't clean up after themselves, and not in a way that models production traffic.

I'm not sure this particular failure makes a good case one way or another: We caught a performance problem, but it was specific to our test code.

@islemaster islemaster merged commit 1139081 into staging Jul 13, 2020
@islemaster islemaster deleted the reenable-plc-test branch July 13, 2020 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants