-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable test: Facilitator View of dashboard is as expected #35775
Conversation
And I wait to see element with ID "inProgressWorkshopsTable" | ||
And I wait to see element with ID "notStartedWorkshopsTable" | ||
And I open my eyes to test "facilitator dashboard" | ||
And I see no difference for "viewing facilitator dashboard" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire test changes here because we are reversing an error in #35749 when the test was disabled. This actually puts the test back the way it originally was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was me (bad copy/paste), thanks for fixing!
And I make the teacher named "#{random_name}" a workshop organizer | ||
And I create a workshop for course "CS Fundamentals" organized by "#{random_name}" with 5 people and end it | ||
} | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We noticed these two steps while looking for other code that might be affected by these table scans. It turns out, both of these are entirely unused.
@@ -497,7 +479,7 @@ def create_facilitator(course) | |||
workshop.facilitators << create_facilitator(course) | |||
end | |||
else | |||
facilitator = role == 'facilitated' ? User.find_by(name: name) : create_facilitator(course) | |||
facilitator = role == 'facilitated' ? find_test_user_by_name(name) : create_facilitator(course) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We believe this particular slow query was the cause of this test failing.
user.permission = UserPermission::WORKSHOP_ORGANIZER | ||
end | ||
|
||
And(/^I make the teacher named "([^"]*)" a workshop admin$/) do |name| | ||
require_rails_env | ||
|
||
user = User.find_by(email: @users[name][:email]) | ||
user = find_test_user_by_name(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these replacements are simply a refactor, since they were already looking the test user up by email.
@@ -1062,6 +1062,10 @@ def generate_user(name) | |||
return email, password | |||
end | |||
|
|||
def find_test_user_by_name(name) | |||
User.find_by(email: @users[name][:email]) | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've located this helper in steps.rb adjacent to the generate_user
method which is responsible for initializing and populating the @users
variable, and confirmed that it is accessible from pd.rb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting!
Would it make sense to erase and recreate the test env every so often?
Yes, it probably would! It might be reasonable to reset the test database on every run. Didn't we do this at some point? If that slows down our test cycle too much, I could see us instead choosing to wipe it once a week. I think the counterargument is that not resetting the test database makes it more productionlike, giving us the chance of catching issues with migrations and performance issues caused by having large numbers of rows. I think the counter-counterargument is that test is probably not large in a productionlike way, since it grows with tests that don't clean up after themselves, and not in a way that models production traffic. I'm not sure this particular failure makes a good case one way or another: We caught a performance problem, but it was specific to our test code. |
Enables the eyes test disabled in #35749, and fixes a performance issue that we believe caused the flakiness in the first place.
Which test are we talking about?
Here is the scenario in question:
code-dot-org/dashboard/test/ui/features/plc/pd/dashboard_view.feature
Lines 6 to 14 in 4008770
At first glance, there's not a whole lot going on here, except I suspect that first step
I am a facilitator with started and completed courses
is doing the heavy lifting. We'll come back to that.How often does it fail?
Here's the last 30 days of Sauce Labs' history on this test, from our test environment. You can see that the test gradually gets more flaky over time, then suddenly stops failing two days ago.
The test has been passing consistently in our Drone environment. This suggested that our failures were somehow related to persistent state on the test environment, which wouldn't be a problem on Drone. Given that it recently started succeeding again, maybe we didn't need to skip it? Still, it seems like a good idea to investigate, since it got so bad that it was only passing 20% of the time.
How does it fail?
We looked at several sample failures and they are very consistent. Here's a sample cucumber log:
When you open the Sauce Labs test log, you are greeted with this message:
The last cucumber step in the SL log is
I am a facilitator with started and completed courses
and the last selenium command is thisexecute_async
which successfully creates a teacher account:The video record also shows that the browser stays up for 60 seconds at this point, without receiving a command. Conclusion: Our test client is doing something that takes longer than 60 seconds, so Sauce Labs gives up and closes the remote browser. Next time our test tries to send a command to Sauce Labs, it discovers that the test session has expired.
Another observation: Although this is an "eyes" test, we don't need to check Applitools logs because the test is consistently failing before it reaches any Applitools steps.
What is the test doing that takes so long?
To understand this, we need to dig down into our step definitions.
I am a facilitator with started and completed courses
invokes several other steps in turn.code-dot-org/dashboard/test/ui/features/step_definitions/pd.rb
Lines 73 to 82 in d671161
Ben and I looked at these steps one at a time and tried to identify potentially slow code.
And I create a teacher named "#{random_name}"
code-dot-org/dashboard/test/ui/features/step_definitions/steps.rb
Lines 1285 to 1288 in d671161
code-dot-org/dashboard/test/ui/features/step_definitions/steps.rb
Lines 1249 to 1270 in d671161
This step does interact with the remote browser. It accounts for the navigation to
/reset_session
and for the successful POST request we observed in the Sauce Labs log. I think it's safe to say this step completes successfully.And I make the teacher named "#{random_name}" a facilitator for course "CS Fundamentals"
code-dot-org/dashboard/test/ui/features/step_definitions/pd.rb
Lines 147 to 153 in d671161
The
require_rails_env
step is known to be slow, but we do use it in other places. I'd also expect these tests to be failing on Drone if that was the only issue. We'll keep it in mind.Everything else here looks fine - we're finding a user by email, granting them a permission, and inserting a CourseFacilitator row.
Note that this step does not interact with the remote browser.
And I create a workshop for course "CS Fundamentals" facilitated by "#{random_name}" with 5 people and start it
The last three steps are actually all the same, heavily-parameterized step:
code-dot-org/dashboard/test/ui/features/step_definitions/pd.rb
Lines 470 to 523 in d671161
There's a lot going on here. First, notice that this code also does not interact with the remote browser.
Second, there are two
User.find_by(name: name)
calls, and one of them (line 500) is invoked by all three steps we are investigating. Ourusers
table does not contain an index onname
, so this accounts for at least three table scans during this step. On Drone that's probably fine - our database will only contain users we've created for tests. In our test environment, however, we can have millions of rows in the users table.Today there were more than three million rows in
users
on test, which was sufficient to make this query take several seconds. We haven't yet uncovered whether this number was higher two days ago.Third, this method invokes
create_enrollment
for each of the three steps in our parent step:code-dot-org/dashboard/test/ui/features/step_definitions/pd.rb
Lines 436 to 459 in d671161
create_enrollment
in turn contains aSchool.find_or_create_by
call that doesn't use any of theschools
table's indexes, resulting in another table scan. There are about 100 thousand rows inschools
on test today.Conclusion
This test will fail if the
require_rails_env
call and the six table scans we found take more than 60 seconds all together. It seems likely that the slow queries are the main issue here, since this has been a problem on test but not Drone.It looks like something (we're not sure what) happened two days ago to make these queries faster on test. Despite this, we're going ahead and fixing the
users
table scans here, to avoid similar issues in the future.Links
Reviewer Checklist: