Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample data generation #23558

Merged
merged 4 commits into from
Jul 11, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
181 changes: 181 additions & 0 deletions dashboard/lib/sample_data.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
include FactoryGirl::Syntax::Methods

class SampleData
TEST_ACCOUNT_PASSWORD = "00secret".freeze
@@prng = nil

# Returns a seeded random number generator for consistent test data
def self.prng
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the p? Persistent?
Is this a common abbreviation? I'd prefer a more clear and descriptive name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a common abbreviation for Pseudorandom Number Generator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess I should have googled that. How about changing the comment s/random/pseudorandom ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had chosen prng also because Ruby docs use that variable name in their samples for Random: https://ruby-doc.org/core-2.2.0/Random.html

But it may be unnecessarily unconfusing. Changed to rng.

@@prng ||= Random.new(0)
@@prng
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is unnecessary. Line 9 will evaluate to this anyway:

irb(main):002:0> defined? foo
=> nil
irb(main):003:0> foo ||= 'bar'
=> "bar"
irb(main):004:0> foo
=> "bar"
irb(main):005:0> foo = 'another value'
=> "another value"
irb(main):006:0> foo ||= 'bar'
=> "another value"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, changed

end

# Creates sample data
def self.seed
raise "Should not be run outside of development" unless CDO.rack_env?(:development)

# Create a test teacher
teacher = create_teacher "testteacher@code.org", "TestTeacher Codeberg"

# Create normal-sized and large sections for each of CSF, CSD and CSP

create_section({teacher: teacher, name: "CSF 1",
login_type: Section::LOGIN_TYPE_PICTURE, grade: 2, age_min: 7,
age_max_inclusive: 9, script_name: "coursea-2018", num_students: 30,
use_imperfect_results: true}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby-style:

  1. The curly braces aren't needed. For that matter neither are the parentheses, but they're helpful for multi-line declarations like this.
  2. The first arg should be on a new line and the final end parenthesis should be unindented one level. I'm surprised Rubocop didn't complain about this.
create_section(
  teacher: teacher, name: "CSF 1",
  login_type: Section::LOGIN_TYPE_PICTURE, grade: 2, age_min: 7,
  age_max_inclusive: 9, script_name: "coursea-2018", num_students: 30,
  use_imperfect_results: true
)

Minor Ruby-style nit: prefer single quote strings when not using interpolation.

I also tend to prefer splitting these long hashes into one key per line for readability. That's not necessary but it tends to make reading, comparing, and editing easier. This would be my ideal format:

create_section(
  teacher: teacher, 
  name: 'CSF 1',
  login_type: Section::LOGIN_TYPE_PICTURE, 
  grade: 2, 
  age_min: 7,
  age_max_inclusive: 9, 
  script_name: 'coursea-2018', 
  num_students: 30,
  use_imperfect_results: true
)

Same for the other calls below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, all curly braces removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentations fixed. I tried one line per parameter but it explodes that section of code into multiple screenfuls. I left it with multiple parameters per line because I can see the whole sample case generation strategy on one screen.


create_section({teacher: teacher, name: "CSF 2 (Large)",
login_type: Section::LOGIN_TYPE_PICTURE, grade: 2, age_min: 7,
age_max_inclusive: 9, script_name: "coursea-2018", num_students: 150,
use_imperfect_results: true}
)

create_section({teacher: teacher, name: "CSD 1",
login_type: Section::LOGIN_TYPE_EMAIL, grade: 7, age_min: 13,
age_max_inclusive: 14, script_name: "csd1-2018", num_students: 30,
use_imperfect_results: false}
)

create_section ({teacher: teacher, name: "CSD 2 (Large)",
login_type: Section::LOGIN_TYPE_EMAIL, grade: 7, age_min: 13,
age_max_inclusive: 14, script_name: "csd1-2018", num_students: 150,
use_imperfect_results: false}
)

create_section ({teacher: teacher, name: "CSP 1",
login_type: Section::LOGIN_TYPE_EMAIL, grade: 10, age_min: 14,
age_max_inclusive: 16, script_name: "csp1-2018", num_students: 30,
use_imperfect_results: false}
)

create_section ({teacher: teacher, name: "CSP 2 (Large)",
login_type: Section::LOGIN_TYPE_EMAIL, grade: 10, age_min: 14,
age_max_inclusive: 16, script_name: "csp1-2018", num_students: 150,
use_imperfect_results: false}
)
end

# will delete the teacher and all of the teacher's sections and students
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment style isn't consistent with the others:
Delete the teacher and all of the teacher's sections...

# and recreate.
def self.create_teacher(email, name)
# Delete any existing test data
user = User.find_by_email_or_hashed_email(email)
unless user.nil?
user.sections.each do |section|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:

The User.sections association is designated dependent destroy.

Same with Section.followers.

So, deleting one of these test users will (soft) delete all their sections and followers. It's worth a comment here to explain that this is to also (hard) delete the test users and to hard delete the section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, commented

# Delete all students in each section
section.followers.each do |follower|
student_user = User.find_by_id(follower.student_user_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary lookup. We can iterate section.students directly since the follower here is only used to find the student. That will do one sql query with a join, rather than an initial query plus one each to find all the students.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, done

raise "Not a test user" unless student_user.name.include? "Codeberg"
Copy link
Contributor

@aoby aoby Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dicey. We may override the test user's name, and what's to keep a real user from being named Codeberg?

I'd prefer a more reliable way of differentiating these test users if we want to keep them apart.
What about adding a serialized field to User, only present when true, called something like test_user that we always set through the factory (can be default factory or a trait).

As a serialized property it will be stored in the existing properties JSON column and not require a data migration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly okay with this given the other defenses in place to only run this in development, and I'm not thrilled with the idea of adding a properties column for this purpose (it may not require a migration but it's still another test-specific support in production code). That said, I think the following would be an improvement:

  • Whatever identifying string we use for a test user should be in a constant in this file to minimize the chance of typos.
  • Setting a custom username may be more reliable, since display name can be changed via a normal user workflow and one can imagine changing the display name of one of these users in a test scenario and subsequently failing to delete the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're sure this will stay only in development, then it's ok but then I agree use a constant in this file as the name (or name prefix) and make it something less like a real name and different from the FactoryBot default of "User n Codeberg".

If we think this could ever run on production, then it scares me to automatically hard delete users and I do want more protection even if it means having "test-specific support" in production. We could call it seeded_user so it's less test specific.

I'd like to see a unit test for this too - make sure we delete the users we expect and only them.

UserGeo.where(user_id: student_user.id).destroy_all
student_user.really_destroy!
end
# Delete each section
section.really_destroy!
end
UserGeo.where(user_id: user.id).destroy_all
# Delete the existing test teacher
raise "Not a test user" unless user.name.include? "Codeberg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

user.really_destroy!
end
# Create the test teacher
create :teacher, {email: email, name: name,
password: TEST_ACCOUNT_PASSWORD, terms_of_service_version: 1}
end

# Generate a random gender choice with reasonable distributions
def self.random_gender
val = prng.rand(100)
gender = nil
case val
when 0..29
gender = "m"
when 30..59
gender = "f"
when 60..89
gender = nil
when 90..94
gender = "n"
when 95..99
gender = "o"
end
gender
end

# Create a test section
# Options is a hash with expected values:
# :teacher - User object of teacher
# :name - section name
# :login_type - Section::LOGIN_TYPE_xxx value of login type for section
# :grade - grade for section
# :age_min - minimum age to generate for students
# :age_max_inclusive - max age (inclusive) to generate for students
# :script_name - name of script to assign for section
# :num_students - number of students to generate for section
# :use_imperfect_results - if true, generate some less than perfect
# results. (CSF allows imperfect results, CSD and CSP do not.)
def self.create_section(options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these options required? If so it might be better to use keyword arguments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about that, and yes, these are all required.

script = Script.get_from_cache(options[:script_name])
level_count = script.script_levels.count

# Create the section
section = create :section, {user: options[:teacher], name: options[:name],
login_type: options[:login_type], grade: options[:grade],
script: script}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby-style: the curly braces aren't needed

Also for convenience, you could use Rails' Hash.slice to form a subset hash:

section = create :section, script: script, **options.slice(:teacher, :name, :login_type, :grade)

Note: the double-splat (**) is needed to convert the sliced hash into keyword args so it can be combined with the previous script: keyword argument. It wouldn't be needed otherwise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done


# Create students in section
(0..options[:num_students] - 1).each do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use an inclusive range from 0 to num students - 1, especially when we're not using the iterator? Remove the -1 and either iterate inclusively from 1 to num_students, or exclusively from 0 to num_students.

[development] dashboard > num = 5
=> 5
[development] dashboard > (0..num - 1).map {|n| n} # unneessary math
=> [0, 1, 2, 3, 4]
[development] dashboard > (1..num).map {|n| n} # inclusive from 1
=> [1, 2, 3, 4, 5]
[development] dashboard > (0...num).map {|n| n} # exclusive from 0
=> [0, 1, 2, 3, 4]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, did (1..num)

# Choose random properties and create student
age_min = options[:age_min]
age_max_inclusive = options[:age_max_inclusive]

age = prng.rand(age_max_inclusive + 1 - age_min) + age_min
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rand can take a range.
How about?

age = prng.rand(age_min..age_max_inclusive)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done

gender = random_gender
student_user = create :student, {age: age, gender: gender}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby-style: unnecessary {}s

Elsewhere too. I'm going to stop commenting on each one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done throughout


# Add student to section
create :follower, {section: section, student_user: student_user}

# Create random student progress.
pct_skipped = prng.rand(15)
pct_imperfect = options[:use_imperfect_results] ? prng.rand(40) + 25 +
pct_skipped : 0

max_level =
if prng.rand(100) < 90
(level_count.to_f * (prng.rand(0.6) + 0.2)).to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth using a range for readability: prng.rand(0.2..0.8)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done

else
# To simulate real-world data, some students have no progress
0
end

# Create progress for this student on each level, up to a random
# max level
current_level = 0
script.script_levels.each do |script_level|
break if current_level == max_level

# Roll the dice to decide if progress is completed, perfect, or
# skipped for this level
rand_val = prng.rand(100)
best_result =
if rand_val < pct_skipped
nil
elsif rand_val < pct_imperfect
ActivityConstants::MINIMUM_PASS_RESULT
else
ActivityConstants::BEST_PASS_RESULT
end

# Save progress for this level if not skipping
if best_result
create :user_level, user: student_user, script_id: script.id,
level_id: script_level.level_id, attempts: 1,
best_result: best_result
end

current_level += 1
end
end
end
end
4 changes: 4 additions & 0 deletions dashboard/lib/tasks/seed.rake
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,10 @@ namespace :seed do
Census::StateCsOffering.seed
end

task sample_data: :environment do
SampleData.seed
end

MAX_LEVEL_SOURCES = 10_000
desc "calculate solutions (ideal_level_source) for levels based on most popular correct solutions (very slow)"
task ideal_solutions: :environment do
Expand Down