Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample data generation #23558

Merged
merged 4 commits into from Jul 11, 2018
Merged

Sample data generation #23558

merged 4 commits into from Jul 11, 2018

Conversation

jeremydstone
Copy link

This PR adds the ability for developers to easily generate some representative test data useful to develop features against.

Developers can now run the command: dashboard/bundle exec rake seed:sample_data to create sample data.

Presently, the sample data created is a single test teacher (testteacher@code.org); a normal-sized (30) and large (150) section of each of CSF, CSD and CSP; test students to fill the sections with a distribution of ages and genders; and student progress for all students with a reasonable distribution of completions, results, skips and no progress. This set of sample data is most useful for teacher feature scenarios. In the future, we could add additional sample data for other scenarios such as PLC.

The sample data will be persistent on the developer's machine unless some action is taken separately to delete it. The sample data creation process is idempotent; if you re-run the script, it will delete any existing sample data and recreate it. So as we add to and change the sample data, developers can update to the latest simply be re-running the script.

Presently this is intended to be run only in the development environment (and will raise if run outside the development environment). If we find it useful, in the future we could consider automatically populating test environments with sample data.

Copy link
Contributor

@caleybrock caleybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me - will definitely be super useful!

)
end

# will delete the teacher and all of the teacher's sections and students
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment style isn't consistent with the others:
Delete the teacher and all of the teacher's sections...

Copy link
Contributor

@aoby aoby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I have a bunch of small style and readability suggestions, and one concern about the potential to delete the wrong users (with a suggestion).

# Returns a seeded random number generator for consistent test data
def self.prng
@@prng ||= Random.new(0)
@@prng
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is unnecessary. Line 9 will evaluate to this anyway:

irb(main):002:0> defined? foo
=> nil
irb(main):003:0> foo ||= 'bar'
=> "bar"
irb(main):004:0> foo
=> "bar"
irb(main):005:0> foo = 'another value'
=> "another value"
irb(main):006:0> foo ||= 'bar'
=> "another value"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, changed

@@prng = nil

# Returns a seeded random number generator for consistent test data
def self.prng
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the p? Persistent?
Is this a common abbreviation? I'd prefer a more clear and descriptive name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a common abbreviation for Pseudorandom Number Generator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess I should have googled that. How about changing the comment s/random/pseudorandom ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had chosen prng also because Ruby docs use that variable name in their samples for Random: https://ruby-doc.org/core-2.2.0/Random.html

But it may be unnecessarily unconfusing. Changed to rng.

login_type: Section::LOGIN_TYPE_PICTURE, grade: 2, age_min: 7,
age_max_inclusive: 9, script_name: "coursea-2018", num_students: 30,
use_imperfect_results: true}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby-style:

  1. The curly braces aren't needed. For that matter neither are the parentheses, but they're helpful for multi-line declarations like this.
  2. The first arg should be on a new line and the final end parenthesis should be unindented one level. I'm surprised Rubocop didn't complain about this.
create_section(
  teacher: teacher, name: "CSF 1",
  login_type: Section::LOGIN_TYPE_PICTURE, grade: 2, age_min: 7,
  age_max_inclusive: 9, script_name: "coursea-2018", num_students: 30,
  use_imperfect_results: true
)

Minor Ruby-style nit: prefer single quote strings when not using interpolation.

I also tend to prefer splitting these long hashes into one key per line for readability. That's not necessary but it tends to make reading, comparing, and editing easier. This would be my ideal format:

create_section(
  teacher: teacher, 
  name: 'CSF 1',
  login_type: Section::LOGIN_TYPE_PICTURE, 
  grade: 2, 
  age_min: 7,
  age_max_inclusive: 9, 
  script_name: 'coursea-2018', 
  num_students: 30,
  use_imperfect_results: true
)

Same for the other calls below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, all curly braces removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentations fixed. I tried one line per parameter but it explodes that section of code into multiple screenfuls. I left it with multiple parameters per line because I can see the whole sample case generation strategy on one screen.

# Delete any existing test data
user = User.find_by_email_or_hashed_email(email)
unless user.nil?
user.sections.each do |section|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:

The User.sections association is designated dependent destroy.

Same with Section.followers.

So, deleting one of these test users will (soft) delete all their sections and followers. It's worth a comment here to explain that this is to also (hard) delete the test users and to hard delete the section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, commented

user.sections.each do |section|
# Delete all students in each section
section.followers.each do |follower|
student_user = User.find_by_id(follower.student_user_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary lookup. We can iterate section.students directly since the follower here is only used to find the student. That will do one sql query with a join, rather than an initial query plus one each to find all the students.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, done

# Create the section
section = create :section, {user: options[:teacher], name: options[:name],
login_type: options[:login_type], grade: options[:grade],
script: script}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby-style: the curly braces aren't needed

Also for convenience, you could use Rails' Hash.slice to form a subset hash:

section = create :section, script: script, **options.slice(:teacher, :name, :login_type, :grade)

Note: the double-splat (**) is needed to convert the sliced hash into keyword args so it can be combined with the previous script: keyword argument. It wouldn't be needed otherwise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done

script: script}

# Create students in section
(0..options[:num_students] - 1).each do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use an inclusive range from 0 to num students - 1, especially when we're not using the iterator? Remove the -1 and either iterate inclusively from 1 to num_students, or exclusively from 0 to num_students.

[development] dashboard > num = 5
=> 5
[development] dashboard > (0..num - 1).map {|n| n} # unneessary math
=> [0, 1, 2, 3, 4]
[development] dashboard > (1..num).map {|n| n} # inclusive from 1
=> [1, 2, 3, 4, 5]
[development] dashboard > (0...num).map {|n| n} # exclusive from 0
=> [0, 1, 2, 3, 4]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, did (1..num)

age_min = options[:age_min]
age_max_inclusive = options[:age_max_inclusive]

age = prng.rand(age_max_inclusive + 1 - age_min) + age_min
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rand can take a range.
How about?

age = prng.rand(age_min..age_max_inclusive)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done


age = prng.rand(age_max_inclusive + 1 - age_min) + age_min
gender = random_gender
student_user = create :student, {age: age, gender: gender}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby-style: unnecessary {}s

Elsewhere too. I'm going to stop commenting on each one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done throughout


max_level =
if prng.rand(100) < 90
(level_count.to_f * (prng.rand(0.6) + 0.2)).to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth using a range for readability: prng.rand(0.2..0.8)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done

@jeremydstone
Copy link
Author

@aoby how do these changes look?

user.sections.each do |section|
# Hard-delete all students in each section.
section.students.each do |student_user|
raise "Not a sample student" unless student_user.name =~ SAMPLE_STUDENT_NAME_REGEX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding some identifiable information in the error, like student id or email.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done

UserGeo.where(user_id: user.id).destroy_all
# Delete the existing test teacher
raise "Not a sample teacher" unless user.name.eql? SAMPLE_TEACHER_NAME
raise "Not a sample teacher" unless user.email.eql? SAMPLE_TEACHER_EMAIL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx done

# Choose random properties and create student
age_min = options[:age_min]
age_max_inclusive = options[:age_max_inclusive]
current_student += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this will reset the count for each section, reusing the student names between sections. Names don't have to be unique so that's ok but it might be clearer / easier to distinguish if we increase the number for all students in the seed (across all sections).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I did this deliberately because it seems less confusing to look at sections this way (with numbering within sections). If we prefer it "globally numbered" we can make that change later.

# and recreate. Sections and followers would be soft-deleted by
# dependency when we delete the teacher; but to not leave a trail of
# old test data behind, we explictly hard-delete.
def self.create_teacher(email, password, name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since deleting happens in this method that could conceivably in the future be called from anywhere, how about duplicating the

raise "Should not be run outside of development" unless CDO.rack_env?(:development)

check at the top of this method too, just to be paranoid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see you've done this down on line 82. I'd pull it up here - as currently written, you could hit the hard-delete on a section without raising if that section had no students.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added another one at top of method. I'd like to have defense in depth as the code changes in the future, perhaps inadvertently gets copied/moved into other contexts, etc.

# Delete any existing test data
user = User.find_by_email_or_hashed_email(email)
unless user.nil?
user.sections.each do |section|
# Hard-delete all students in each section.
section.students.each do |student_user|
raise "Not a sample student" unless student_user.name =~ SAMPLE_STUDENT_NAME_REGEX
raise "Not a sample student - #{student_user.name}" unless student_user.name =~ SAMPLE_STUDENT_NAME_REGEX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is not unique. Why not provide something that we can identify the record with like id or hashed_email?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is not unique, but it IS identifiable. The only way we could accidentally delete, say, prod data is if someone named a user "TestStudent[#] Codeberg" (AND other code here somehow went berzerk and found it). I think is is a sufficiently robust cross-check for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to the error message, not the check. Getting an error "Not a sample student - Bobby Tables" is not that helpful if we want to find which student that is. I want the error message to have enough context to debug this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I don't think this really satisfies my earlier request

raise "Not a sample teacher" unless user.name.eql? SAMPLE_TEACHER_NAME
raise "Not a sample teacher" unless user.email.eql? SAMPLE_TEACHER_EMAIL
unless (user.name.eql? SAMPLE_TEACHER_NAME) && (user.email.eql? SAMPLE_TEACHER_EMAIL)
raise "Not a sample teacher - #{user.name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is not unique. Why not provide something that we can identify the record with like id or email?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do check for email here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. I'm referring to the error message

@jeremydstone jeremydstone merged commit a5cae9b into staging Jul 11, 2018
@jeremydstone jeremydstone deleted the sample_data branch August 15, 2018 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants