Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proficiencies backfill 2019 [ci skip] #32391

Merged
merged 2 commits into from
Jan 22, 2020

Conversation

uponthesun
Copy link

@uponthesun uponthesun commented Dec 12, 2019

See #26362 for more details. Basically just doing the same thing again for this year.

I'm in the process of testing on a clone; will update once I have. In the meantime, double checking for any mistakes around script IDs and start/end dates would be greatly appreciated.

UPDATE:

Successfully ran backfill (see later post for before and after numbers). Documenting the undocumented steps just in case we need this again.

  1. Create appropriate sed files and .sql files based on the date ranges that are missing data. The sed files are used to generate the .sql files in conjunction with the template. (The .sql files are just plain text files with a sql command in them.) A few tricky things:
  • you need one sed file and one .sql file for each month starting from when the levels with missing data were introduced, all the way to the current date (The script still needs data from the time period after the LCD import to correctly calculate basic_proficiency_at.). In the user_proficiencies script, in the date ranges, you use the boolean field already_recorded to indicate whether the ranges had missing data or not.
  • Find the starting and ending IDs for each date range through trial and error (querying the db).
  • For the month where the missing tagging data is introduced, break it up into two ranges, since one is "true" and one is "false" for already_recorded.
  1. Run each of the sql commands from the .sql files. This will create one temp table in the db for each date range. You can copy and paste into the mysql CLI, but it may be faster to use syntax like this to read the command from the file instead: https://dev.mysql.com/doc/refman/8.0/en/mysql-batch-commands.html

  2. For each temp table, dump it to a local file using mysqldump.

  3. You need to convert the resulting dump files to csv. Surprisingly there's no official tool for this and even more surprisingly, this script works: https://github.com/jamesmishra/mysqldump-to-csv

  4. Finally, update the hardcoded filepaths (see the DATA_DIRECTORY constant) in the user_proficiencies script and run the script. Enter "Y" at both prompts.

See query in post below for a way to see if it (looks like it) worked. Of course, it's best if you can test + rehearse this all beforehand on a clone + adhoc.

Running the final script took about 3 hours wall clock time for me.

@bencodeorg
Copy link
Contributor

bencodeorg commented Dec 14, 2019

Script IDs and dates look good to me!

There's a filename here that looks out of date, but not sure it matters:
https://github.com/code-dot-org/code-dot-org/pull/32391/files#diff-1481397638eaebaca2f724a6ea1fd83fR129

[edit: this link isn't super helpful, search for AGGREGATE_FILENAME in the script]

@uponthesun
Copy link
Author

Backfill completed successfully.

Before:

mysql> select count(user_id), month(basic_proficiency_at) as month, year(basic_proficiency_at) as year from user_proficiencies group by year, month;
+----------------+-------+------+
| count(user_id) | month | year |
+----------------+-------+------+
... 
|          42389 |     6 | 2019 |
|          16998 |     7 | 2019 |
|          24065 |     8 | 2019 |
|          82779 |     9 | 2019 |
|         140403 |    10 | 2019 |
|         137871 |    11 | 2019 |
|          88147 |    12 | 2019 |
+----------------+-------+------+

After:

mysql> select count(user_id), month(basic_proficiency_at) as month, year(basic_proficiency_at) as year from user_proficiencies group by year, month;
+----------------+-------+------+
| count(user_id) | month | year |
+----------------+-------+------+
...
|          46924 |     6 | 2019 |
|          22271 |     7 | 2019 |
|          35734 |     8 | 2019 |
|          93830 |     9 | 2019 |
|         139825 |    10 | 2019 |
|         138688 |    11 | 2019 |
|          88991 |    12 | 2019 |
+----------------+-------+------+

@sureshc
Copy link
Contributor

sureshc commented Jan 8, 2020

Are you ready to merge this? I'm going to merge staging-next into staging early next week after implementing a couple of fixes for the staging-next Drone builds and staging-next managed server builds.

@sureshc
Copy link
Contributor

sureshc commented Jan 16, 2020

Also, can you delete what appears to be a temp table in dashboard "user_proficiencies_20190601_20190701 on the production database? Its presence causes schema.rb generated on production-daemon to have a diff against what's in the production branch, which blocks any build that contains an intentional schema change.

@uponthesun uponthesun changed the base branch from staging-next to staging January 22, 2020 00:51
@uponthesun uponthesun merged commit 79c6db1 into staging Jan 22, 2020
@uponthesun uponthesun deleted the proficiencies-backfill-2019-clean branch January 22, 2020 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants