Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply "quick fix" data ingestion performance improvements #187

Merged
merged 20 commits into from
Aug 25, 2020
Merged

Conversation

krivard
Copy link
Contributor

@krivard krivard commented Aug 12, 2020

Fixes cmu-delphi/covidcast-indicators#158

  • Add flag columns for wip and latest-issue status
  • Partition direction updates x24 for memory and running time
  • Partition meta cache update xN(signal, source) for memory
  • Give timestamp columns more descriptive names
  • Extend signal name from 32 to 64 chars

Results:

  • Direction update runs in 35-40 minutes each partition, compatible with an hourly ingestion cycle
  • Meta cache update runs in 90 minutes total, compatible with once/twice daily cycle

Requires the following setup in prod before merging:

  • MariaDB version update
  • Table schema edits, as above
  • One-time computation of is_wip and is_latest_issue for existing rows

melange396 and others added 20 commits July 23, 2020 15:33
…pdates.

Also fixed/improved setUp in integrations/server/test_covidcast_meta.py and added skeleton-code for new `missing_*` rows.
Changed column names back to timestamp1, timestamp2 for testing purposes.
tests changed as i pulled the serialization down a layer
* Implemented the Quick Fix: Direction should be updated to NA wherever there are no historical data to compute the stdev from.
* Modified the integration test to test for this behavior
* Updated column names back to value_updated_timestamp and  direction_updated_timestamp.
change signal type from VARCHAR(32) to VARCHAR(64) in temporary tables.
* Database schema
* Signal name length guard in CSV uploader
* Integration test verifies signal length 32<x<64 succeeds, x>64 fails
Increase maximum signal length for COVIDcast to 64 characters
@krivard krivard mentioned this pull request Aug 12, 2020
@sgratzl
Copy link
Member

sgratzl commented Aug 12, 2020

If I understand the is_latest_issue correctly -> doesn't that mean we could get rid of the sub query:

//fetch most recent issues
$sub_condition_asof = "TRUE";
if ($as_of !== null) {
$sub_condition_asof = "(`issue` <= {$as_of})";
}
$sub_fields = "max(`issue`) `max_issue`, `time_type`, `time_value`, `source`, `signal`, `geo_type`, `geo_value`";
$sub_group = "`time_type`, `time_value`, `source`, `signal`, `geo_type`, `geo_value`";
$sub_condition = "x.`max_issue` = t.`issue` AND x.`time_type` = t.`time_type` AND x.`time_value` = t.`time_value` AND x.`source` = t.`source` AND x.`signal` = t.`signal` AND x.`geo_type` = t.`geo_type` AND x.`geo_value` = t.`geo_value`";
$subquery = "JOIN (SELECT {$sub_fields} FROM {$table} WHERE ({$conditions} AND {$sub_condition_asof}) GROUP BY {$sub_group}) x ON {$sub_condition}";
$condition_version = 'TRUE';

@krivard
Copy link
Contributor Author

krivard commented Aug 12, 2020

@sgratzl Yes, though I'm going to call that out of scope for this PR, mostly because staging this batch of changes has been more work than expected. Once it's merged we can do a followup to drop the subquery from the API server code.

@krivard krivard merged commit ddbcec5 into main Aug 25, 2020
@krivard krivard deleted the staging branch November 18, 2020 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resolve performance issues related to data versioning (quick way)
4 participants