New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Census ap data import #19864
Census ap data import #19864
Conversation
A couple concerns:
|
end | ||
|
||
def self.seed | ||
if CDO.stub_school_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice reuse of existing config 👍
Ok, I made several changes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I like the use of the etag content hash to determine whether to reseed 👍
|
||
def self.seed_from_s3 | ||
etag = AWS::S3.create_client.head_object({bucket: CENSUS_BUCKET_NAME, key: CSV_OBJECT_KEY}).etag | ||
unless SeededS3Object.where(bucket: CENSUS_BUCKET_NAME, key: CSV_OBJECT_KEY, etag: etag).count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use exists?
rather than count > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
t.timestamps | ||
end | ||
|
||
add_index :seeded_s3_objects, [:bucket, :key, :etag] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have an index on just [:bucket, :key]
without :etag
? I can see wanting a quick check to know whether a particular S3 file is seeded at all (regardless of contents / version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need an additional index. Since bucket and key are the leading edge of the index it should be used for queries even if they don't include an etag.
mysql> explain select * from seeded_s3_objects o where o.bucket='a bucket' and o.key='a key';
explain select * from seeded_s3_objects o where o.bucket='a bucket' and o.key='a key';
+----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | o | NULL | ref | index_seeded_s3_objects_on_bucket_and_key_and_etag | index_seeded_s3_objects_on_bucket_and_key_and_etag | 1536 | const,const | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)
object_key = "ap_cs_offerings/#{course}-#{school_year}-#{school_year + 1}.csv" | ||
begin | ||
etag = AWS::S3.create_client.head_object({bucket: CENSUS_BUCKET_NAME, key: object_key}).etag | ||
unless SeededS3Object.where(bucket: CENSUS_BUCKET_NAME, key: object_key, etag: etag).count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/count > 0
/exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This is the code to import the census AP data from files in S3. For now I am not having this part of the standard seed tasks (seed:all, etc.) because we found some issues with the school code data that we need to work out.
I debated wether it made sense to move this code out of seed.rake and if so where it would make sense to put it. It doesn't feel like it logically is part of the model and that it is more properly part of the DB seeding logic. I'm happy to move it someplace else though.