Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Census ap data import #19864

Merged
merged 10 commits into from Jan 12, 2018
Merged

Census ap data import #19864

merged 10 commits into from Jan 12, 2018

Conversation

drewsamnick
Copy link
Contributor

This is the code to import the census AP data from files in S3. For now I am not having this part of the standard seed tasks (seed:all, etc.) because we found some issues with the school code data that we need to work out.

I debated wether it made sense to move this code out of seed.rake and if so where it would make sense to put it. It doesn't feel like it logically is part of the model and that it is more properly part of the DB seeding logic. I'm happy to move it someplace else though.

@drewsamnick
Copy link
Contributor Author

Create table for AP data

@aoby
Copy link
Contributor

aoby commented Jan 9, 2018

A couple concerns:

  1. How large are the files / how long does the seeding take? Should we have a stubbed version to save time on dev environments like we do for schools and districts?

  2. What happens for, say outside contributors, who don't have S3 access? Again, perhaps we should have local stubs.

end

def self.seed
if CDO.stub_school_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice reuse of existing config 👍

@drewsamnick
Copy link
Contributor Author

Ok, I made several changes.

  1. Added stub data. Since the AP data depends on the schools data I used the same config flag to control wether to use stub data.
  2. Moved the seed logic into the models.
  3. Added a way to track which S3 object versions have already been seeded and skip them. This will allow us to seed once but still reseed if the file is updated.

Copy link
Contributor

@aoby aoby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the use of the etag content hash to determine whether to reseed 👍


def self.seed_from_s3
etag = AWS::S3.create_client.head_object({bucket: CENSUS_BUCKET_NAME, key: CSV_OBJECT_KEY}).etag
unless SeededS3Object.where(bucket: CENSUS_BUCKET_NAME, key: CSV_OBJECT_KEY, etag: etag).count > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use exists? rather than count > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

t.timestamps
end

add_index :seeded_s3_objects, [:bucket, :key, :etag]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have an index on just [:bucket, :key] without :etag? I can see wanting a quick check to know whether a particular S3 file is seeded at all (regardless of contents / version).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need an additional index. Since bucket and key are the leading edge of the index it should be used for queries even if they don't include an etag.

mysql> explain select * from seeded_s3_objects o where o.bucket='a bucket' and o.key='a key';
explain select * from seeded_s3_objects o where o.bucket='a bucket' and o.key='a key';
+----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys                                      | key                                                | key_len | ref         | rows | filtered | Extra |
+----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+
|  1 | SIMPLE      | o     | NULL       | ref  | index_seeded_s3_objects_on_bucket_and_key_and_etag | index_seeded_s3_objects_on_bucket_and_key_and_etag | 1536    | const,const |    1 |   100.00 | NULL  |
+----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)

object_key = "ap_cs_offerings/#{course}-#{school_year}-#{school_year + 1}.csv"
begin
etag = AWS::S3.create_client.head_object({bucket: CENSUS_BUCKET_NAME, key: object_key}).etag
unless SeededS3Object.where(bucket: CENSUS_BUCKET_NAME, key: object_key, etag: etag).count > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/count > 0/exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@drewsamnick drewsamnick merged commit d3ecbfa into staging Jan 12, 2018
@drewsamnick drewsamnick deleted the census-ap-data-import branch January 12, 2018 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants