Census ap data import #19864

drewsamnick · 2018-01-09T00:46:12Z

This is the code to import the census AP data from files in S3. For now I am not having this part of the standard seed tasks (seed:all, etc.) because we found some issues with the school code data that we need to work out.

I debated wether it made sense to move this code out of seed.rake and if so where it would make sense to put it. It doesn't feel like it logically is part of the model and that it is more properly part of the DB seeding logic. I'm happy to move it someplace else though.

drewsamnick · 2018-01-09T00:46:40Z

Create table for AP data

aoby · 2018-01-09T18:54:12Z

A couple concerns:

How large are the files / how long does the seeding take? Should we have a stubbed version to save time on dev environments like we do for schools and districts?
What happens for, say outside contributors, who don't have S3 access? Again, perhaps we should have local stubs.

davidsbailey · 2018-01-10T23:21:59Z

dashboard/app/models/census/ap_cs_offering.rb

+  end
+
+  def self.seed
+    if CDO.stub_school_data


Nice reuse of existing config 👍

drewsamnick · 2018-01-11T00:41:36Z

Ok, I made several changes.

Added stub data. Since the AP data depends on the schools data I used the same config flag to control wether to use stub data.
Moved the seed logic into the models.
Added a way to track which S3 object versions have already been seeded and skip them. This will allow us to seed once but still reseed if the file is updated.

aoby

LGTM. I like the use of the etag content hash to determine whether to reseed 👍

aoby · 2018-01-11T21:29:17Z

dashboard/app/models/census/ap_school_code.rb

+
+  def self.seed_from_s3
+    etag = AWS::S3.create_client.head_object({bucket: CENSUS_BUCKET_NAME, key: CSV_OBJECT_KEY}).etag
+    unless SeededS3Object.where(bucket: CENSUS_BUCKET_NAME, key: CSV_OBJECT_KEY, etag: etag).count > 0


use exists? rather than count > 0

aoby · 2018-01-11T21:31:10Z

dashboard/db/migrate/20180110233855_create_seeded_s3_objects.rb

+      t.timestamps
+    end
+
+    add_index :seeded_s3_objects, [:bucket, :key, :etag]


Should we also have an index on just [:bucket, :key] without :etag? I can see wanting a quick check to know whether a particular S3 file is seeded at all (regardless of contents / version).

We shouldn't need an additional index. Since bucket and key are the leading edge of the index it should be used for queries even if they don't include an etag.

mysql> explain select * from seeded_s3_objects o where o.bucket='a bucket' and o.key='a key'; explain select * from seeded_s3_objects o where o.bucket='a bucket' and o.key='a key'; +----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+ | 1 | SIMPLE | o | NULL | ref | index_seeded_s3_objects_on_bucket_and_key_and_etag | index_seeded_s3_objects_on_bucket_and_key_and_etag | 1536 | const,const | 1 | 100.00 | NULL | +----+-------------+-------+------------+------+----------------------------------------------------+----------------------------------------------------+---------+-------------+------+----------+-------+ 1 row in set, 1 warning (0.01 sec)

aoby · 2018-01-11T21:32:15Z

dashboard/app/models/census/ap_cs_offering.rb

+        object_key = "ap_cs_offerings/#{course}-#{school_year}-#{school_year + 1}.csv"
+        begin
+          etag = AWS::S3.create_client.head_object({bucket: CENSUS_BUCKET_NAME, key: object_key}).etag
+          unless SeededS3Object.where(bucket: CENSUS_BUCKET_NAME, key: object_key, etag: etag).count > 0


s/count > 0/exists?

Drew Samnick added 2 commits January 8, 2018 16:06

Allow AP CS Offerings from 2016

ab3d80f

Add seed tasks for ap_school_codes and ap_cs_offerings

1b2f826

drewsamnick requested review from davidsbailey, aoby and bencodeorg January 9, 2018 00:46

Merge branch 'staging' into census-ap-data-import

da47a8a

Move seeding in AP data models, add stub data

3fc101b

davidsbailey reviewed Jan 10, 2018

View reviewed changes

dashboard/app/models/census/ap_cs_offering.rb

end

def self.seed

if CDO.stub_school_data

Copy link

Member

davidsbailey Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice reuse of existing config 👍

Drew Samnick added 3 commits January 10, 2018 16:26

Add seeded_s3_objects table and model

50d218d

Don't reseed AP data from S3 if file was already processed

672b163

Add stub data for AP CS offerings

af85d7a

Merge branch 'staging' into census-ap-data-import

5b56b6b

aoby approved these changes Jan 11, 2018

View reviewed changes

Drew Samnick and others added 2 commits January 11, 2018 14:04

Use exists? instead of count > 0

e31c657

Merge branch 'staging' into census-ap-data-import

6ad4d65

drewsamnick merged commit d3ecbfa into staging Jan 12, 2018

drewsamnick deleted the census-ap-data-import branch January 12, 2018 00:11

drewsamnick mentioned this pull request Jan 12, 2018

Census ib data import #19989

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Census ap data import #19864

Census ap data import #19864

drewsamnick commented Jan 9, 2018

drewsamnick commented Jan 9, 2018

aoby commented Jan 9, 2018

davidsbailey Jan 10, 2018

drewsamnick commented Jan 11, 2018

aoby left a comment

aoby Jan 11, 2018

drewsamnick Jan 11, 2018

aoby Jan 11, 2018

drewsamnick Jan 11, 2018

aoby Jan 11, 2018

drewsamnick Jan 11, 2018

Census ap data import #19864

Census ap data import #19864

Conversation

drewsamnick commented Jan 9, 2018

drewsamnick commented Jan 9, 2018

aoby commented Jan 9, 2018

davidsbailey Jan 10, 2018

Choose a reason for hiding this comment

drewsamnick commented Jan 11, 2018

aoby left a comment

Choose a reason for hiding this comment

aoby Jan 11, 2018

Choose a reason for hiding this comment

drewsamnick Jan 11, 2018

Choose a reason for hiding this comment

aoby Jan 11, 2018

Choose a reason for hiding this comment

drewsamnick Jan 11, 2018

Choose a reason for hiding this comment

aoby Jan 11, 2018

Choose a reason for hiding this comment

drewsamnick Jan 11, 2018

Choose a reason for hiding this comment