Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Census: support CSV V2 format #29567

Merged
merged 6 commits into from Jul 12, 2019
Merged

Census: support CSV V2 format #29567

merged 6 commits into from Jul 12, 2019

Conversation

breville
Copy link
Member

@breville breville commented Jul 9, 2019

Going forward, all CSVs should use this one "V2" format, with no more variants per state (which we support in the existing "V1" format). For the 2018-19 school year, we already have some CSVs using the V1 format, and so we will only attempt to use the V2 format for some states. The assumption is that all states will use the V2 format from 2019-20 onwards.

The V2 format has two columns of note for identifying a school: state_school_id and nces_id. If nces_id is provided, then we use it to look up a state school ID. If nces_is is not provided or is "unspecified", then we fall back to the state_school_id column.

The course is specified by course_id and must be a valid course; we no longer will compare against a list of valid course codes for a state.

Once this code is live, any subsequent seed step will be able to pick up a newly-added V2 CSV file on S3, and should successfully ingest it. We will start this process by just doing one state - ID - and making sure that ingests successfully.

Going forward, all CSVs should use this one "V2" format, with no more variants per state which we do in the existing "V1" format.  For the 2018-19 school year, we already have some CSVs using the V1 format, and so we only attempt to use the V2 format for some states.  The assumption is that all states will use the V2 format from 2019-20 onwards.
Copy link

@agealy agealy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only addition on the description is that we may after all be uploading v2 CSVs for states with existing v1 2018-2019 forms. According to discussion so far, our expectation is that this will work so long as the v2 forms are supersets of their respective v1 forms.

@@ -85,7 +99,23 @@ def self.infer_no(state_code)
INFERRED_NO_EXCLUSION_LIST.exclude? state_code.upcase
end

def self.construct_state_school_id(state_code, row_hash)
def self.construct_state_school_id(state_code, row_hash, school_year)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth renaming this method to get_state_school_id since that’s what we are typically doing here? In some rare/legacy cases, we are constructing the state_school_id from district_id and school_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this also because it avoids duplicating the function name in School that we call a lot in here.


# The V2 format requires either nces_id or state_school_id.
if nces_id != UNSPECIFIED_VALUE
return School.find_by(id: nces_id)&.state_school_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If find_by the NCES ID does not return a record, this will return nil. Could we try to find the record and then fallback to returning the state_school_id provided in the CSV?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

Copy link
Contributor

@bencodeorg bencodeorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much simpler, looks great!

@breville breville requested a review from islemaster July 12, 2019 16:58
Copy link
Contributor

@islemaster islemaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@breville breville merged commit 15cd21a into staging Jul 12, 2019
@breville breville deleted the census-support-csv-v2-format branch July 12, 2019 18:54
breville added a commit that referenced this pull request Jul 15, 2019
Now that we have support for the new V2 CSV format in #29567, we have more data to ingest.

We will place these CSV files on S3 for actual ingestion, but since I wanted to test ingestion locally first as we iteratively cleaned up the data, I've just included them in this change.  They can be ingested locally so long as `CDO.stub_school_data` is set to `true`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants