New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Census: support CSV V2 format #29567
Conversation
Going forward, all CSVs should use this one "V2" format, with no more variants per state which we do in the existing "V1" format. For the 2018-19 school year, we already have some CSVs using the V1 format, and so we only attempt to use the V2 format for some states. The assumption is that all states will use the V2 format from 2019-20 onwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only addition on the description is that we may after all be uploading v2 CSVs for states with existing v1 2018-2019 forms. According to discussion so far, our expectation is that this will work so long as the v2 forms are supersets of their respective v1 forms.
@@ -85,7 +99,23 @@ def self.infer_no(state_code) | |||
INFERRED_NO_EXCLUSION_LIST.exclude? state_code.upcase | |||
end | |||
|
|||
def self.construct_state_school_id(state_code, row_hash) | |||
def self.construct_state_school_id(state_code, row_hash, school_year) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth renaming this method to get_state_school_id
since that’s what we are typically doing here? In some rare/legacy cases, we are constructing the state_school_id from district_id and school_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this also because it avoids duplicating the function name in School
that we call a lot in here.
|
||
# The V2 format requires either nces_id or state_school_id. | ||
if nces_id != UNSPECIFIED_VALUE | ||
return School.find_by(id: nces_id)&.state_school_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If find_by
the NCES ID does not return a record, this will return nil
. Could we try to find the record and then fallback to returning the state_school_id
provided in the CSV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much simpler, looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Now that we have support for the new V2 CSV format in #29567, we have more data to ingest. We will place these CSV files on S3 for actual ingestion, but since I wanted to test ingestion locally first as we iteratively cleaned up the data, I've just included them in this change. They can be ingested locally so long as `CDO.stub_school_data` is set to `true`.
Going forward, all CSVs should use this one "V2" format, with no more variants per state (which we support in the existing "V1" format). For the 2018-19 school year, we already have some CSVs using the V1 format, and so we will only attempt to use the V2 format for some states. The assumption is that all states will use the V2 format from 2019-20 onwards.
The V2 format has two columns of note for identifying a school:
state_school_id
andnces_id
. Ifnces_id
is provided, then we use it to look up a state school ID. Ifnces_is
is not provided or is"unspecified"
, then we fall back to thestate_school_id
column.The course is specified by
course_id
and must be a valid course; we no longer will compare against a list of valid course codes for a state.Once this code is live, any subsequent seed step will be able to pick up a newly-added V2 CSV file on S3, and should successfully ingest it. We will start this process by just doing one state - ID - and making sure that ingests successfully.