Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DB constraint violations #1212

Closed
12 tasks done
zaneselvans opened this issue Sep 9, 2021 · 0 comments
Closed
12 tasks done

Fix DB constraint violations #1212

zaneselvans opened this issue Sep 9, 2021 · 0 comments
Assignees
Labels
data-cleaning Tasks related to cleaning & regularizing data during ETL. epic Any issue whose primary purpose is to organize other issues into a group. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. sqlite Issues related to interacting with sqlite databases

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Sep 9, 2021

Description

In getting the SQLite/Parquet ETL (#1176) working, some previously enforced database constraints were relaxed. There are also new constraints that we want to impose on the structure of the DB to keep it tidy, and enable programmatic use of the relational structure, most immediately in the context of the entity harvesting & resolution process (#639). These issues include modifying our data processing to ensure that no primary keys contain null values, and all are unique, making sure that foreign key relationships and data types are being checked by the database, and that appropriate NA/Null values are being used in the DB.

This work should be billed to the Sloan Metadata/Harvest Revamp project in Harvest.

Motivation

  • Our primary data products are well-normalized relational databases. These constraint checks are part of how we know that the data is clean, and the structure allows much more automated use of different tables in conjunction with each other, both by us and others.
  • Some of these constraint violations highlight existing data normalization problems that need to be addressed in order for the data to make sense in many analyses.
  • Many (though maybe not all?) of these issues need to be addressed before the new entity harvesting & resolution process (Refactor Harvesting [Sloan] #639) can be fully implemented. That refactor is blocking a lot of new data integration. We should identify which issues are truly blocking.

In Scope

Out of Scope

@zaneselvans zaneselvans added data-cleaning Tasks related to cleaning & regularizing data during ETL. sqlite Issues related to interacting with sqlite databases metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. epic Any issue whose primary purpose is to organize other issues into a group. labels Sep 9, 2021
@zaneselvans zaneselvans self-assigned this Sep 9, 2021
zaneselvans added a commit that referenced this issue Sep 18, 2021
This is the first big chunk of changes related to changing how we store our metadata, output the databases & Parquet files, and do entity resolution within the EIA datasets. We still have several database constraint violations to fix, enumerated in #1212.

Then (or in parallel with fixing them) we can make the switch over to using the new entity resolution system, which will involve some column name changes. Much of the code for that new system is already written and contained in this PR, but it's not yet being applied in the ETL process.
@zaneselvans zaneselvans pinned this issue Sep 18, 2021
@TrentonBush TrentonBush removed their assignment Sep 30, 2021
@zaneselvans zaneselvans changed the title Fix DB Constraint Violations Fix DB constraint violations Oct 2, 2021
@zaneselvans zaneselvans unpinned this issue Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-cleaning Tasks related to cleaning & regularizing data during ETL. epic Any issue whose primary purpose is to organize other issues into a group. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. sqlite Issues related to interacting with sqlite databases
Projects
None yet
Development

No branches or pull requests

2 participants