Skip to content
This repository has been archived by the owner on Feb 14, 2024. It is now read-only.

Replace id column data type #59

Merged
merged 3 commits into from
Sep 15, 2021
Merged

Replace id column data type #59

merged 3 commits into from
Sep 15, 2021

Conversation

cduhn17
Copy link
Collaborator

@cduhn17 cduhn17 commented Sep 14, 2021

This update will change the id columns data type to uuid of all tables and enables uuid in Postgres.

🗣 Description

The "id" column data type ''text" is replaced with uuid at all table schema.

💭 Motivation and context

The motivation is to future proof the table schema in the event the database is merged into a data warehouse where there may be many of the same schema getting merged together. When operations like this happen the id columns of each database to be merged , would have duplicate id integers. The id column is primary key and as a result duplicated "id" would be filtered out of the data to be merged and the data would be lost. This PR addresses issue #49

🧪 Testing

✅ Checklist

  • This PR has an informative and human-readable title.
  • Changes are limited to a single goal - eschew scope creep!
  • All future TODOs are captured in issues, which are referenced
    in code comments.
  • All relevant type-of-change labels have been added.
  • I have read the CONTRIBUTING document.
  • These code changes follow cisagov code standards.
  • All relevant repo and/or project documentation has been updated
    to reflect the changes in this PR.
  • Tests have been added and/or modified to cover the changes in this PR.
  • All new and existing tests pass.

This update will change the id columns data type to uuid of all tables and enables uuid in Postgres.
@coveralls
Copy link

coveralls commented Sep 14, 2021

Pull Request Test Coverage Report for Build 1237867968

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 76.987%

Totals Coverage Status
Change from base Build 1233850694: 0.0%
Covered Lines: 174
Relevant Lines: 217

💛 - Coveralls

Get all new changes from develop repo..
@schmelz21 schmelz21 added the improvement This issue or pull request will add or improve functionality, maintainability, or ease of use label Sep 14, 2021
@schmelz21
Copy link
Collaborator

@cduhn17 I like this idea, just a couple of quick thoughts to run by you:

  • Do we want to consider having UUID associated to a single/simple column "name" to build continuity. For example use "_id", "id", "key", "pk" or "uuid" or alternatively "THE_TABLE_NAME_uid"? The thought here as we work with various sources, we may find conflict in using something like "organization_id" as it could also be a key to one of our data sources. This "organization_id" would then be "Foreign". Whatever selected would then be "off-limits".
  • For this PR do we want to consider including 'timestamp' for a column for record creation giving us a source to retrieve data.
  • Last one for consideration, do we want to leverage PostGres "Schemas" for each of our sources to better handle long-term objectives. Also becasue there will be times where sources will come and go.... For example a data sources may construct "exposed_credentials" natively. This could then become sixgill.exposed_credentials and hibp.exposed_credentials respectively. Secondarily a scheme for P&E Assets.

@cduhn17
Copy link
Collaborator Author

cduhn17 commented Sep 14, 2021

@schmelz-ctr ,

I agree with creating a single identified "id" column. I like the simplicity and ease to write "_uid" that way when reading/writing and execution sql statements i.e. joins. As well as the reasons that you mentioned.
@schmelz-ctr @aloftus23 @DJensen94
In regards to the separate schemas per source. I think separating the schemas by source goes outside of the scope of this PR but its a real consideration and we should make that another issue. A thought on that is to keep the quantity of tables to a minimum and add a column called "source" this would add continuity to the data and reduce the complexity of sql statements(As the data sources grow, execution time may be impacted. In the case of joins).

@schmelz21
Copy link
Collaborator

schmelz21 commented Sep 14, 2021

In regards to the separate schemas per source. I think separating the schemas by source goes outside of the scope of this PR but its a real consideration and we should make that another issue. A thought on that is to keep the quantity of tables to a minimum and add a column called "source" this would add continuity to the data and reduce the complexity of sql statements(As the data sources grow, execution time may be impacted. In the case of joins).

Makes sense.. for now agree simpler the better.

@schmelz21 schmelz21 added this to PR Reviews in Sprint 4 (9/15 – 10/1) Sep 15, 2021
@cduhn17 cduhn17 marked this pull request as ready for review September 15, 2021 15:27
Copy link
Contributor

@DJensen94 DJensen94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@schmelz21 schmelz21 merged commit e20fd50 into develop Sep 15, 2021
@schmelz21 schmelz21 deleted the CD-working-dbschema branch September 15, 2021 19:29
@schmelz21 schmelz21 linked an issue Sep 15, 2021 that may be closed by this pull request
1 task
@schmelz21 schmelz21 mentioned this pull request Sep 15, 2021
1 task
@schmelz21 schmelz21 moved this from Pull Request to Done in Sprint 4 (9/15 – 10/1) Sep 15, 2021
aloftus23 pushed a commit that referenced this pull request Apr 13, 2022
Sort the Package Requirements in setup.py
aloftus23 pushed a commit that referenced this pull request Apr 13, 2022
aloftus23 pushed a commit that referenced this pull request Apr 15, 2022
Sort the Package Requirements in setup.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
improvement This issue or pull request will add or improve functionality, maintainability, or ease of use
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Primary Key integer or UUID
5 participants