Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On FT_branch_1, for SAP, normalize date format and/or deduplicate data #138

Closed
aih opened this issue Jan 28, 2021 · 7 comments
Closed

On FT_branch_1, for SAP, normalize date format and/or deduplicate data #138

aih opened this issue Jan 28, 2021 · 7 comments
Assignees

Comments

@aih
Copy link
Collaborator

aih commented Jan 28, 2021

It appears that the Statements of Administrative Policy here are duplicates, but data is formatted differently between them (we want the date format of the first one: March 5, 2019) and the pdf link only appears on the second one:

image

@aih
Copy link
Collaborator Author

aih commented Jan 28, 2021

115hr2 seems to have duplicates as well. Three of these are dated May 15, 2018, and the two links go to the same document. I am curious what the first statement, dated June 26, 2018 refers to.

It is possible that this is a consequence of the way the websites presented the data and maybe the Whitehouse had duplicate data. If that is the case, there is nothing we can do about the data. We should consider adding a note in the table that duplicates may be a result of poor data from Whitehouse website.

image

@ayeshamk
Copy link
Collaborator

Josh suggested to change the date format to YYYY-MM-DD.

I changed it from MM DD, YYYY to YYYY-MM-DD. If we want the older one, I can change it back.

@ayeshamk
Copy link
Collaborator

Checked this too. I do not see duplicates as you can see below. You may have old data and tables. Try deleting and reloading new data.

image

@aih
Copy link
Collaborator Author

aih commented Jan 29, 2021

Josh suggested to change the date format to YYYY-MM-DD.

Thanks for explaining. We'll leave it the way Josh suggested.

Checked this too. I do not see duplicates as you can see below. You may have old data and tables. Try deleting and reloading new data.

I suspected that might be the case. Thank you for checking. I deleted the table this morning, hoping to re-create it and test, but now I'm stuck in a loop of migration problems :-(.

@aih
Copy link
Collaborator Author

aih commented Jan 29, 2021

For the two items you show for HR2, there is still some duplication: the 'Moving Forward Act' is for 116HR2, not 115HR2. Do you know why this is picked up for 115hr2?

@ayeshamk
Copy link
Collaborator

Yes, data quality for the Trump administration (from the Whitehouse website) is poor. I re-scraped. They removed most of it. We have many bills with no SAP pdf files. I am working on a way around for this.

@aih
Copy link
Collaborator Author

aih commented Feb 6, 2021

Closing. We're now working on updated branches.

@aih aih closed this as completed Feb 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants