Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dataset to use pandas and cleaner setup. #36

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

davidgasquez
Copy link
Member

Heya @rufuspollock!

Since this seems to be the most starred package and also broken, I thought I would update/improve it after having spent some time with Frictionless and other data package managers.

It changes lots of things so please push back on anything that doesn't make sense.

Comment on lines +1 to +7
{
"name": "Core Dataset",
"image": "mcr.microsoft.com/devcontainers/python:3.11",
"features": {
"ghcr.io/stuartleeks/dev-container-features/shell-history:0": {}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, you can run the pipeline in your browser using GitHub Codespaces.

A simple way to make future contributors life easier as getting a development environment is one click away.

branches: ["master"]
pull_request:
branches: ["master"]
workflow_dispatch:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it possible to trigger the workflow from the UI.

Comment on lines +1 to +30
name: s-and-p-500-companies
title: S&P 500 Companies with Financial Information
version: "2.0"
licenses:
- name: ODC-PDDL-1.0
path: http://opendatacommons.org/licenses/pddl/
title: Open Data Commons Public Domain Dedication and License v1.0
resources:
- name: constituents
path: data/constituents.csv
format: csv
mediatype: text/csv
schema:
fields:
- name: Symbol
type: string
- name: Security
type: string
- name: GICS Sector
type: string
- name: GICS Sub-Industry
type: string
- name: Headquarters Location
type: string
- name: Date added
type: string
- name: CIK
type: string
- name: Founded
type: string
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to YAML as it makes it more readable and I is what modern Frictionless data packages are using around GitHub.

@@ -1 +1,3 @@
beautifulsoup4
pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it! Though pandas is so heavy duty (you have to install numpy right ...). I wonder if we can get away with something more lightweight.

Let's live with it for now!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. In this case I think is heavyweight for the computers but easy for humans to understand.

@rufuspollock
Copy link
Member

Amazing job. Merging. 👏

@rufuspollock rufuspollock changed the title Refactor dataset project Refactor dataset to use pandas and cleaner setup. Apr 13, 2023
@rufuspollock rufuspollock merged commit 6517cdb into main Apr 13, 2023
@davidgasquez davidgasquez deleted the revisit-package branch April 13, 2023 15:23
@davidgasquez davidgasquez mentioned this pull request Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants