Skip to content

Commit

Permalink
Refactor dataset to use pandas and cleaner setup.
Browse files Browse the repository at this point in the history
  • Loading branch information
davidgasquez authored Apr 13, 2023
1 parent 3fa2081 commit 6517cdb
Show file tree
Hide file tree
Showing 15 changed files with 621 additions and 1,339 deletions.
7 changes: 7 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"name": "Core Dataset",
"image": "mcr.microsoft.com/devcontainers/python:3.11",
"features": {
"ghcr.io/stuartleeks/dev-container-features/shell-history:0": {}
}
}
64 changes: 23 additions & 41 deletions .github/workflows/actions.yml
Original file line number Diff line number Diff line change
@@ -1,48 +1,30 @@
on:
push:
branches:
- master
# 2023-03-08 disable cron until working again
# schedule:
# - cron: '0 1 * * *'
branches: ["master"]
pull_request:
branches: ["master"]
workflow_dispatch:
schedule:
- cron: "0 0 * * *"

jobs:
update:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master
- name: Build the data and create local changes
uses: actions/setup-python@v2
with:
python-version: '3.x'
architecture: x64
- name: Install requirements
run: |
pip install -r scripts/requirements.txt
- name: Process Data
run: |
python scripts/constituents.py
- name: Commit files
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git diff --quiet && git diff --staged --quiet || git commit -a -m "Auto-update of the data packages"
- name: Push changes
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.gh }}
deploy:
needs: update
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
- uses: actions/setup-node@v1
with:
node-version: '8.x'
- run: npm install -g data-cli
- run: data --version
- run: data push
env:
id: ${{secrets.dhid}}
username: ${{secrets.dhusername}}
token: ${{secrets.dhtoken}}
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Run
run: make
- name: Commit and Push
run: |
git config --global user.name "GitHub Action"
git config --global user.email "actions@users.noreply.github.com"
git add -A
if git diff-index --quiet HEAD --; then
echo "No changes to commit"
else
git commit -m "Update data"
git push
fi
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ cache/*
tmp/
scripts/.DS_Store
.DS_Store
*.env
21 changes: 21 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
VENV = .env
PYTHON = $(VENV)/bin/python3
PIP = $(VENV)/bin/pip

.PHONY: data clean

all: data

data: $(VENV)/bin/activate
$(PYTHON) scripts/scrape.py

$(VENV)/bin/activate: scripts/requirements.txt
python3 -m venv $(VENV)
$(PIP) install -r scripts/requirements.txt

validate:
$(PYTHON) -m frictionless validate data/constituents.csv

clean:
rm -rf __pycache__
rm -rf $(VENV)
56 changes: 23 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,45 @@
List of companies in the S&P 500 (Standard and Poor's 500). The S&P 500 is a
free-float, capitalization-weighted index of the top 500 publicly listed stocks
in the US (top 500 by market cap). The dataset includes a list of all the
stocks contained therein.
# S&P 500 Companies Dataset

## Data
List of companies in the S&P 500 (Standard and Poor's 500). The S&P 500 is a free-float, capitalization-weighted index of the top 500 publicly listed stocks in the US (top 500 by market cap). The dataset includes a list of all the stocks contained therein.

Information on S&P 500 index used to be available on the [official webpage on the Standard and Poor's website][sp-home]
but until they publish it back, Wikipedia is the best up-to-date and open data source.
## Data

* Index listing - see <data/constituents.csv> extracted from Wikipedia's [SP500 list of companies][sp-list].
Information on S&P 500 index used to be available on the [official webpage on the Standard and Poor's website][sp-home] but until they publish it back, Wikipedia's [SP500 list of companies][sp-list] is the best up-to-date and open data source.

### Sources
## Sources

Detailed information on the S&P 500 (primarily in XLS format) used to be obtained
from its [official webpage on the Standard and Poor's website][sp-home] - it was
free but registration was required.
* Index listing - see <data/constituents.csv>
* used to be extracted from [source Excel file on S&P website][sp-listing-dec-2014] but this no longer contains a list of constituents. (Note this Excel was actually S&P 500 EPS estimates but on sheet 4 it used to have a list of members - [previous file][sp-listing] was just members but that 404s as of Dec 2014) (Note: <del>but note you have to register and login to access</del> - no longer true as of August 2013)
* Historical performance ([source xls on S&P website][sp-historical])
Detailed information on the S&P 500 (primarily in XLS format) used to be obtained from its [official webpage on the Standard and Poor's website][sp-home] - it was free but registration was required.

[sp-home]: http://www.spindices.com/indices/equity/sp-500
[sp-list]: http://en.wikipedia.org/wiki/List_of_S%26P_500_companies
[sp-listing-dec-2014]: http://www.spindices.com/documents/additional-material/sp-500-eps-est.xlsx?force_download=true
[sp-listing]: http://us.spindices.com/idsexport/file.xls?hostIdentifier=48190c8c-42c4-46af-8d1a-0cd5db894797&selectedModule=Constituents&selectedSubModule=ConstituentsFullList&indexId=340
[sp-historical]: http://www.standardandpoors.com/prot/spf/docs/indices/SPUSA-500-USDUF--P-US-L--HistoricalData.xls

*Note*: for aggregate information on the S&P (dividends, earnings, etc.) see
[Standard and Poor's 500 Dataset][shiller].
> **Note**
> For aggregate information on the S&P (dividends, earnings, etc.) see [Standard and Poor's 500 Dataset][shiller].
[shiller]: http://data.okfn.org/data/s-and-p-500

### General Financial Notes
## General Financial Notes

Publicly listed US companies are obliged various reports on a regular basis with the SEC. Of these 2 types are of especial interest to investors and others interested in their finances and business. These are:

- 10-K = Annual Report
- 10-Q = Quarterly report

Publicly listed US companies are obliged various reports on a regular basis
with the SEC. Of these 2 types are of especial interest to investors and others
interested in their finances and business. These are:
## Development

* 10-K = Annual Report
* 10-Q = Quarterly report
The pipeline relies on Python, so you'll need to have it installed on your machine. Then:

## Preparation
1. Create a virtual environment in a directory using Python's venv module: `python3 -m venv .env`
2. Activate the virtual environment: `source .env/bin/activate`
3. Install the dependencies: `pip install -r scripts/requirements.txt`
4. Run the scripts: `python scripts/scrape.py`

You can run the script yourself to update the data and publish them to GitHub : see [scripts README](https://github.com/datasets/s-and-p-500-companies/blob/master/scripts/README.md).
Alternatively, you can use the provided Makefile to run the scraping with a simple `make`. It'll create a virtual environment, install the dependencies and run the script.

## License

All data is licensed under the [Open Data Commons Public Domain Dedication and
License][pddl]. All code is licensed under the MIT/BSD license.
All data is licensed under the [Open Data Commons Public Domain Dedication and License][pddl]. All code is licensed under the MIT/BSD license.

Note that while no credit is formally required a link back or credit to [Rufus
Pollock][rp] and the [Open Knowledge Foundation][okfn] is much appreciated.
Note that while no credit is formally required a link back or credit to [Rufus Pollock][rp] and the [Open Knowledge Foundation][okfn] is much appreciated.

[pddl]: http://opendatacommons.org/licenses/pddl/1.0/
[rp]: http://rufuspollock.com/
Expand Down
Loading

0 comments on commit 6517cdb

Please sign in to comment.