Refactor dataset to use pandas and cleaner setup.

datasets · Apr 13, 2023 · 6517cdb · 6517cdb
1 parent 3fa2081
commit 6517cdb
Show file tree

Hide file tree

Showing 15 changed files with 621 additions and 1,339 deletions.
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -0,0 +1,7 @@
+{
+    "name": "Core Dataset",
+    "image": "mcr.microsoft.com/devcontainers/python:3.11",
+    "features": {
+        "ghcr.io/stuartleeks/dev-container-features/shell-history:0": {}
+    }
+}
diff --git a/.github/workflows/actions.yml b/.github/workflows/actions.yml
@@ -1,48 +1,30 @@
 on:
   push:
-    branches:
-      - master
-  #  2023-03-08 disable cron until working again
-  # schedule:
-  #  - cron:  '0 1 * * *'
+    branches: ["master"]
+  pull_request:
+    branches: ["master"]
+  workflow_dispatch:
+  schedule:
+    - cron: "0 0 * * *"
 
 jobs:
   update:
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/checkout@master
-    - name: Build the data and create local changes
-      uses: actions/setup-python@v2
-      with:
-        python-version: '3.x'
-        architecture: x64
-    - name: Install requirements
-      run: |
-        pip install -r scripts/requirements.txt
-    - name: Process Data
-      run: |
-        python scripts/constituents.py
-    - name: Commit files
-      run: |
-        git config --local user.email "action@github.com"
-        git config --local user.name "GitHub Action"
-        git diff --quiet && git diff --staged --quiet || git commit -a -m "Auto-update of the data packages"
-    - name: Push changes
-      uses: ad-m/github-push-action@master
-      with:
-        github_token: ${{ secrets.gh }}
-  deploy:
-    needs: update
-    runs-on: ubuntu-latest
-    steps:
-    - uses: actions/checkout@v1
-    - uses: actions/setup-node@v1
-      with:
-        node-version: '8.x'
-    - run: npm install -g data-cli
-    - run: data --version
-    - run: data push
-      env:
-        id: ${{secrets.dhid}}
-        username: ${{secrets.dhusername}}
-        token: ${{secrets.dhtoken}}
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
+        with:
+          python-version: "3.11"
+      - name: Run
+        run: make
+      - name: Commit and Push
+        run: |
+          git config --global user.name "GitHub Action"
+          git config --global user.email "actions@users.noreply.github.com"
+          git add -A
+          if git diff-index --quiet HEAD --; then
+            echo "No changes to commit"
+          else
+            git commit -m "Update data"
+            git push
+          fi
diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,4 @@ cache/*
 tmp/
 scripts/.DS_Store
 .DS_Store
+*.env
diff --git a/Makefile b/Makefile
@@ -0,0 +1,21 @@
+VENV = .env
+PYTHON = $(VENV)/bin/python3
+PIP = $(VENV)/bin/pip
+
+.PHONY: data clean
+
+all: data
+
+data: $(VENV)/bin/activate
+	$(PYTHON) scripts/scrape.py
+
+$(VENV)/bin/activate: scripts/requirements.txt
+	python3 -m venv $(VENV)
+	$(PIP) install -r scripts/requirements.txt
+
+validate:
+	$(PYTHON) -m frictionless validate data/constituents.csv
+
+clean:
+	rm -rf __pycache__
+	rm -rf $(VENV)
diff --git a/README.md b/README.md
@@ -1,55 +1,45 @@
-List of companies in the S&P 500 (Standard and Poor's 500). The S&P 500 is a
-free-float, capitalization-weighted index of the top 500 publicly listed stocks
-in the US (top 500 by market cap). The dataset includes a list of all the
-stocks contained therein.
+# S&P 500 Companies Dataset
 
-## Data
+List of companies in the S&P 500 (Standard and Poor's 500). The S&P 500 is a free-float, capitalization-weighted index of the top 500 publicly listed stocks in the US (top 500 by market cap). The dataset includes a list of all the stocks contained therein.
 
-Information on S&P 500 index used to be available on the [official webpage on the Standard and Poor's website][sp-home]
-but until they publish it back, Wikipedia is the best up-to-date and open data source.
+## Data
 
-* Index listing - see <data/constituents.csv> extracted from Wikipedia's [SP500 list of companies][sp-list].
+Information on S&P 500 index used to be available on the [official webpage on the Standard and Poor's website][sp-home] but until they publish it back, Wikipedia's [SP500 list of companies][sp-list] is the best up-to-date and open data source.
 
-### Sources
+## Sources
 
-Detailed information on the S&P 500 (primarily in XLS format) used to be obtained
-from its [official webpage on the Standard and Poor's website][sp-home] - it was
-free but registration was required.
-* Index listing - see <data/constituents.csv>
-  * used to be extracted from [source Excel file on S&P website][sp-listing-dec-2014] but this no longer contains a list of constituents. (Note this Excel was actually S&P 500 EPS estimates but on sheet 4 it used to have a list of members - [previous file][sp-listing] was just members but that 404s as of Dec 2014) (Note: <del>but note you have to register and login to access</del> - no longer true as of August 2013)
-* Historical performance ([source xls on S&P website][sp-historical])
+Detailed information on the S&P 500 (primarily in XLS format) used to be obtained from its [official webpage on the Standard and Poor's website][sp-home] - it was free but registration was required.
 
 [sp-home]: http://www.spindices.com/indices/equity/sp-500
-[sp-list]: http://en.wikipedia.org/wiki/List_of_S%26P_500_companies
-[sp-listing-dec-2014]: http://www.spindices.com/documents/additional-material/sp-500-eps-est.xlsx?force_download=true
-[sp-listing]: http://us.spindices.com/idsexport/file.xls?hostIdentifier=48190c8c-42c4-46af-8d1a-0cd5db894797&selectedModule=Constituents&selectedSubModule=ConstituentsFullList&indexId=340
-[sp-historical]: http://www.standardandpoors.com/prot/spf/docs/indices/SPUSA-500-USDUF--P-US-L--HistoricalData.xls
 
-*Note*: for aggregate information on the S&P (dividends, earnings, etc.) see
-[Standard and Poor's 500 Dataset][shiller].
+> **Note**
+> For aggregate information on the S&P (dividends, earnings, etc.) see [Standard and Poor's 500 Dataset][shiller].
 
 [shiller]: http://data.okfn.org/data/s-and-p-500
 
-### General Financial Notes
+## General Financial Notes
+
+Publicly listed US companies are obliged various reports on a regular basis with the SEC. Of these 2 types are of especial interest to investors and others interested in their finances and business. These are:
+
+- 10-K = Annual Report
+- 10-Q = Quarterly report
 
-Publicly listed US companies are obliged various reports on a regular basis
-with the SEC. Of these 2 types are of especial interest to investors and others
-interested in their finances and business. These are:
+## Development
 
-* 10-K = Annual Report
-* 10-Q = Quarterly report
+The pipeline relies on Python, so you'll need to have it installed on your machine. Then:
 
-## Preparation
+1. Create a virtual environment in a directory using Python's venv module: `python3 -m venv .env`
+2. Activate the virtual environment: `source .env/bin/activate`
+3. Install the dependencies: `pip install -r scripts/requirements.txt`
+4. Run the scripts: `python scripts/scrape.py`
 
-You can run the script yourself to update the data and publish them to GitHub : see [scripts README](https://github.com/datasets/s-and-p-500-companies/blob/master/scripts/README.md).
+Alternatively, you can use the provided Makefile to run the scraping with a simple `make`. It'll create a virtual environment, install the dependencies and run the script.
 
 ## License
 
-All data is licensed under the [Open Data Commons Public Domain Dedication and
-License][pddl]. All code is licensed under the MIT/BSD license.
+All data is licensed under the [Open Data Commons Public Domain Dedication and License][pddl]. All code is licensed under the MIT/BSD license.
 
-Note that while no credit is formally required a link back or credit to [Rufus
-Pollock][rp] and the [Open Knowledge Foundation][okfn] is much appreciated.
+Note that while no credit is formally required a link back or credit to [Rufus Pollock][rp] and the [Open Knowledge Foundation][okfn] is much appreciated.
 
 [pddl]: http://opendatacommons.org/licenses/pddl/1.0/
 [rp]: http://rufuspollock.com/