Scraper - A VS Code Extension Scraper Service
This service was initially designed to be used for the cataloging and processing of open source VS Code Extensions on the Coder internal Extension Marketplace. However, it does not directly serve extensions or their related metadata to VS Code clients. It's half of the puzzle.
The Microsoft TOS for Visual Studio Code prohibits the usage of the official Visual Studio Extension Marketplace as an API. Although the official marketplace does have an internal API, it is not publicly documented, and not intend for public use.
That said, the majority of extensions on the official marketplace are open sourced. This allows third parties to create services such as this, and their own marketplace APIs to serve the extensions collected to users of forked/embedded VS Code clients (e.g. code-server).
🐋Docker to test and compile extensions 🐙GitHub API for downloading OSS extensions 🐬MariaDB to store extension and source repository data 📡Google Cloud Storage to store compiled extension files (
*.VSIX) and metadata
🚧The VS Code Extension Manager (vsce) to ensure that extensions conform to the standards of the official marketplace 📜Buffered File System (BFS) module to ensure that extension dependencies are valid
You'll need Docker and some SQL server on the same machine as the service. I'd recommend MariaDB.
yarn to grab all the dependencies and build the Docker image:
Configure the SQL server
You can configure the following environment vars as needed:
Create the extension database:
mysql -u -p mysql < extension_schema.sql
If you want to double check that it worked:
mysql -u -p -e "USE extensions; SHOW TABLES;"
The above should output:
+----------------------+ | Tables_in_extensions | +----------------------+ | categories | | extensions | | sources | | tags | +----------------------+
Configure the GitHub API
Here's a quick guide that you can follow to generate a GitHub access token. The token only needs permission to access public repositories.
The token can then be stored in the
GITHUB_TOKEN environment variable.
Configure the Google Cloud Storage API
In order for the scraping process to be fully-functional, the scraped extensions are going to have to live somewhere. That's where GCP comes in. You can read the following guide on how to setup the GCP Storage CLI: How to install gsutil. This will configure the default credentials for the machine running the scraper.
You'll need a GCP Bucket with the name
It may also be worth setting up default credentials, you can read how to do so here.
An extension source (repository) can be added by running:
yarn start add-source https://github.com/<owner-name>/<repo-name>
The scraper can be started by running:
yarn start run
All of the available commands can be viewed by running:
yarn start --help
The obvious bottleneck here is RAM for reading modules into memory and CPU for running Docker instances concurrently. So, if the machine running the service is not particularly powerful, reducing the amount of concurrent processes will make a big difference. Likewise, if you've got 32GB of RAM and 8 cores, feel free to run 20 instances concurrently.
The other restriction to consider is that the GitHub API does have a daily rate limit. So, if the scraper is allowed to run on a reasonable amount of extensions for a few hours, it may start failing to query GitHub.
You should run the tests on a non-production database and GCP bucket.