Skip to content
A Visual Studio Code Extension Scraper Service.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Use published @foresthoffman/bfs module Jul 14, 2019
test
.gitignore
Dockerfile init May 6, 2019
LICENSE Add LICENSE Jun 29, 2019
README.md
extension_schema.sql Remove reference to deprecated version table Jun 30, 2019
jest.config.js Move tests into dedicated directory Jun 4, 2019
package.json
tsconfig.json Fix #3, fill database module Jun 4, 2019
tslint.json Add tslint rules Jun 28, 2019
yarn-error.log init May 6, 2019
yarn.lock Update deps Jul 14, 2019

README.md

Scraper - A VS Code Extension Scraper Service

This service was initially designed to be used for the cataloging and processing of open source VS Code Extensions on the Coder internal Extension Marketplace. However, it does not directly serve extensions or their related metadata to VS Code clients. It's half of the puzzle.

The Microsoft TOS for Visual Studio Code prohibits the usage of the official Visual Studio Extension Marketplace as an API. Although the official marketplace does have an internal API, it is not publicly documented, and not intend for public use.

That said, the majority of extensions on the official marketplace are open sourced. This allows third parties to create services such as this, and their own marketplace APIs to serve the extensions collected to users of forked/embedded VS Code clients (e.g. code-server).

Tech Used

  • 🐋 Docker to test and compile extensions
  • 🐙 GitHub API for downloading OSS extensions
  • 🐬 MariaDB to store extension and source repository data
  • 📡 Google Cloud Storage to store compiled extension files (*.VSIX) and metadata
  • 🚧 The VS Code Extension Manager (vsce) to ensure that extensions conform to the standards of the official marketplace
  • 📜 Buffered File System (BFS) module to ensure that extension dependencies are valid

Installation

You'll need Docker and some SQL server on the same machine as the service. I'd recommend MariaDB.

Run yarn to grab all the dependencies and build the Docker image:

yarn

Configure the SQL server

You can configure the following environment vars as needed:

  • SQL_HOST (default: "localhost")
  • SQL_PORT (default: 3306)
  • SQL_USER (default: "root")
  • SQL_PASS (default: "root")

Create the extension database:

mysql -u -p mysql < extension_schema.sql

If you want to double check that it worked:

mysql -u -p -e "USE extensions; SHOW TABLES;"

The above should output:

+----------------------+
| Tables_in_extensions |
+----------------------+
| categories           |
| extensions           |
| sources              |
| tags                 |
+----------------------+

Configure the GitHub API

Here's a quick guide that you can follow to generate a GitHub access token. The token only needs permission to access public repositories.

The token can then be stored in the GITHUB_TOKEN environment variable.

Configure the Google Cloud Storage API

In order for the scraping process to be fully-functional, the scraped extensions are going to have to live somewhere. That's where GCP comes in. You can read the following guide on how to setup the GCP Storage CLI: How to install gsutil. This will configure the default credentials for the machine running the scraper.

You'll need a GCP Bucket with the name scraper-extensions.

It may also be worth setting up default credentials, you can read how to do so here.

Usage

An extension source (repository) can be added by running:

yarn start add-source https://github.com/<owner-name>/<repo-name>

The scraper can be started by running:

yarn start run

All of the available commands can be viewed by running:

yarn start --help

Restrictions

The obvious bottleneck here is RAM for reading modules into memory and CPU for running Docker instances concurrently. So, if the machine running the service is not particularly powerful, reducing the amount of concurrent processes will make a big difference. Likewise, if you've got 32GB of RAM and 8 cores, feel free to run 20 instances concurrently.

The other restriction to consider is that the GitHub API does have a daily rate limit. So, if the scraper is allowed to run on a reasonable amount of extensions for a few hours, it may start failing to query GitHub.

Testing

You should run the tests on a non-production database and GCP bucket.

Simply run:

yarn test
You can’t perform that action at this time.