Skip to content

fpgmaas/pypi-scout

Repository files navigation

PyPI Scout Logo


What does this do?

Finding the right Python package on PyPI can be a bit difficult, since PyPI isn't really designed for discovering packages easily. For example, you can search for the word "plot" and get a list of hundreds of packages that contain the word "plot" in seemingly random order.

Inspired by this blog post about finding arXiv articles using vector embeddings, I decided to build a small application that helps you find Python packages with a similar approach. For example, you can ask it "I want to make nice plots and visualizations", and it will provide you with a short list of packages that can help you with that.

How does this work?

The project works by collecting project summaries and descriptions for all packages on PyPI with more than 100 weekly downloads. These are then converted into vector representations using Sentence Transformers. When the user enters a query, it is converted into a vector representation, and the most similar package descriptions are fetched from the vector database. Additional weight is given to the amount of weekly downloads before presenting the results to the user in a dashboard.

Stack

The project uses the following technologies:

  1. FastAPI for the API backend
  2. NextJS and TailwindCSS for the frontend
  3. Sentence Transformers for vector embeddings

Getting Started

Build and Setup

1. (Optional) Create a .env file

By default, all data will be stored on your local machine. It is also possible to store the data for the API on Azure Blob storage, and have the API read from there. To do so, create a .env file:

cp .env.template .env

and fill in the required fields.

2. Run the Setup Script

The setup script will:

  • Download and process the PyPI dataset and store the results in the data directory.
  • Create vector embeddings for the PyPI dataset.
  • If the STORAGE_BACKEND environment variable is set to BLOB: Upload the datasets to blob storage.

There are three methods to run the setup script, dependent on if you have a NVIDIA GPU and NVIDIA Container Toolkit installed. Please run the setup script using the method that is applicable for you:

Note

The dataset contains approximately 100.000 packages on PyPI with more than 100 weekly downloads. To speed up local development, you can lower the amount of packages that is processed locally by lowering the value of FRAC_DATA_TO_INCLUDE in pypi_scout/config.py.

3. Run the Application

Start the application using Docker Compose:

docker-compose up

After a short while, your application will be live at http://localhost:3000.

Data

The dataset for this project is created using the PyPI dataset on Google BigQuery. The SQL query used can be found in pypi_bigquery.sql. The resulting dataset is available as a CSV file on Google Drive.