Please see our website for the latest updates on this topic.
In this tutorial, we will use Bytewax as the basis to create a pipeline that will retrieve new hackernews posts, parallelize the parsing and creation of embeddings to be fed into Milvus, our vector database.
The dataflow has 5 parts:
- Input - stream stories and comments from HackerNews API.
- Preprocess - retrieve updates and filter for stories/comments.
- Retrieve Content - download the html and parse it into useable text. Thanks to awesome Unstructured.io.
- Vectorize - Create an embedding or list of embeddings for text using Hagging Face Transformers.
- Output - write the vectors to Milvus and create a new index
Recommended with Python 3.11.
pip install -r requirements.txt
python -m bytewax.run "pipeline:run_hn_flow()"
We'll use Milvus Lite in this tutorial. Milvus Lite is a lightweight version of Milvus that can be embedded into your Python application. It is a single binary that can be easily installed and run on your machine. Install with client (pymilvus):
python -m pip install "milvus[client]"
Milvus Command-Line Interface (CLI) is a command-line tool that supports database connection, data operations, and import and export of data. So you can run queries and see if the data goes through the pipeline. Install with
pip install milvus-cli