Skip to content

bytewax/real-time-milvus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stream, Process, Embed, Repeat

Please see our website for the latest updates on this topic.

Your takeaway

In this tutorial, we will use Bytewax as the basis to create a pipeline that will retrieve new hackernews posts, parallelize the parsing and creation of embeddings to be fed into Milvus, our vector database.

Dataflow

dataflow

The dataflow has 5 parts:

  • Input - stream stories and comments from HackerNews API.
  • Preprocess - retrieve updates and filter for stories/comments.
  • Retrieve Content - download the html and parse it into useable text. Thanks to awesome Unstructured.io.
  • Vectorize - Create an embedding or list of embeddings for text using Hagging Face Transformers.
  • Output - write the vectors to Milvus and create a new index

Setting up your environment

Recommended with Python 3.11.

pip install -r requirements.txt

Run it

python -m bytewax.run "pipeline:run_hn_flow()"

Milvus Lite

We'll use Milvus Lite in this tutorial. Milvus Lite is a lightweight version of Milvus that can be embedded into your Python application. It is a single binary that can be easily installed and run on your machine. Install with client (pymilvus):

python -m pip install "milvus[client]"

Milvus CLI

Milvus Command-Line Interface (CLI) is a command-line tool that supports database connection, data operations, and import and export of data. So you can run queries and see if the data goes through the pipeline. Install with

pip install milvus-cli

About

Real-time Hacker News stories with RAG and Milvus

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages