Stream, Process, Embed, Repeat

Please see our website for the latest updates on this topic.

Your takeaway

In this tutorial, we will use Bytewax as the basis to create a pipeline that will retrieve new hackernews posts, parallelize the parsing and creation of embeddings to be fed into Milvus, our vector database.

Dataflow

The dataflow has 5 parts:

Input - stream stories and comments from HackerNews API.
Preprocess - retrieve updates and filter for stories/comments.
Retrieve Content - download the html and parse it into useable text. Thanks to awesome Unstructured.io.
Vectorize - Create an embedding or list of embeddings for text using Hagging Face Transformers.
Output - write the vectors to Milvus and create a new index

Setting up your environment

Recommended with Python 3.11.

pip install -r requirements.txt

Run it

python -m bytewax.run "pipeline:run_hn_flow()"

Milvus Lite

We'll use Milvus Lite in this tutorial. Milvus Lite is a lightweight version of Milvus that can be embedded into your Python application. It is a single binary that can be easily installed and run on your machine. Install with client (pymilvus):

python -m pip install "milvus[client]"

Milvus CLI

Milvus Command-Line Interface (CLI) is a command-line tool that supports database connection, data operations, and import and export of data. So you can run queries and see if the data goes through the pipeline. Install with

pip install milvus-cli

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataflow.png		dataflow.png
milvus_connector.py		milvus_connector.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

dataflow.png

dataflow.png

milvus_connector.py

milvus_connector.py

pipeline.py

pipeline.py

requirements.txt

requirements.txt

Repository files navigation

Stream, Process, Embed, Repeat

Your takeaway

Dataflow

Setting up your environment

Run it

Milvus Lite

Milvus CLI

About

Releases

Packages

Contributors 2

Languages

License

bytewax/real-time-milvus

Folders and files

Latest commit

History

Repository files navigation

Stream, Process, Embed, Repeat

Your takeaway

Dataflow

Setting up your environment

Run it

Milvus Lite

Milvus CLI

About

Topics

Resources

License

Stars

Watchers

Forks

Languages