You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/01-announcement.md
+34-32Lines changed: 34 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,14 +2,14 @@
2
2
3
3
DocArray has had a good run so far: Since being spun out of Jina ten months ago, the project has seen 141 releases, integrated six external storage backends, attracted contributors from five companies, and collected 1.4k GitHub stars.
4
4
5
-
And yet we feel like we have to bring some big changes to the library in order to make it what we want it to be: the go-to solution for modelling, sending, and storing multi-modal data, with a particular soft spot for ML and neural search applications.
5
+
And yet we feel like we have to bring some big changes to the library to make it what we want it to be: the go-to solution for modelling, sending, and storing multi-modal data, with a particular soft spot for ML and neural search applications.
6
6
7
7
The purpose of this post is to outline the technical reasons for this transition, from the perspective of us, the maintainers.
8
8
You might also be interested in a slightly different perspective, the one of Han Xiao, CEO of Jina AI and originator of DocArray. You can find his blog post [here](https://jina.ai/news/donate-docarray-lf-for-inclusive-standard-multimodal-data-model/).
9
9
10
-
If you are interested in the progress of the rewrite itself, you can follow along on our [public roadmap](https://github.com/docarray/docarray/issues/780).
10
+
If you're interested in the progress of the rewrite itself, you can follow along on our [public roadmap](https://github.com/docarray/docarray/issues/780).
11
11
12
-
So, without further ado, let's delve into what the plans are for this v2 of DocArray, followed by some explanations of why we think that this is the right move.
12
+
So, without further ado, let's delve into the plans for this v2 of DocArray, followed by some explanations of why we think that this is the right move.
13
13
14
14
# The What
15
15
@@ -137,22 +137,22 @@ But there is a flip side to this: What does DocArray offer that differentiates i
137
137
138
138
## The Why: Data Modelling Perspective
139
139
140
-
In the current DocArray, every `Document` has a fixed schema: It as a `text`, an `embedding`, a `tensor`, a `uri`, ...
141
-
This setup is fine for simple use cases, but isnot flexible enough for advanced scenarios:
140
+
In the current DocArray, every `Document` has a fixed schema: It has a `text`, an `embedding`, a `tensor`, a `uri`, ...
141
+
This setup is fine for simple use cases, but isn't flexible enough for advanced scenarios:
142
142
143
143
- How can you store multiple embeddings for a **hybrid search use case**?
144
144
- How do you model deeply nested data?
145
145
- How do you store multiple data modalities in one object?
146
146
147
-
All of these scenarios can only be solved by a solution that gives all the flexibility to the user, andad dataclass-like API offers just that.
147
+
All of these scenarios can only be solved by a solution that gives all the flexibility to the user, anda dataclass-like API offers just that.
148
148
149
149
## The Why: ML and Training Perspective
150
150
151
151
`DocumentArray`is fundamentally a row-based data structure: Every Document is one unit (row), that can be manipulated, shuffled around, etc. This is a great property to have in an information retrieval or neural search setting, where tasks like ranking require row-based data access.
152
152
153
153
For other use cases like training an ML model, however, a column-based data structure is preferable: When you train your model, you want it to take inall data of a given mini-batch at once, as one big tensor; you don't want to first stack a bunch of tiny tensors before your forward pass.
154
154
155
-
Therefore, we will be introducing a mode for selectively enabling column-based behaviour on certain fields of your data model ("stacked mode"):
155
+
Therefore, we will introduce a mode to selectively enable column-based behaviour on certain fields of your data model ("stacked mode"):
156
156
157
157
```python
158
158
from docarray import Document, DocumentArray
@@ -179,45 +179,47 @@ with da.stacked('tensor'):
179
179
180
180
This will make DocArray much more suitable forML training and inference, and even for use inside of ML models.
181
181
182
-
## The Why: Document Store / Vector DB perspective
182
+
## The Why: Document Store/Vector database perspective
183
183
184
184
In the current DocArray, every `DocumentArray` can be mapped to a storage backend `da = DocumentArray(storage='annlite', ...)`.
185
-
This offers a lot of convenience for simple use cases, but the conflation of the array concept and the DB concept lead to a number of problems:
185
+
This offers a lot of convenience for simple use cases, but the conflation of the array concept and the database concept lead to a number of problems:
186
186
187
-
- Itisnot always clear what data is on disk, and what data isin memory
188
-
- Not allin-place operations on a DocumentArray are automatically reflected in the associated DB, while others are. This isdue to the fact that some operations load data into memory before the manipulation happens, and means that a deep understanding of DocArray is necessary to know whatisgoing on
189
-
- Supporting list-like operations on a DB-like object carries overhead with little benefit
190
-
- Itisdifficult to expose all the power and flexibility of various vector DBs through the `DocumentArray`API
187
+
- It's not always clear what data is on disk, and what data is in memory.
188
+
- Not allin-place operations on a DocumentArray are automatically reflected in the associated database, while others are. This isbecause some operations load data into memory before the manipulation happens, and means that a deep understanding of DocArray is necessary to know what's going on.
189
+
- Supporting list-like operations on a database-like object carries overhead with little benefit.
190
+
- It's difficult to expose all the power and flexibility of various vector databases through the `DocumentArray` API.
191
191
192
-
All of the problems above currently make it difficult to use vector DBs through DocArray in production.
193
-
Disentangling the concepts of `DocumentArray`and`DocumentStore` will give more transparency to the user, and more flexibility to the contributors, while directly solving most of the above.
192
+
All of the problems above currently make it difficult to use vector databases through DocArray in production.
193
+
Disentangling the concepts of `DocumentArray`and`DocumentStore` will give more transparency to the user, and more flexibility to contributors, while directly solving most of the above issues.
194
194
195
195
## The Why: Web Application Perspective
196
196
197
-
Currently itis possible to use DocArray in combination with FastAPI and other web frameworks, as it already provides a translation to pydantic.
198
-
However, this integration isnot without friction:
197
+
Currently it's possible to use DocArray in combination with FastAPI and other web frameworks, as it already provides a translation to pydantic.
198
+
However, this integration isnot without friction, since:
199
199
200
-
-Since currently every Document follows the same schema, as Document payload cannot be customized
201
-
- This means that one isforced to create payload with (potentially many) empty and unused fields
202
-
-While at the same time, there is no natural way to add new fields
203
-
- Sending requests from programming languages other than Python requires the user to needlessly recreate the Document's structure
200
+
-Currently every Document follows the same schema, as Document payload cannot be customized.
201
+
- This means that you're forced to create a payload with (potentially many) empty and unused fields.
202
+
-At the same time, there is no natural way to add new fields.
203
+
- Sending requests from programming languages other than Python requires you to needlessly recreate the Document's structure.
204
204
205
205
By switching to a dataclass-first approach with pydantic as a fundamental building block, we are able to ease all of these pains:
206
206
207
-
- Fields are completely customizable
208
-
- Every `Document`is also a pydantic model, enabling amazing support for FastAPI and other tools
209
-
- Creating payloads from other programming languages isas easy as creating a dictionary with the same fields as the dataclass - same workflow aswithnormal pydantic
207
+
- Fields are completely customizable.
208
+
- Every `Document`is also a pydantic model, enabling amazing support for FastAPI and other tools.
209
+
- Creating payloads from other programming languages isas easy as creating a dictionary with the same fields as the dataclass -the same workflow as normal pydantic.
210
210
211
211
## The Why: Microservices Perspective
212
212
213
-
In the land of cloud-nativeness and microservices, the concerns from"normal" web development also apply, but are often exacerbated due to the many network calls that occur, and other technologies such asprotobufand gRPC entering the game.
213
+
In the land of cloud-nativeness and microservices, the concerns from"normal" web development also apply, but are often exacerbated due to the many network calls that occur, and other technologies such asProtobufand gRPC entering the game.
214
214
215
-
With this in mind, DocArray v2 can offer the following improvements
215
+
With this in mind, DocArray v2 will offer the following improvements
216
216
217
-
- Creating valid protobuf definitions from outside of Python will be as simple as doing the same forJSON: Just specify a mapping that includes the keys that you defined in the Document dataclass interface
218
-
- Itisno longer necessary to re-create the predefined Document structure in your Protobuf definitions
219
-
- For every microservice, the Document schema can function as requirement or contract about the inputand output data of that particular microservice
217
+
- Creating valid Protobuf definitions from outside of Python will be as simple as doing the same forJSON: Just specify a mapping that includes the keys that you defined in the Document dataclass interface.
218
+
- It's no longer necessary to re-create the predefined Document structure in your Protobuf definitions.
219
+
- For every microservice, the Document schema can function as requirement or contract about the inputand output data of that particular microservice.
220
220
221
-
Currently, a DocArray-based microservice architecture will usually rely on `Document` being the unified inputand output forall microservices. So there might be concern here: Won't this new, more flexible structure create a huge mess where microservices cannot rely on anything?
222
-
We argue the opposite! In complex real-life settings, it is often the case that inputand output Documents heavily rely on the `.chunks` field to represent nested data. Therefore, it is already unclear what exact data model can be expected.
223
-
The shift to a dataclass-first approach allows you to make all of these (nested) data models explicit instead of implicit, leading to _more_ interoperability between microservices, not less.
221
+
Currently, a DocArray-based microservice architecture usually relies on `Document` being the unified inputand output forall microservices. So there might be concern here: Won't this new, more flexible structure create a huge mess where microservices cannot rely on anything?
222
+
223
+
We argue the opposite! In complex real-life settings, inputand output Documents often heavily rely on the `.chunks` field to represent nested data. Therefore, the exact data model that you can expect is already unclear.
224
+
225
+
The shift to a dataclass-first approach lets you make all of these (nested) data models explicit instead of implicit, leading to _more_ interoperability between microservices, not less.
0 commit comments