You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/01-announcement.md
+9-10Lines changed: 9 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,11 @@
1
-
2
1
# DocArray v2: What and Why
3
2
4
3
DocArray has had a good run so far: Since being spun out of Jina ten months ago, the project has seen 141 releases, integrated six external storage backends, attracted contributors from five companies, and collected 1.4k GitHub stars.
5
4
6
-
And yet we feel like we have to bring some big changes to the library in order to make it what we want it to be: the go-to solution for modelling, sending, and storing mulit-modal data, with a particular soft spot for ML and neural search applications.
5
+
And yet we feel like we have to bring some big changes to the library in order to make it what we want it to be: the go-to solution for modelling, sending, and storing multi-modal data, with a particular soft spot for ML and neural search applications.
7
6
8
7
The purpose of this post is to outline the technical reasons for this transition, from the perspective of us, the maintainers.
9
-
You might als be interested in a slightly different persepective, the one of Han Xiao, CEO of Jina AI and originator of DocArray. You can find his blog post [here](https://jina.ai/news/donate-docarray-lf-for-inclusive-standard-multimodal-data-model/).
8
+
You might also be interested in a slightly different perspective, the one of Han Xiao, CEO of Jina AI and originator of DocArray. You can find his blog post [here](https://jina.ai/news/donate-docarray-lf-for-inclusive-standard-multimodal-data-model/).
10
9
11
10
If you are interested in the progress of the rewrite itself, you can follow along on our [public roadmap](https://github.com/docarray/docarray/issues/780).
12
11
@@ -90,7 +89,7 @@ from docarray import DocumentArray
90
89
da = DocumentArray([MyDoc(txt='hi there!'for _ inrange(10)])
91
90
```
92
91
93
-
However, the commiment to a dataclass-like interface allows for DocumentArrays that are typed by a specific schema:
92
+
However, the commitment to a dataclass-like interface allows for DocumentArrays that are typed by a specific schema:
94
93
95
94
```python
96
95
da = DocumentArray[MyDoc]([MyDoc(txt='hi there!'for _ inrange(10)])
@@ -188,24 +187,24 @@ This offers a lot of convenience for simple use cases, but the conflation of the
188
187
- It isnot always clear what data is on disk, and what data isin memory
189
188
- Not allin-place operations on a DocumentArray are automatically reflected in the associated DB, while others are. This is due to the fact that some operations load data into memory before the manipulation happens, and means that a deep understanding of DocArray is necessary to know what is going on
190
189
- Supporting list-like operations on a DB-like object carries overhead with little benefit
191
-
- It isdifficutl to expose all the power and flexibility of various vector DBs throught the `DocumentArray`API
190
+
- It isdifficult to expose all the power and flexibility of various vector DBs through the `DocumentArray`API
192
191
193
-
All of the problems above currently make it difficult to use bector DBs through DocArray in production.
194
-
Disentangling the concepts of `DocumentArray`and`DocumentStore` will give more transparancy to the user, and more flexibility to the contributors, while directly solving most of the above.
192
+
All of the problems above currently make it difficult to use vector DBs through DocArray in production.
193
+
Disentangling the concepts of `DocumentArray`and`DocumentStore` will give more transparency to the user, and more flexibility to the contributors, while directly solving most of the above.
195
194
196
195
## The Why: Web Application Perspective
197
196
198
197
Currently it is possible to use DocArray in combination with FastAPI and other web frameworks, as it already provides a translation to Pydantic.
199
198
However, this integration isnot without friction:
200
199
201
200
- Since currently every Document follows the same schema, as Document payload cannot be customized
202
-
- This means that one is forced to create payload with (potentially many) empyand unused fields
201
+
- This means that one is forced to create payload with (potentially many) emptyand unused fields
203
202
- While at the same time, there is no natural way to add new fields
204
203
- Sending requests from programming languages other than Python requires the user to recreated the Document's structure, needlessly
205
204
206
205
By switching to a dataclass-first approach with Pydantic as a fundamental building block, we are able to ease all of these pains:
207
206
208
-
-Fiels are completely customizable
207
+
-Fields are completely customizable
209
208
- Every `Document`is also a Pydantic model, enabling amazing support for FastAPI and other tools
210
209
- Creating payloads from other programming languages isas easy as creating a dictionary with the same fields as the dataclass - same workflow aswith normal Pydantic
211
210
@@ -219,7 +218,7 @@ With this in mind, DocArray v2 can offer the following improvements
219
218
- It is no longer needed to re-create the predefined Document structure in your Protobuf definitions
220
219
- For every microservice, the Document schema can function as requirement or contract about the inputand output data of that particular microservice
221
220
222
-
Currently, a DocArray-based microservice architecture will usally rely on `Document` being the unified inputand output forall microservices. So there might be concern here: Won't this new, more flexible structure create a huge mess where microservices cannot rely on anything?
221
+
Currently, a DocArray-based microservice architecture will usually rely on `Document` being the unified inputand output forall microservices. So there might be concern here: Won't this new, more flexible structure create a huge mess where microservices cannot rely on anything?
223
222
We argue the opposite! In complex real-life settings, it is often the case that inputand output Documents heavily rely on the `.chunks` field to represent nested data. Therefore, it is already unclear what exact data model can be expected.
224
223
The shift to a dataclass-first approach allows you to make all of these (nested) data models explicit instead of implicit, leading to _more_ interoperability between microservices, not less.
0 commit comments