|
| 1 | +# This week(s) in DocArray |
| 2 | + |
| 3 | +It’s already been a month since the [last alpha release](https://github.com/docarray/docarray/releases/tag/2023.01.18.alpha) of DocArray v2. Since then a lot has happened: we’ve merged features that we’re really proud of and keep crying tears of joy and misery trying to coerce Python into doing what we want. If you want to learn about interesting Python edge cases or follow the advancement of DocArray v2 development then you’re at the right place in this dev blog! |
| 4 | + |
| 5 | +For those who don’t know, DocArray is a library for **representing, sending, and storing multi-modal data**, with a focus on applications in **ML** and **Neural Search.** The project just moved to the Linux foundation AI and Data and to celebrate its first birthday we decided to rewrite it from scratch, mainly because of a design shift and a will to solidify the codebase from the ground up. |
| 6 | + |
| 7 | +## MultiModalDataset |
| 8 | + |
| 9 | +As part of our goal to make DocArray the go-to library for representing, sending, and storing multi-modal data, we‘ve added a `MultiModalDataset` class to easily convert DocumentArrays into PyTorch Dataset compliant datasets that can be used in the PyTorch DataLoader. |
| 10 | + |
| 11 | +All you need is a DocumentArray and a dictionary of preprocessing functions and you’re up and running! |
| 12 | + |
| 13 | +```python |
| 14 | +from docarray import BaseDocument, DocumentArray |
| 15 | +from docarray.data import MultiModalDataset |
| 16 | +from docarray.documents import Text |
| 17 | +from torch.utils.data import DataLoader |
| 18 | + |
| 19 | +class Thesis(BaseDocument): |
| 20 | + title: Text |
| 21 | + |
| 22 | +class Student(BaseDocument): |
| 23 | + thesis: Thesis |
| 24 | + |
| 25 | +da: DocumentArray[Student] = get_students() |
| 26 | +ds: MultiModalDataset[Student] = MultiModalDataset[Student]( |
| 27 | + da, |
| 28 | + preprocessing={'thesis.title': embed_title, 'thesis': normalize_embedding}, |
| 29 | +) |
| 30 | +loader: DataLoader = DataLoader( |
| 31 | + ds, batch_size=4, collate_fn=MultiModalDataset[Student].collate_fn |
| 32 | +) |
| 33 | + |
| 34 | +# Use your loader just like any other dataloader for awesome DL training |
| 35 | +``` |
| 36 | + |
| 37 | +If you’re interested in using DocArray for training, check out our [example notebook](https://github.com/docarray/docarray/blob/feat-rewrite-v2/docs/tutorials/multimodal_training_and_serving.md), or take a peek at [implementation details of MultiModalDataset](https://github.com/docarray/docarray/pull/1049). |
| 38 | + |
| 39 | +## TensorFlow support |
| 40 | + |
| 41 | +After recently adding PyTorch support, we’ve now gone on to add TensorFlow support to DocArray v2. Like with PyTorch, we planned on subclassing the `tensorflow.Tensor` class with our `TensorFlowTensor` class. By doing so we want to allow DocArray to run operations on it while also being able to hand over our `TensorFlowTensor` instance to ML models or TensorFlow functions without TensorFlow being confused about this instance’s class but instead recognizing it as its own. Since we implemented this for PyTorch already, this should be easy, right? |
| 42 | + |
| 43 | +But stop, not so fast. At first glance, TensorFlow tensors seem to be of class `tf.Tensor`, right? |
| 44 | + |
| 45 | +```python |
| 46 | +import tensorflow as tf |
| 47 | + |
| 48 | +tensor = tf.zeros((5,)) |
| 49 | +tensor |
| 50 | +``` |
| 51 | + |
| 52 | +```python |
| 53 | +<tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 0., 0., 0., 0.], dtype=float32)> |
| 54 | +``` |
| 55 | + |
| 56 | +When trying to subclass `tf.Tensor` though, we notice that this does not seem to work: |
| 57 | + |
| 58 | +```python |
| 59 | +from typing import Any, Type, Union, cast |
| 60 | + |
| 61 | +import tensorflow as tf |
| 62 | +from docarray.typing.tensor.abstract_tensor import AbstractTensor |
| 63 | +from pydantic.tools import parse_obj_as |
| 64 | + |
| 65 | +class TensorFlowTensor(AbstractTensor, tf.Tensor): |
| 66 | + @classmethod |
| 67 | + def validate(cls, value, field, config) -> 'TensorFlowTensor': |
| 68 | + if isinstance(value, tf.Tensor): |
| 69 | + value.__class__ = cls |
| 70 | + return cast(TensorFlowTensor, value) |
| 71 | + else: |
| 72 | + raise ValueError(f'Expected a tf.Tensor, got {type(value)}') |
| 73 | + |
| 74 | +our_tensor = parse_obj_as(TensorFlowTensor, tf.zeros((5,))) # will fail |
| 75 | +``` |
| 76 | + |
| 77 | +Parsing a `tf.Tensor` as `TensorFlowTensor` will fail: |
| 78 | + |
| 79 | +```python |
| 80 | +pydantic.error_wrappers.ValidationError: 1 validation error for ParsingModel[TensorFlowTensor] |
| 81 | +__root__ |
| 82 | + __class__ assignment: 'TensorFlowTensor' object layout differs from 'tensorflow.python.framework.ops.EagerTensor' (type=type_error) |
| 83 | +``` |
| 84 | + |
| 85 | +But wait, here they talk about an `EagerTensor`, not `tf.Tensor`. This is because TensorFlow actually supports eager execution and as well as graph execution. It defaults to eager execution, where operations are evaluated immediately. In graph execution, a computational graph is constructed for later evaluation. |
| 86 | + |
| 87 | +So maybe we just need to extend TensorFlow’s `EagerTensor` then! |
| 88 | + |
| 89 | +This, however, doesn’t work either, because the class `EagerTensor` is created on the fly, which is why trying to extend this class will fail with: |
| 90 | + |
| 91 | +`TypeError: type 'tensorflow.python.framework.ops.EagerTensor' is not an acceptable base type`. |
| 92 | + |
| 93 | +With all that being said, we’ve decided to go with the following solution for now: |
| 94 | + |
| 95 | +Instead of extending TensorFlow’s tensor, we store a `tf.Tensor` instance as an attribute of our `TensorFlowTensor` class. Therefore if you want to perform operations on the tensor data or hand it over to your ML model, you have to explicitly access the `.tensor` attribute: |
| 96 | + |
| 97 | +```python |
| 98 | +import tensorflow as tf |
| 99 | +from docarray.typing import TensorFlowTensor |
| 100 | + |
| 101 | +t = TensorFlowTensor(tensor=tf.zeros((224, 224))) |
| 102 | + |
| 103 | +# tensorflow functions |
| 104 | +broadcasted = tf.broadcast_to(t.tensor, (3, 224, 224)) |
| 105 | +broadcasted = tf.broadcast_to(t.unwrap(), (3, 224, 224)) |
| 106 | +broadcasted = tf.broadcast_to(t, (3, 224, 224)) # this will fail |
| 107 | +``` |
| 108 | + |
| 109 | +In future we plan to take a closer look and find a solution that enables handling `TensorFlowTensor`s just like our `TorchTensor`s. In particular, we plan to investigate if there’s an equivalent in TensorFlow to Torch’s `__torch_function__()`, which we told you about in the [previous blog post](https://jina.ai/news/this-week-in-docarray-1). With such an equivalent and some tricks here and there we hope to enable smooth usage or our `TensorFlowTensor` class and make it feel like it’s a subclass of TensorFlow’s tensor, without it actually being one. |
| 110 | + |
| 111 | +## Nested class and multiprocessing |
| 112 | + |
| 113 | +As part of our goal to make DocArray the go-to library for representing, sending, and storing multi-modal data, it’s important that DocumentArrays support multiprocessing, namely processing on multi-CPU cores. |
| 114 | + |
| 115 | +In particular, we recently implemented a `MultiModalDataset` class to easily convert a DocumentArray into a dataset that can be used in the PyTorch DataLoader. The PyTorch DataLoader wraps the Python multiprocessing module to implement preprocessing with multiple CPUs. |
| 116 | + |
| 117 | +**The problem** |
| 118 | + |
| 119 | +One of the well-known issues with multiprocessing is that it doesn’t support classes that are declared inside a function: |
| 120 | + |
| 121 | +```python |
| 122 | +def get_class(): |
| 123 | + class B: |
| 124 | + ... |
| 125 | + |
| 126 | + return B |
| 127 | + |
| 128 | +MyClass = get_class() |
| 129 | + |
| 130 | +def foo(*args): |
| 131 | + return MyClass() |
| 132 | + |
| 133 | +import multiprocessing as mp |
| 134 | + |
| 135 | +with mp.get_context('fork').Pool(2) as p: |
| 136 | + print(p.map(foo, range(2))) |
| 137 | +``` |
| 138 | + |
| 139 | +```bash |
| 140 | +Traceback (most recent call last): |
| 141 | + File "/Users/jackmin/Jina/docarray/meow.py", line 13, in <module> |
| 142 | + print(p.map(foo, range(2))) |
| 143 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 367, in map |
| 144 | + return self._map_async(func, iterable, mapstar, chunksize).get() |
| 145 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 774, in get |
| 146 | + raise self._value |
| 147 | +multiprocessing.pool.MaybeEncodingError: Error sending result: '[<__main__.get_class.<locals>.B object at 0x10152e950>]'. Reason: 'AttributeError("Can't pickle local object 'get_class.<locals>.B'")' |
| 148 | +``` |
| 149 | + |
| 150 | +**Pickling** |
| 151 | + |
| 152 | +This is because multiprocessing uses pickle to share objects with workers. Pickling only saves the qualified class name of an object and unpickling requires re-importing the class by its qualified class name. For that to work, the class needs a global qualified name. Classes defined by functions are local and thus cannot be pickled: |
| 153 | + |
| 154 | +```python |
| 155 | +def get_class(): |
| 156 | + class B: |
| 157 | + ... |
| 158 | + |
| 159 | + return B |
| 160 | + |
| 161 | +MyClass = get_class() |
| 162 | + |
| 163 | +import pickle |
| 164 | + |
| 165 | +pickle.dump(MyClass(), open('meow.pkl', 'wb')) |
| 166 | +``` |
| 167 | + |
| 168 | +```bash |
| 169 | +Traceback (most recent call last): |
| 170 | + File "/Users/jackmin/Jina/docarray/meow.py", line 10, in <module> |
| 171 | + pickle.dump(MyClass(), open("meow.pkl", "wb")) |
| 172 | +AttributeError: Can't pickle local object 'get_class.<locals>.B' |
| 173 | +``` |
| 174 | +
|
| 175 | +In order to get around this, we need to make the declared class global: |
| 176 | +
|
| 177 | +```python |
| 178 | +def get_class(): |
| 179 | + global B |
| 180 | +
|
| 181 | + class B: |
| 182 | + ... |
| 183 | +
|
| 184 | + return B |
| 185 | +
|
| 186 | +MyClass = get_class() |
| 187 | +
|
| 188 | +import pickle |
| 189 | +
|
| 190 | +pickle.dump(MyClass(), open('meow.pkl', 'wb')) |
| 191 | +``` |
| 192 | +
|
| 193 | +We can now load the pickles in a separate process as long as the process has a declaration of our class: |
| 194 | +
|
| 195 | +```python |
| 196 | +def get_class(): |
| 197 | + global B |
| 198 | +
|
| 199 | + class B: |
| 200 | + ... |
| 201 | +
|
| 202 | + return B |
| 203 | +
|
| 204 | +MyClass = get_class() |
| 205 | +
|
| 206 | +import pickle |
| 207 | +
|
| 208 | +print(pickle.load(open('meow.pkl', 'rb'))) |
| 209 | +``` |
| 210 | +
|
| 211 | +It doesn’t really matter how it ends up in the global scope. We can even do this: |
| 212 | +
|
| 213 | +```python |
| 214 | +class B: |
| 215 | + ... |
| 216 | +
|
| 217 | +import pickle |
| 218 | +
|
| 219 | +print(pickle.load(open('meow.pkl', 'rb'))) |
| 220 | +``` |
| 221 | +
|
| 222 | +**The fix?** |
| 223 | +
|
| 224 | +Ok. It just wants it to be global. Simple enough right? Let’s just plop `global` in front of our declaration and be done with it. |
| 225 | +
|
| 226 | +```python |
| 227 | +def get_class(): |
| 228 | + global B |
| 229 | +
|
| 230 | + class B: |
| 231 | + ... |
| 232 | +
|
| 233 | + return B |
| 234 | +
|
| 235 | +MyClass = get_class() |
| 236 | +
|
| 237 | +def foo(*args): |
| 238 | + return MyClass() |
| 239 | +
|
| 240 | +import multiprocessing as mp |
| 241 | +
|
| 242 | +with mp.get_context('fork').Pool(2) as p: |
| 243 | + print(p.map(foo, range(2))) |
| 244 | +``` |
| 245 | +
|
| 246 | +Yay this runs fine. But, what if our function returns a different class depending on the input arguments? I mean, why else would I want to return a class from a function? |
| 247 | +
|
| 248 | +```python |
| 249 | +def get_class(version: int): |
| 250 | + global B |
| 251 | +
|
| 252 | + class B: |
| 253 | + VERSION: int = version |
| 254 | +
|
| 255 | + return B |
| 256 | +
|
| 257 | +C1 = get_class(1) |
| 258 | +C2 = get_class(2) |
| 259 | +
|
| 260 | +def get_version(cls): |
| 261 | + print(cls) |
| 262 | + return cls.VERSION |
| 263 | +
|
| 264 | +import multiprocessing as mp |
| 265 | +
|
| 266 | +with mp.get_context('fork').Pool(2) as p: |
| 267 | + print(p.map(get_version, [C1, C2])) |
| 268 | +``` |
| 269 | +
|
| 270 | +```bash |
| 271 | +<class '__main__.B'> |
| 272 | +Traceback (most recent call last): |
| 273 | + File "/Users/jackmin/Jina/docarray/meow.py", line 19, in <module> |
| 274 | + print(p.map(get_version, [C1, C2])) |
| 275 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 367, in map |
| 276 | + return self._map_async(func, iterable, mapstar, chunksize).get() |
| 277 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 774, in get |
| 278 | + raise self._value |
| 279 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 540, in _handle_tasks |
| 280 | + put(task) |
| 281 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/connection.py", line 211, in send |
| 282 | + self._send_bytes(_ForkingPickler.dumps(obj)) |
| 283 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps |
| 284 | + cls(buf, protocol).dump(obj) |
| 285 | +_pickle.PicklingError: Can't pickle <class '__main__.B'>: it's not the same object as __main__.B |
| 286 | +``` |
| 287 | +
|
| 288 | +`Can't pickle <class '__main__.B'>: it's not the same object as __main__.B`. What does that mean? |
| 289 | +
|
| 290 | +**Double declaration** |
| 291 | +
|
| 292 | +Well, our little trick has some caveats. By performing a global declaration, we’re essentially taking the class declaration out into the top-level scope. This means we’re essentially doing this: |
| 293 | +
|
| 294 | +```python |
| 295 | +class B: |
| 296 | + VERSION: int = 1 |
| 297 | +
|
| 298 | +C1 = B |
| 299 | +
|
| 300 | +class B: |
| 301 | + VERSION: int = 2 |
| 302 | +
|
| 303 | +C2 = B |
| 304 | +
|
| 305 | +def get_version(cls): |
| 306 | + print(cls) |
| 307 | + return cls.VERSION |
| 308 | +
|
| 309 | +import multiprocessing as mp |
| 310 | +
|
| 311 | +with mp.get_context('fork').Pool(2) as p: |
| 312 | + print(p.map(get_version, [C1, C2])) |
| 313 | +``` |
| 314 | +
|
| 315 | +If we run this code, we get the exact same error we got before: |
| 316 | +
|
| 317 | +```bash |
| 318 | +<class '__main__.B'> |
| 319 | +Traceback (most recent call last): |
| 320 | + File "/Users/jackmin/Jina/docarray/wow.py", line 15, in <module> |
| 321 | + print(p.map(get_version, [C1, C2])) |
| 322 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 367, in map |
| 323 | + return self._map_async(func, iterable, mapstar, chunksize).get() |
| 324 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 774, in get |
| 325 | + raise self._value |
| 326 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 540, in _handle_tasks |
| 327 | + put(task) |
| 328 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/connection.py", line 211, in send |
| 329 | + self._send_bytes(_ForkingPickler.dumps(obj)) |
| 330 | + File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps |
| 331 | + cls(buf, protocol).dump(obj) |
| 332 | +_pickle.PicklingError: Can't pickle <class '__main__.B'>: it's not the same object as __main__.B |
| 333 | +``` |
| 334 | +
|
| 335 | +What happened here? By declaring the class twice, we’ve overwritten our first `Class B` with a second `Class B` in the global scope. Pickle is aware of this when it tries to serialize `C1`. It will notice that the `Class B` `C1` refers to is no longer the top-level one and raises an exception. |
| 336 | +
|
| 337 | +**Qualified names must be unique** |
| 338 | +
|
| 339 | +The issue here is that both `Class B`s have the same qualified name. Thus, both definitions are fighting over who gets to be the one the global dictionary knows about. |
| 340 | +
|
| 341 | +We can resolve this conflict and allow our two classes to live together peacefully by moving them to different qualified names and thus, different keys in the global scope: |
| 342 | +
|
| 343 | +```python |
| 344 | +def get_class(version: int): |
| 345 | + global B |
| 346 | +
|
| 347 | + class B: |
| 348 | + VERSION: int = version |
| 349 | +
|
| 350 | + B.__qualname__ = B.__qualname__ + str(version) |
| 351 | + globals()[f'B{version}'] = B |
| 352 | + return B |
| 353 | +
|
| 354 | +C1 = get_class(1) |
| 355 | +C2 = get_class(2) |
| 356 | +
|
| 357 | +def get_version(cls): |
| 358 | + print('Class Name:', cls.__name__) |
| 359 | + print('Class Qualified Name:', cls.__qualname__) |
| 360 | + print('Type repr', cls) |
| 361 | + return cls.VERSION |
| 362 | +
|
| 363 | +import multiprocessing as mp |
| 364 | +
|
| 365 | +with mp.get_context('fork').Pool(2) as p: |
| 366 | + print(p.map(get_version, [C1, C2])) |
| 367 | +``` |
| 368 | +
|
| 369 | +```bash |
| 370 | +Class Name: B |
| 371 | +Class Qualified Name: B1 |
| 372 | +Type repr <class '__main__.B1'> |
| 373 | +Class Name: B |
| 374 | +Class Qualified Name: B2 |
| 375 | +Type repr <class '__main__.B2'> |
| 376 | +[1, 2] |
| 377 | +``` |
| 378 | +
|
| 379 | +Notice that although the two classes have different qualified names, they can still share the same name with no issues. Printing the type does however show the qualified name. |
| 380 | +
|
| 381 | +**Implementation example** |
| 382 | +
|
| 383 | +If you’d like to see how we used this pattern to implement DocumentArrays that work with multiprocessing, check out [this PR](https://github.com/docarray/docarray/pull/1049). |
| 384 | +
|
| 385 | +## Support Proto 3 and 4 |
| 386 | +
|
| 387 | +[Protobuf](https://protobuf.dev/) introduced a [breaking change](https://github.com/tensorflow/tensorflow/issues/56077) in their 4.21 release. This has had a big impact on the Python ecosystem, and a lot of libraries have not yet been updated to use version 4.x. Perhaps the biggest pain for the ML ecosystem is TensorFlow’s lack of support for Protobuf, as it’s a widely used library and many packages, including DocArray, depend on it. |
| 388 | +
|
| 389 | +At the same time, DocArray can be used without TensorFlow — It’s just one of several available backends. To better support all users, we’ve decided to support both versions of protobuf. |
| 390 | +
|
| 391 | +This is actually easier than it may sound. We simply generated two Python files with Protoc, one for each of the Protobuf versions we want to support (3.x and 4.x). |
| 392 | +
|
| 393 | +So, depending on the protobuf version you have installed, we either load the first or the second version of the proto file. It’s as straightforward as that. [Here](https://github.com/docarray/docarray/pull/1078) is the PR for the curious. |
| 394 | +
|
| 395 | +## Join the conversation |
| 396 | +
|
| 397 | +Want to keep up to date or just have a chat with us? **[Join our Discord](https://discord.gg/WaMp6PVPgR)** and say hi! |
0 commit comments