Add a new datastore format. #532

tikurahul · 2020-01-21T04:46:26Z

This is an implementation of a completely new datastore for Donkey.
Rather than have a single file for every record, this datastore
stores all records in a Seekable file, with O(1) time access. Images
are still stored separately.
This format uses newline delimited json, and stores a configurable
number of records per file. It has high level APIs which make it easy
to consume records consistently.
I also ported over tubclean, tubplot and provides a script convert_to_tub_v2.py
to move from the old datastore to the new datastore format.
I ported over the Linear and the Inferred models so far, but adding support for
other models should be very trivial.
I also removed a lot of dead code which should make things a lot easier to understand going forward.
For now the old train.py is archived, as we will port over more things from the old training script.
The new training pipeline also arbitrary pre-processing steps with imgaug which is now a new dependency.
Added benchmarks for the new datastore.
Added a new experimental KerasInferred model which learns steering, but not throttle. We then infer
throttle based on a inverted exponential curve.
Added lots of unit tests.

DocGarbanzo · 2020-02-09T16:19:31Z

Hi Rahul,
I've checked out your code and ran a training on my tubs after conversion. This is working well and I see the current tub really is bit of a hacky concept with a lot of files and logic between data entries and file names.
Training seems to be running at the same speed as with the old tub - you don't expect any performance gains there, right? Performance gains mostly come from deletion operations where you simply mark entries as deleted instead of physically removing them.
The structure of the files in the new tub with Catalog and Manifest files doesn't become much simpler though in the new design, even if this is well engineered. In the end the tub is just a database and I was wondering if it is worth doing all of these improvements instead of just storing the data into a pandas DataFrame and saving this as hdf5 which pandas supports. Appending single rows to DataFrames doesn't seem to be very efficient, but appending lists of say of 1000 (like your catalog size) dictionaries should be fast. And adding a single column of 0/1 or True/False for valid/to-delete rows. All of this is probably much easier than your code change, but maybe you tried this and didn't find it working well?

tikurahul · 2020-02-09T17:20:48Z

Thanks for testing it out.

The huge gains are when you use the new datastore to collect data for training. Writes are between 5-10x faster which makes a huge difference if you want to push faster frame-rates.

We stopped using Pandas some time ago. We were using a modified pipeline. This training script has multiple processes disabled but you can get huge benefits by turning it on. The underlying store and sequences are multi process safe.

Also it’s now super easy to add preprocessing steps without affecting the rest of the pipeline (which is very important) to build stable models.

tikurahul · 2020-02-09T17:23:08Z

Also Pandas is painfully slow.

We tried making those improvements before but that’s why we ended up with the modified pipeline that Tawn built. So hdfs is almost a non-starter.

DocGarbanzo · 2020-02-09T18:34:53Z

Cool, thanks for explaining. I should try the recording, too. To enable multiprocessing in training I set workers=N instead of workers=1 in train.py:193? Will give it a shot and report.

tikurahul · 2020-02-09T21:06:53Z

Yes, that is correct.

nvtkaszpir · 2020-02-17T23:59:44Z

donkeycar/benchmarks/tub.py

+
+def benchmark():
+    # Change with a non SSD storage path
+    path = Path('/media/rahulrav/Cruzer/tub')


please try not to hardcode paths :)

nvtkaszpir · 2020-02-18T00:00:21Z

donkeycar/benchmarks/tub_v2.py

+
+def benchmark():
+    # Change to a non SSD storage path
+    path = Path('/media/rahulrav/Cruzer/benchmark')


nvtkaszpir · 2020-02-18T00:09:05Z

donkeycar/parts/keras.py

@@ -158,8 +158,7 @@ def __init__(self, num_outputs=2, input_shape=(120, 160, 3), roi_crop=(0, 0), *a
        self.compile()

    def compile(self):
-        self.model.compile(optimizer=self.optimizer,
-                loss='mse')
+        self.model.compile(optimizer=self.optimizer, loss='mse')


Does this needs to be in this PR?
Does it affect a new datastore format directly?

I believe it suits better as a new PR focused on optimizing the AI/ML part, so maybe it is better to add it to the dev branch as separate PR, especially if it is not related to the datastore itself, what do you think?

This is useful because it shows how to add a model to the new pipeline.

OK this is just a newline remove.

nvtkaszpir · 2020-02-18T00:09:38Z

donkeycar/parts/keras.py

@@ -170,6 +169,23 @@ def run(self, img_arr):



+class KerasInferred(KerasPilot):


Same as above.

nvtkaszpir · 2020-02-18T00:11:03Z

donkeycar/parts/keras.py

@@ -317,9 +333,9 @@ def default_categorical(input_shape=(120, 160, 3), roi_crop=(0, 0)):

 def default_n_linear(num_outputs, input_shape=(120, 160, 3), roi_crop=(0, 0)):

-    drop = 0.1
+    drop = 0.2


From my perspective it is not clear what and why this value was changed and what is the impact. Maybe adding a single line comment would be sufficient?

This just increases the dropout in the linear model.

nvtkaszpir · 2020-02-18T00:11:59Z

donkeycar/parts/tensorrt.py

-            # [TensorRT] ERROR: Internal error: could not find any implementation for node 2-layer MLP, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
-            # [TensorRT] ERROR: ../builder/tacticOptimizer.cpp (1230) - OutOfMemory Error in computeCosts: 0
-            builder.max_workspace_size = 1 << 20 #common.GiB(1)
+


Is the deleted comment no longer valid?

It was never technically correct. Defining the work space size just had the intended side effect. This is also not a very useful comment.

nvtkaszpir · 2020-02-18T00:13:38Z

donkeycar/parts/tub_v2.py

+    A datastore to store sensor data in a key, value format. \n
+    Accepts str, int, float, image_array, image, and array data types.
+    '''
+    def __init__(self, base_path, inputs=[], types=[], metadata=[], max_catalog_len=1000):


Could you pleas explain more about max_catalog_len?
I get a feeling it is somekind of limitation that some people should be aware of, or am I wronh?

No need to worry about this at all. For reference, this is the maximum number of records per catalog file.

nvtkaszpir · 2020-02-18T00:15:32Z

donkeycar/parts/tub_v2.py

+                    contents[key] = name
+
+        # Private properties
+        contents['_timestamp'] = int((time.time() / 1000) * 1000)


Can you please explain this?
Division and then multiplication?

time.time() returns a floating point timestamp. Dividing and multiplying by a 1000 gives us the timestamp in milliseconds.

Maybe time.monotonic() to avoid issues of clock skews on some systems?
(I've seen really strange things....)

Or time.monotonic_ns() which returns int but it requires python 3.7

This is effectively just a rounding, but because divisions are expensive, not be best way to do it, round would be faster here.

nvtkaszpir · 2020-02-18T00:17:42Z

donkeycar/tests/_test_train.py

@@ -1,4 +1,6 @@
 # -*- coding: utf-8 -*- #


Well, this change looks weird :)

This is just an archived change.

Just delete not needed files, they are in git history anyway.
Otherwise it just clutters the space and may even confuse some, especially if it is not used.

nvtkaszpir · 2020-02-18T00:19:53Z

donkeycar/utils.py

@@ -480,6 +480,12 @@ def get_model_by_type(model_type, cfg):
    elif model_type == "fastai":
        from donkeycar.parts.fastai import FastAiPilot
        kl = FastAiPilot()
+    elif model_type == "transfer":


this looks as if it does not belong to this PR.
Make a new one?

It does, because its a useful illustration on how to add new models to the new training pipeline.

nvtkaszpir

Note to self - try this one.

nvtkaszpir · 2020-02-18T10:30:59Z

One more thing, is it going to support python 3.5?

DocGarbanzo · 2020-03-17T22:28:21Z

@tikurahul & @tawnkramer - what are the plans to get this change merged? Is there an issue w/ pathlib and python 3.5 in the tests?

tikurahul · 2020-03-17T23:32:08Z

Yes. Python 3.5 is still supported.

The change itself is ready. I plan to make a few small improvements, but those can happen as separate PRs.

DocGarbanzo · 2020-03-21T07:30:43Z

Well it looked like there were issues on Travis with the 3.5 build. Would be great to have this change merged.

tikurahul · 2020-04-13T16:10:50Z

Will add support for the remaining models, and submit one last pull request.

Ezward · 2020-06-25T04:27:29Z

donkeycar/parts/augumentations.py

+        return augmentation
+
+    @classmethod
+    def trapezoidal_mask(cls, lower_left, lower_right, upper_left, upper_right, min_y, max_y):


Nice! When does this get applied in training and autopilot. More specifically; I want to know if this alters data on the way into the tub, or just at training and inference time.

tawnkramer · 2020-06-25T04:46:05Z

great work here Rahul!

Not excited to see scikit-learn added again as dependency. I worked to remove that as occasionally you were forced to build it. And like pandas, that was super painful. Also, we used it very little. It prevented training from running on the Nano. Not the best idea to train on SBC, but no reason to prevent it. Can we trim this? ( edit: I see we just use train_test_split. We have our own version of that. Did you see a problem with it? )
Are the doc changes to build_harware.md accidental? Don't think you want any of those.
I would like to call the tub conversion to the new format as part of the donkey update command. We can keep it as standalone too.
will start testing it out now. Those were just initial impressions.

tawnkramer · 2020-06-25T04:54:11Z

was hoping to convert all my tubs in one command with

python ~/projects/donkeycar/scripts/convert_to_tub_v2.py --tub=./data/* --output=./data_conv

will do them one at a time...

tawnkramer · 2020-06-25T05:04:47Z

I found that the converted data set has the same value for _timestamp for all records. Didn't retain original value in "milliseconds" field. Don't see that _timestamp is better name than "milliseconds". Loses clarity and invites question on scale.

tawnkramer · 2020-06-25T05:16:12Z

previously I could train with all tubs like this:

python train.py  --model=./models/test.h5 --tub=./data_conv/*

or with no tub argument. Can we support these easy cases?

tawnkramer · 2020-06-25T06:44:29Z

donkey tubcheck : broken
donkey tubplot : confusing args "--model MODEL name of record to create histogram", doesn't take records to plot.
donkey makemovie : broken
donkey tubhist : broken
donkey tubaugment : broken

broken models: rnn, 3d, latent, behavior, localizer, tflite_linear

tikurahul · 2020-06-25T14:07:30Z

Thanks for taking a thorough look.

A couple of tools don't need to exist anymore. For e.g. tubaugment is quite restrictive in terms of the transformations. The new pipeline enables so much more.

tflite_linear was something I missed adding support for. But the others are definitely not there (intentionally). I did this mostly for convenience when making such a huge change. I will we can add them back as we see fit and as we migrate more things to TF 2.x.

DocGarbanzo · 2020-06-26T21:30:14Z

The idea of donkey tubaugment was to run it before the training as I found the augmentation costing some ms and when you train with 100k records, then this overhead adds to your epoch time quite a bit (I actually train w/ 500k records). I agree that it is not sophisticated, it is just applying the same code that we have been using in-training so far.

* This is an implementation of a completely new datastore for Donkey. Rather than have a single file for every record, this datastore stores all records in a Seekable file, with O(1) time access. Images are still stored separately. * This format uses newline delimited json, and stores a configurable number of records per file. It has high level APIs which make it easy to consume records consistently. * I also ported over tubclean, tubplot and provides a script convert_to_tub_v2.py to move from the old datastore to the new datastore format. * I ported over the Linear and the Inferred models so far, but adding support for other models should be very trivial. * I also removed a lot of dead code which should make things a lot easier to understand going forward. For now the old train.py is archived, as we will port over more things from the old training script. * The new training pipeline also arbitrary pre-processing steps with imgaug which is now a new dependency. * Added benchmarks for the new datastore. * Added a new experimental KerasInferred model which learns steering, but not throttle. We then infer throttle based on a inverted exponential curve.

tikurahul · 2020-09-12T21:16:12Z

Closing this PR. Will open against the 4.x branch.

tikurahul force-pushed the autorope-dev branch 7 times, most recently from cb9c1f0 to 04e29ca Compare January 21, 2020 07:10

tikurahul force-pushed the autorope-dev branch 4 times, most recently from b4dcb2b to 2c4193c Compare February 1, 2020 20:43

tikurahul force-pushed the autorope-dev branch from 2c4193c to 20fb5a8 Compare February 16, 2020 19:03

nvtkaszpir reviewed Feb 17, 2020

View reviewed changes

nvtkaszpir reviewed Feb 18, 2020

View reviewed changes

tikurahul force-pushed the autorope-dev branch from 20fb5a8 to 41c3a24 Compare February 29, 2020 17:36

tikurahul force-pushed the autorope-dev branch 5 times, most recently from f7124b6 to edbc3e6 Compare March 7, 2020 16:43

tikurahul force-pushed the autorope-dev branch from edbc3e6 to a6f6267 Compare June 19, 2020 21:39

Ezward reviewed Jun 25, 2020

View reviewed changes

tikurahul force-pushed the autorope-dev branch from a6f6267 to 0b661ed Compare September 12, 2020 21:15

tikurahul closed this Sep 12, 2020

		@@ -170,6 +169,23 @@ def run(self, img_arr):



		class KerasInferred(KerasPilot):

Add a new datastore format. #532

Add a new datastore format. #532

Conversation

tikurahul commented Jan 21, 2020 • edited

DocGarbanzo commented Feb 9, 2020

tikurahul commented Feb 9, 2020

tikurahul commented Feb 9, 2020

DocGarbanzo commented Feb 9, 2020

tikurahul commented Feb 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvtkaszpir Feb 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvtkaszpir left a comment

Choose a reason for hiding this comment

nvtkaszpir commented Feb 18, 2020

DocGarbanzo commented Mar 17, 2020

tikurahul commented Mar 17, 2020 • edited

DocGarbanzo commented Mar 21, 2020

tikurahul commented Apr 13, 2020

Choose a reason for hiding this comment

tawnkramer commented Jun 25, 2020 • edited

tawnkramer commented Jun 25, 2020

tawnkramer commented Jun 25, 2020

tawnkramer commented Jun 25, 2020

tawnkramer commented Jun 25, 2020

tikurahul commented Jun 25, 2020

DocGarbanzo commented Jun 26, 2020

tikurahul commented Sep 12, 2020

tikurahul commented Jan 21, 2020 •

edited

nvtkaszpir Feb 18, 2020 •

edited

tikurahul commented Mar 17, 2020 •

edited

tawnkramer commented Jun 25, 2020 •

edited