Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new datastore format. #532

Closed
wants to merge 1 commit into from
Closed

Conversation

tikurahul
Copy link
Collaborator

@tikurahul tikurahul commented Jan 21, 2020

  • This is an implementation of a completely new datastore for Donkey.
    Rather than have a single file for every record, this datastore
    stores all records in a Seekable file, with O(1) time access. Images
    are still stored separately.

  • This format uses newline delimited json, and stores a configurable
    number of records per file. It has high level APIs which make it easy
    to consume records consistently.

  • I also ported over tubclean, tubplot and provides a script convert_to_tub_v2.py
    to move from the old datastore to the new datastore format.

  • I ported over the Linear and the Inferred models so far, but adding support for
    other models should be very trivial.

  • I also removed a lot of dead code which should make things a lot easier to understand going forward.
    For now the old train.py is archived, as we will port over more things from the old training script.

  • The new training pipeline also arbitrary pre-processing steps with imgaug which is now a new dependency.

  • Added benchmarks for the new datastore.

  • Added a new experimental KerasInferred model which learns steering, but not throttle. We then infer
    throttle based on a inverted exponential curve.

  • Added lots of unit tests.

@tikurahul tikurahul force-pushed the autorope-dev branch 7 times, most recently from cb9c1f0 to 04e29ca Compare January 21, 2020 07:10
@tikurahul tikurahul force-pushed the autorope-dev branch 4 times, most recently from b4dcb2b to 2c4193c Compare February 1, 2020 20:43
@DocGarbanzo
Copy link
Contributor

Hi Rahul,
I've checked out your code and ran a training on my tubs after conversion. This is working well and I see the current tub really is bit of a hacky concept with a lot of files and logic between data entries and file names.
Training seems to be running at the same speed as with the old tub - you don't expect any performance gains there, right? Performance gains mostly come from deletion operations where you simply mark entries as deleted instead of physically removing them.
The structure of the files in the new tub with Catalog and Manifest files doesn't become much simpler though in the new design, even if this is well engineered. In the end the tub is just a database and I was wondering if it is worth doing all of these improvements instead of just storing the data into a pandas DataFrame and saving this as hdf5 which pandas supports. Appending single rows to DataFrames doesn't seem to be very efficient, but appending lists of say of 1000 (like your catalog size) dictionaries should be fast. And adding a single column of 0/1 or True/False for valid/to-delete rows. All of this is probably much easier than your code change, but maybe you tried this and didn't find it working well?

@tikurahul
Copy link
Collaborator Author

Thanks for testing it out.

The huge gains are when you use the new datastore to collect data for training. Writes are between 5-10x faster which makes a huge difference if you want to push faster frame-rates.

We stopped using Pandas some time ago. We were using a modified pipeline. This training script has multiple processes disabled but you can get huge benefits by turning it on. The underlying store and sequences are multi process safe.

Also it’s now super easy to add preprocessing steps without affecting the rest of the pipeline (which is very important) to build stable models.

@tikurahul
Copy link
Collaborator Author

Also Pandas is painfully slow.

We tried making those improvements before but that’s why we ended up with the modified pipeline that Tawn built. So hdfs is almost a non-starter.

@DocGarbanzo
Copy link
Contributor

Cool, thanks for explaining. I should try the recording, too. To enable multiprocessing in training I set workers=N instead of workers=1 in train.py:193? Will give it a shot and report.

@tikurahul
Copy link
Collaborator Author

Yes, that is correct.


def benchmark():
# Change with a non SSD storage path
path = Path('/media/rahulrav/Cruzer/tub')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try not to hardcode paths :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.


def benchmark():
# Change to a non SSD storage path
path = Path('/media/rahulrav/Cruzer/benchmark')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

@@ -158,8 +158,7 @@ def __init__(self, num_outputs=2, input_shape=(120, 160, 3), roi_crop=(0, 0), *a
self.compile()

def compile(self):
self.model.compile(optimizer=self.optimizer,
loss='mse')
self.model.compile(optimizer=self.optimizer, loss='mse')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this needs to be in this PR?
Does it affect a new datastore format directly?

I believe it suits better as a new PR focused on optimizing the AI/ML part, so maybe it is better to add it to the dev branch as separate PR, especially if it is not related to the datastore itself, what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful because it shows how to add a model to the new pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is just a newline remove.

@@ -170,6 +169,23 @@ def run(self, img_arr):



class KerasInferred(KerasPilot):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -317,9 +333,9 @@ def default_categorical(input_shape=(120, 160, 3), roi_crop=(0, 0)):

def default_n_linear(num_outputs, input_shape=(120, 160, 3), roi_crop=(0, 0)):

drop = 0.1
drop = 0.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective it is not clear what and why this value was changed and what is the impact. Maybe adding a single line comment would be sufficient?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just increases the dropout in the linear model.

# [TensorRT] ERROR: Internal error: could not find any implementation for node 2-layer MLP, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
# [TensorRT] ERROR: ../builder/tacticOptimizer.cpp (1230) - OutOfMemory Error in computeCosts: 0
builder.max_workspace_size = 1 << 20 #common.GiB(1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the deleted comment no longer valid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was never technically correct. Defining the work space size just had the intended side effect. This is also not a very useful comment.

A datastore to store sensor data in a key, value format. \n
Accepts str, int, float, image_array, image, and array data types.
'''
def __init__(self, base_path, inputs=[], types=[], metadata=[], max_catalog_len=1000):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pleas explain more about max_catalog_len?
I get a feeling it is somekind of limitation that some people should be aware of, or am I wronh?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to worry about this at all. For reference, this is the maximum number of records per catalog file.

contents[key] = name

# Private properties
contents['_timestamp'] = int((time.time() / 1000) * 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain this?
Division and then multiplication?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.time() returns a floating point timestamp. Dividing and multiplying by a 1000 gives us the timestamp in milliseconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe time.monotonic() to avoid issues of clock skews on some systems?
(I've seen really strange things....)

Or time.monotonic_ns() which returns int but it requires python 3.7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively just a rounding, but because divisions are expensive, not be best way to do it, round would be faster here.

@@ -1,4 +1,6 @@
# -*- coding: utf-8 -*- #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this change looks weird :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an archived change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete not needed files, they are in git history anyway.
Otherwise it just clutters the space and may even confuse some, especially if it is not used.

@@ -480,6 +480,12 @@ def get_model_by_type(model_type, cfg):
elif model_type == "fastai":
from donkeycar.parts.fastai import FastAiPilot
kl = FastAiPilot()
elif model_type == "transfer":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks as if it does not belong to this PR.
Make a new one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, because its a useful illustration on how to add new models to the new training pipeline.

Copy link
Contributor

@nvtkaszpir nvtkaszpir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self - try this one.

@nvtkaszpir
Copy link
Contributor

One more thing, is it going to support python 3.5?

@tikurahul tikurahul force-pushed the autorope-dev branch 5 times, most recently from f7124b6 to edbc3e6 Compare March 7, 2020 16:43
@DocGarbanzo
Copy link
Contributor

@tikurahul & @tawnkramer - what are the plans to get this change merged? Is there an issue w/ pathlib and python 3.5 in the tests?

@tikurahul
Copy link
Collaborator Author

tikurahul commented Mar 17, 2020

Yes. Python 3.5 is still supported.

The change itself is ready. I plan to make a few small improvements, but those can happen as separate PRs.

@DocGarbanzo
Copy link
Contributor

Well it looked like there were issues on Travis with the 3.5 build. Would be great to have this change merged.

@tikurahul
Copy link
Collaborator Author

Will add support for the remaining models, and submit one last pull request.

return augmentation

@classmethod
def trapezoidal_mask(cls, lower_left, lower_right, upper_left, upper_right, min_y, max_y):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! When does this get applied in training and autopilot. More specifically; I want to know if this alters data on the way into the tub, or just at training and inference time.

@tawnkramer
Copy link
Contributor

tawnkramer commented Jun 25, 2020

great work here Rahul!

  • Not excited to see scikit-learn added again as dependency. I worked to remove that as occasionally you were forced to build it. And like pandas, that was super painful. Also, we used it very little. It prevented training from running on the Nano. Not the best idea to train on SBC, but no reason to prevent it. Can we trim this? ( edit: I see we just use train_test_split. We have our own version of that. Did you see a problem with it? )

  • Are the doc changes to build_harware.md accidental? Don't think you want any of those.

  • I would like to call the tub conversion to the new format as part of the donkey update command. We can keep it as standalone too.

  • will start testing it out now. Those were just initial impressions.

@tawnkramer
Copy link
Contributor

was hoping to convert all my tubs in one command with

python ~/projects/donkeycar/scripts/convert_to_tub_v2.py --tub=./data/* --output=./data_conv

will do them one at a time...

@tawnkramer
Copy link
Contributor

I found that the converted data set has the same value for _timestamp for all records. Didn't retain original value in "milliseconds" field. Don't see that _timestamp is better name than "milliseconds". Loses clarity and invites question on scale.

@tawnkramer
Copy link
Contributor

previously I could train with all tubs like this:

python train.py  --model=./models/test.h5 --tub=./data_conv/*

or with no tub argument. Can we support these easy cases?

@tawnkramer
Copy link
Contributor

donkey tubcheck : broken
donkey tubplot : confusing args "--model MODEL name of record to create histogram", doesn't take records to plot.
donkey makemovie : broken
donkey tubhist : broken
donkey tubaugment : broken

broken models: rnn, 3d, latent, behavior, localizer, tflite_linear

@tikurahul
Copy link
Collaborator Author

Thanks for taking a thorough look.

A couple of tools don't need to exist anymore. For e.g. tubaugment is quite restrictive in terms of the transformations. The new pipeline enables so much more.

tflite_linear was something I missed adding support for. But the others are definitely not there (intentionally). I did this mostly for convenience when making such a huge change. I will we can add them back as we see fit and as we migrate more things to TF 2.x.

@DocGarbanzo
Copy link
Contributor

The idea of donkey tubaugment was to run it before the training as I found the augmentation costing some ms and when you train with 100k records, then this overhead adds to your epoch time quite a bit (I actually train w/ 500k records). I agree that it is not sophisticated, it is just applying the same code that we have been using in-training so far.

* This is an implementation of a completely new datastore for Donkey.
Rather than have a single file for every record, this datastore
stores all records in a Seekable file, with O(1) time access. Images
are still stored separately.

* This format uses newline delimited json, and stores a configurable
number of records per file. It has high level APIs which make it easy
to consume records consistently.

* I also ported over tubclean, tubplot and provides a script convert_to_tub_v2.py
to move from the old datastore to the new datastore format.

* I ported over the Linear and the Inferred models so far, but adding support for
other models should be very trivial.

* I also removed a lot of dead code which should make things a lot easier to understand going forward.
For now the old train.py is archived, as we will port over more things from the old training script.

* The new training pipeline also arbitrary pre-processing steps with imgaug which is now a new dependency.

* Added benchmarks for the new datastore.

* Added a new experimental KerasInferred model which learns steering, but not throttle. We then infer
throttle based on a inverted exponential curve.
@tikurahul
Copy link
Collaborator Author

Closing this PR. Will open against the 4.x branch.

@tikurahul tikurahul closed this Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants