New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new datastore format. #532
Conversation
cb9c1f0
to
04e29ca
Compare
b4dcb2b
to
2c4193c
Compare
Hi Rahul, |
Thanks for testing it out. The huge gains are when you use the new datastore to collect data for training. Writes are between 5-10x faster which makes a huge difference if you want to push faster frame-rates. We stopped using Pandas some time ago. We were using a modified pipeline. This training script has multiple processes disabled but you can get huge benefits by turning it on. The underlying store and sequences are multi process safe. Also it’s now super easy to add preprocessing steps without affecting the rest of the pipeline (which is very important) to build stable models. |
Also Pandas is painfully slow. We tried making those improvements before but that’s why we ended up with the modified pipeline that Tawn built. So |
Cool, thanks for explaining. I should try the recording, too. To enable multiprocessing in training I set |
Yes, that is correct. |
2c4193c
to
20fb5a8
Compare
|
||
def benchmark(): | ||
# Change with a non SSD storage path | ||
path = Path('/media/rahulrav/Cruzer/tub') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please try not to hardcode paths :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
|
||
def benchmark(): | ||
# Change to a non SSD storage path | ||
path = Path('/media/rahulrav/Cruzer/benchmark') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
@@ -158,8 +158,7 @@ def __init__(self, num_outputs=2, input_shape=(120, 160, 3), roi_crop=(0, 0), *a | |||
self.compile() | |||
|
|||
def compile(self): | |||
self.model.compile(optimizer=self.optimizer, | |||
loss='mse') | |||
self.model.compile(optimizer=self.optimizer, loss='mse') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this needs to be in this PR?
Does it affect a new datastore format directly?
I believe it suits better as a new PR focused on optimizing the AI/ML part, so maybe it is better to add it to the dev branch as separate PR, especially if it is not related to the datastore itself, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is useful because it shows how to add a model to the new pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is just a newline remove.
@@ -170,6 +169,23 @@ def run(self, img_arr): | |||
|
|||
|
|||
|
|||
class KerasInferred(KerasPilot): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -317,9 +333,9 @@ def default_categorical(input_shape=(120, 160, 3), roi_crop=(0, 0)): | |||
|
|||
def default_n_linear(num_outputs, input_shape=(120, 160, 3), roi_crop=(0, 0)): | |||
|
|||
drop = 0.1 | |||
drop = 0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective it is not clear what and why this value was changed and what is the impact. Maybe adding a single line comment would be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just increases the dropout in the linear model.
# [TensorRT] ERROR: Internal error: could not find any implementation for node 2-layer MLP, try increasing the workspace size with IBuilder::setMaxWorkspaceSize() | ||
# [TensorRT] ERROR: ../builder/tacticOptimizer.cpp (1230) - OutOfMemory Error in computeCosts: 0 | ||
builder.max_workspace_size = 1 << 20 #common.GiB(1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the deleted comment no longer valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was never technically correct. Defining the work space size just had the intended side effect. This is also not a very useful comment.
A datastore to store sensor data in a key, value format. \n | ||
Accepts str, int, float, image_array, image, and array data types. | ||
''' | ||
def __init__(self, base_path, inputs=[], types=[], metadata=[], max_catalog_len=1000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you pleas explain more about max_catalog_len?
I get a feeling it is somekind of limitation that some people should be aware of, or am I wronh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to worry about this at all. For reference, this is the maximum number of records per catalog file.
donkeycar/parts/tub_v2.py
Outdated
contents[key] = name | ||
|
||
# Private properties | ||
contents['_timestamp'] = int((time.time() / 1000) * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain this?
Division and then multiplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.time()
returns a floating point timestamp. Dividing and multiplying by a 1000
gives us the timestamp in milliseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe time.monotonic()
to avoid issues of clock skews on some systems?
(I've seen really strange things....)
Or time.monotonic_ns()
which returns int but it requires python 3.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively just a rounding, but because divisions are expensive, not be best way to do it, round
would be faster here.
donkeycar/tests/_test_train.py
Outdated
@@ -1,4 +1,6 @@ | |||
# -*- coding: utf-8 -*- # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this change looks weird :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an archived change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete not needed files, they are in git history anyway.
Otherwise it just clutters the space and may even confuse some, especially if it is not used.
@@ -480,6 +480,12 @@ def get_model_by_type(model_type, cfg): | |||
elif model_type == "fastai": | |||
from donkeycar.parts.fastai import FastAiPilot | |||
kl = FastAiPilot() | |||
elif model_type == "transfer": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks as if it does not belong to this PR.
Make a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, because its a useful illustration on how to add new models to the new training pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self - try this one.
One more thing, is it going to support python 3.5? |
20fb5a8
to
41c3a24
Compare
f7124b6
to
edbc3e6
Compare
@tikurahul & @tawnkramer - what are the plans to get this change merged? Is there an issue w/ pathlib and python 3.5 in the tests? |
Yes. Python 3.5 is still supported. The change itself is ready. I plan to make a few small improvements, but those can happen as separate PRs. |
Well it looked like there were issues on Travis with the 3.5 build. Would be great to have this change merged. |
Will add support for the remaining models, and submit one last pull request. |
return augmentation | ||
|
||
@classmethod | ||
def trapezoidal_mask(cls, lower_left, lower_right, upper_left, upper_right, min_y, max_y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! When does this get applied in training and autopilot. More specifically; I want to know if this alters data on the way into the tub, or just at training and inference time.
great work here Rahul!
|
was hoping to convert all my tubs in one command with
will do them one at a time... |
I found that the converted data set has the same value for _timestamp for all records. Didn't retain original value in "milliseconds" field. Don't see that _timestamp is better name than "milliseconds". Loses clarity and invites question on scale. |
previously I could train with all tubs like this:
or with no tub argument. Can we support these easy cases? |
donkey tubcheck : broken broken models: rnn, 3d, latent, behavior, localizer, tflite_linear |
Thanks for taking a thorough look. A couple of tools don't need to exist anymore. For e.g.
|
The idea of |
* This is an implementation of a completely new datastore for Donkey. Rather than have a single file for every record, this datastore stores all records in a Seekable file, with O(1) time access. Images are still stored separately. * This format uses newline delimited json, and stores a configurable number of records per file. It has high level APIs which make it easy to consume records consistently. * I also ported over tubclean, tubplot and provides a script convert_to_tub_v2.py to move from the old datastore to the new datastore format. * I ported over the Linear and the Inferred models so far, but adding support for other models should be very trivial. * I also removed a lot of dead code which should make things a lot easier to understand going forward. For now the old train.py is archived, as we will port over more things from the old training script. * The new training pipeline also arbitrary pre-processing steps with imgaug which is now a new dependency. * Added benchmarks for the new datastore. * Added a new experimental KerasInferred model which learns steering, but not throttle. We then infer throttle based on a inverted exponential curve.
a6f6267
to
0b661ed
Compare
Closing this PR. Will open against the 4.x branch. |
This is an implementation of a completely new datastore for Donkey.
Rather than have a single file for every record, this datastore
stores all records in a
Seekable
file, with O(1) time access. Imagesare still stored separately.
This format uses newline delimited json, and stores a configurable
number of records per file. It has high level APIs which make it easy
to consume records consistently.
I also ported over
tubclean
,tubplot
and provides a scriptconvert_to_tub_v2.py
to move from the old datastore to the new datastore format.
I ported over the
Linear
and theInferred
models so far, but adding support forother models should be very trivial.
I also removed a lot of dead code which should make things a lot easier to understand going forward.
For now the old
train.py
is archived, as we will port over more things from the old training script.The new training pipeline also arbitrary pre-processing steps with
imgaug
which is now a new dependency.Added benchmarks for the new datastore.
Added a new experimental
KerasInferred
model which learns steering, but not throttle. We then inferthrottle based on a inverted exponential curve.
Added lots of unit tests.