[REEF-1339] Adding IInputPartition.Cache() for data download and cache #968
Conversation
This addressed the issue by * Adding an unstable Cache function. * Modify existing implementations of IInputPartition to allow them to cache only on invocation of Cache instead of on initialization. JIRA: [REEF-1339](https://issues.apache.org/jira/browse/REEF-1339)
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move this into a method of its own? e.g. Download
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per discussion below, for now, we will simply assume that Cache()
in this class assumes a disk level cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more concerned about the relative length of this method. The if/else
would be the only thing in this method, and the actual download logic would be in the helper methods:
if (!_localFiles.IsPresent()){
Download();
}
LGTM overall, with the one comment. |
I will have a look. |
|
||
if (!copyToLocal) | ||
{ | ||
// Implies that the files are already local. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyToLocal means download from remote to local before reading the data.
!copyToLocal means read data from file (no matter files are on local or remote). It doesn't mean the file is already in local.
In our use case, the files are always remote.
For local test, the files can be at local, then copyToLocal is redundant, but still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, then why is the code structured like this in the previous commit?
if (_copyToLocal)
{
if (!_isInitialized)
{
Initialize();
}
return _fileSerializer.Deserialize(_localFiles);
}
return _fileSerializer.Deserialize(_filePaths);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwang98052 The previous code seems to imply that if not copy to local, then the files are already local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In original code, files are provided in _filePaths, usually remote files.
If the user what to download files to local first, and if the files are not downloaded yet, then download (Initialize()), then Deserialize from the local downloaded files.
Else, meaning if user doesn't choose to copyToLocal, then Deserialize from the original files directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, and where are these "original files?" None of our IFileDeSerializer
implementations provides the ability to deserialize remote files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afchung @jwang98052 my understanding is what andrew is saying. Even the code does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have IFileDeSerializer implementation that read from remote file on the fly. Clients can choose to do it. The code in REEF doesn't show that sample, that's true.
@afchung LGTM. I am wondering if Cache() should be added in DataLoadingContext in IMRU as part of this or later... Plus it would be good to check IMRU and make sure it did not break anything. Once you are done with all reviews, ping me to check IMRU on cluster and I will. You can do it also :) if you want. |
We definitely want to test any changes in those classes in both local and Yarn. I have HadoopFileInputPartitionTest, you might want to update for Yarn testing. When test it with IMRU, we need to set copytolocal both true and false to test it. |
private readonly object _lock = new object(); | ||
private readonly ITempFileCreator _tempFileCreator; | ||
private readonly ISet<string> _localFiles = new HashSet<string>(); | ||
private readonly ISet<string> _filePaths; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to _remoteFilePaths
or such to make clear that these are remote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In localRuntime or testing, it can be local files as well. In real use cases, yes, it represents remote files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be explicit on what this is for. If this is only for remote files, we should make it so. For testing, we should fake the IFileSerializer
to fetch from local instead of using an if
branch the way we do right now that obfuscates the intention of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it would be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, for this PR, I will assume the following behavior:
IFileDeSerializer
can be remote OR local._copyToLocal
implies caching on access and thatIFileDeSerializer
is local. If_copyToLocal
is false, we assume thatIFileDeSerializer
is remote.
We can clarify what IFileDeserializer
is and what exactly _copyToLocal
means in a separate JIRA item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened REEF-1365 for discussion.
@jwang98052 @markusweimer @dkm2110 I've addressed your comments and opened REEF-1339 for discussion, please have another look. Thanks! |
// For now, assume IFileDeSerializer is local. | ||
return _fileSerializer.Deserialize(_localFiles.Value); | ||
} | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need else.
LGTM with one minor comment. |
@jwang98052 Addressed your comment. |
LGTM |
I'll do a pass. |
{ | ||
/// <summary> | ||
/// The id of the partition. | ||
/// </summary> | ||
string Id { get; } | ||
|
||
/// <summary> | ||
/// Caches the data locally, cached location is based on the implementation. | ||
/// </summary> | ||
[Unstable("0.14", "Contract may change.")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contract will change
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will
? It won't necessarily change.
LGTM, I left some comments we may or may not address. Let me know whether you want to, @afchung. Otherwise, I will test and merge. |
@markusweimer addressed your comments, please have another look. Thanks! |
LGTM, will test and merge. |
This addressed the issue by
JIRA:
REEF-1339