Skip to content
This repository has been archived by the owner on Aug 4, 2022. It is now read-only.

[REEF-1339] Adding IInputPartition.Cache() for data download and cache #968

Closed
wants to merge 4 commits into from

Conversation

afchung
Copy link
Contributor

@afchung afchung commented Apr 25, 2016

This addressed the issue by

  • Adding an unstable Cache function.
  • Modify existing implementations of IInputPartition to allow them to cache only on invocation of Cache instead of on initialization.

JIRA:
REEF-1339

This addressed the issue by
  * Adding an unstable Cache function.
  * Modify existing implementations of IInputPartition to allow them to cache only on invocation of Cache instead of on initialization.

JIRA:
  [REEF-1339](https://issues.apache.org/jira/browse/REEF-1339)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this into a method of its own? e.g. Download?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per discussion below, for now, we will simply assume that Cache() in this class assumes a disk level cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more concerned about the relative length of this method. The if/else would be the only thing in this method, and the actual download logic would be in the helper methods:

 if (!_localFiles.IsPresent()){ 
  Download();
}

@markusweimer
Copy link
Contributor

LGTM overall, with the one comment.

@jwang98052
Copy link
Contributor

I will have a look.


if (!copyToLocal)
{
// Implies that the files are already local.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyToLocal means download from remote to local before reading the data.
!copyToLocal means read data from file (no matter files are on local or remote). It doesn't mean the file is already in local.

In our use case, the files are always remote.

For local test, the files can be at local, then copyToLocal is redundant, but still work.

Copy link
Contributor Author

@afchung afchung Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, then why is the code structured like this in the previous commit?

if (_copyToLocal)
{       
    if (!_isInitialized)        
    {       
        Initialize();       
    }       

    return _fileSerializer.Deserialize(_localFiles);        
}       
return _fileSerializer.Deserialize(_filePaths);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwang98052 The previous code seems to imply that if not copy to local, then the files are already local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In original code, files are provided in _filePaths, usually remote files.

If the user what to download files to local first, and if the files are not downloaded yet, then download (Initialize()), then Deserialize from the local downloaded files.

Else, meaning if user doesn't choose to copyToLocal, then Deserialize from the original files directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, and where are these "original files?" None of our IFileDeSerializer implementations provides the ability to deserialize remote files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afchung @jwang98052 my understanding is what andrew is saying. Even the code does that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have IFileDeSerializer implementation that read from remote file on the fly. Clients can choose to do it. The code in REEF doesn't show that sample, that's true.

@dkm2110
Copy link
Contributor

dkm2110 commented Apr 25, 2016

@afchung LGTM. I am wondering if Cache() should be added in DataLoadingContext in IMRU as part of this or later... Plus it would be good to check IMRU and make sure it did not break anything.

Once you are done with all reviews, ping me to check IMRU on cluster and I will. You can do it also :) if you want.

@jwang98052
Copy link
Contributor

jwang98052 commented Apr 26, 2016

We definitely want to test any changes in those classes in both local and Yarn. I have HadoopFileInputPartitionTest, you might want to update for Yarn testing.

When test it with IMRU, we need to set copytolocal both true and false to test it.

private readonly object _lock = new object();
private readonly ITempFileCreator _tempFileCreator;
private readonly ISet<string> _localFiles = new HashSet<string>();
private readonly ISet<string> _filePaths;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to _remoteFilePaths or such to make clear that these are remote.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In localRuntime or testing, it can be local files as well. In real use cases, yes, it represents remote files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be explicit on what this is for. If this is only for remote files, we should make it so. For testing, we should fake the IFileSerializer to fetch from local instead of using an if branch the way we do right now that obfuscates the intention of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it would be clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for this PR, I will assume the following behavior:

  1. IFileDeSerializer can be remote OR local.
  2. _copyToLocal implies caching on access and that IFileDeSerializer is local. If _copyToLocal is false, we assume that IFileDeSerializer is remote.

We can clarify what IFileDeserializer is and what exactly _copyToLocal means in a separate JIRA item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened REEF-1365 for discussion.

@afchung
Copy link
Contributor Author

afchung commented Apr 26, 2016

@jwang98052 @markusweimer @dkm2110 I've addressed your comments and opened REEF-1339 for discussion, please have another look. Thanks!

// For now, assume IFileDeSerializer is local.
return _fileSerializer.Deserialize(_localFiles.Value);
}
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need else.

@jwang98052
Copy link
Contributor

LGTM with one minor comment.

@afchung
Copy link
Contributor Author

afchung commented Apr 27, 2016

@jwang98052 Addressed your comment.

@jwang98052
Copy link
Contributor

LGTM

@markusweimer
Copy link
Contributor

I'll do a pass.

{
/// <summary>
/// The id of the partition.
/// </summary>
string Id { get; }

/// <summary>
/// Caches the data locally, cached location is based on the implementation.
/// </summary>
[Unstable("0.14", "Contract may change.")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contract will change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will? It won't necessarily change.

@markusweimer
Copy link
Contributor

LGTM, I left some comments we may or may not address. Let me know whether you want to, @afchung. Otherwise, I will test and merge.

@afchung
Copy link
Contributor Author

afchung commented Apr 28, 2016

@markusweimer addressed your comments, please have another look. Thanks!

@markusweimer
Copy link
Contributor

LGTM, will test and merge.

@asfgit asfgit closed this in d34441c Apr 28, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants