[REEF-2021]In AzureBatch Runtime, Enable REEF .NET Client Communication to Driver #1468
Conversation
@tcondie This is the approach using Azure Batch InboundNATPool to enable the communication. More Info in Jira. |
Rebased master branch csproj file change to resolve conflicts. |
Ping if someone can review. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I posted comments and suggestions.
using System.Linq; | ||
using Microsoft.Azure.Batch; | ||
using Microsoft.Azure.Batch.Common; | ||
#if REEF_DOTNET_BUILD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This #if can be removed now - it was used to support building both old and new projects. Ill create a JIRA to go through the code and remove as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -45,6 +47,12 @@ public sealed class AzureBatchDotNetClient : IREEFClient | |||
private readonly AzureBatchService _batchService; | |||
private readonly JobJarMaker _jobJarMaker; | |||
private readonly AzureBatchFileNames _azbatchFileNames; | |||
private readonly string _azureBatchAccountName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @homezcx
This set of data (batch account name, pool id, etc) gets repeated. This is currently defined in the AzureBatchClientConfiguration -- can that configuration be re-used at all in these cases as a structure for managing this data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely we can. I am kind not sure which one is better practice. Having a wrapper only contains getters with no logic, or just like this. Since we are using injection here, no pain to call new Instance(param1, param2, .....)
Let's wait for another vote maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. I will fix this by implementing helper methods in AzureBatchService class which holds the credentials. So we don't need pass them around.
} | ||
catch (BatchException e) | ||
{ | ||
throw new InvalidOperationException("driver http endpoint file is not ready.", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really an invalid operation or is it a timing issue? Like maybe we checked too soon for the driver file? If its a timing issue, would waiting a retry help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a timing issue and will be retried according to line 66-67. To trigger the retry policy, an exception must be thrown. Any return value including "null" will be considered as successful. Retrhow the exception for logging purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification!
else | ||
{ | ||
LOGGER.Log(Level.Warning, "unable to get driver http endpoint. The format in remote file is not correct."); | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I feel like should throw because an empty file, which should not happen, is not recoverable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. It is not recoverable but we need this to warn people in future if they changed file DriverHttpEndpoint.txt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding comments that this is called from the RetryPolicy? Like a comment specifying we are returning null to exit the RetryPolicy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it.
} | ||
|
||
//// Get port | ||
string[] driverIpAndPorts = driverHost.Split(':'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would combine this with the code above since its all related to retrieving the ip and port from the string. Then throw invalid operation if that fails to parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to separate recoverable exception (by throwing InvalidOperationException) and unrecoverable exception (by returning null).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - makes sense, I wasn't aware of having to return null when I reviewed.
get { return _driverUrl; } | ||
get | ||
{ | ||
if (_driverUrl != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit - this could be simplified to if (_driverUrl == null){} return _driverUrl;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
while (status.IsActive()) | ||
{ | ||
// Add sleep in while loop, whose value alligns with default heart beat interval. | ||
Task.Delay(TimeSpan.FromSeconds(DriverStatusIntervalInSecond)).GetAwaiter().GetResult(); | ||
LOGGER.Log(Level.Info, "DriverStatus is " + status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task delay is an empty task right? Why not use Thread.Sleep()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had offline discussion with @singlis . The idea was that I thought using Task.Delay will hold the execution without locking the thread. But actually there is no difference with calling Thread.Sleep() and Task.Delay().GetAwaiter().GetResult().
Changed it back to Thread.Sleep()
@@ -19,6 +19,7 @@ | |||
using System.Net.Http; | |||
using System.Threading; | |||
using System.Threading.Tasks; | |||
using Org.Apache.REEF.Client.Common; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this using needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. AllErrorsTransientStrategy is moved to Common, as it was not designed for YARN.
@@ -123,6 +124,11 @@ private static IConfiguration GetRuntimeConfiguration(string name) | |||
.Set(AzureBatchRuntimeClientConfiguration.AzureStorageAccountKey, @"##########################################") | |||
.Set(AzureBatchRuntimeClientConfiguration.AzureStorageAccountName, @"############") | |||
.Set(AzureBatchRuntimeClientConfiguration.AzureStorageContainerName, @"###########") | |||
//// Extend default retry interval in Azure Batch | |||
.Set(AzureBatchRuntimeClientConfiguration.DriverHTTPConnectionRetryInterval, "20000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about driver connection attempts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default values works for our needs. So in this example we don't need to alter the value. Also, when using List, it gives a cannot convert error at compiling.
networkConfiguration = client.poolOperations().getPool(poolId).networkConfiguration(); | ||
} catch (IOException e) { | ||
LOG.log(Level.WARNING, "Unable to setup Http Server with InBoundNATPool Port", e); | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not return null here - this is requiring that the code calling this api is checking for null. Looking at the code above that references this function, it looks like it does not and would fail with a null ref. I think this code should throw and be captured appropriately so that you can identify what the real error is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for other cases where null is returned in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. I fixed this by returning empty set.
In this case, Driver Http server will fall back to use RangeTcpPortProvider.
Thanks, this looks good to me. @markusweimer or @motus, can you please take a second look? |
@markusweimer or @motus Could you help review this PR as well? |
Sure, I will look at it now. Meanwhile, @homezcx, could you please resolve the conflicts? Thank you! |
@motus Thanks! I will work on resolving the conflicts. |
…on to Driver This approach enables REEF.NET Client-Driver communication by using InBoundNATPool features provided by Azure Batch Pool. More Info: https://docs.microsoft.com/en-us/rest/api/batchservice/pool/add#inboundnatpool JIRA: [REEF-2021](https://issues.apache.org/jira/browse/REEF-2021) Pull Request: This closes #
folder. Process the bridge jar as other runtime to avoid IOException caused by duplicate copy of bridge jar to global folder.
1. Fix potential null pointer excpetion when there is no InboundNATPool set. 2. Code cleanup
AzureBatchService to perform Azure Batch operations.
Node operation is only suppoted using AzureBatch key.
@motus Rebased and tested. |
namespace Org.Apache.REEF.Client.AzureBatch.Parameters | ||
{ | ||
[NamedParameter(Documentation = "The Azure Batch Pool Driver Http Server Ports List")] | ||
public sealed class AzureBatchPoolDriverPortsList : Name<IList<string>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string [](start = 67, length = 6)
Encountered this when looking at my related changes. Why this is not IList instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the example in TestListInjection.java, to bind a List , it requires to use List
In reply to: 195608449 [](ancestors = 195608449)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, it seems binding List in ConfigurationModule is not supported.
In reply to: 195804156 [](ancestors = 195804156,195608449)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few nit picks
result.add(sequence.toString()); | ||
} | ||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use final
; preallocate the result array; the signature can be more generic. I would write something like
private static List<String> asStringList(final Collection<? extends CharSequence> list) {
final List<String> result = new ArrayList<>(list.size());
for (final CharSequence sequence : list) {
result.add(sequence.toString());
}
return result;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@Inject | ||
public ListTcpPortProvider(@Parameter(TcpPortList.class) final List<Integer> tcpPortList) { | ||
this.tcpPortList = tcpPortList; | ||
LOG.log(Level.FINE, "Instantiating " + this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use string interpolation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
protected override string GetDriverUrl(string filepath) | ||
{ | ||
var policy = new RetryPolicy<AllErrorsTransientStrategy>(_numberOfRetries, TimeSpan.FromMilliseconds(_retryInterval)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
policy
can be a member variable, initialized in constructor. then you won't need _numberOfRetries
and _retryInterval
members
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
LOGGER.Log(Level.Warning, "unable to get driver http endpoint. The format in remote file is not correct."); | ||
//// Returns null to exit retry policy since it is not recoverable. | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, http endpoint file may not necessarily have a terminating \n
. I would rather write
string driverHost = httpEndPointFile.ReadAsString().TrimEnd('\r', '\n', ' ');
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is \n
accourding to
https://github.com/apache/reef/blob/master/lang/java/reef-bridge-java/src/main/java/org/apache/reef/javabridge/generic/JobDriver.java#L193
But it is no harm to make the change.
LOGGER.Log(Level.Warning, "unable to get driver http endpoint port. The format in remote file is not correct."); | ||
//// Returns null to exit retry policy since it is not recoverable. | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to parse the port as int
right away, i.e. do something like
string[] driverIpAndPorts = driverHost.Split(':');
if (driverIpAndPorts.Length <= 1 || !int.TryParse(driverIpAndPorts[1], out int backendPort))
{
LOGGER.Log(Level.Warning, "Unable to get driver http endpoint port from: {0}", driverHost);
return null;
}
// ...
InboundEndpoint endpoint = inboundEndpoints.FirstOrDefault(s => s.BackendPort == backendPort);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
} | ||
|
||
return "http://" + publicIp + ':' + frontEndPort + '/'; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, who writes the content of the httpEndPointFile
? If it is us, can we write a proper URL to it instead, so that the string can be parsed with just
Uri driverUrl = new Uri(httpEndPointFile.ReadAsString());
We can also return a Uri
object instead of a string from that method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is JavaBridge writes the HttpEndPointFile. See:
https://github.com/apache/reef/blob/master/lang/java/reef-bridge-java/src/main/java/org/apache/reef/javabridge/generic/JobDriver.java#L193
The string format ip + port in this file is used by other runtime as well, YARN, local. We can change it but it seems to be out of scope of this PR. Shall we make it separate ticket?
while (status.IsActive()) | ||
{ | ||
// Add sleep in while loop, whose value alligns with default heart beat interval. | ||
Thread.Sleep(DriverStatusIntervalInMilliSecond); | ||
LOGGER.Log(Level.Info, "DriverStatus is " + status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use string interpolation
@@ -85,7 +94,15 @@ internal JobSubmissionResult(IREEFClient reefClient, string filePath, int number | |||
/// </summary> | |||
public string DriverUrl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: maybe return Uri
instead of a string URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: The current string format is "10.0.0.4:1234". it requires to change the string in HttpEndPointFile to a valid URI format, which could affect other runtimes. See if we want to create another ticket for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Will test locally and merge
|
||
try | ||
{ | ||
status = FetchDriverStatus(); | ||
} | ||
catch (WebException e) | ||
catch (WebException) | ||
{ | ||
// If we no longer can reach the Driver, it must have exited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still log the exception here, say
LOGGER.Log(Level.Warning, "Driver unreacheable. Exiting now.", e);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
This approach enables REEF.NET Client-Driver communication by using InBoundNATPool features provided by Azure Batch Pool. More Info: https://docs.microsoft.com/en-us/rest/api/batchservice/pool/add#inboundnatpool
JIRA:
REEF-2021
Pull Request:
This closes #1468