Skip to content
This repository has been archived by the owner on Aug 4, 2022. It is now read-only.

[REEF-2021]In AzureBatch Runtime, Enable REEF .NET Client Communication to Driver #1468

Merged
merged 10 commits into from Jun 19, 2018
Merged

[REEF-2021]In AzureBatch Runtime, Enable REEF .NET Client Communication to Driver #1468

merged 10 commits into from Jun 19, 2018

Conversation

homezcx
Copy link
Contributor

@homezcx homezcx commented May 29, 2018

This approach enables REEF.NET Client-Driver communication by using InBoundNATPool features provided by Azure Batch Pool. More Info: https://docs.microsoft.com/en-us/rest/api/batchservice/pool/add#inboundnatpool

JIRA:
REEF-2021

Pull Request:
This closes #1468

@homezcx
Copy link
Contributor Author

homezcx commented May 29, 2018

@tcondie This is the approach using Azure Batch InboundNATPool to enable the communication. More Info in Jira.

@homezcx homezcx closed this May 30, 2018
@homezcx homezcx deleted the features/inboundNatPool branch May 30, 2018 22:29
@homezcx homezcx restored the features/inboundNatPool branch May 30, 2018 22:29
@homezcx homezcx reopened this May 30, 2018
@homezcx
Copy link
Contributor Author

homezcx commented May 30, 2018

Rebased master branch csproj file change to resolve conflicts.

@homezcx
Copy link
Contributor Author

homezcx commented Jun 4, 2018

Ping if someone can review. Thanks.

Copy link
Contributor

@singlis singlis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I posted comments and suggestions.

using System.Linq;
using Microsoft.Azure.Batch;
using Microsoft.Azure.Batch.Common;
#if REEF_DOTNET_BUILD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This #if can be removed now - it was used to support building both old and new projects. Ill create a JIRA to go through the code and remove as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -45,6 +47,12 @@ public sealed class AzureBatchDotNetClient : IREEFClient
private readonly AzureBatchService _batchService;
private readonly JobJarMaker _jobJarMaker;
private readonly AzureBatchFileNames _azbatchFileNames;
private readonly string _azureBatchAccountName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @homezcx
This set of data (batch account name, pool id, etc) gets repeated. This is currently defined in the AzureBatchClientConfiguration -- can that configuration be re-used at all in these cases as a structure for managing this data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely we can. I am kind not sure which one is better practice. Having a wrapper only contains getters with no logic, or just like this. Since we are using injection here, no pain to call new Instance(param1, param2, .....)

Let's wait for another vote maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. I will fix this by implementing helper methods in AzureBatchService class which holds the credentials. So we don't need pass them around.

}
catch (BatchException e)
{
throw new InvalidOperationException("driver http endpoint file is not ready.", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really an invalid operation or is it a timing issue? Like maybe we checked too soon for the driver file? If its a timing issue, would waiting a retry help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a timing issue and will be retried according to line 66-67. To trigger the retry policy, an exception must be thrown. Any return value including "null" will be considered as successful. Retrhow the exception for logging purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

else
{
LOGGER.Log(Level.Warning, "unable to get driver http endpoint. The format in remote file is not correct.");
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I feel like should throw because an empty file, which should not happen, is not recoverable.

Copy link
Contributor Author

@homezcx homezcx Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It is not recoverable but we need this to warn people in future if they changed file DriverHttpEndpoint.txt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding comments that this is called from the RetryPolicy? Like a comment specifying we are returning null to exit the RetryPolicy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it.

}

//// Get port
string[] driverIpAndPorts = driverHost.Split(':');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would combine this with the code above since its all related to retrieving the ip and port from the string. Then throw invalid operation if that fails to parse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to separate recoverable exception (by throwing InvalidOperationException) and unrecoverable exception (by returning null).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - makes sense, I wasn't aware of having to return null when I reviewed.

get { return _driverUrl; }
get
{
if (_driverUrl != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit - this could be simplified to if (_driverUrl == null){} return _driverUrl;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

while (status.IsActive())
{
// Add sleep in while loop, whose value alligns with default heart beat interval.
Task.Delay(TimeSpan.FromSeconds(DriverStatusIntervalInSecond)).GetAwaiter().GetResult();
LOGGER.Log(Level.Info, "DriverStatus is " + status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task delay is an empty task right? Why not use Thread.Sleep()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had offline discussion with @singlis . The idea was that I thought using Task.Delay will hold the execution without locking the thread. But actually there is no difference with calling Thread.Sleep() and Task.Delay().GetAwaiter().GetResult().

Changed it back to Thread.Sleep()

@@ -19,6 +19,7 @@
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
using Org.Apache.REEF.Client.Common;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this using needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. AllErrorsTransientStrategy is moved to Common, as it was not designed for YARN.

@@ -123,6 +124,11 @@ private static IConfiguration GetRuntimeConfiguration(string name)
.Set(AzureBatchRuntimeClientConfiguration.AzureStorageAccountKey, @"##########################################")
.Set(AzureBatchRuntimeClientConfiguration.AzureStorageAccountName, @"############")
.Set(AzureBatchRuntimeClientConfiguration.AzureStorageContainerName, @"###########")
//// Extend default retry interval in Azure Batch
.Set(AzureBatchRuntimeClientConfiguration.DriverHTTPConnectionRetryInterval, "20000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about driver connection attempts?

Copy link
Contributor Author

@homezcx homezcx Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default values works for our needs. So in this example we don't need to alter the value. Also, when using List, it gives a cannot convert error at compiling.

networkConfiguration = client.poolOperations().getPool(poolId).networkConfiguration();
} catch (IOException e) {
LOG.log(Level.WARNING, "Unable to setup Http Server with InBoundNATPool Port", e);
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not return null here - this is requiring that the code calling this api is checking for null. Looking at the code above that references this function, it looks like it does not and would fail with a null ref. I think this code should throw and be captured appropriately so that you can identify what the real error is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for other cases where null is returned in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I fixed this by returning empty set.

In this case, Driver Http server will fall back to use RangeTcpPortProvider.

@singlis
Copy link
Contributor

singlis commented Jun 4, 2018

Thanks, this looks good to me. @markusweimer or @motus, can you please take a second look?

@homezcx
Copy link
Contributor Author

homezcx commented Jun 11, 2018

@markusweimer or @motus Could you help review this PR as well?

@motus
Copy link
Contributor

motus commented Jun 14, 2018

Sure, I will look at it now. Meanwhile, @homezcx, could you please resolve the conflicts? Thank you!

@homezcx
Copy link
Contributor Author

homezcx commented Jun 14, 2018

@motus Thanks! I will work on resolving the conflicts.

…on to Driver

This approach enables REEF.NET Client-Driver communication by using InBoundNATPool features provided by Azure Batch Pool. More Info: https://docs.microsoft.com/en-us/rest/api/batchservice/pool/add#inboundnatpool

JIRA:
  [REEF-2021](https://issues.apache.org/jira/browse/REEF-2021)

Pull Request:
  This closes #
folder. Process the bridge jar as other runtime to avoid IOException
caused by duplicate copy of bridge jar to global folder.
1. Fix potential null pointer excpetion when there is no InboundNATPool
set.
2. Code cleanup
AzureBatchService to perform Azure Batch operations.
Node operation is only suppoted using AzureBatch key.
@homezcx
Copy link
Contributor Author

homezcx commented Jun 14, 2018

@motus Rebased and tested.

namespace Org.Apache.REEF.Client.AzureBatch.Parameters
{
[NamedParameter(Documentation = "The Azure Batch Pool Driver Http Server Ports List")]
public sealed class AzureBatchPoolDriverPortsList : Name<IList<string>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string [](start = 67, length = 6)

Encountered this when looking at my related changes. Why this is not IList instead?

Copy link
Contributor Author

@homezcx homezcx Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the example in TestListInjection.java, to bind a List , it requires to use List


In reply to: 195608449 [](ancestors = 195608449)

Copy link
Contributor Author

@homezcx homezcx Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, it seems binding List in ConfigurationModule is not supported.


In reply to: 195804156 [](ancestors = 195804156,195608449)

Copy link
Contributor

@motus motus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few nit picks

result.add(sequence.toString());
}
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use final; preallocate the result array; the signature can be more generic. I would write something like

private static List<String> asStringList(final Collection<? extends CharSequence> list) {
  final List<String> result = new ArrayList<>(list.size());
  for (final CharSequence sequence : list) {
    result.add(sequence.toString());
  }
  return result;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Inject
public ListTcpPortProvider(@Parameter(TcpPortList.class) final List<Integer> tcpPortList) {
this.tcpPortList = tcpPortList;
LOG.log(Level.FINE, "Instantiating " + this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use string interpolation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


protected override string GetDriverUrl(string filepath)
{
var policy = new RetryPolicy<AllErrorsTransientStrategy>(_numberOfRetries, TimeSpan.FromMilliseconds(_retryInterval));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

policy can be a member variable, initialized in constructor. then you won't need _numberOfRetries and _retryInterval members

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

LOGGER.Log(Level.Warning, "unable to get driver http endpoint. The format in remote file is not correct.");
//// Returns null to exit retry policy since it is not recoverable.
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, http endpoint file may not necessarily have a terminating \n. I would rather write

string driverHost = httpEndPointFile.ReadAsString().TrimEnd('\r', '\n', ' ');

etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOGGER.Log(Level.Warning, "unable to get driver http endpoint port. The format in remote file is not correct.");
//// Returns null to exit retry policy since it is not recoverable.
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to parse the port as int right away, i.e. do something like

string[] driverIpAndPorts = driverHost.Split(':');
if (driverIpAndPorts.Length <= 1 || !int.TryParse(driverIpAndPorts[1], out int backendPort))
{
    LOGGER.Log(Level.Warning, "Unable to get driver http endpoint port from: {0}", driverHost);
    return null;
}
// ...
InboundEndpoint endpoint = inboundEndpoints.FirstOrDefault(s => s.BackendPort == backendPort);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

return "http://" + publicIp + ':' + frontEndPort + '/';
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, who writes the content of the httpEndPointFile? If it is us, can we write a proper URL to it instead, so that the string can be parsed with just

Uri driverUrl = new Uri(httpEndPointFile.ReadAsString());

We can also return a Uri object instead of a string from that method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is JavaBridge writes the HttpEndPointFile. See:
https://github.com/apache/reef/blob/master/lang/java/reef-bridge-java/src/main/java/org/apache/reef/javabridge/generic/JobDriver.java#L193

The string format ip + port in this file is used by other runtime as well, YARN, local. We can change it but it seems to be out of scope of this PR. Shall we make it separate ticket?

while (status.IsActive())
{
// Add sleep in while loop, whose value alligns with default heart beat interval.
Thread.Sleep(DriverStatusIntervalInMilliSecond);
LOGGER.Log(Level.Info, "DriverStatus is " + status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use string interpolation

@@ -85,7 +94,15 @@ internal JobSubmissionResult(IREEFClient reefClient, string filePath, int number
/// </summary>
public string DriverUrl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: maybe return Uri instead of a string URL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: The current string format is "10.0.0.4:1234". it requires to change the string in HttpEndPointFile to a valid URI format, which could affect other runtimes. See if we want to create another ticket for it.

Copy link
Contributor

@motus motus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Will test locally and merge


try
{
status = FetchDriverStatus();
}
catch (WebException e)
catch (WebException)
{
// If we no longer can reach the Driver, it must have exited.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still log the exception here, say

LOGGER.Log(Level.Warning, "Driver unreacheable. Exiting now.", e);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@motus motus merged commit b96c045 into apache:master Jun 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants