MNIST with spark standalone cluster mode #37
Comments
Hi Manoj, The error says the parameter server port is already allocated, could you check this? File "/usr/lib/python2.7/socket.py", line 228, in meth Joeri |
Hi Joeri, ./bin/spark-submit --master spark://10.51.5.40:7077 examples/src/main/python/gtzanKeras.py gtzan.parquet Layer (type) Output Shape Param #dense_1 (Dense) (None, 40) 1240 activation_1 (Activation) (None, 40) 0 dropout_1 (Dropout) (None, 40) 0 dense_2 (Dense) (None, 15) 615 activation_2 (Activation) (None, 15) 0 dropout_2 (Dropout) (None, 15) 0 dense_3 (Dense) (None, 10) 160 activation_3 (Activation) (None, 10) 0Total params: 2,015 Number of training instances: 897
Training time: 27.7298340797 Thanks, |
Hi Manoj, This shouldn't happen. My guess is that there is an other process occupying the default port (5000). Or, the OS has a firewall enabled. What is the output of Joeri |
Hi Joeri, ee207437@pcg-ee207437-3:~$ netstat -lanp --protocol=inet ee207437@pcg-ee207437-3:~$ sudo ufw status Thanks, |
Hmm, strange. Could you send me a sample of how you use ADAG? Because it seems that for some reason the parameter server is not allocated. Joeri |
Hi Joeri, if name == "main":
--Manoj |
Hmm, this is really strange. But in the ouput it shows that the model is trained? Could you give the output of Joeri |
Joeri For determine_host_address() O/P is 127.0.1.1 When I run in with --master local[*} on PC-ee207437-1 This is expected, as I have num_worker = 3 When I run in with --master spark://10.51.5.40 on PC-ee207437-1 From worker-2 and worker-3 its trying to connect to itself, is this correct ? /etc/hosts has spark/sbin/slaves In SparkUI I see all worker connected to master. I tried running other examples like wordcount.py. I think, I'm missing some env-var setting for dist-keras with spark-standalone. Thanks, |
Yes, that's the error. We need to force determine_host_address to not pick the local address. If we fix that it will work. I'm on my phone right now, but I can look at it in about one hour. I'll keep you posted. Joeri |
Also, did you define the hostname of local as 127.0.0.1 in /etc/hosts? That would explain why determine host address doesn't function properly. Joeri |
Hi Manoj, I think I have a fix. The only downside is that this code isn't cross-platform. But I don't think a lot of people run Spark on Windows / Mac anyway. import os
import fcntl
import socket
import struct
def get_interface_ip(ifname):
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
return socket.inet_ntoa(fcntl.ioctl(s.fileno(), 0x8915, struct.pack('256s',
ifname[:15]))[20:24])
def get_default_iface():
route = "/proc/net/route"
with open(route) as f:
for line in f.readlines():
try:
iface, dest, _, flags, _, _, _, _, _, _, _, = line.strip().split()
if dest != '00000000' or not int(flags, 16) & 2:
continue
return iface
except:
continue
def determine_host_address():
# Retrieve the Network Interface in ASCII encoding.
iface = get_default_iface().encode("ascii")
# Obtain the network address from an active NIC.
address = get_interface_ip(iface)
return address Could you verify if that Joeri |
Hi Jeori,
After commenting localhost /etc/hosts in workers. I did not get error.
However my code is still running. I shall keep you posted once it's
completed.
Also I shall try with your code fix that you have suggested.
Thank you for immediate responses.
-Manoj
…On Oct 13, 2017 5:56 PM, "Joeri Hermans" ***@***.***> wrote:
Hi Manoj,
I think I have a fix. The only downside is that this code isn't
cross-platform. But I don't think a lot of people run Spark on Windows /
Mac anyway.
import osimport fcntlimport socketimport struct
def get_interface_ip(ifname):
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
return socket.inet_ntoa(fcntl.ioctl(s.fileno(), 0x8915, struct.pack('256s',
ifname[:15]))[20:24])
def get_default_iface():
route = "/proc/net/route"
with open(route) as f:
for line in f.readlines():
try:
iface, dest, _, flags, _, _, _, _, _, _, _, = line.strip().split()
if dest != '00000000' or not int(flags, 16) & 2:
continue
return iface
except:
continue
def determine_host_address():
# Retrieve the Network Interface in ASCII encoding.
iface = get_default_iface().encode("ascii")
# Obtain the network address from an active NIC.
address = get_interface_ip(iface)
return address
Could you verify if that determine_host_address() doesn't return the
local address on your machines?
Joeri
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AZPUETrSgsLicyJQTbxdvpr2ZWL4n1o2ks5sr1bzgaJpZM4P1K5V>
.
|
If it's ok for you I'll close this issue now. Feel free to re-open it. Joeri |
Hi Joeri, When I run in with --master spark://10.51.5.40 on PC-ee207437-1 The Run completes successfully. Thanks a lot for immediate responses, Joeri. |
@JoeriHermans
Hi Joeri,
I tried to run one of my experiment with pysprak standalone cluster mode, with 3 workers.
I'm getting an connectionRefused. error to the worker.
Is this expected?
ee207437@pcg-ee207437-1:/usr/lib/spark$ ./bin/spark-submit --master spark://10.51.5.40:7077 examples/src/main/python/gtzanKeras.py gtzan.parquet
Using TensorFlow backend.
17/10/11 14:35:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root
|-- features_normalized: vector (nullable = true)
|-- label_index: double (nullable = true)
|-- label: array (nullable = true)
| |-- element: double (containsNull = true)
Layer (type) Output Shape Param #
dense_1 (Dense) (None, 40) 1240
activation_1 (Activation) (None, 40) 0
dropout_1 (Dropout) (None, 40) 0
dense_2 (Dense) (None, 15) 615
activation_2 (Activation) (None, 15) 0
dropout_2 (Dropout) (None, 15) 0
dense_3 (Dense) (None, 10) 160
activation_3 (Activation) (None, 10) 0
Total params: 2,015
Trainable params: 2,015
Non-trainable params: 0
Number of training instances: 887
Number of testing instances: 113
2017-10-11 14:36:03.929908: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 14:36:03.929928: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 14:36:03.929934: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 14:36:03.929938: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 14:36:03.929943: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/distkeras/trainers.py", line 466, in service
self.parameter_server.initialize()
File "/usr/local/lib/python2.7/dist-packages/distkeras/parameter_servers.py", line 111, in initialize
file_descriptor.bind(('0.0.0.0', self.master_port))
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 98] Address already in use
[Stage 9:> (0 + 3) / 3]17/10/11 14:36:10 WARN TaskSetManager: Lost task 1.0 in stage 9.0 (TID 656, 10.51.5.30, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/lib/python2.7/dist-packages/distkeras/workers.py", line 261, in train
self.connect()
File "/usr/local/lib/python2.7/dist-packages/distkeras/workers.py", line 197, in connect
self.socket = connect(self.master_host, self.master_port, self.disable_nagle)
File "/usr/local/lib/python2.7/dist-packages/distkeras/networking.py", line 97, in connect
fd.connect((host, port))
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
Thanks,
Manoj
The text was updated successfully, but these errors were encountered: