Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ENTIRELY AUTOMATED) Google Cloud + Microsoft Azure Tesla V100 Free Trial : video+text tutorial with managed instance group #1905

Open
wonderingabout opened this issue Oct 2, 2018 · 110 comments

Comments

@wonderingabout
Copy link
Contributor

wonderingabout commented Oct 2, 2018

hi,

IMPORTANT UPDATE : 13 november 2018 !!!
TO ALL THOSE WHO USED THE GOOGLE CLOUD FREE TRIAL INSTRUCTIONS BEFORE 13 NOVEMBER 2018 :
https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit
OLD STARTUP-SCRIPT DOESNT WORK ANYMORE !
YOU NEED TO DELETE YOUR INSTANCE GROUP AND INSTANCE TEMPLATE, AND CREATE A NEW TEMPLATE WITH THE UPDATED STARTUP-SCRIPT PROVIDED IN THE GOOGLE DOC (i updated it)
see page 10 of the doc, more details on this github comment : #1905 (comment)

FOR MICROSOFT AZURE TESLA V100 FREE TRIAL WITH LOW PRIORITY COST AUTOMATED INSTRUCTIONS , SEE :
https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

MAIN MESSAGE FOR GOOGLE CLOUD :
as i said recently on discord, google colab (tesla K80) doesnt work so well, because after 1-2 hours of computing you are disconnected (no gpu backend) and have to wait another time to get a gpu available... to use it only 1-2 hours before being blocked again, see their faq : https://research.google.com/colaboratory/faq.html#gpu-availability

however, google cloud is different, they give you totally free of charge 300 dollars/257 euros credit that can be used with a Tesla V100 (0,75$/hour of free credit consumed, with preemptible costs with 300 dollars you are good for a very long time for free!),
at the condition that your id is verified with a vald credit card, to avoid abuse (but you'll never be charged anything, even when your free trial credit ends)

now, you dont need to manually start autogtp anymore ! see final instructions google doc :
https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit

about speed of the tesla v100 :
i experimented with it, and its very very powerful !
it was slightly slower during first run during the tutorial, but you can expect to produce games at arround this speed :

  • arround 11 +/-1 games 5%resign in 60 minutes with master branch
  • and 13 +/-1 games 5%resign in 60 minutes with next branch,
  • and 16 +/- 1 games 5%resign in 60 minutes with next branch -g 2 (./autogtp -g 2 is faster but requires 4vcpu/5,75gb ram)
    -if you have a 0% resign game with 700 moves it will last arround 1 hour with master branch and arround 40 minutes with next branch at the time i'm writing this tutorial

a long run can do something like that :

g ssh games 21

i myself dont have a very powerful gpu, so i'm glad i can help with 40b which is ressource expansive,

from my calculations, one free trial user can produce arround 250 games per day at a h24 7/7 usage = games per free trial,
so with 100 people doing this free trial, we'd produce 25 000 extra games per 24 hours, and 175 000 games/week.

it doesnt consume your ressouce as the V100 runs in the cloud, so just start your instance and forget it, no need to let the SSH window or your computer open !
̷p̷l̷e̷a̷s̷e̷ ̷d̷o̷n̷t̷ ̷f̷o̷r̷g̷e̷t̷ ̷t̷o̷ ̷s̷t̷o̷p̷ ̷t̷h̷e̷ ̷i̷n̷s̷t̷a̷n̷c̷e̷ ̷a̷f̷t̷e̷r̷ ̷y̷o̷u̷ ̷e̷x̷i̷t̷e̷d̷ ̷t̷h̷e̷ ̷s̷s̷h̷ ̷w̷i̷n̷d̷o̷w̷,̷ ̷o̷r̷ ̷t̷h̷e̷ ̷c̷r̷e̷d̷i̷t̷ ̷i̷s̷ ̷w̷a̷s̷t̷e̷d̷ ̷f̷o̷r̷ ̷n̷o̷ ̷r̷e̷a̷s̷o̷n̷,̷ ̷a̷s̷ ̷e̷x̷p̷l̷a̷i̷n̷e̷d̷ ̷i̷n̷ ̷t̷h̷e̷ ̷t̷u̷t̷o̷r̷i̷a̷l̷ not needed to stop your isntance anymore, as autogtp produces games automatically

last thing, these ressources can also be used for other projects, like training further the 192x15, as suggested in #1889

the more people use it, the faster development will be,
please share and thank you for reading this

edit :
after a quick search, i found that microsoft azure and oracle cloud offer a free trial credit for GPU cloud computing :

edit 2 : simplified commands in pastebin tutorial, and added part 3b instructions to use next branch (25-30% faster to produce games), and instructions to install latest nvidia short-lived(fresh) drivers in part 1b

edit 3 : in part 4, ./autogtp -g 2 produces games significantly faster, even with 5% resign only games only (16 vs 13 games in 60 minutes), but also partly because when a 0% resign game is generated, the extra gpu power of the v100 can be used to produce another game simultaneously
run instead (requires 4vcpu and 5,75 gb ram)
./autogtp -g 2
(265 games/24 hours VS 208 games/24hours)

edit 4 :

  • greatly shrinked commands, much faster to create a new instance now !
  • added startup-script in metadata

(remove -g 2 if you chose the weak hardware option, slower)
now, as soon as your instance has started (green circle) you dont need to open an SSH window to start producing games anymore (and you should not, or autogtp will run twice, causing hardware to overload and to be slower and unstable) , and no need to have your computer powered on anymore !

  • added : journal command to check the progress of the startup-script
    sudo journalctl -u google-startup-scripts.service -b -e -f
  • added : details of the instance modification after its creation, to add the startup-script metadata after finishing the install of all pre leela-zero packages (nvidia,all packages,etc)
  • will do : 2 new video tutorials (minimalist + detailed) as soon as the preemptibility issue is solved (auto restart after preemptible stop)

edit 5 : (final version) :

edit 6 :
added microsoft azure free trial with low priority costs (80% cheaper), entirely automated
see : https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

fa

edit 7 :
as of 13 november 2018, old startup-script doesnt work anymore, updated it !! update needed or free credit is going to be wasted ! see : #1905 (comment)

edit 8 :
26 november 2018 : google cloud update, added quota requests for GPU (all regions) Global and preemptible CPUs in every region,
see page 3 : https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit
and see this comment : #1905 (comment)

edit 9 :
07 december 2018 : much more shrinked and efficient script for google cloud,
see : #1905 (comment)

edit 10 :
02 february 2019
startup script updated with the new repo owner leela-zero instead of gcp,
see : #1905 (comment)

@l1t1
Copy link

l1t1 commented Oct 2, 2018

interesting

@l1t1
Copy link

l1t1 commented Oct 3, 2018

lc0 also began use this. http://blog.lczero.org/2018/10/contributing-to-leela-chess-zero.html?m=1#more

@wonderingabout
Copy link
Contributor Author

yes, as they said the V100 preemptible is really most efficient use of the free credit

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 3, 2018

after a quick search, i found that microsoft azure and oracle cloud offer a free trial credit for GPU cloud computing :

edit : today's run so far :
google cloud finally 9

edit 2 :
old instructions (moved from main top message) :
and came with a text and video tutorial (a minimalist one with no explanations, and a detailed one)

SHORT VERSION :
minimalist text instructions :
https://pastebin.com/SL1iBAey

minimalist and quick tutorial (no explanations) :
(will be remade soon to add all improvements) https://www.youtube.com/watch?v=LBeh4cfVcPg

LONG VERSION :

long and detailed text instructions :
https://pastebin.com/552UN25c

detailed and long tutorial :
(will be remade soon to add all improvements) https://www.youtube.com/watch?v=64SDJuibv78&

here is what i get with next branch (includes 3 no resign games of 700 moves and 2200 seconds each) :

google cloud finally 16multi

@seopsx
Copy link

seopsx commented Oct 3, 2018

after a quick search, i found that microsoft azure and oracle cloud offer a free trial credit for GPU cloud computing :

edit : today's run so far :

@wonderingabout That seems slow. If you are using V100 you should be getting moves considerably faster.

My results:

137 game(s) (107 self played and 30 matches) played in 893 minutes = 391 seconds/game, 1904 ms/move, last game took 374 seconds.

What are your benchmark results?

My results:

3200 visits, 1104899 nodes, 3199 playouts, 384 n/s

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 3, 2018

@seopsx

if you have one no resign game, it can last one hour (60 minutes) even on a V100 (723 moves), so imagine worst case scenario when you get only no resign games and they all last one hour each, you'll produce 6 games in 6 hours (but these games are important for learning process and shouldnt be skipped)

also, 60 games in 6 hours (so 10 games/hour), including some no resign games, is not slow at all
in fact, it is (much) faster than what most of our contributors can do (look at average game duration in selfplay or match settings), and it doesnt prevent you from using in addition your personal machine, its all just a free extra

questions for you :

  1. what is your build/rig ? cloud ? personal machine ? V100 ? GPU ? SLI/crossfire ?
  2. do you use master or next ? if you use next branch, you can potentially have it (much) faster, but i chose to keep it simple and i went with master branch, eventually a new main master update should be released eventually
    also, if leela zero was optimzed for V100, it would be a monster
  3. why do you have 3200 visits ? we now use 1600 visits, and 0 playouts (not 3200)
    if you use it for benchmark (i assume you're using LZ181), i'm running autogtp for now so i will answer you next time

@seopsx
Copy link

seopsx commented Oct 3, 2018

@wonderingabout

I'm running LZ next with enabled half precision on google cloud V100. The benchmark uses 3200 visits.

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 3, 2018

@seopsx
i see, thanks,

for information, on average i get about the same speed, no big change (includes no resign games) :
google cloud finally 11

for the benchmark, i'd need to know which the precise command you ran to start the benchmark
i assume you compiled leelazero, but then can you tell me the command you used ?
then when i try the benchmark i will tell you if i can reproduce your results

also, you may want to write standardized steps for using autogtp with next branch, starting from part 3 of the text instructions i sent

thanks

@seopsx
Copy link

seopsx commented Oct 4, 2018

@wonderingabout
The command used for benchmark is "./leelaz --benchmark -w networks/68824bbc683a0eb482bcdc34ea7c3e4bc3e1dd152e3aa94f9a8bfc6d189f3091.gz"

There's a bug which makes LZ to choose single precision FP even if it's slower than half. See #1887

For the time being you can use this patch to increase the efficiency. #1888

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 4, 2018

@seopsx

i tried using cmake on master branch followed with your command, nothing happens after i input your command , this is what i did :

DELETED, see my next messages for correct instructions

this is what i get (nothing happens):

google cloud benchmark1

did i do something wrong ?
thanks again

@seopsx
Copy link

seopsx commented Oct 4, 2018

@wonderingabout

did i do something wrong ?
thanks again

Are you sure the weight file exists under that filename?

@wonderingabout
Copy link
Contributor Author

not at all, i just blindly copied your command

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 4, 2018

updated instructions 4 october 2018 :
https://pastebin.com/3RGqevbt

commands are shorter

@seopsx
Copy link

seopsx commented Oct 4, 2018

@wonderingabout

The bug is now fixed on next branch. I encourage everyone to do "git checkout next" and "git pull" inside leela-zero directory and follow the README instructions to build it again.

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 5, 2018

@seopsx

then i'll create another instance and try autogtp's performance directly, i will let you know when i do it and what results i get
but this is why i prefered to stay on master for the tutorial : it may be slower but easier and more reliable

i wish we get a new master release soon though, because current master is old

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 7, 2018

@seopsx

thanks, i followed your instructions, compiled next (in a sub directory) and ran it successfully, with these instructions :

after you installed master and successfully ran autogtp, exit instance and restart it, then :
(installing next in a subdirectory so that you can always go back to master)

DELETED (see my next message for correct instructions) : #1905 (comment)

however, i didnt see a significant speed difference, is it because it's the first run, or are there extra options i should enable ?
i'll let autogtp next run for the day to see if in the long run it goes faster

@seopsx
Copy link

seopsx commented Oct 7, 2018

@wonderingabout

You are using the wrong directory. Autogtp is inside "build" when you compile it like this:

# Use stand alone directory to keep source dir clean
  mkdir build && cd build
  cmake ..
  cmake --build .

@wonderingabout
Copy link
Contributor Author

@seopsx

i understand, thank you
when i try it again next time, i'll let you know the results i get

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 8, 2018

@seopsx

i successfully managed to run next branch, it is indeed arround 30% faster than master branch
(arround 13-15 games per hour in average vs 10 games per hour for master)

i updated the pastebin tutorial to add next branch instructions (part 3b), and also simplified the commands (one less reboot needed) : https://pastebin.com/552UN25c

after you run autogtp on master branch, exit the instance (chrome exit button) and click again on SSH to restart it, then copy paste all this selection :

pull next branch

cd leela-zero
git checkout next
git pull
git clone https://github.com/gcp/leela-zero
git submodule update --init --recursive

compile leelaz and autogtp binaries

mkdir build && cd build
cmake ..
cmake --build .
./tests

then go to autogtp subdirectory

cd ..
cd ./autogtp

compile autogtp in autogtp folder

qmake -qt5
make

copy leelaz binary into autogtp folder

cp ../build/leelaz .

run autogtp

./autogtp

you can see that it's not master version because these lines dont exist in current master version, among others :
BLAS Core: built-in Eigen 3.3.5 library.
Half precision compute support: No.
time_settings 0 1 0

thanks again for your help @seopsx

one question i want to ask, doesnt V100 support half precision in leela zero ?
this line surprised me :
Half precision compute support: No.

edit : today's run so far :

it is really much faster on next branch (master branch would have played arround 30-32 games in the same time)

google cloud finally 12

includes 3 no resign games that lasted 700 moves and 2200 seconds each (vs 3200 seconds on master branch)

@seopsx
Copy link

seopsx commented Oct 8, 2018

@wonderingabout You can remove unnecessary step "compile autogtp binary". If you look closely command "cmake --build ." builds also autogtp.

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 8, 2018

@seopsx
thank you, i did notice that autogtp was compiled during cmake, but readme of autogtp next said to compile it as i did : https://github.com/gcp/leela-zero/blob/next/autogtp/README.md
still, i updated the pastebin tutorial based on your comment

last question, leela zero doesn't support half precision ?

@alreadydone
Copy link
Contributor

alreadydone commented Oct 8, 2018

I seem to remember that someone mentioned NVIDIA cards doesn't support half precision compute through OpenCL, but only through CuDNN or so. (Edit: linke here #1689 (comment)) However, I heard that the latest NVIDIA driver sped things up by 7% on 1080 Ti.

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 8, 2018

@alreadydone @seopsx

i see, i will try with beta/fresh nvidia drivers if i find how to include the ppa, or i'll use a direct install
i will let you know if i see significant performance improvements

edit : currently, nvidia's latest driver version is nvidia 396.44, there is no beta driver for tesla v100 (and 1080ti), i checked here :
https://www.nvidia.com/drivers/beta
only the rtx 2xxx have 4xx.xx branch of september 2018

so going with the ppa (396.54 of 23rd august 2018) is just the easiest i think, i will let you know how it goes : https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

edit 2 : successfully installed nvidia short-lived branch, added instructions in part 1b of the pastebin tutorial :
here is how i did, see part 1b of the pastebin tutorial for details : https://pastebin.com/552UN25c

sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && apt-cache search nvidia

then found the nvidia-driver-396 line

then run this command :
sudo apt-get -y install nvidia-driver-396 linux-headers-generic nvidia-opencl-dev libnvidia-compute-396 && sudo reboot

i will let you know if i see any major performance difference with the latest 396 drivers as compared to 390 long lived branch

edit 3 : with 11 games in 60 minutes in next branch, nvidia-396 is actually arround 15% slower than nvidia-390 (long lived branch, 13-14 games in 60 minutes), and 625 moves took 2400 seconds (VS 700 moves in 2200 seconds with nvidia long lived 390, so yes nvidia short lived 396 with next branch is significantly slower)

edit 4 : actually i was using autogtp master with leelaz next, but with autogtp next nvidia-driver-396 is not slower from nvidia-390, s from my tests, i'd recommend to run long short-lived nvidia branch (part 1b) with next branch (parts 3a+3b)

@alreadydone
Copy link
Contributor

alreadydone commented Oct 9, 2018

Sorry, the speedup from ~850 n/s to ~910 n/s on 1080 Ti happens on Windows and the driver is probably https://www.nvidia.com/download/driverResults.aspx/138567/en-us (NVIDIA GeForce 416.16 WHQL driver for Windows 10 v1809). However, v1809 had devastating effects on some computers, and seems to be unavailable now.

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 9, 2018

no problem,
edit : actually, nvidia 390 was not faster than 396, see my above comment

@alreadydone @roy7 @gcp now that the instructions are finalized, can we add google cloud free trial instructions to the github "i want to help" page just before the google colab instructions :

https://github.com/gcp/leela-zero#i-want-to-help
https://github.com/gcp/leela-zero/tree/next#i-want-to-help

in both master and next branches
note that microsoft azure and oracle cloud also give free trial VM cloud instances

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Oct 9, 2018

edit 2 : i read that tesla v100 includes multi threading here (ctrl+f "thread") : https://images.nvidia.com/content/technologies/volta/pdf/437317-Volta-V100-DS-NV-US-WEB.pdf

so i'm trying ./autogtp --gamesNum 3 (3 games runing on 3 different gpu threads at the same time with only one gpu, i will let you know if i see a significant performance boost)

edit 3 : with 14 games/hour, --gamesNum 3 is arround the same speed than 1 game at a time, but i suspect in the long run (with a 0% resign game taking all the gpu time), it should be faster
more than 3 games could not run with this error message :

ERROR: Could not talk to engine after launching.

however, with this 2vcpu/3.75gb ram, the cpu was at 100% unlike with 1 game (50%) at a time, so i'll try creating instances with 4/6/8 vcpu and 8/16gb ram to see if there was no bottleneck slowing things, i will update it with the results i get

edit 4 : with 8vcpu 8gb ram, max setting is --gamesNum 7 , more generates the same error, i will try to increase ram to 16gb and maybe later use a ssd instead to see, with 8vcpu 8gb ram 1 game at a time uses 12.5% cpu and --gamesNum 7 uses 86,5% cpu, so even if we increase ram we may only have a few more simultaneous games at most

google cloud finally 13multi

@alreadydone
Copy link
Contributor

@wonderingabout You can edit (click the pen icon) https://github.com/gcp/leela-zero/blob/next/README.md and after that you'll be directed to submit a pull request.
Does increasing the number of vCPUs or RAM costs more credits? To quickly check the speed, you may count the number of moves generated per say 20 seconds (copy the text and count the number of parentheses, etc.). You should also be able to use nvidia-smi to check GPU usage, and if it doesn't increase, the speed also shouldn't increase.

@wonderingabout
Copy link
Contributor Author

@alreadydone
ok, i will try then, if the result is reproductible between all 40b nets and similar enough, we can indeed skip some time like that

since i start to run out of google cloud credit, i cant experiment a lot with it though, but i'll try

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 9, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

for microsoft azure, i'll try debugging with minimalistic scripts and see how it goes :

try 1 :

/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' clinfo|grep "install ok installed")
echo Checking for clinfolib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No clinfolib. Setting up clinfolib and all other leela-zero packages."
  sudo -i && uname -a && sudo apt-get -y install clinfo && clinfo
else 
  sudo -i && uname -a && clinfo && pwd
fi'

then clicking on reboot button :

edit : this may be useful for later debugging :

dpkg-query: no packages found matching clinfo
mesg: ttyname failed: Inappropriate ioctl for device
debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 

after reboot, it still doesnt detect clinfo, while it is already installed :

Checking for clinfolib:
No clinfolib. Setting up clinfolib and all other leela-zero packages.
Linux 0d9cf60224db40fe99dec519fddb8a93000000 4.15.0-1030-azure #31-Ubuntu SMP Tue Oct 30 18:35:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Reading package lists...
Building dependency tree...
Reading state information...
clinfo is already the newest version (2.2.18.03.26-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Number of platforms                               0

and automation script too

interesting read :

https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-linux

as it explains reboot will always break the startup script (that doesnt last more than 90 minutes anyways), which makes us move to try 2 :

try 2 :

using a distro that has gpu driver already installed, thus avoiding the need to reboot and starting with a job schedule directly (with recurence)

/bin/bash -c 'sudo -i && uname -a && sudo apt-get update && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo && pwd && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2'

as centos uses yum :

this datascience distro is promising :

https://azuremarketplace.microsoft.com/en-US/marketplace/apps/microsoft-ads.linux-data-science-vm

try 3 for centos datascience :

/bin/bash -c 'sudo -i && uname -a && pwd && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2'

what it returns :

Package clinfo-2.1.17.02.09-1.el7.x86_64 already installed and latest version Package cmake-2.8.12.2-2.el7.x86_64 already installed and latest version Package git-1.8.3.1-14.el7_5.x86_64 already installed and latest version No package libboost-all-dev available. No package libopenblas-dev available. No package zlib1g-dev available. No package build-essential available. No package qtbase5-dev available. No package qttools5-dev available. No package qttools5-dev-tools available. No package libboost-dev available. No package libboost-program-options-dev available. Package opencl-headers-2.2-1.20180306gite986688.el7.noarch already installed and latest version No package ocl-icd-libopencl1 available. No package ocl-icd-opencl-dev available. No package qt5-default available. No package qt5-qmake available. Package curl-7.29.0-46.el7.x86_64 already installed and latest version Nothing to do

try 3.5

this vm image already includes gpu driver :

https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-azure-batch.ubuntu-server-container?tab=PlansAndPrice

/bin/bash -c 'sudo -i && uname -a && pwd && sudo apt-get update && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo rm -r leela-zero git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2'

try 4 :

using an all in one ubuntu install (with rm -r leela-zero included without &&), then scheduling a reboot every 3 hours with job schedule :

/bin/bash -c 'sudo -i && uname -a && pwd && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y install nvidia-driver-410 linux-headers-generic nvidia-opencl-dev && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo
rm -r leela-zero
git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2'

job schedule, every 3 hours :
/bin/bash -c 'sudo -i && uname -a && sudo reboot'

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 11, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

since i got my account flagged due to excessive use, i wasnt able to continue the tests, now back to it :

https://docs.microsoft.com/en-us/azure/scheduler/scheduler-get-started-portal

https://azuremarketplace.microsoft.com/en-us/marketplace/apps?page=1&filters=linux%3Bpricing-free%3Bvirtual-machine-images

some cloud images come with gpu driver preinstalled, thus not needing to reboot

/bin/bash -c 'sudo -i && uname -a && pwd && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo ; rm -r leela-zero ; git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2'

test 64 :


/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo ; rm -r leela-zero ;  sudo -i && uname -a && clinfo && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && clinfo
fi'

test 65 :

/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && nvidia-smi ; rm -r leela-zero ; sudo -i && uname -a && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && clinfo
fi'

test 66 :

/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && sudo apt-get -y install linux-headers-generic nvidia-opencl-dev clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo && nvidia-smi ; rm -r leela-zero ;  sudo -i && uname -a && clinfo && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && clinfo
fi'

test 67 :

/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y install nvidia-opencl-dev clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo && nvidia-smi ; rm -r leela-zero ;  sudo -i && uname -a && clinfo && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && clinfo
fi'

test69 :

/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update ; sudo apt-get -y install nvidia-opencl-dev ; sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo && nvidia-smi ; rm -r leela-zero ;  sudo -i && uname -a && clinfo && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && nvidia-smi && clinfo
fi'

test 70 :

/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update ; sudo apt-get -y --force-yes remove nvidia-opencl-dev ; sudo apt-get -y --force-yes install ocl-icd-opencl-dev ; sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo && nvidia-smi ; rm -r leela-zero ;  sudo -i && uname -a && clinfo && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && nvidia-smi && clinfo
fi'

debug :
suggested packages Suggested packages: aspell-doc spellutils avahi-autoipd bumblebee colord-sensor-argyll evolution evolution-data-server-dbg gettext-doc autopoint apache2-bin libapache2-mod-dnssd hunspell openoffice.org-hunspell | openoffice.org-core ibus-clutter ibus-doc ibus-qt4 click powerd unity-system-compositor zenity unity-greeter-session-broadcast lrzip alsa-utils libgssapi-perl libcanberra-gtk0 libgles2-mesa | libgles2 libdv-bin oss-compat libenchant-voikko fcitx libfftw3-bin libfftw3-dev libgd-tools gphoto2 libvisual-0.4-plugins gstreamer1.0-tools libdata-dump-perl libusbmuxd-tools jackd2 liblcms2-utils avahi-autoipd | zeroconf opus-tools libhtml-template-perl libxml-simple-perl pcscd libqt5libqgtk2 qt5-image-formats-plugins qtwayland5 libraw1394-doc librsvg2-bin hplip libsane-extras sane-utils lm-sensors speex libwww-perl url-dispatcher libxext-doc bindfs binutils-multiarch libtext-template-perl nautilus network-manager-openconnect-gnome network-manager-openvpn-gnome network-manager-vpnc-gnome network-manager-pptp-gnome pinentry-doc pavumeter pavucontrol paman paprefs python3-genshi python3-lxml-dbg python-lxml-doc python3-smbc reiserfsprogs exfat-utils libcanberra-gtk-module x11-xserver-utils lightdm-remote-session-freerdp lightdm-remote-session-uccsconfigure remote-login-service metacity | x-window-manager graphviz upstart-monitor comgt wvdial libvdpau-va-gl1 nvidia-vdpau-driver nvidia-legacy-340xx-vdpau-driver wpagui libengine-pkcs11-openssl mesa-utils xfonts-100dpi | xfonts-75dpi xfonts-scalable gpointing-device-settings touchfreeze xinput

debug : error :

Leela Zero 0.16  Copyright (C) 2017-2018  Gian-Carlo Pascutto and contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see the COPYING file for details.

Using 2 thread(s).
RNG seed: 13804860540396540417
BLAS Core: built-in Eigen 3.3.5 library.
Detecting residual layers...v1...256 channels...40 blocks.
Initializing OpenCL (autodetecting precision).
OpenCL: clGetPlatformIDs
terminate called after throwing an instance of 'cl::Error'
  what():  clGetPlatformIDs

after reboot, it uninstalls then reinstall opencl ... weird :

The following packages will be REMOVED:
  ocl-icd-opencl-dev
The following NEW packages will be installed:
  nvidia-opencl-dev
0 upgraded, 1 newly installed, 1 to remove and 0 not upgraded.
Removing ocl-icd-opencl-dev:amd64 (2.2.8-1) ...
Selecting previously unselected package 
Selecting previously unselected package nvidia-opencl-dev:amd64.
Preparing to unpack .../nvidia-opencl-dev_7.5.18-0ubuntu1_amd64.deb ...
Unpacking nvidia-opencl-dev:amd64 (7.5.18-0ubuntu1) ...
Setting up nvidia-opencl-dev:amd64 (7.5.18-0ubuntu1) ...

then just after :

Reading package lists...
Building dependency tree...
Reading state information...
build-essential is already the newest version (12.1ubuntu2).
libboost-dev is already the newest version (1.58.0.1ubuntu1).
libboost-program-options-dev is already the newest version (1.58.0.1ubuntu1).
ocl-icd-libopencl1 is already the newest version (2.2.8-1).
opencl-headers is already the newest version (2.0~svn32091-2).
clinfo is already the newest version (2.1.16.01.12-1).
libboost-all-dev is already the newest version (1.58.0.1ubuntu1).
libopenblas-dev is already the newest version (0.2.18-1ubuntu1).
cmake is already the newest version (3.5.1-1ubuntu3).
curl is already the newest version (7.47.0-1ubuntu2.11).
git is already the newest version (1:2.7.4-0ubuntu1.5).
qt5-qmake is already the newest version (5.5.1+dfsg-16ubuntu7.5).
qtbase5-dev is already the newest version (5.5.1+dfsg-16ubuntu7.5).
zlib1g-dev is already the newest version (1:1.2.8.dfsg-2ubuntu4.1).
qt5-default is already the newest version (5.5.1+dfsg-16ubuntu7.5).
qttools5-dev is already the newest version (5.5.1-3ubuntu0.1).
qttools5-dev-tools is already the newest version (5.5.1-3ubuntu0.1).
Recommended packages:
  libpoclu-dev
The following packages will be REMOVED:
  nvidia-opencl-dev
The following NEW packages will be installed:
  ocl-icd-opencl-dev
0 upgraded, 1 newly installed, 1 to remove and 0 not upgraded.
The following packages will be REMOVED:
  ocl-icd-opencl-dev
The following NEW packages will be installed:
  nvidia-opencl-dev
Removing nvidia-opencl-dev:amd64 (7.5.18-0ubuntu1) ...
Selecting previously unselected package ocl-icd-opencl-dev:amd64.
Preparing to unpack .../ocl-icd-opencl-dev_2.2.8-1_amd64.deb ...
Unpacking ocl-icd-opencl-dev:amd64 (2.2.8-1) ...
Setting up ocl-icd-opencl-dev:amd64 (2.2.8-1) ...
Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 10.0.185
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Platform Extensions function suffix             NV
  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     Tesla K80
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  410.73
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Topology (NV)                            PCI-E, 00:00.0
  Max compute units                               13
  Max clock frequency                             823MHz
  Compute Capability (NV)                         3.7
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Address bits                                    64, Little-Endian
  Global memory size                              11996954624 (11.17GiB)
  Error Correction support                        Yes
  Max memory allocation                           2999238656 (2.793GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        212992
  Global Memory cache line                        128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             4096x4096x4096 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max constant buffer size                        65536 (64KiB)
  Max number of constant args                     9
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 No
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  NVIDIA CUDA
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [NV]
  clCreateContext(NULL, ...) [default]            Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform
ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.8
  ICD loader Profile                              OpenCL 1.2
	NOTE:	your OpenCL library declares to support OpenCL 1.2,
		but it seems to support up to OpenCL 2.1 too.
Sun Nov 11 21:43:59 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |

success after reboot :
script test 69 works after reboot

works

conclusion : manual reboot cant be avoided even with vms with included nvidia driver, due to conflict with opencl-dev packages

last option :
test 70, trying --force-yes to avoid reboot


/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update ; sudo apt-get remove --force-yes --yes nvidia-opencl-dev ocl-icd-opencl-dev ; sudo apt-get -y install ocl-icd-opencl-dev clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo && nvidia-smi ; rm -r leela-zero ;  sudo -i && uname -a && clinfo && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && nvidia-smi && clinfo
fi'

result : doesnt work

test 71 : trying sud apt-get autoremove with remove -y , or apt-get -f -m


/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo apt-get -y -f -m install nvidia-opencl-dev clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && clinfo && nvidia-smi ; rm -r leela-zero ; sudo -i && uname -a && clinfo && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && nvidia-smi && clinfo
fi'

debug result :

Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
 nvidia-opencl-dev : Conflicts: opencl-dev
 ocl-icd-opencl-dev : Conflicts: opencl-dev
                      Recommends: libpoclu-dev but it is not installable

workaround suggested :
first install olc-icd
then install nvidia-opencl

is compatible wit behaviour observed after reboot :

test 73 :

/bin/bash -c 'PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && uname -a && sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update ; sudo apt-get -y -f -m remove opencl-dev ; sudo apt-get -y -f -m install nvidia-opencl-dev clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 qt5-default qt5-qmake curl && clinfo && nvidia-smi ; rm -r leela-zero ; sudo -i && uname -a && clinfo && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2
else
  sudo -i && uname -a && nvidia-smi && clinfo
fi'

conclusion : because of broken dependencies in custom image provided by microsoft, and the need to reboot to fix these with -f, i prefer to rather go with a blank default ubuntu 18.04 lts with the script below of test 80 in job scheduler with the following settings :

  • in pool : number of tasks max : 20, 1 or more low priority nodes nc6v3
  • in job schedule : run exclusive true, retry limit unlimited, kill job at completion false, reccurence schedule after every 1 hour, behaviour after task complete no action x2, task autouser admin,

script used :
test 80 :

then manually reboot when stdout.txt goes to leela-zero (opencl needs reboot or number of platforms 0), last question remaining before writing instructions :

question :

  1. after preemption, is the low priority node deleted or just rebooted ?
  2. after preemption, can task scheduler restart the script automatically after some time ?

edit : outdated, see final instructions in the comment below :

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 13, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

@wonderingabout wonderingabout changed the title (ENTIRELY AUTOMATED) Google Cloud Tesla V100 Free Trial : video+text tutorial with managed instance group (ENTIRELY AUTOMATED) Google Cloud + Microsoft Azure Tesla V100 Free Trial : video+text tutorial with managed instance group Nov 13, 2018
@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 13, 2018

IMPORTANT UPDATE : 13 november 2018 !!!
TO ALL THOSE WHO USED THE GOOGLE CLOUD FREE TRIAL INSTRUCTIONS BEFORE 13 NOVEMBER 2018 :
OLD STARTUP-SCRIPT DOESNT WORK ANYMORE !
YOU NEED TO DELETE YOUR INSTANCE GROUP AND INSTANCE TEMPLATE, AND CREATE A NEW TEMPLATE WITH THE UPDATED STARTUP-SCRIPT PROVIDED IN THE GOOGLE DOC (i updated it)

see google doc at page 10 :
https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit

old script shows this error (thanks @herazul) for finding it :

herazul

new script is simplified so unlikely to produce such issues in the future :


#!/bin/bash
PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y -f install nvidia-driver-410 nvidia-opencl-dev && sudo apt-get -y -f install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && sudo apt-get -y -f install glances zip && sudo apt-get clean && sudo reboot
else 
  sudo -i && cd /leela-zero/autogtp && ./autogtp -g 2
fi

modification1
modification2
modification3
herazul2

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 14, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

microsoft azure cloud's today's test run with NC6v3 (low priority cost) :

236 games in 1378 minutes :

fg

edit : when stdout.txt file size is too big (>1MB) , cannot be displayed but click on download button :

max size stdout

for example :
stdout.txt

can be viewed here for test80 :

http://m.uploadedit.com/bbtc/1542202923658.txt
http://www.uploadedit.com/_upload-documents-checkstatus.htm?new_name=1542202923658.txt

can be viewed here for test81 :

http://m.uploadedit.com/bbtc/15422043546.txt
http://www.uploadedit.com/_upload-documents-checkstatus.htm?new_name=15422043546.txt

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 14, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

@alreadydone

great news !

after preemption, low priority node is not deleted !
but it reboots, then after some time the scheduled job runs again (i set it to 30 minutes recurrence, could be lowered to 20 minutes recurrence) !
so no need to reinstall all system packages and redo everything !

this is what automatically happens at preemption :

actual restart statsv2
detailed statsv2

microsoft azure instructions are now complete !

after preemption :

after preemption restarts
after preemption restarts 2
after preemption restarts 3

all what is left is to write them, but i may go ahead and record them with video first, seeing how unintuitive it is

i went ahead and updated main instructions for azure

@Iwtbm
Copy link

Iwtbm commented Nov 14, 2018

I just set up my GPC this morning but I get the following error message when starting the instance:
Quota 'GPUS_ALL_REGIONS' exceeded. Limit: 0.0 globally.

I have upgraded my account. Now it is bronze level. What should I do now???

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 14, 2018

@lwtbm

i updated the google doc last time they included their policy change (page 1) :

https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit

My account is old so i cant test it for you (we didnt have this issue in the past), but from what i read on the internet you need to upgrade to a "pay as you go" option to unlock gpu access (quota from 0 will be increased to 1)

"pay as you go" (its the word they use in microsoft azure, not sure it's the same for google cloud) : it means that the free trial is still free credit, but when it ends you have to manually cancel it, or you'll be charge for any consumption that goes beyond the 300 dollars free trial

@Iwtbm
Copy link

Iwtbm commented Nov 14, 2018

@wonderingabout

Thanks for your reply. I read your doc before and I have upgraded. But the error is still there. I don't know what to do...

@Iwtbm
Copy link

Iwtbm commented Nov 14, 2018

I contact the google support, and they reallocate me a gpu for that project. The problem is solved. Thanks.

@wonderingabout
Copy link
Contributor Author

@lwtbm

thanks for your feedback too
so what was the issue ?

is this procedure now needed for every google cloud users, or was your case specific ?

i'm asking because if there is something i need to add on the google doc i'd like you to tell me

thanks

@Iwtbm
Copy link

Iwtbm commented Nov 14, 2018

I don't know... I just started this morning. Maybe all new users have the same problem.

@wonderingabout
Copy link
Contributor Author

i see

you contacted them via email right ?

i will add a small note then mentioning that if problem persists, email support to increase quota, right ?

@Iwtbm
Copy link

Iwtbm commented Nov 14, 2018

Yes.

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 14, 2018

ok, thanks

just updated the doc of google cloud at page 1 if you want to look : https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 14, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

i want to mention i had another topic on lczero when asking for help or advice :
https://groups.google.com/forum/#!topic/lczero/gH6zmsEdIFw

datascience ubuntu batch works after preemption without needing to reboot !
an early no resign game got in the way though :

works after preemption

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 15, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

to remember, this is the script to have latest packages, but it will need a manual reboot at first boot, then a manual reboot after every preemption :

test 81 :

/bin/bash -c 'sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y install nvidia-driver-410 linux-headers-generic nvidia-opencl-dev && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl ; rm -r leela-zero ; clinfo ; nvidia-smi ; git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && ./autogtp -g 2'

see clinfo and tuning outputs of datascience preinstalled or very latest ppa packages here :

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 15, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

1 similar comment
@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 16, 2018

edit : outdated
see final version azure instructions with low priority cost in this google doc :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 16, 2018

**VERSION OF 16 NOVEMBER 2018 : **

INSTRUCTIONS TO USE MICROSOFT AZURE FREE TRIAL WITH A TESLA V100 WITH LOW PRIORITY COST, FOR LEELA ZERO :

you can see the doc version here :

https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing

@alreadydone

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Nov 26, 2018

update :

added google cloud free trial instructions for quota requets, with screenshots, for :

  • GPU (all regions) Global : quota increase from 0 to 1
  • Preemptible CPUs in every region : quota increase from 0 to 24

see page 3 of the google doc :
https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit

sample :

quotaa7

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Dec 7, 2018

update 07 december 2018 :
trying a much more shrinked script :

#!/bin/bash
PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo -i && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y -f install nvidia-driver-410 libboost-dev libboost-program-options-dev libboost-filesystem-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev clinfo cmake qt5-default qt5-qmake curl git zip glances && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cp leelaz autogtp && sudo reboot
else 
  sudo -i && cd /leela-zero/build/autogtp && ./autogtp -g 2
fi

got inspired from the recent instructions clearer topic

here : #1983
and here : #2071

RESULT : WORKS !
(was using 2 vcpu only for this test instead of 4 so it may have been slower)
is now the new main script for google cloud instructions

new script worksv2

for reference, old script :

#!/bin/bash
PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
sudo -i && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y -f install nvidia-driver-410 nvidia-opencl-dev && sudo apt-get -y -f install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && sudo apt-get -y -f install glances zip && sudo apt-get clean && sudo reboot
else
sudo -i && cd /leela-zero/autogtp && ./autogtp -g 2
fi

@wonderingabout
Copy link
Contributor Author

wonderingabout commented Feb 2, 2019

update 02 february 2019
startup script updated with the new repo owner leela-zero instead of gcp,

repo owner changed in the scripts of both google cloud main(fixed)+optionnal(no need to change) tutorial, and the microsoft azure tutorial :

see this discussion for more details : #2157 (comment)

@ozymandias8
Copy link

ozymandias8 commented Apr 11, 2019

@wonderingabout

I am unable to compile 0.17 in Microsoft Azure due to an error. The script for use in Azure needs to be updated to reflect these compiler changes.

[ 62%] Linking CXX executable tests
gtest/googlemock/gtest/libgtest.a(gtest-all.cc.o)

See more discussion here: #2303

Edit: I was able to get the new compilers installed on Azure Ubuntu 16.04 using a separate "task" but was not able to assign them as the default compilers, and therefore my leela-zero script won't work. Sorry, I'm just a Go player, not so much a programmer. If anyone could let me know how to assign the newer compiler as the default I would appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants