New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(ENTIRELY AUTOMATED) Google Cloud + Microsoft Azure Tesla V100 Free Trial : video+text tutorial with managed instance group #1905
Comments
interesting |
lc0 also began use this. http://blog.lczero.org/2018/10/contributing-to-leela-chess-zero.html?m=1#more |
yes, as they said the V100 preemptible is really most efficient use of the free credit |
after a quick search, i found that microsoft azure and oracle cloud offer a free trial credit for GPU cloud computing :
edit 2 : SHORT VERSION : minimalist and quick tutorial (no explanations) : LONG VERSION : long and detailed text instructions : detailed and long tutorial : here is what i get with next branch (includes 3 no resign games of 700 moves and 2200 seconds each) : |
@wonderingabout That seems slow. If you are using V100 you should be getting moves considerably faster. My results:
What are your benchmark results? My results:
|
if you have one no resign game, it can last one hour (60 minutes) even on a V100 (723 moves), so imagine worst case scenario when you get only no resign games and they all last one hour each, you'll produce 6 games in 6 hours (but these games are important for learning process and shouldnt be skipped) also, 60 games in 6 hours (so 10 games/hour), including some no resign games, is not slow at all questions for you :
|
I'm running LZ next with enabled half precision on google cloud V100. The benchmark uses 3200 visits. |
@seopsx for information, on average i get about the same speed, no big change (includes no resign games) : for the benchmark, i'd need to know which the precise command you ran to start the benchmark also, you may want to write standardized steps for using autogtp with next branch, starting from part 3 of the text instructions i sent thanks |
@wonderingabout There's a bug which makes LZ to choose single precision FP even if it's slower than half. See #1887 For the time being you can use this patch to increase the efficiency. #1888 |
i tried using cmake on master branch followed with your command, nothing happens after i input your command , this is what i did : DELETED, see my next messages for correct instructions this is what i get (nothing happens): did i do something wrong ? |
Are you sure the weight file exists under that filename? |
not at all, i just blindly copied your command |
updated instructions 4 october 2018 : commands are shorter |
The bug is now fixed on next branch. I encourage everyone to do "git checkout next" and "git pull" inside leela-zero directory and follow the README instructions to build it again. |
then i'll create another instance and try autogtp's performance directly, i will let you know when i do it and what results i get i wish we get a new master release soon though, because current master is old |
thanks, i followed your instructions, compiled next (in a sub directory) and ran it successfully, with these instructions : after you installed master and successfully ran autogtp, exit instance and restart it, then : DELETED (see my next message for correct instructions) : #1905 (comment) however, i didnt see a significant speed difference, is it because it's the first run, or are there extra options i should enable ? |
You are using the wrong directory. Autogtp is inside "build" when you compile it like this:
|
i understand, thank you |
i successfully managed to run next branch, it is indeed arround 30% faster than master branch i updated the pastebin tutorial to add next branch instructions (part 3b), and also simplified the commands (one less reboot needed) : https://pastebin.com/552UN25c after you run autogtp on master branch, exit the instance (chrome exit button) and click again on SSH to restart it, then copy paste all this selection :
you can see that it's not master version because these lines dont exist in current master version, among others : thanks again for your help @seopsx one question i want to ask, doesnt V100 support half precision in leela zero ? edit : today's run so far : it is really much faster on next branch (master branch would have played arround 30-32 games in the same time) includes 3 no resign games that lasted 700 moves and 2200 seconds each (vs 3200 seconds on master branch) |
@wonderingabout You can remove unnecessary step "compile autogtp binary". If you look closely command "cmake --build ." builds also autogtp. |
@seopsx last question, leela zero doesn't support half precision ? |
I seem to remember that someone mentioned NVIDIA cards doesn't support half precision compute through OpenCL, but only through CuDNN or so. (Edit: linke here #1689 (comment)) However, I heard that the latest NVIDIA driver sped things up by 7% on 1080 Ti. |
i see, i will try with beta/fresh nvidia drivers if i find how to include the ppa, or i'll use a direct install edit : currently, nvidia's latest driver version is nvidia 396.44, there is no beta driver for tesla v100 (and 1080ti), i checked here : so going with the ppa (396.54 of 23rd august 2018) is just the easiest i think, i will let you know how it goes : https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa edit 2 : successfully installed nvidia short-lived branch, added instructions in part 1b of the pastebin tutorial :
i will let you know if i see any major performance difference with the latest 396 drivers as compared to 390 long lived branch edit 3 : with 11 games in 60 minutes in next branch, nvidia-396 is actually arround 15% slower than nvidia-390 (long lived branch, 13-14 games in 60 minutes), and 625 moves took 2400 seconds (VS 700 moves in 2200 seconds with nvidia long lived 390, so yes nvidia short lived 396 with next branch is significantly slower) edit 4 : actually i was using autogtp master with leelaz next, but with autogtp next nvidia-driver-396 is not slower from nvidia-390, s from my tests, i'd recommend to run long short-lived nvidia branch (part 1b) with next branch (parts 3a+3b) |
Sorry, the speedup from ~850 n/s to ~910 n/s on 1080 Ti happens on Windows and the driver is probably https://www.nvidia.com/download/driverResults.aspx/138567/en-us (NVIDIA GeForce 416.16 WHQL driver for Windows 10 v1809). However, v1809 had devastating effects on some computers, and seems to be unavailable now. |
no problem, @alreadydone @roy7 @gcp now that the instructions are finalized, can we add google cloud free trial instructions to the github "i want to help" page just before the google colab instructions : https://github.com/gcp/leela-zero#i-want-to-help in both master and next branches |
edit 2 : i read that tesla v100 includes multi threading here (ctrl+f "thread") : https://images.nvidia.com/content/technologies/volta/pdf/437317-Volta-V100-DS-NV-US-WEB.pdf so i'm trying ./autogtp --gamesNum 3 (3 games runing on 3 different gpu threads at the same time with only one gpu, i will let you know if i see a significant performance boost) edit 3 : with 14 games/hour, --gamesNum 3 is arround the same speed than 1 game at a time, but i suspect in the long run (with a 0% resign game taking all the gpu time), it should be faster ERROR: Could not talk to engine after launching. however, with this 2vcpu/3.75gb ram, the cpu was at 100% unlike with 1 game (50%) at a time, so i'll try creating instances with 4/6/8 vcpu and 8/16gb ram to see if there was no bottleneck slowing things, i will update it with the results i get edit 4 : with 8vcpu 8gb ram, max setting is --gamesNum 7 , more generates the same error, i will try to increase ram to 16gb and maybe later use a ssd instead to see, with 8vcpu 8gb ram 1 game at a time uses 12.5% cpu and --gamesNum 7 uses 86,5% cpu, so even if we increase ram we may only have a few more simultaneous games at most |
@wonderingabout You can edit (click the pen icon) https://github.com/gcp/leela-zero/blob/next/README.md and after that you'll be directed to submit a pull request. |
@alreadydone since i start to run out of google cloud credit, i cant experiment a lot with it though, but i'll try |
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing for microsoft azure, i'll try debugging with minimalistic scripts and see how it goes : try 1 :
then clicking on reboot button : edit : this may be useful for later debugging :
after reboot, it still doesnt detect clinfo, while it is already installed :
and automation script too interesting read : https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-linux as it explains reboot will always break the startup script (that doesnt last more than 90 minutes anyways), which makes us move to try 2 : try 2 : using a distro that has gpu driver already installed, thus avoiding the need to reboot and starting with a job schedule directly (with recurence)
as centos uses yum : this datascience distro is promising : https://azuremarketplace.microsoft.com/en-US/marketplace/apps/microsoft-ads.linux-data-science-vm try 3 for centos datascience :
what it returns :
try 3.5 this vm image already includes gpu driver :
try 4 : using an all in one ubuntu install (with rm -r leela-zero included without &&), then scheduling a reboot every 3 hours with job schedule :
job schedule, every 3 hours : |
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing since i got my account flagged due to excessive use, i wasnt able to continue the tests, now back to it : https://docs.microsoft.com/en-us/azure/scheduler/scheduler-get-started-portal some cloud images come with gpu driver preinstalled, thus not needing to reboot
test 64 :
test 65 :
test 66 :
test 67 :
test69 :
test 70 :
debug : debug : error :
after reboot, it uninstalls then reinstall opencl ... weird :
then just after :
success after reboot : conclusion : manual reboot cant be avoided even with vms with included nvidia driver, due to conflict with opencl-dev packages last option :
result : doesnt work test 71 : trying sud apt-get autoremove with remove -y , or apt-get -f -m
debug result :
workaround suggested : is compatible wit behaviour observed after reboot : test 73 :
conclusion : because of broken dependencies in custom image provided by microsoft, and the need to reboot to fix these with -f, i prefer to rather go with a blank default ubuntu 18.04 lts with the script below of test 80 in job scheduler with the following settings :
script used : then manually reboot when stdout.txt goes to leela-zero (opencl needs reboot or number of platforms 0), last question remaining before writing instructions : question :
edit : outdated, see final instructions in the comment below : |
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing |
IMPORTANT UPDATE : 13 november 2018 !!! see google doc at page 10 : old script shows this error (thanks @herazul) for finding it : new script is simplified so unlikely to produce such issues in the future :
|
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing microsoft azure cloud's today's test run with NC6v3 (low priority cost) : 236 games in 1378 minutes : edit : when stdout.txt file size is too big (>1MB) , cannot be displayed but click on download button : for example : can be viewed here for test80 : http://m.uploadedit.com/bbtc/1542202923658.txt can be viewed here for test81 : http://m.uploadedit.com/bbtc/15422043546.txt |
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing great news ! after preemption, low priority node is not deleted ! this is what automatically happens at preemption : microsoft azure instructions are now complete ! after preemption : all what is left is to write them, but i may go ahead and record them with video first, seeing how unintuitive it is i went ahead and updated main instructions for azure |
I just set up my GPC this morning but I get the following error message when starting the instance: I have upgraded my account. Now it is bronze level. What should I do now??? |
@lwtbm i updated the google doc last time they included their policy change (page 1) : https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit My account is old so i cant test it for you (we didnt have this issue in the past), but from what i read on the internet you need to upgrade to a "pay as you go" option to unlock gpu access (quota from 0 will be increased to 1) "pay as you go" (its the word they use in microsoft azure, not sure it's the same for google cloud) : it means that the free trial is still free credit, but when it ends you have to manually cancel it, or you'll be charge for any consumption that goes beyond the 300 dollars free trial |
Thanks for your reply. I read your doc before and I have upgraded. But the error is still there. I don't know what to do... |
I contact the google support, and they reallocate me a gpu for that project. The problem is solved. Thanks. |
@lwtbm thanks for your feedback too is this procedure now needed for every google cloud users, or was your case specific ? i'm asking because if there is something i need to add on the google doc i'd like you to tell me thanks |
I don't know... I just started this morning. Maybe all new users have the same problem. |
i see you contacted them via email right ? i will add a small note then mentioning that if problem persists, email support to increase quota, right ? |
Yes. |
ok, thanks just updated the doc of google cloud at page 1 if you want to look : https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit |
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing i want to mention i had another topic on lczero when asking for help or advice : datascience ubuntu batch works after preemption without needing to reboot ! |
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing to remember, this is the script to have latest packages, but it will need a manual reboot at first boot, then a manual reboot after every preemption : test 81 :
see clinfo and tuning outputs of datascience preinstalled or very latest ppa packages here :
|
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing |
1 similar comment
edit : outdated https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing |
**VERSION OF 16 NOVEMBER 2018 : ** INSTRUCTIONS TO USE MICROSOFT AZURE FREE TRIAL WITH A TESLA V100 WITH LOW PRIORITY COST, FOR LEELA ZERO : you can see the doc version here : https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing |
update : added google cloud free trial instructions for quota requets, with screenshots, for :
see page 3 of the google doc : sample : |
update 07 december 2018 :
got inspired from the recent instructions clearer topic RESULT : WORKS ! for reference, old script :
|
update 02 february 2019 repo owner changed in the scripts of both google cloud main(fixed)+optionnal(no need to change) tutorial, and the microsoft azure tutorial :
see this discussion for more details : #2157 (comment) |
I am unable to compile 0.17 in Microsoft Azure due to an error. The script for use in Azure needs to be updated to reflect these compiler changes. [ 62%] Linking CXX executable tests See more discussion here: #2303 Edit: I was able to get the new compilers installed on Azure Ubuntu 16.04 using a separate "task" but was not able to assign them as the default compilers, and therefore my leela-zero script won't work. Sorry, I'm just a Go player, not so much a programmer. If anyone could let me know how to assign the newer compiler as the default I would appreciate it. |
hi,
IMPORTANT UPDATE : 13 november 2018 !!!
TO ALL THOSE WHO USED THE GOOGLE CLOUD FREE TRIAL INSTRUCTIONS BEFORE 13 NOVEMBER 2018 :
https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit
OLD STARTUP-SCRIPT DOESNT WORK ANYMORE !
YOU NEED TO DELETE YOUR INSTANCE GROUP AND INSTANCE TEMPLATE, AND CREATE A NEW TEMPLATE WITH THE UPDATED STARTUP-SCRIPT PROVIDED IN THE GOOGLE DOC (i updated it)
see page 10 of the doc, more details on this github comment : #1905 (comment)
FOR MICROSOFT AZURE TESLA V100 FREE TRIAL WITH LOW PRIORITY COST AUTOMATED INSTRUCTIONS , SEE :
https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing
MAIN MESSAGE FOR GOOGLE CLOUD :
as i said recently on discord, google colab (tesla K80) doesnt work so well, because after 1-2 hours of computing you are disconnected (no gpu backend) and have to wait another time to get a gpu available... to use it only 1-2 hours before being blocked again, see their faq : https://research.google.com/colaboratory/faq.html#gpu-availability
however, google cloud is different, they give you totally free of charge 300 dollars/257 euros credit that can be used with a Tesla V100 (0,75$/hour of free credit consumed, with preemptible costs with 300 dollars you are good for a very long time for free!),
at the condition that your id is verified with a vald credit card, to avoid abuse (but you'll never be charged anything, even when your free trial credit ends)
now, you dont need to manually start autogtp anymore ! see final instructions google doc :
https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit
about speed of the tesla v100 :
i experimented with it, and its very very powerful !
it was slightly slower during first run during the tutorial, but you can expect to produce games at arround this speed :
-g 2
(./autogtp -g 2
is faster but requires 4vcpu/5,75gb ram)-if you have a 0% resign game with 700 moves it will last arround 1 hour with master branch and arround 40 minutes with next branch at the time i'm writing this tutorial
a long run can do something like that :
i myself dont have a very powerful gpu, so i'm glad i can help with 40b which is ressource expansive,
from my calculations, one free trial user can produce arround 250 games per day at a h24 7/7 usage = games per free trial,
so with 100 people doing this free trial, we'd produce 25 000 extra games per 24 hours, and 175 000 games/week.
it doesnt consume your ressouce as the V100 runs in the cloud, so just start your instance and forget it, no need to let the SSH window or your computer open !
̷p̷l̷e̷a̷s̷e̷ ̷d̷o̷n̷t̷ ̷f̷o̷r̷g̷e̷t̷ ̷t̷o̷ ̷s̷t̷o̷p̷ ̷t̷h̷e̷ ̷i̷n̷s̷t̷a̷n̷c̷e̷ ̷a̷f̷t̷e̷r̷ ̷y̷o̷u̷ ̷e̷x̷i̷t̷e̷d̷ ̷t̷h̷e̷ ̷s̷s̷h̷ ̷w̷i̷n̷d̷o̷w̷,̷ ̷o̷r̷ ̷t̷h̷e̷ ̷c̷r̷e̷d̷i̷t̷ ̷i̷s̷ ̷w̷a̷s̷t̷e̷d̷ ̷f̷o̷r̷ ̷n̷o̷ ̷r̷e̷a̷s̷o̷n̷,̷ ̷a̷s̷ ̷e̷x̷p̷l̷a̷i̷n̷e̷d̷ ̷i̷n̷ ̷t̷h̷e̷ ̷t̷u̷t̷o̷r̷i̷a̷l̷ not needed to stop your isntance anymore, as autogtp produces games automatically
last thing, these ressources can also be used for other projects, like training further the 192x15, as suggested in #1889
the more people use it, the faster development will be,
please share and thank you for reading this
edit :
after a quick search, i found that microsoft azure and oracle cloud offer a free trial credit for GPU cloud computing :
Azure (200 dollars) :
https://azure.microsoft.com/
Oracle (300 dollars) :
https://cloud.oracle.com/tryit
edit 2 : simplified commands in pastebin tutorial, and added part 3b instructions to use next branch (25-30% faster to produce games), and instructions to install latest nvidia short-lived(fresh) drivers in part 1b
edit 3 : in part 4,
./autogtp -g 2
produces games significantly faster, even with 5% resign only games only (16 vs 13 games in 60 minutes), but also partly because when a 0% resign game is generated, the extra gpu power of the v100 can be used to produce another game simultaneouslyrun instead (requires 4vcpu and 5,75 gb ram)
./autogtp -g 2
(265 games/24 hours VS 208 games/24hours)
edit 4 :
(remove -g 2 if you chose the weak hardware option, slower)
now, as soon as your instance has started (green circle) you dont need to open an SSH window to start producing games anymore (and you should not, or autogtp will run twice, causing hardware to overload and to be slower and unstable) , and no need to have your computer powered on anymore !
sudo journalctl -u google-startup-scripts.service -b -e -f
edit 5 : (final version) :
edit 6 :
added microsoft azure free trial with low priority costs (80% cheaper), entirely automated
see : https://docs.google.com/document/d/1DMpi16Aq9yXXvGj0OOw7jbd7k2A9LHDUDxxWPNHIRPQ/edit?usp=sharing
edit 7 :
as of 13 november 2018, old startup-script doesnt work anymore, updated it !! update needed or free credit is going to be wasted ! see : #1905 (comment)
edit 8 :
26 november 2018 : google cloud update, added quota requests for GPU (all regions) Global and preemptible CPUs in every region,
see page 3 : https://docs.google.com/document/d/1P_c-RbeLKjv1umc4rMEgvIVrUUZSeY0WAtYHjaxjD64/edit
and see this comment : #1905 (comment)
edit 9 :
07 december 2018 : much more shrinked and efficient script for google cloud,
see : #1905 (comment)
edit 10 :
02 february 2019
startup script updated with the new repo owner leela-zero instead of gcp,
see : #1905 (comment)
The text was updated successfully, but these errors were encountered: