# Using FABRIC GPUs

Your compute nodes can include GPUs. These devices are made available as FABRIC components and can be added to your nodes like any other component.

This example notebook will demonstrate how to reserve and use Nvidia GPU devices on FABRIC.


## Setup the Experiment

#### Import FABRIC API

In [None]:
from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

try: 
    fablib = fablib_manager()
                     
    fablib.show_config()
except Exception as e:
    print(f"Exception: {e}")

## Create a Node

The cell below creates a slice that contains a single node. The node includes a GPU component.

### Set the Slice Name and FABRIC Site

Use a filter function to find random sites with your desired GPUs.


In [None]:
slice_name="MySlice"

rtx6000_site = fablib.get_random_site(filter_function=lambda x: x['rtx6000_available'] > 0)
tesla_site = fablib.get_random_site(filter_function=lambda x: x['tesla_t4_available'] > 0)                                                                                                                                                                                                                          

rtx6000_node_name='rtx1'
tesla_node_name='tesla1'


In [None]:
try:
    #Create Slice
    slice = fablib.new_slice(name=slice_name)

    # Add node
    rtx_node = slice.add_node(name=rtx6000_node_name, site=rtx6000_site)
    rtx_node.add_component(model='GPU_RTX6000', name='gpu1')

    tesla_node = slice.add_node(name=tesla_node_name, site=tesla_site)
    tesla_node.add_component(model='GPU_TeslaT4', name='gpu1')


    #Submit Slice Request
    slice.submit()
except Exception as e:
    print(f"Exception: {e}")

## Get the Slice

Retrieve the node information and save the management IP addresses.

In [None]:
try:
    slice = fablib.get_slice(name=slice_name)
    slice.show()
except Exception as e:
    print(f"Exception: {e}")

## Get the Nodes

Retrieve the nodes information and save the management IP address.


In [None]:
try:
    rtx_node = slice.get_node(rtx6000_node_name) 
    rtx_node.show()
    
    rtx_gpu = rtx_node.get_component('gpu1')
    rtx_gpu.show()
    
    tesla_node = slice.get_node(tesla_node_name) 
    tesla_node.show()
    
    tesla_gpu = tesla_node.get_component('gpu1')
    tesla_gpu.show()
except Exception as e:
    print(f"Exception: {e}")

Use the RTX6000 Node for the rest of the example

In [None]:
node = rtx_node

### GPU PCI Device

Run the command <code>lspci</code> to see your GPU PCI device(s). This is the raw GPU PCI device that is not yet configured for use.  You can use the GPUs as you would any GPUs.

View node1's GPU

In [None]:
command = "sudo dnf install -q -y pciutils && lspci | grep 'NVIDIA\|3D controller'"
try:
    stdout, stderr = node.execute(command)
except Exception as e:
    print(f"Exception: {e}")

## Install Nvidia Drivers

Now, let's run the following commands to install the latest CUDA driver and the CUDA libraries and compiler.

In [None]:
commands = [
    'sudo dnf install -q -y epel-release',
    'sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo',
    'sudo dnf install -q -y kernel-devel kernel-headers nvidia-driver nvidia-settings cuda-driver cuda'
]
try:
    print("Installing CUDA...")
    for command in commands:
        stdout, stderr = node.execute(command)
    print("Done installing CUDA. Now, reboot for the changes to take effect.")
except Exception as e:
    print(f"Fail: {e}")

And once CUDA is installed, reboot the machine.

In [None]:
reboot = 'sudo reboot'
try:
    print(reboot)
    node.execute(reboot)
    
    slice.wait_ssh(timeout=360,interval=10,progress=True)

    print("Now testing SSH abilites to reconnect...",end="")
    slice.update()
    slice.test_ssh()
    print("Reconnected!")

except Exception as e:
    print(f"Fail: {e}")

## Testing the GPU and CUDA Installation

First, verify that the Nvidia drivers recognize the GPU by running `nvidia-smi`.

In [None]:
try:
    stdout, stderr = node.execute("nvidia-smi")
    print(f"stdout: {stdout}")
except Exception as e:
    print(f"Exception: {e}")

Now, let's upload the following "Hello World" CUDA program file to the node.

`hello-world.cu`

*Source: https://computer-graphics.se/multicore/pdf/hello-world.cu*

*Author: Ingemar Ragnemalm*

>This file is from *"The real "Hello World!" for CUDA, OpenCL and GLSL!"* (https://computer-graphics.se/hello-world-for-cuda.html), written by Ingemar Ragnemalm, programmer and CUDA teacher. The only changes (if you download the original file from the website) are to additionally `#include <unistd.h>`, as `sleep()` is now a fuction defined in the `unistd.h` library.

In [None]:
node.upload_file('./hello-world.cu', 'hello-world.cu')

We now compile the `.cu` file using `nvcc`, the CUDA compiler tool installed with CUDA. In this example, we create an executable called `hello_world`.

In [None]:
try:
    stdout, stderr = node.execute("/usr/local/cuda-11.7/bin/nvcc -o hello_world hello-world.cu")
except Exception as e:
    print(f"Exception: {e}")

Finally, run the executable:

In [None]:
try:
    stdout, stderr = node.execute("./hello_world")
    print(f"stdout: {stdout}")
except Exception as e:
    print(f"Exception: {e}")

If you see `Hello World!`, the CUDA program ran successfully. `World!` was computed on the GPU from an array of offsets being summed with the string `Hello `, and the resut was printed to stdout.

### Congratulations! You have now successfully run a program on a FABRIC GPU!

## Cleanup Your Experiment

In [None]:
try:
    slice = fablib.get_slice(name=slice_name)
    slice.delete()
except Exception as e:
    print(f"Exception: {e}")