<a href="https://colab.research.google.com/github/VanadG123/DaSH-Lab-Assignment-2024/blob/main/Copy_of_SOP_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Connecting to Google Drive in Colab

To access files stored in your Google Drive within a Google Colab notebook, you need to mount your Drive to the Colab environment.  This allows the notebook to interact with your Drive files as if they were local.

**Steps:**

1. **Authorization:**  The first time you connect, you'll be prompted to authorize Colab to access your Google Drive.  You'll need to follow the provided link, which will open a new window in your browser, and grant permission.  A unique authorization code will be displayed, which you'll then need to enter into the Colab notebook to complete the connection.

2. **Mounting:** After successful authorization, your Google Drive will be mounted to a specific directory within the Colab runtime environment.  You can then use standard file system operations (like `os.listdir`, `open`, etc.) to access your files.


**Important Considerations:**

* **Security:**  Be mindful of the files you choose to share with Colab.  Only grant access to the files you absolutely need for your current project.

* **Runtime:** The mounted Drive connection is specific to the current Colab runtime. If you reset your runtime or restart your notebook, you'll need to remount your Drive.

* **File Paths:** Pay careful attention to the file paths when referencing your Drive files. They will not be in your typical Drive location. You'll need to use the path provided by the mounting command, typically something like `/content/drive/My Drive/`.


**Example (Conceptual):**

After successful mounting, a file located at `My Drive/data/my_file.csv` on your Google Drive would be accessible at `/content/drive/My Drive/data/my_file.csv` within your Colab notebook.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Ensure GPU Access: Configuring Google Colab to Use GPU

To use the `GPUDevice` class effectively, the code requires access to an **NVIDIA GPU** with **NVML (NVIDIA Management Library) support**. If you're working in **Google Colab**, follow these steps to change the runtime to GPU and ensure the environment is ready for GPU-based operations.

#### Steps to Enable GPU Runtime in Google Colab:
1. **Open Google Colab Notebook:**
   - Navigate to [Google Colab](https://colab.research.google.com) and open a new or existing notebook.

2. **Change the Runtime to GPU:**
   - Click on the **Runtime** menu at the top of the notebook.
   - Select **Change runtime type** from the dropdown.
   - In the dialog box, under **Hardware accelerator**, choose **GPU**.
   - Click **Save**.

3. **Verify GPU Availability:**
   - Run the following code snippet in a new Colab cell to confirm that the GPU is available:
     ```python
     import tensorflow as tf
     print("GPU Available:", tf.config.list_physical_devices('GPU'))
     ```
   - The output should list one or more GPUs if the runtime is correctly set up.

4. **Check for NVIDIA NVML Support:**
   - Google Colab provides access to NVIDIA GPUs, but you need to ensure **NVML** is installed and working. Colab typically includes NVML by default, but you can verify it by running:
     ```python
     !nvidia-smi
     ```
   - This command shows the **GPU status** and verifies if NVML is accessible.

5. **Install Missing Packages (if required):**
   - If you encounter missing modules like `pynvml`, install it by running:
     ```python
     !pip install nvidia-ml-py3
     ```

6. **Restart the Runtime:**
   - After setting the runtime to GPU and installing required libraries, restart the Colab runtime by clicking **Runtime > Restart runtime**. This ensures the GPU environment is refreshed and ready for use.

7. **Test NVML Initialization:**
   - You can now create an instance of the `GPUDevice` class to confirm NVML initialization:
     ```python
     from pynvml import nvmlInit, nvmlShutdown

     try:
         nvmlInit()
         print("NVML initialized successfully!")
         nvmlShutdown()
     except Exception as e:
         print(f"NVML failed to initialize: {e}")
     ```

### Key Notes on GPU Access:
- **Compatibility:** Your code will only work on NVIDIA GPUs, as NVML is designed specifically for NVIDIA hardware.
- **Limited GPU Access:** On Colab, the GPU usage is time-limited, so be mindful of runtime restrictions, especially if running long experiments.
- **Runtime Environment:** Ensure that your code executes within the **GPU runtime environment** to leverage hardware acceleration properly.

By following these steps, you ensure that Colab is configured correctly for running your `GPUDevice` class, which relies on GPU metrics gathered through NVML.

In [2]:
import tensorflow as tf
print("GPU Available,", tf.config.list_physical_devices('GPU'))

GPU Available, [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [3]:
!nvidia-smi

Thu Oct 31 08:52:04 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [4]:
!pip install nvidia-ml-py3

Collecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nvidia-ml-py3
  Building wheel for nvidia-ml-py3 (setup.py) ... [?25l[?25hdone
  Created wheel for nvidia-ml-py3: filename=nvidia_ml_py3-7.352.0-py3-none-any.whl size=19173 sha256=d373561b2b9021ee9ea9a8151ddb5178cafd11455efbd4679db10874e5473c07
  Stored in directory: /root/.cache/pip/wheels/5c/d8/c0/46899f8be7a75a2ffd197a23c8797700ea858b9b34819fbf9e
Successfully built nvidia-ml-py3
Installing collected packages: nvidia-ml-py3
Successfully installed nvidia-ml-py3-7.352.0


In [5]:
from pynvml import nvmlInit, nvmlShutdown

try:
    nvmlInit()
    print("NVML initialized successfully!")
    nvmlShutdown()
except Exception as e:
    print(f"NVML failed to initialize: {e}")

NVML initialized successfully!


## How to Use the `GPUDevice` Class

This guide provides instructions on using the `GPUDevice` class to monitor GPU metrics such as power consumption, temperature, and memory usage using the NVIDIA Management Library (NVML).

---

### 1. **Prerequisites**

- **Install NVML Python Bindings**: Make sure the `pynvml` library is installed. If not, install it using:
  ```bash
  !pip install nvidia-ml-py3
  ```
- **Ensure GPU Access**: The code needs access to an NVIDIA GPU with NVML support.

---

### 2. **Class Overview**

The `GPUDevice` class monitors various GPU metrics. It collects data periodically and saves it in a CSV file for further analysis. Below are the key methods:

| **Method**                  | **Description**                                    |
|-----------------------------|----------------------------------------------------|
| `__init__`                  | Initializes NVML and prepares the GPU object.      |
| `get_power()`               | Returns the GPU's current power usage (Watts).     |
| `get_energy()`              | Returns the total energy consumption (Joules).     |
| `get_temp()`                | Retrieves the GPU temperature (Celsius).           |
| `get_memory_usage()`        | Reports memory usage in MB.                        |
| `get_gpu_utilization()`     | Returns the GPU utilization percentage.            |
| `get_graphics_clock()`      | Retrieves the current graphics clock (MHz).        |
| `get_memory_clock()`        | Retrieves the current memory clock (MHz).          |
| `get_pcie_throughput()`     | Reports PCIe throughput (KB/s).                    |
| `start_reading()`           | Starts collecting metrics periodically.            |
| `stop_reading()`            | Stops data collection and saves it to CSV.         |

---

### 3. **Setting Up the GPU Monitor**

```python
# Initialize the GPU monitor
gpu_device = GPUDevice(
    device_index=0,                # Index of the GPU (if multiple GPUs available)
    kernel_name="Example Kernel",  # Name to tag data with
    sampling_interval=0.1,         # Time interval between each sample (in seconds)
    log_file="gpu_monitor.log"     # Log file to store runtime data
)
```

---

### 4. **Starting and Stopping the Monitoring Process**

Start data collection by calling `start_reading()` and stop it using `stop_reading()`.

```python
# Start monitoring GPU metrics
gpu_device.start_reading()

# Keep collecting data for 10 seconds
time.sleep(10)

# Stop monitoring and save the results to CSV
gpu_device.stop_reading()
```

---

### 5. **CSV Output and Logging**

- **CSV File**: After stopping the data collection, a CSV file will be generated with metrics like temperature, power, memory usage, and GPU utilization.
- **Logging**: Data is also logged into the specified log file (`gpu_monitor.log`), which can help with debugging.

---

### 6. **Sample Output Format**

The generated CSV will contain the following columns:

| **Kernel** | **Time (s)** | **Temperature (C)** | **Power (W)** | **Memory Usage (MB)** | **GPU Utilization (%)** | **Graphics Clock (MHz)** | **Memory Clock (MHz)** | **PCIe Tx Throughput (KB/s)** | **PCIe Rx Throughput (KB/s)** |
|------------|--------------|---------------------|--------------|----------------------|-------------------------|--------------------------|------------------------|------------------------------|------------------------------|
| Example    | 0.1          | 50                  | 75           | 2000                 | 80                      | 1500                     | 700                    | 50                           | 40                           |

---

### 7. **Troubleshooting**

- **NVML Initialization Error**: If you encounter errors related to NVML, ensure the NVIDIA drivers are properly installed and the GPU supports NVML.
- **Permission Issues**: Running the code may require administrative or root privileges if accessing GPU hardware metrics.

---

### 8. **Example Use Case**  

```python
from gpu_device import GPUDevice

# Initialize the GPUDevice instance
gpu_device = GPUDevice(
    device_index=0,                # GPU index (if multiple GPUs)
    kernel_name="Example Kernel",  # Tag for collected data
    sampling_interval=0.1,         # Interval between samples (in seconds)
    log_file="gpu_monitor.log"     # Log file for runtime data
)

gpu_device.start_reading()  # Start monitoring

# << Insert your GPU code here >>  
# Example: Simple TensorFlow GPU computation  
import tensorflow as tf  
with tf.device('/GPU:0'):
    a = tf.random.uniform((1000, 1000))
    b = tf.random.uniform((1000, 1000))
    c = tf.matmul(a, b)

gpu_device.stop_reading()  # Stop monitoring and save data
```

This monitors GPU metrics during code execution and generates a CSV file named `gpu_data_Example Kernel.csv`.

### 9. **Conclusion**

The `GPUDevice` class provides a comprehensive way to monitor GPU metrics, making it useful for tasks such as performance tuning, model optimization, or energy efficiency studies. You can extend the class by adding more NVML metrics or customizing the output format.

In [6]:
!nvidia-smi

Thu Oct 31 08:52:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

**NVIDIA System Management Interface (nvidia-smi)** is a command-line utility that provides a comprehensive interface for monitoring and managing NVIDIA GPUs. It enables users to obtain information about GPU utilization, memory usage, temperature, and power consumption.

### Key Functions of nvidia-smi:

1. **Monitoring GPU Metrics**:  
   - Displays real-time statistics about GPU performance, including GPU and memory usage, temperature, and power consumption.

2. **Process Management**:  
   - Lists processes currently using the GPU, along with their memory consumption and GPU utilization, allowing users to identify resource-intensive applications.

3. **Changing GPU Parameters**:  
   - Users can modify several GPU settings, including power limits, clock speeds, and performance modes. This can be done with commands like:
     - **Set Power Limit**:  
       ```bash
       nvidia-smi -i <gpu_index> -pl <power_limit_in_Watts>
       ```
     - **Set Performance Mode**:  
       ```bash
       nvidia-smi -i <gpu_index> -pm <mode>  # where mode can be 0 (disabled) or 1 (enabled)
       ```

### Additional Resources
For more detailed information and examples, you can refer to the official NVIDIA documentation: [NVIDIA-SMI Documentation](https://docs.nvidia.com/deploy/pdf/NVSMI_Manual.pdf).

In [7]:
## NVIDIA-SMI Documentation:- https://docs.nvidia.com/deploy/pdf/NVSMI_Manual.pdf

# Task: GPU Feature Exploration and Data Collection using `nvidia-smi`

This task consists of two parts: first, identifying adjustable GPU features using the `nvidia-smi` command, and second, running GPU-intensive applications to collect performance data by **modifying GPU features**. The **model you use is not important**, but the **data collected** during these runs is essential.

---

## **Task 1: Explore `nvidia-smi` and Identify Adjustable GPU Features**  

### **Instructions:**  
1. **Explore the `nvidia-smi` Command:**  
   - Use the command line to investigate the functionality of `nvidia-smi`.  
   - Document **available subcommands** for querying and setting features.
   - Example:
     ```bash
     nvidia-smi --help  # To explore available commands
     nvidia-smi --query-gpu=all --format=csv  # Query all GPU metrics
     ```

2. **Identify Adjustable Features:**  
   - List **all adjustable features** such as:
     - Power Limits
     - Memory Clocks
     - Fan Speed
     - Compute Modes  
   - **Example Command:**  
     ```bash
     nvidia-smi -pl 150  # Set power limit to 150W
     ```
     
3. **Output:**  
   - **Document each feature** in Markdown with the following details:
     - **Feature Name:** e.g., Power Limit
     - **Possible Values/Settings:** e.g., 100W - 300W
     - **Command to Modify:** e.g., `nvidia-smi -pl <value>`  
     - **Effect:** e.g., Impacts power draw and temperature regulation

---

## **Task 2: Run GPU-Intensive Applications and Collect Data with GPU Feature Modifications**  

While the **specific model** you use (e.g., Ultralytics or Hugging Face) is not important, the **data collected during feature changes** is critical. You need to **modify GPU features** identified in Task 1 and **measure performance metrics** during the execution of each task.

### **Instructions:**  
1. **Run GPU-Intensive Applications:**
   - Choose two different applications, such as:
     - **Ultralytics Model Training:**  

     - **Fine-tuning Hugging Face LLM:**  

2. **Modify GPU Features:**
   - For **each identified feature** (e.g., power limit, fan speed), modify its value.
   - **Example Command:**  
     ```bash
     nvidia-smi -pl 150  # Set power limit to 150W
     ```

3. **Collect Performance Data using `GPUDevice` Class:**  
   - Capture the following **GPU metrics** using GPUDevice Class as explained above.

4. **Repeat the Process for Each Feature Configuration:**
   - For **each feature change**, re-run the selected task and **log the collected data**.

---

## **Conclusion:**
Summarize key findings, such as:
- Which feature changes had the most significant impact on performance, power consumption, or temperature.
- Any patterns observed (e.g., higher memory clocks improve model training speed but increase temperature).

---

## **Deliverables:**
1. **List of adjustable GPU features** with explanations.
2. **Collected data** for each feature change with applications.

---

This task ensures that you learn how to **use `nvidia-smi` to manipulate GPU configurations** and **analyze their impact on performance**. Focus on the **data collected** and ensure all findings are well-documented for future reference.

In [10]:
!nvidia-smi --help

NVIDIA System Management Interface -- v535.104.05

NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.

Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available.  The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.

http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
    - All Tesla products, starting with the Kepler architecture
    - All Quadro products, starting with the Kepler architecture
    - All GRID products, starting with the Kepler architecture
    - GeForce Titan products, starting with the Kepler architecture
- 

#  **Documentation**
##  Querying Commands
  1.	Name: The name of the GPU.
  2.	UUID: The unique identifier for the GPU.
  3.	GPU Bus ID: The PCI bus ID of the GPU.
  4.	Persistence Mode: Indicates whether persistence mode is enabled or disabled.
  5.	Temperature (C): The current temperature of the GPU.
  6.	Utilization (%): The current utilization of the GPU (in percentage).
  7.	Memory Usage:
  	a) Total Memory: Total GPU memory; b)
	  Used Memory: Memory currently in use; c)
	  Free Memory: Available memory.
  8.	Compute Mode: The compute mode of the GPU.
  9.	Driver Version: The version of the installed NVIDIA driver.
  10.	Display Active: Indicates if the display is active on the GPU.
  11.	FB Memory Usage: Framebuffer memory usage.
  12.	Bar1 Memory Usage: Memory usage for BAR1.
  13.	Power Draw (W): Current power consumption of the GPU.
  14.	Power Limit (W): The maximum power limit set for the GPU.
  15.	GPU Instance ID: For GPUs that support Multi-Instance GPU (MIG).
  16.	Migration State: The migration state for the GPU.
  17.	Fan Speed (%): Current speed of the GPU fan.
  18.	Performance State: The current performance state of the GPU (P-State).
  19.	Encoder Utilization (%): Utilization of the GPU encoder.
  20.	Decoder Utilization (%): Utilization of the GPU decoder.
  21.	Ecc Mode: Indicates whether ECC (Error-Correcting Code) is enabled.


In [15]:
!nvidia-smi --query-gpu=name,utilization.gpu,memory.total,memory.free,memory.used --format=csv

name, utilization.gpu [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
Tesla T4, 0 %, 15360 MiB, 15099 MiB, 3 MiB


## Setting Commands

### Setting Persistance Mode
Keeps the GPU initalised when not being used and thus, reduces initialisation time for subsequent applications.

!nvidia-smi -pm 1: Enables Persistence mode
!nvidia-smi -pm 0: Disables Persistence mode

### Setting Power Limit
Sets power limit to optimise performance and energy consumption.

!nvidia-smi -pl 100: Sets to 100W

### Setting Application Clocks
Sets Graphic (Core) and Memory clocks to the specified frequencies

!nvidia-smi -ac 2505,1500

### Enabling or Disabling ECC

!nvidia-smi -e 1
!nvidia-smi -e 0

### Listing all GPUs

!nvidia-smi --list-gpus

