<a href="https://colab.research.google.com/github/ayushgkp/UCS547-Accelerated-Data-Science/blob/main/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q1: Identify !, % and %% in Google Colab

!  → Used to execute Linux shell commands.
Example: !nvidia-smi

%  → Line magic command (runs for one line only).
Example: %timeit

%% → Cell magic command (applies to entire cell).
Example: %%writefile filename.cu


# Q2: Important nvidia-smi Commands

1. nvidia-smi           → Show GPU status
2. nvidia-smi -L        → List GPUs
3. nvidia-smi -q        → Detailed info
4. nvidia-smi -q -d MEMORY → Memory info
5. watch -n 1 nvidia-smi → Refresh every 1 sec
6. nvidia-smi --help    → Help

# Q3: Common CUDA Errors

1. Zero Output:
   Cause: Missing cudaDeviceSynchronize()
   Fix: Add cudaDeviceSynchronize();

2. Incorrect Indexing:
   Wrong: int id = threadIdx.x;
   Correct: int id = blockIdx.x * blockDim.x + threadIdx.x;

3. PTX Errors:
   Cause: Wrong CUDA architecture.
   Fix: nvcc -arch=sm_75 file.cu


Q4:CUDA Program-1 Block, 8 Threads

In [3]:
%%writefile hello.cu

#include <stdio.h>

// Device Code (GPU)
__global__ void helloKernel() {
    int global_thread_id = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Hello from GPU thread %d\n", global_thread_id);
}

// Host Code (CPU)
int main() {
    helloKernel<<<1, 8>>>();
    cudaDeviceSynchronize();
    return 0;
}


Writing hello.cu


Q5. Host and Device Memory Separation

In [4]:
%%writefile memory.cu

#include <stdio.h>

// Device Code
__global__ void printKernel(int *d_array) {
    int id = threadIdx.x;
    printf("GPU thread %d value: %d\n", id, d_array[id]);
}

// Host Code
int main() {

    int h_array[5] = {10, 20, 30, 40, 50};
    int *d_array;

    cudaMalloc((void**)&d_array, 5 * sizeof(int));
    cudaMemcpy(d_array, h_array, 5 * sizeof(int), cudaMemcpyHostToDevice);

    printKernel<<<1, 5>>>(d_array);
    cudaDeviceSynchronize();

    cudaMemcpy(h_array, d_array, 5 * sizeof(int), cudaMemcpyDeviceToHost);

    printf("Values copied back to CPU:\n");
    for(int i = 0; i < 5; i++) {
        printf("%d ", h_array[i]);
    }

    cudaFree(d_array);
    return 0;
}


Writing memory.cu


Q6. CPU Time Comparison (List vs Tuple vs NumPy)

In [5]:
import time
import numpy as np

size = 1000000

# List
start = time.time()
lst = [i for i in range(size)]
lst = [x*2 for x in lst]
print("List time:", time.time() - start)

# Tuple
start = time.time()
tpl = tuple(range(size))
tpl = tuple(x*2 for x in tpl)
print("Tuple time:", time.time() - start)

# NumPy
start = time.time()
arr = np.arange(size)
arr = arr * 2
print("NumPy time:", time.time() - start)


List time: 0.21351122856140137
Tuple time: 0.211134672164917
NumPy time: 0.005968332290649414
