
# Introduction to GPU Tensor Core Matrix Operations

This notebook introduces our high-performance matrix operations library that leverages NVIDIA Tensor Cores for maximum computational 
efficiency. The library provides optimized implementations of common matrix operations using double-precision split techniques to 
maintain accuracy while benefiting from tensor core acceleration.

## Setup and Installation

First, let's import the required libraries and set up our environment:

In [82]:
import numpy as np
import cupy as cp
from tensor_matrix_ops import TensorMatrixOps

# Initialize the tensor core operations library  
# (ensure cuda_matlib.so and tensor_matrix_ops.py are in the same directory as this notebook)
tensor_ops = TensorMatrixOps()


# the following is only for this notebook and is not required generally
# Optional: Set random seed for reproducibility
np.random.seed(42)
cp.random.seed(42)

Initializing CUDA...
CUDA initialization complete
Loaded library: ./cuda_matlib.so
Function signatures configured


## 1. Matrix-Matrix Multiplication (matmul)

Matrix multiplication is a fundamental operation in linear algebra and deep learning. Our implementation optimizes for performance while maintaining numerical precision.

In [83]:
# Define matrix dimensions
M, K, N = 1024, 1024, 1024  # A is (M x K), B is (K x N), C is (M x N)

# Create random matrices with proper scaling
a = cp.random.random((M, K), dtype=cp.float64)
b = cp.random.random((K, N), dtype=cp.float64)

# Scale inputs to prevent overflow
a /= cp.sqrt(K)
b /= cp.sqrt(K)


# Run matrix multiplication using Tensor Cores
c_tensor = tensor_ops.matmul(a, b)

# Compare with standard CuPy implementation
c_cupy = cp.matmul(a, b)

# Print sample results
print("Matrix Multiplication (A×B):")
print("Tensor Core result (first 5 elements):", c_tensor.flatten()[:5])
print("CuPy result (first 5 elements):       ", c_cupy.flatten()[:5])

Matrix Multiplication (A×B):
Tensor Core result (first 5 elements): [0.25248319 0.23801261 0.25244898 0.25019878 0.24522987]
CuPy result (first 5 elements):        [0.25248071 0.23801127 0.25244621 0.25019609 0.24522916]


## 2. Matrix Power (A^n)

Computing matrix powers efficiently is essential for many algorithms like Markov chains and graph analytics. Our implementation uses tensor cores to accelerate these computations.

In [84]:
# Create a square matrix
n = 512
a = cp.random.random((n, n), dtype=cp.float64)

# Scale to prevent overflow during powers
a /= (cp.sqrt(n) * 1.1)

# Define the power
power = 4

# Compute matrix power using Tensor Cores
a_power_tensor = tensor_ops.matrix_power(a, power)

# Compare with standard CuPy implementation
a_power_cupy = cp.linalg.matrix_power(a, power)

# Print sample results
print(f"Matrix Power (A^{power}):")
print("Tensor Core result (first 5 elements):", a_power_tensor.flatten()[:5])
print("CuPy result (first 5 elements):       ", a_power_cupy.flatten()[:5])

Matrix Power (A^4):
Tensor Core result (first 5 elements): [21.84820557 22.37931252 21.33176041 21.0262146  21.53623581]
CuPy result (first 5 elements):        [21.84799909 22.38001368 21.33145314 21.02625034 21.53599366]


## 3. Batched Matrix Multiplication

Batched matrix multiplication performs the same operation on multiple matrix pairs simultaneously, which is common in deep learning when processing batches of data.

In [85]:
# Define batch size and dimensions
batch_size = 32
M, K, N = 128, 512, 512

# Create batched random matrices
a_batch = cp.random.random((batch_size, M, K), dtype=cp.float64)
b_batch = cp.random.random((batch_size, K, N), dtype=cp.float64)

# Scale inputs
a_batch /= cp.sqrt(K)
b_batch /= cp.sqrt(K)

# Compute batched matrix multiplication using Tensor Cores
c_batch_tensor = tensor_ops.batched_matmul(a_batch, b_batch)

# Compare with standard CuPy implementation
c_batch_cupy = cp.matmul(a_batch, b_batch)

# Print sample results
print("Batched Matrix Multiplication:")
print("Tensor Core result (first 5 elements):", c_batch_tensor.flatten()[:5])
print("CuPy result (first 5 elements):       ", c_batch_cupy.flatten()[:5])

Batched Matrix Multiplication:
Tensor Core result (first 5 elements): [0.24708433 0.24681014 0.24933796 0.2704584  0.24725303]
CuPy result (first 5 elements):        [0.25764366 0.23845166 0.24916795 0.24741405 0.26051707]


## 4. Vector-Matrix Multiplication (v×A)

Vector-matrix multiplication is a special case that computes the product of a vector and a matrix, used in many operations including neural network forward passes.

In [86]:
# Create a vector and a matrix
n = 4096
v = cp.random.random(n, dtype=cp.float64)
a = cp.random.random((n, n), dtype=cp.float64)

# Scale inputs
v /= cp.sqrt(n)
a /= cp.sqrt(n)

# Compute vector-matrix multiplication using Tensor Cores
result_tensor = tensor_ops.vector_matmul(v, a)

# Compare with standard CuPy implementation
result_cupy = v @ a

# Print sample results
print("Vector-Matrix Multiplication (v×A):")
print("Tensor Core result (first 5 elements):", result_tensor.flatten()[:5])
print("CuPy result (first 5 elements):       ", result_cupy.flatten()[:5])

Vector-Matrix Multiplication (v×A):
Tensor Core result (first 5 elements): [0.2469551  0.24752763 0.24533175 0.25095335 0.25167346]
CuPy result (first 5 elements):        [0.2469551  0.24752763 0.24533173 0.25095336 0.25167343]


## 5. Matrix-Vector Multiplication (A×v)

Matrix-vector multiplication computes the product of a matrix and a vector, common in many scientific computing and machine learning applications.

In [87]:
# Create a matrix and a vector
n = 1024
a = cp.random.random((n, n), dtype=cp.float64)
v = cp.random.random(n, dtype=cp.float64)

# Scale inputs
a /= cp.sqrt(n)
v /= cp.sqrt(n)

# Compute matrix-vector multiplication using Tensor Cores
result_tensor = tensor_ops.matmul_vector(a, v)

# Compare with standard CuPy implementation
result_cupy = a @ v

# Print sample results
print("Matrix-Vector Multiplication (A×v):")
print("Tensor Core result (first 5 elements):", result_tensor.flatten()[:5])
print("CuPy result (first 5 elements):       ", result_cupy.flatten()[:5])

Matrix-Vector Multiplication (A×v):
Tensor Core result (first 5 elements): [0.25579202 0.25863317 0.25647345 0.26250604 0.25792497]
CuPy result (first 5 elements):        [0.25579201 0.25863317 0.25647348 0.26250603 0.25792499]


## 6. Batched Vector Multiplication

This operation applies the same matrix to multiple vectors in parallel, which is useful for processing multiple inputs simultaneously.

In [88]:
# Create a matrix and a batch of vectors
n = 4096
batch_size = 32
a = cp.random.random((n, n), dtype=cp.float64)
v_batch = cp.random.random((n, batch_size), dtype=cp.float64)

# Scale inputs
a /= cp.sqrt(n)
v_batch /= cp.sqrt(n)

# Compute batched vector multiplication using Tensor Cores
result_tensor = tensor_ops.batched_vector_matmul(v_batch, a)

# Compare with standard CuPy implementation
result_cupy = cp.empty((batch_size, n), dtype=cp.float64)
for i in range(batch_size):
    result_cupy[i] = a @ v_batch[:,i]

# Print sample results
print("Batched Vector Multiplication:")
print("Tensor Core result (first 5 elements):", result_tensor.flatten()[:5])
print("CuPy result (first 5 elements):       ", result_cupy.flatten()[:5])

Batched Vector Multiplication:
Tensor Core result (first 5 elements): [0.25338036 0.25158006 0.25674474 0.25279993 0.25193566]
CuPy result (first 5 elements):        [0.24548769 0.25240293 0.25115439 0.25377019 0.24899413]


## 7. Strided Batch Matrix Multiplication

Strided batch matrix multiplication operates on matrices stored contiguously in memory with a fixed stride between them, which can be more memory-efficient.

In [89]:
# Define dimensions
batch_size = 32
M, K, N = 128, 128, 128

# Create batched matrices
a_batch = cp.random.random((batch_size, M, K), dtype=cp.float64)
b_batch = cp.random.random((batch_size, K, N), dtype=cp.float64)

# Scale inputs
a_batch /= cp.sqrt(K)
b_batch /= cp.sqrt(K)

# Compute strided batch matrix multiplication using Tensor Cores
result_tensor = tensor_ops.strided_batch_matmul(M, K, N, batch_size, a_batch, b_batch)

# Compare with standard CuPy implementation
result_cupy = cp.matmul(a_batch, b_batch)

# Print sample results
print("Strided Batch Matrix Multiplication:")
print("Tensor Core result (first 5 elements):", result_tensor.flatten()[:5])
print("CuPy result (first 5 elements):       ", result_cupy.flatten()[:5])

Strided Batch Matrix Multiplication:
Tensor Core result (first 5 elements): [0.23895994 0.24422252 0.24787512 0.23367743 0.24973069]
CuPy result (first 5 elements):        [0.23286825 0.22940109 0.25613556 0.22756938 0.24376466]


## 8. 4D Tensor Matrix Multiplication

4D tensor multiplication extends matrix operations to higher-dimensional arrays, which is common in convolutional neural networks and other deep learning models.

In [90]:
# Define dimensions
batch1, batch2, M, N = 4, 4, 128, 128

# Create 4D tensors
a_tensor = cp.random.random((batch1, batch2, M, N), dtype=cp.float64)
b_tensor = cp.random.random((batch1, batch2, N, N), dtype=cp.float64)

# Scale inputs
a_tensor /= cp.sqrt(N)
b_tensor /= cp.sqrt(N)

# Compute 4D tensor matrix multiplication using Tensor Cores
result_tensor = tensor_ops.tensor_4d_matmul(a_tensor, b_tensor)

# Compare with standard CuPy implementation
result_cupy = cp.matmul(a_tensor, b_tensor)

# Print sample results
print("4D Tensor Matrix Multiplication:")
print("Tensor Core result (first 5 elements):", result_tensor.flatten()[:5])
print("CuPy result (first 5 elements):       ", result_cupy.flatten()[:5])

4D Tensor Matrix Multiplication:
Tensor Core result (first 5 elements): [0.268554   0.25216904 0.22979662 0.23863836 0.27084583]
CuPy result (first 5 elements):        [0.26855174 0.25216689 0.22979669 0.23863848 0.27084709]


## 9. 5D Tensor Matrix Multiplication

5D tensor multiplication handles even higher-dimensional data, useful for 3D convolutions and video processing.

In [91]:
# Define dimensions
batch, channels, depth, height, width = 2, 3, 4, 64, 64

# Create 5D tensors
a_tensor = cp.random.random((batch, channels, depth, height, width), dtype=cp.float64)
b_tensor = cp.random.random((batch, channels, depth, width, width), dtype=cp.float64)

# Scale inputs
a_tensor /= cp.sqrt(width)
b_tensor /= cp.sqrt(width)

# Compute 5D tensor matrix multiplication using Tensor Cores
result_tensor = tensor_ops.tensor_5d_matmul(a_tensor, b_tensor)

# Compare with standard CuPy implementation
result_cupy = cp.zeros((batch, channels, depth, height, width), dtype=cp.float64)
for b_idx in range(batch):
    for c in range(channels):
        for d in range(depth):
            result_cupy[b_idx,c,d] = a_tensor[b_idx,c,d] @ b_tensor[b_idx,c,d]

# Print sample results
print("5D Tensor Matrix Multiplication:")
print("Tensor Core result (first 5 elements):", result_tensor[0,0,0].flatten()[:5])
print("CuPy result (first 5 elements):       ", result_cupy[0,0,0].flatten()[:5])

5D Tensor Matrix Multiplication:
Tensor Core result (first 5 elements): [0.22260779 0.24218149 0.22158685 0.21666944 0.25543442]
CuPy result (first 5 elements):        [0.22261135 0.23759576 0.26438895 0.22105957 0.20184577]


## Conclusion

By following the nine examples for using the tensor core engine it would be relatively straight forward to replace CuPy in notebooks where you need significant acceleration in speed (for the trade-off of some accuracy). For operations which are not memory bound, you should see significant performance improvements over standard CuPy implementations, especially for larger matrices and higher batch sizes (see benchmark notebook for some sample runs).

Key takeaways:
1. Tensor Core operations can provide substantial speedups for large matrices
2. Double-precision splitting technique maintains high accuracy
3. The library handles various formats from simple matrices to 5D tensors
4. Performance benefits increase with problem size

For more details, check the library documentation, samples, or explore the python wrapper to understand the implementation details.