# How to run on GPU

## Step 1: Generate GitHub Access Token
Go to https://github.com/settings/tokens and create a personal access token with permissions to read your private repositories.

## Step 2: Set Up Colab Environment
1. Clone this Colab notebook
2. Change the runtime type to GPU:
   - Navigate to **Runtime → Change runtime type**
   - Select **GPU** (e.g T4) as the hardware accelerator
   - Click **Save**

## Step 3: Install Python 3.11
Install Python 3.11 to ensure compatibility with the required dependencies.

## Step 4: Clone and Install Dependencies
1. Clone your repository using the GitHub token

## Step 5: Run Training
Execute the command-line training script to begin model training on the GPU.

# Install python 3.11
Run the cell below without editing it.

In [None]:
!sudo apt-get update -y
!sudo apt-get install python3.11
!curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
!python3.11 get-pip.py

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [1 InRelease 0 B/3,                                                                               Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [1 InRelease 0 B/3,                                                                               Get:3 https://cli.github.com/packages stable InRelease [3,917 B]
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [1 InRelease 0 B/3,0% [Connecting to archive.ubuntu.com] [Waiting for headers] [1 InRelease 3,632 0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Connected to r2u.s                                                                               Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:

# Clone and install your code
First we need to set some environment variables. Get your github API token and MLE repo username and put them into the below variables.

In [None]:
%env TOKEN=token here
%env USER=user name here

Run the below code. Editing should not be necessary.

In [None]:
TOKEN = %env TOKEN
USER = %env USER
%env DIR=module-3-$USER
DIR = %env DIR

!echo https://$TOKEN@github.com/Cornell-Tech-ML/$DIR

!git clone -b master --single-branch https://$TOKEN@github.com/Cornell-Tech-ML/$DIR
!cd $DIR; python3.11 -m pip install -e ". [cuda]"

If you update your code, you can re-pull the repo by running this cell.

In [None]:
!cd $DIR; git pull origin master; pip3.11 install --force-reinstall --no-cache-dir .

From https://github.com/Cornell-Tech-ML/module-3-chenyingyu-main
 * branch            master     -> FETCH_HEAD
Already up to date.
Processing /content/module-3-chenyingyu-main
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting colorama==0.4.6 (from minitorch==0.5)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting hypothesis==6.138.2 (from minitorch==0.5)
  Downloading hypothesis-6.138.2-py3-none-any.whl.metadata (5.6 kB)
Collecting numba>=0.61.2 (from minitorch==0.5)
  Downloading numba-0.62.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.8 kB)
Collecting numpy<2.0 (from minitorch==0.5)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting pytest-env==1.1.5 (from minitorch==0.5)
  Downloading pytest_env-1.1.5-py3-none-any.whl.metadata (5.2 kB)
Co

# Run tests

In [None]:
!cd $DIR; python3.11 -m pytest -m task3_3 -v

platform linux -- Python 3.11.14, pytest-8.4.1, pluggy-1.6.0 -- /usr/bin/python3.11
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /content/module-3-chenyingyu-main
configfile: pyproject.toml
plugins: hypothesis-6.138.2, env-1.1.5
collected 117 items / 60 deselected / 57 selected                              [0m

tests/test_tensor_general.py::test_create[cuda] [32mPASSED[0m[32m                   [  1%][0m
tests/test_tensor_general.py::test_one_args[cuda-fn0] [32mPASSED[0m[32m             [  3%][0m
tests/test_tensor_general.py::test_one_args[cuda-fn1] [32mPASSED[0m[33m             [  5%][0m
tests/test_tensor_general.py::test_one_args[cuda-fn2] [32mPASSED[0m[33m             [  7%][0m
tests/test_tensor_general.py::test_one_args[cuda-fn3] [32mPASSED[0m[33m             [  8%][0m
tests/test_tensor_general.py::test_one_args[cuda-fn4] [32mPASSED[0m[33m             [ 10%][0m
tests/test_tensor_general.py::test_one_args[cuda-fn5] [32mPASSED[0m[33m       

In [None]:
!cd $DIR; python3.11 -m pytest -m task3_4 -v

platform linux -- Python 3.11.14, pytest-8.4.1, pluggy-1.6.0 -- /usr/bin/python3.11
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /content/module-3-chenyingyu-main
configfile: pyproject.toml
plugins: hypothesis-6.138.2, env-1.1.5
collected 117 items / 110 deselected / 7 selected                              [0m

tests/test_tensor_general.py::test_mul_practice1 [32mPASSED[0m[32m                  [ 14%][0m
tests/test_tensor_general.py::test_mul_practice2 [32mPASSED[0m[33m                  [ 28%][0m
tests/test_tensor_general.py::test_mul_practice3 [32mPASSED[0m[33m                  [ 42%][0m
tests/test_tensor_general.py::test_mul_practice4 [32mPASSED[0m[33m                  [ 57%][0m
tests/test_tensor_general.py::test_mul_practice5 [32mPASSED[0m[33m                  [ 71%][0m
tests/test_tensor_general.py::test_mul_practice6 [32mPASSED[0m[33m                  [ 85%][0m
tests/test_tensor_general.py::test_bmm[cuda] [32mPASSED[0m[33m                

# Run Training

In [None]:
if False:
  import time
  # test if the training goes well

  # Simple Dataset - GPU
  start = time.time()
  !cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05
  gpu_time = time.time() - start
  print(f"\n=== GPU Total Time: {gpu_time:.2f}s ===")
  print(f"=== Time per Epoch: {gpu_time/500:.3f}s ===")

  # Simple Dataset - CPU
  start = time.time()
  !cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05
  cpu_time = time.time() - start
  print(f"\n=== CPU Total Time: {cpu_time:.2f}s ===")
  print(f"=== Time per Epoch: {cpu_time/500:.3f}s ===")

In [None]:
if True:
  # train with three dataset on both cpu and gpu
  # training log after finished
  # summary with a graph

  import time

  # Results storage
  results = {
      'simple': {},
      'split': {},
      'xor': {}
  }

  # Run all experiments
  for dataset in ['simple', 'split', 'xor']:
      print(f"\n{'#'*70}")
      print(f"# Dataset: {dataset.upper()}")
      print(f"{'#'*70}\n")

      # GPU
      print(f"Running on GPU...")
      start = time.time()
      !cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py \
          --BACKEND gpu --HIDDEN 100 --DATASET {dataset} --RATE 0.05 2>&1 | grep -E "(Epoch|correct)"
      gpu_time = time.time() - start
      results[dataset]['gpu'] = gpu_time
      print(f"GPU Time: {gpu_time:.2f}s ({gpu_time/500:.3f}s per epoch)\n")

      # CPU
      print(f"Running on CPU...")
      start = time.time()
      !cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py \
          --BACKEND cpu --HIDDEN 100 --DATASET {dataset} --RATE 0.05 2>&1 | grep -E "(Epoch|correct)"
      cpu_time = time.time() - start
      results[dataset]['cpu'] = cpu_time
      print(f"CPU Time: {cpu_time:.2f}s ({cpu_time/500:.3f}s per epoch)\n")

      speedup = cpu_time / gpu_time
      print(f"Speedup: {speedup:.2f}x\n")

  # Print summary table
  print("\n" + "="*70)
  print("SUMMARY")
  print("="*70)
  print(f"{'Dataset':<15} {'GPU Time':<15} {'CPU Time':<15} {'Speedup':<10}")
  print("-"*70)
  for dataset in ['simple', 'split', 'xor']:
      gpu = results[dataset]['gpu']
      cpu = results[dataset]['cpu']
      speedup = cpu / gpu
      print(f"{dataset:<15} {gpu:>6.2f}s ({gpu/500:.3f}s)  {cpu:>6.2f}s ({cpu/500:.3f}s)  {speedup:>6.2f}x")


######################################################################
# Dataset: SIMPLE
######################################################################

Running on GPU...
Epoch  0  loss  49.61079539828879 correct 396
Epoch  10  loss  21.593731660819888 correct 493
Epoch  20  loss  14.373737447747168 correct 492
Epoch  30  loss  10.79728634796506 correct 500
Epoch  40  loss  7.910825895035382 correct 498
Epoch  50  loss  8.5846880041248 correct 499
Epoch  60  loss  6.71990907937991 correct 499
Epoch  70  loss  6.571671323260893 correct 499
Epoch  80  loss  5.411215937945851 correct 500
Epoch  90  loss  4.057698089507355 correct 499
GPU Time: 292.55s (0.585s per epoch)

Running on CPU...
Epoch  0  loss  53.97566261707796 correct 442
Epoch  10  loss  23.929828202598042 correct 478
Epoch  20  loss  11.959024773847704 correct 485
Epoch  30  loss  10.62618445549168 correct 496
Epoch  40  loss  12.502628398571142 correct 498
Epoch  50  loss  10.024342754324962 correct 489
Epoch  60  