<a href="https://colab.research.google.com/github/everythingapplejj/Research-Graph-Embeddings-/blob/JJ/Copy_of_rapids_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [None]:
!nvidia-smi

Fri Jul 26 21:54:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

#Setup:
Set up script installs
1. Updates gcc in Colab
1. Installs Conda
1. Install RAPIDS' current stable version of its libraries, as well as some external libraries including:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuSignal
  1. BlazingSQL
  1. xgboost
1. Copy RAPIDS .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.


In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 490, done.[K
remote: Counting objects: 100% (221/221), done.[K
remote: Compressing objects: 100% (130/130), done.[K
remote: Total 490 (delta 149), reused 124 (delta 91), pack-reused 269[K
Receiving objects: 100% (490/490), 136.70 KiB | 752.00 KiB/s, done.
Resolving deltas: 100% (251/251), done.
Collecting pynvml
  Downloading pynvml-11.5.3-py3-none-any.whl.metadata (8.8 kB)
Downloading pynvml-11.5.3-py3-none-any.whl (53 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 1.1 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.3
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla T4!
We will now install RAPIDS via pip!  Please stand by, should be quick...
***********************************************************************



In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Found existing installation: cupy-cuda12x 12.2.0
Uninstalling cupy-cuda12x-12.2.0:
  Successfully uninstalled cupy-cuda12x-12.2.0
restarting Colab...


In [None]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:18
🔁 Restarting kernel...


In [None]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!


In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

Collecting pynvml
  Using cached pynvml-11.5.3-py3-none-any.whl.metadata (8.8 kB)
Using cached pynvml-11.5.3-py3-none-any.whl (53 kB)
Installing collected packages: pynvml
Successfully installed pynvml-11.5.3
Found existing installation: cffi 1.16.0
Uninstalling cffi-1.16.0:
  Successfully uninstalled cffi-1.16.0
Collecting cffi==1.15.0
  Downloading cffi-1.15.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (1.2 kB)
Downloading cffi-1.15.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (446 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 446.3/446.3 kB 10.3 MB/s eta 0:00:00
Installing collected packages: cffi
Successfully installed cffi-1.15.0
Installing RAPIDS Stable 23.12
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local

  adde

# cuDF and cuML Examples #

Now you can run code!

What follows are basic examples where all processing takes place on the GPU.

#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.

Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

In [None]:
import cudf
import io, requests

# download CSV file from GitHub
url="https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

# read CSV from memory
tips_df = cudf.read_csv(io.StringIO(content))
tips_df['tip_percentage'] = tips_df['tip']/tips_df['total_bill']*100

# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

#[cuML](https://github.com/rapidsai/cuml)#

This snippet loads a

As above, all calculations are performed on the GPU.

In [None]:
!pip uninstall numpy numba -y
!pip install numpy==1.24

Found existing installation: numpy 1.23.0
Uninstalling numpy-1.23.0:
  Successfully uninstalled numpy-1.23.0
[0mCollecting numpy==1.24
  Using cached numpy-1.24.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cucim 23.12.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.12.1 requires cubinlinker, which is not installed.
cudf 23.12.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.12.1 requires numba<0.58,>=0.57, which is not installed.
cudf 23.12.1 requires ptxcompiler, which is not installed.
cugraph 23.12.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cugraph 23.12.0 requires numba>=0.57, which is not insta

In [None]:
!pip uninstall llvmlite -y

Found existing installation: llvmlite 0.43.0
Uninstalling llvmlite-0.43.0:
  Successfully uninstalled llvmlite-0.43.0


In [None]:
!pip uninstall llvmlite numba -y
!pip install llvmlite==0.43.0
!pip install numba==0.54.0

Found existing installation: llvmlite 0.43.0
Uninstalling llvmlite-0.43.0:
  Successfully uninstalled llvmlite-0.43.0
[0mCollecting llvmlite==0.43.0
  Using cached llvmlite-0.43.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.8 kB)
Using cached llvmlite-0.43.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (43.9 MB)
Installing collected packages: llvmlite
Successfully installed llvmlite-0.43.0


[31mERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.53.0 Requires-Python >=3.6,<3.10; 0.53.0rc1.post1 Requires-Python >=3.6,<3.10; 0.53.0rc2 Requires-Python >=3.6,<3.10; 0.53.0rc3 Requires-Python >=3.6,<3.10; 0.53.1 Requires-Python >=3.6,<3.10; 0.54.0 Requires-Python >=3.7,<3.10; 0.54.0rc2 Requires-Python >=3.7,<3.10; 0.54.0rc3 Requires-Python >=3.7,<3.10; 0.54.1 Requires-Python >=3.7,<3.10[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement numba==0.54.0 (from versions: 0.1, 0.2, 0.3, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.7.2, 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.12.1, 0.12.2, 0.13.0, 0.13.2, 0.13.3, 0.13.4, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.18.1, 0.18.2, 0.19.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0, 0.22.1, 0.23.0, 0.23.1, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.1, 0.29.0, 0.30.0, 0.30.1, 0.31.0, 0.32.0, 0.33.0, 0.34.0, 0.35.0, 0.36.1, 0.36.2, 0.37.

In [None]:
!pip install plotly

Collecting plotly
  Downloading plotly-5.23.0-py3-none-any.whl.metadata (7.3 kB)
Collecting tenacity>=6.2.0 (from plotly)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Downloading plotly-5.23.0-py3-none-any.whl (17.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m88.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tenacity-8.5.0-py3-none-any.whl (28 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.23.0 tenacity-8.5.0


In [None]:
import numpy as np

measure = np.load("measure.npy")
embeddings = np.load("embeddings.npy")
labels = np.load("labels.npy")
dis_labels = np.load("dis_labels.npy")


In [None]:
print(embeddings)

[[[-1.2152314 -1.2866514 -1.86935   -1.2985009]]

 [[-1.2204597 -1.2970121 -1.835061  -1.3021207]]

 [[-1.1632395 -1.176903  -2.327377  -1.2667544]]

 ...

 [[-1.2835389 -1.4146783 -1.5118604 -1.3492229]]

 [[-1.3384823 -1.5091949 -1.3152447 -1.3932786]]

 [[-1.3143866 -1.4684981 -1.3945947 -1.3737482]]]


In [None]:
# only df_reference:
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import plotly.express as px
import plotly.io as pio

# Ensure Plotly is configured for interactive display in Google Colab
pio.renderers.default = 'colab'

# Assume df_reference, measure, embeddings, dis_labels, and labels are predefined

# Reference data
df_reference = pd.DataFrame({
    'Dim1': embeddings[1::2, :, 0].flatten(),
    'Dim2': embeddings[1::2, :, 1].flatten(),
    'Dim3': embeddings[1::2, :, 2].flatten(),
    'Dim4': embeddings[1::2, :, 3].flatten(),
    'Label': 'Reference'
})

# Extract unique embeddings from df_reference
unique_reference = df_reference[['Dim1', 'Dim2', 'Dim3', 'Dim4', 'Label']].drop_duplicates()

print(len(unique_reference))

# Perform t-SNE to reduce to 3 dimensions
tsne_reducer = TSNE(n_components=3, perplexity=2, learning_rate=1, n_iter=250,  random_state=42)
tsne_result = tsne_reducer.fit_transform(unique_reference[['Dim1', 'Dim2', 'Dim3', 'Dim4']])

unique_reference['TSNE1'] = tsne_result[:, 0]
unique_reference['TSNE2'] = tsne_result[:, 1]
unique_reference['TSNE3'] = tsne_result[:, 2]

# Plot the unique embeddings for df_reference
fig = px.scatter_3d(
    unique_reference,
    x='TSNE1',
    y='TSNE2',
    z='TSNE3',
    color='Label',
    title='3D t-SNE Visualization of Reference Embeddings',
    labels={'TSNE1': 't-SNE Dimension 1', 'TSNE2': 't-SNE Dimension 2', 'TSNE3': 't-SNE Dimension 3'}
)

# Display the plot using Plotly's default renderer
pio.show(fig)


18



'n_iter' was renamed to 'max_iter' in version 1.5 and will be removed in 1.7.



In [None]:
import pandas as pd
import numpy as np
from cuml.manifold import TSNE  # Import cuML's TSNE
import plotly.express as px
import plotly.io as pio

# Set Plotly to render within Colab
pio.renderers.default = 'colab'

# Function to sample data
def sample_data(df, sample_size):
    return df.sample(n=sample_size, random_state=42)

# Sample size
sample_size = 75000

# Combine all data into a single DataFrame with an indicator column
df_reference = pd.DataFrame({
    'Dim1': embeddings[1::2, :, 0].flatten(),
    'Dim2': embeddings[1::2, :, 1].flatten(),
    'Dim3': embeddings[1::2, :, 2].flatten(),
    'Dim4': embeddings[1::2, :, 3].flatten(),
    'Label': 'Reference'
})

# Measure data
df_measure = pd.DataFrame({
    'Dim1': measure[:, :, 0].flatten(),
    'Dim2': measure[:, :, 1].flatten(),
    'Dim3': measure[:, :, 2].flatten(),
    'Dim4': measure[:, :, 3].flatten(),
    'Label': dis_labels[:].repeat(embeddings.shape[1])
})

# Embeddings data
df_embeddings = pd.DataFrame({
    'Dim1': embeddings[::2, :, 0].flatten(),
    'Dim2': embeddings[::2, :, 1].flatten(),
    'Dim3': embeddings[::2, :, 2].flatten(),
    'Dim4': embeddings[::2, :, 3].flatten(),
    'Label': labels[::2].repeat(embeddings.shape[1])
})

# Extract unique embeddings from each DataFrame
unique_reference = df_reference[['Dim1', 'Dim2', 'Dim3', 'Dim4', 'Label']].drop_duplicates()
unique_embeddings = df_embeddings[['Dim1', 'Dim2', 'Dim3', 'Dim4', 'Label']].drop_duplicates()
print(type(unique_reference))
print(type(unique_embeddings))

# Combine unique embeddings
combined_df = pd.concat([unique_reference, unique_embeddings]).drop_duplicates()
print(len(combined_df))

# Perform t-SNE to reduce to 2 dimensions using cuML
tsne_reducer = TSNE(n_components=2, perplexity=30000, learning_rate=200, random_state=42, n_iter=1000)

print("first passing")

# Note: cuML expects GPU arrays, so convert DataFrame to CuPy array
combined_gpu = combined_df[['Dim1', 'Dim2', 'Dim3', 'Dim4']].values
tsne_result = tsne_reducer.fit_transform(combined_gpu)

print("passing")

combined_df['MarkerSize'] = combined_df['Label'].apply(lambda x: 100 if x == 'Reference' else 5)
combined_df['TSNE1'] = tsne_result[:, 0]
combined_df['TSNE2'] = tsne_result[:, 1]

print(tsne_result)
# Plot the combined unique embeddings
fig = px.scatter(
    combined_df,
    x='TSNE1',
    y='TSNE2',
    color='Label',
    size='MarkerSize',
    title='2D t-SNE Visualization of Unique Embeddings',
    labels={'TSNE1': 't-SNE Dimension 1', 'TSNE2': 't-SNE Dimension 2'}
)

pio.show(fig)


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
78054
first passing
[W] [23:11:05.317706] Perplexity should be within ranges (5, 50). Your results might be a bit strange...
[W] [23:11:05.318238] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...



Starting from version 22.04, the default method of TSNE is 'fft'.



passing
[[ 302.6183    123.58776 ]
 [ 125.045555   47.080257]
 [-863.8834   -339.1267  ]
 ...
 [ 260.11353    53.11237 ]
 [-971.6729   -136.40433 ]
 [-133.6322     31.231707]]


In [None]:

fig = px.scatter(
    combined_df,
    x='TSNE1',
    y='TSNE2',
    color='Label',
    title='2D t-SNE Visualization of Unique Embeddings',
    labels={'TSNE1': 't-SNE Dimension 1', 'TSNE2': 't-SNE Dimension 2'}
)

pio.show(fig)

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib