<a href="https://colab.research.google.com/github/everythingapplejj/AI-Chatbot/blob/main/Visualize_Large_scale_and_High_dimensional_Data%20-%20tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualize Large-scale and High-dimensional Data

In this tutorial, we will learn how to use GraphVite to generate 2D / 3D visualization of high-dimensional data. Visualization can bring us insights about the data, which is quite useful for fields like machine learning and data science.

We will first demonstrate the visualization steps with Python code. Then we will show how to invoke fast visualization in command line. Finally we introduce the configuration file and hyperparameters.

---

All the code here can be run on Google Colab directly, and results will be displayed in our browser. To run this tutorial,

1. At the top-right of the menu bar, choose *connect to hosted runtime*.
2. In the menu, choose *Runtime -> Run all*.

Since Colab provides only 2 CPU threads and a very economic GPU, the code may take some time to run.

Download and install miniconda and GraphVite. This may take a while.

In [None]:
!wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!./Miniconda3-latest-Linux-x86_64.sh -b -p /usr/local -f

!conda install -y -c milagraph -c conda-forge graphvite \
  python=3.6 cudatoolkit=10.0
!conda install -y wurlitzer ipykernel

import site
site.addsitedir("/usr/local/lib/python3.6/site-packages")
%reload_ext wurlitzer

## Python Interface

First, we import some necessary packages.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
# The following lines are only needed in Jupyter Notebook
from IPython.display import display, Image
%matplotlib inline

import graphvite as gv
import graphvite.application as gap

We use MNIST dataset for illustration. MNIST is an image dataset that contains 10 categories of hand-written digits.

In [None]:
images = gv.dataset.mnist.image_data
for i in range(5):
  plt.subplot(1, 5, i+1)
  plt.xticks([])
  plt.yticks([])
  plt.imshow(images[i].reshape(28, 28))


Here we visualize the pixel space of mnist images. That is, each image (28x28) is treated as a 784-d vector. There are 70000 images in total.

In [None]:
print(images.shape)

We create a 2D visualization application in GraphVite.

The `load()` step loads the vector data and build a KNNGraph to represent the similarity between vectors. The `build()` step allocates all the resource for training. `train()` is invoked to compute the coordinates in visualization.

For now, we just use the default hyperparameters for all steps.

In [None]:
app = gap.VisualizationApplication(dim=2)
app.load(vectors=images)
app.build()
app.train()

Now we have the visualization coordinates. We can plot them out.

In most cases, this will result in 10 clusters. Sometimes there may be one or two more clusters, due to different random seeds.

In [None]:
app.visualization()

A question is, are the clusters corresponding to the categories? We can verify that by coloring the visualization with mnist labels.

It looks that the clusters are well aligned with the categories, even if the visualization process isn't supervised by any label. This indicates MNIST is easy to separate in the pixel space, which is consistent with our experience.

In [None]:
app.visualization(Y=gv.dataset.mnist.label_data)

We can obtain the coordinates by

In [None]:
coordinates = app.solver.coordinates
print(coordinates.shape)

## Command Line

In many cases, we would like to use visualization as an off-the-shelf tool. Fortunately, GraphVite provides us with a convenient command line interface.

We just need to store the vectors in a numpy dump (`*.npy`) or a text matrix (`*.txt`).

In [None]:
np.savetxt("mnist_images.txt", gv.dataset.mnist.image_data)
# an alternative format for vectors
np.save("mnist_images.npy", gv.dataset.mnist.image_data)
# labels can also be strings
np.savetxt("mnist_labels.txt", gv.dataset.mnist.label_data)

!graphvite visualize mnist_images.txt --label mnist_labels.txt --save mnist.png --3d

display(Image("mnist.png", width=400, height=400))

## Configuration File
<a id="configuration_file"></a>

GraphVite supports a configuration file interface. This is very useful if we want to customize hyperparameters for the visualization process, or generate multiple plots.

The following command creates a configuration scaffold for visualization.

In [None]:
!graphvite new visualization --file my_config.yaml

In the left pane, we can find `my_config.yaml` in *Files* tab. As Colab doesn't support editing very well, we can edit it in the following cell and then run the cell.

In [None]:
%%writefile my_config.yaml
application:
  visualization

resource:
  # List of GPU ids. Multiple GPUs will cause unstable results.
  gpus: [0]
  # Memory limit for each GPU in bytes. Default is all available memory.
  gpu_memory_limit: auto
  # Number of CPU thread per GPU. Default is all CPUs.
  cpu_per_gpu: auto
  # Dimension of the embeddings.
  dim: 2

format:
  # String of delimiter characters. Change it if your node name contains blank character.
  delimiters: " \t\r\n"
  # Prefix of comment strings. Change it if you use comment style other than Python.
  comment: "#"

graph:
  # Path to vector file. Each line should be one of the following
  # [value] [delimiter] [value] [delimiter]... [comment]...
  # [comment]...
  # For standard datasets, you can specify them by <[dataset].[split]>.
  vector_file:
  # Number of neighbors for each node. Default is usually reasonable.
  num_neighbor: 200
  # Perplexity for the neighborhood of each node.
  # Typical values are between 5 and 50. Need to be tuned for best results.
  # Larger value focuses on global difference and results in larger clusters.
  perplexity: 30
  # Normalize the input vectors or not. True is recommended.
  vector_normalization: true

build:
  optimizer:
    # Optimizer.
    type: Adam
    # Learning rate. Default is usually reasonable.
    lr: 0.5
    # Weight decay. Default is usually reasonable.
    weight_decay: 1.0e-5
    # Learning rate schedule, can be "linear" or "constant". Linear is recommended.
    schedule: linear
  # Number of partitions. Auto is recommended.
  num_partition: auto
  # Number of negative samples per positive sample.
  # Larger value results in slower training.
  # The performance may be influenced by num_negative * negative_weight.
  num_negative: 5
  # Batch size of samples in CPU-GPU transfer. Default is recommended.
  batch_size: 100000
  # Number of batches in a partition block.
  # Default is recommended.
  episode_size: auto

# Comment out this section if not needed.
load:
  # Path to model file, can be "*.pkl".
  file_name: visualization.pkl

train:
  # Model, can be LargeVis.
  model: LargeVis
  # Number of epochs. Default is recommended.
  num_epoch: 50
  # Resume training from a loaded model.
  resume: false
  # Weight of negative samples. Values larger than 10 may cause unstable training.
  negative_weight: 3
  # Exponent of degrees in negative sampling. Default is recommended.
  negative_sample_exponent: 0.75
  # Batch size of samples in samplers. Default is recommended.
  sample_batch_size: 2000
  # Log every n batches.
  log_frequency: 1000

# Comment out this section if not needed.
evaluate:
  # Comment out any task if not needed.
  - task: visualization
    # Path to label file. Each line should be one of the following
    # [label] [comment]...
    # [comment]...
    # The file is assumed to have the same order as input vectors.
    file_name:
    # Path to save file, can be either "*.png" or "*.pdf".
    # If not provided, show the figure in window.
    save_file:
    # Size of the figure.
    figure_size: 10
    # Size of points. Recommend to use figure_size / 5.
    scale: 2

  # This task only works for dim = 3.
  - task: animation
    # Path to label file. Each line should be one of the following
    # [label] [comment]...
    # [comment]...
    file_name:
    # Path to save file, can be "*.gif".
    save_file:
    # Size of the figure.
    figure_size: 5
    # Size of points. Recommend to use figure_size / 5.
    scale: 1
    # Elevation angle. Default is recommended.
    elevation: 30
    # Number of frames. Default is recommended.
    num_frame: 700

  - task: hierarchy
    # Path to hierarchical label file. Each line should be one of the following
    # [label] [delimiter] [label] [delimiter]... [comment]...
    # [comment]...
    # Labels should be ordered in ascending depth, i.e. the first label corresponds to the root in the hierarchy.
    # The file is assumed to have the same order as input vectors.
    file_name:
    # Target class to be visualized.
    target:
    # Path to save file, can be "*.gif".
    save_file:
    # Size of the figure.
    figure_size: 10
    # Size of points. Recommend to use figure_size / 5.
    scale: 2
    # Duration of each frame in seconds. Default is recommended.
    duration: 3

# Comment out this section if not needed.
save:
  # Path to save file, can be "*.pkl".
  file_name: visualization.pkl
  # Save hyperparameters or not.
  save_hyperparameter: false

Once we are done, we can run our configuration by

In [None]:
# !graphvite run my_config.yaml --cpu 2 --gpu 1

## Hyperparmeters

As showed in Configuration File, there are a bunch of hyperparameters we can custom in GraphVite. Here are a few hyperparameters you may concern for best visualization results, ranked by decreasing importance.

1. `perplexity`. It controls the number of nearest neighbors to preserve for each sample in the visualization. In other words, we get larger clusters from larger perplexity. Common values are 10, 30 and 50. Note `perplexity` should be always smaller than `num_neighbor`.

2. `weight_decay`. It controls the distance between clusters. This might be useful if we want some beautiful spacing. Common values are 1.0e-4, 1.0e-5 and 1.0e-6.