# Text to 3D

In this notebook we will use such techniques as *word2vec* and *t-SNE* (t-distributed stochastic neighbor embedding) to visualize our text in 3D.

With *word2vec* we will convert the words from the text to vectors, which allows us to distribute words in multidimensional space and find relationship between them.
 
*t-SNE* is an algorithm that reduces dimensionality while saving relationship between points. With this algorithm we will map our text to 3d space and then export it as the point cloud file.


In [None]:
#@title Imports
# !pip install transformers
import struct
from google.colab import files
import multiprocessing
import gensim
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
# from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [None]:
#@title Load text file

#@markdown Fist we need to upload our text file to the notebook. 

#@markdown You can do it by manualy drag-n-dropping file to the Files sidebar or by running this cell and choosing the file to upload

uploaded = files.upload()

In [None]:
#@title Word embeddings
#@markdown This cell processes the text and converts words to vectors (oor embeddings). 

#@markdown This allow us to apply some maths to the text data. Later we will use it to calculate distances between words to understand their relationsip and to map it to 3d space.

#@markdown Paste the path to the file you want to use, then run the cell
text_file = '' #@param {type:"string"}
with open(text_file, 'r') as f:
  txt = f.read()

words = []
embeddings = []
for word in list(model.wv.vocab):
    embeddings_wp.append(model.wv[word])
    words.append(word)

# model = GPT2LMHeadModel.from_pretrained('gpt2')
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# text_index = tokenizer.encode(txt,add_prefix_space=True)
# vector = model.transformer.wte.weight[text_index,:].detach().numpy()
print(len(embeddings))

In [None]:
#@title t-SNE Parameters

#@markdown Set parameters and run t-SNE to produce point cloud

#@markdown Algorithm to calculate distance between points 
metrics = "euclidean" #@param ["braycurtis", "canberra", "chebyshev", "cityblock", "correlation", "cosine", "dice", "euclidean", "hamming", "jaccard", "jensenshannon", "kulsinski", "mahalanobis", "matching", "minkowski", "rogerstanimoto", "russellrao", "seuclidean", "sokalmichener", "sokalsneath", "sqeuclidean", "yule"]
#@markdown Maximum number of iterations for the optimization. Should be at least 250.
n_iter=1000 #@param {type:"integer"}
#@markdown The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Different values can result in significantly different results.
perplexity=10 #@param {type:"slider", min:5, max:50, step:1}
#@markdown Determines the random number generator. Setting this can allow future runs of TSNE to look mostly the same.
random_state=12 #@param {type:"integer"}

tsne_3d = TSNE(perplexity=perplexity, n_components=3, init='pca', n_iter=n_iter, random_state=random_state, verbose=1)
embeddings_3d = tsne_3d.fit_transform(embeddings)






In [None]:
#@title Visualisation 

#@markdown Run this cell to visualise t-SNE results
fig = px.scatter_3d(embeddings_3d, x=0, y=1, z=2)

fig.show()

In [None]:


def write_pointcloud(filename, xyz_points, rgb_points=None):
    """ creates a .pkl file of the point clouds generated
    """

    assert xyz_points.shape[1] == 3, 'Input XYZ points should be Nx3 float array'
    if rgb_points is None:
        rgb_points = np.ones(xyz_points.shape).astype(np.uint8) * 255
    assert xyz_points.shape == rgb_points.shape, 'Input RGB colors should be Nx3 float array and have same size as input XYZ points'

    # Write header of .ply file
    with open(filename, 'wb') as fid:
        fid.write(bytes('ply\n', 'utf-8'))
        fid.write(bytes('format binary_little_endian 1.0\n', 'utf-8'))
        fid.write(bytes(f'element vertex {xyz_points.shape[0]}\n', 'utf-8'))
        fid.write(bytes('property float x\n', 'utf-8'))
        fid.write(bytes('property float y\n', 'utf-8'))
        fid.write(bytes('property float z\n', 'utf-8'))
        fid.write(bytes('property uchar red\n', 'utf-8'))
        fid.write(bytes('property uchar green\n', 'utf-8'))
        fid.write(bytes('property uchar blue\n', 'utf-8'))
        fid.write(bytes('end_header\n', 'utf-8'))

        # Write 3D points to .ply file
        for i in range(xyz_points.shape[0]):
            fid.write(bytearray(struct.pack("fffccc", xyz_points[i, 0], xyz_points[i, 1], xyz_points[i, 2],
                                            rgb_points[i, 0].tobytes(), rgb_points[i, 1].tobytes(),
                                            rgb_points[i, 2].tobytes())))

#@title Save point cloud
#@markdown File name (don't forget **.ply** extension )
pointcloud_file = 'tsne-test.ply' #@param {type:"string"}
write_pointcloud(pointcloud_file, embeddings_3d)