# Applying dimensional reduction techniques to data
This notebook is intended to guide you through applying t-SNE and UMAP to the dataset of your choice, ideally formatted as indicated in the README (columns are observables, rows are sample points). 


## Requirements
Because nothing comes without effort...

#### General
pandas >= 0.24.0, otherwise you might not be able to open DataFrames pickled by someone with a recent pandas version. 

In case you are having some problems with dependencies, it might be a good idea to:

1. Create and activate a virtual environment, so the updates you might need to do will not mess up your research project's code.

    1.1 If you don't know how to do this, see the explanation on that page (for conda): https://uoa-eresearch.github.io/eresearch-cookbook/recipe/2014/11/20/conda/ and https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
    
    1.2 You will also need to configure the environment as a kernel for jupyter notebooks: https://anbasile.github.io/programming/2017/06/25/jupyter-venv/
    
    1.3 If you want to install a package only to the virtual environment, you need to specify it with the `-n` option:  `conda install -n yourenvname [package]`. 
    
    1.4 However, the `conda update` command will update the package version in the current virtual environment; no need to specify it. 
    

2. Run `conda update --all`: update all packages in the virtual environment, to be sure that you won't encounter dependency conflicts. 

#### t-SNE
scikit-learn has a user-friendly but slow implementation of t-SNE. I preferred to use Multicore-TSNE, seemingly the fastest alternative available:

https://github.com/DmitryUlyanov/Multicore-TSNE

You will need to install it with `pip` (it is not available with `conda`):

`pip install Multicore-TSNE`

It will very probably not work, telling you that the package `cmake` was not found. Try to install it with `pip`:

`pip install cmake` 

or with `conda`, if you prefer:

`conda install cmake` 

Then, you should be able to install Multicore-TSNE; if not, try to solve the error messages that you receive.

#### UMAP

I think that the only trustworthy implementation is the original one. According to the documentation on Github, you can install via `conda`:

`conda install -c conda-forge umap-learn`

or `pip`:

`pip install umap-learn` 

I have used the pip install, since the conda-forge one wanted to update some packages to versions currently in development (later versions that the regular conda ones). 

There are some dependencies that you may need to install/update: 

conda update numpy scipy
conda update scikit-learn
conda update numba

## References for theory

#### t-SNE
t-SNE's original paper is quite straightforward to understand the algorithm (sections 2-3):

Van Der Maaten L., and Hinton G. “Visualizing Data Using T-SNE.” *Journal of Machine Learning Research*, 9 (2008), 2579–2625.
https://lvdmaaten.github.io/tsne/

Fast implementations use the Barnes-Hut version of t-SNE. Barnes-Hut is essentially an algorithm to compute the gradient approximately and speed up the optimization procedure. 

#### UMAP
The original paper is much more mathematically heavy than for t-SNE. 

McInnes, Leland, John Healy, and James Melville. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” *ArXiv*:1802.03426 [Cs, Stat], February 9, 2018. http://arxiv.org/abs/1802.03426.
https://github.com/lmcinnes/umap

Section 2 presents the foundations of the technique in topology and category theory. Section 3 is more practical. 


<!--the central idea is to assume that all points are uniformly distributed on some manifold, 
% then find the Riemannian metric on that manifold which enforces that uniformity property. (not sure about that part yet)
-->


In [1]:
# Initial imports
import numpy as np
import pandas as pd
import scipy as sp
import os

# Things you will probably need to install
import MulticoreTSNE as tsne
import umap

# Modules from this project
from format_tools import load_object, save_object
from analyze_tools import list_available, load_chosen

## Import your data
Load any pickled DataFrame (or another data format if you feel daring, but there are no guarantees then). 

In [2]:
# Select the folder containing your data files
folder = "data"

# Filter files in the folder by applying the following function on each found file name. 
# Here, we only keep pickle files
condition = lambda x: x.endswith(".pkl")

# Get all files in the chosen folder; store them in a dictionary for easy access
available_files = list_available(folder, condition)

The current working directory is  /Users/francoisb/code_repos/tsne_umap_day
There are 2 available .pkl files in data: {
	0:"gas_example_blocks_formatted.pkl"
	1:"gas_example_ndarray_formatted.pkl"
}
Now, select your file in the cell below


In [3]:
# Choose the file in the dictionary above
file_chosen_index = 0

# Load the object, which should be a DataFrame in this context
df = load_chosen(file_chosen_index, folder, available_files)

Will try to import:
data/gas_example_blocks_formatted.pkl

Succesfully loaded the following object: 

Dimension                           X                   Y                   Z  \
Observables                        vx         x        vy         y        vz   
Temperature Pressure Sample                                                     
10 C        1 atm    0       0.713163  0.540529  0.814140  0.894961  0.773218   
                     1       0.443060  0.483871  0.811989  0.551495  0.947800   
                     2       0.780947  0.441669  0.135596  0.442373  0.900519   
                     3       0.049410  0.607650  0.521471  0.313742  0.270749   
                     4       0.971075  0.804892  0.750665  0.115178  0.480289   
20 C        2 atm    0       0.022336  0.529736  0.775362  0.751712  0.506447   
                     1       0.877458  0.960558  0.656000  0.157513  0.263914   
                     2       0.735113  0.312976  0.171312  0.116324  0.695630   
       

## Apply t-SNE on the loaded data
Skip this section if you prefer UMAP. 

## Apply UMAP on the loaded data