# dimenSNEon Tests
## Read me!

The following notebook can be used to test dimenSNEon on sample scRNA-seq data.

**Test data is not included in the repository (as it is too large), so the notebook will automatically download it. This may take a while.**

You can configure some options below:

In [None]:
# This notebook runs dimenSNEon's t-SNE and scanpy's builtin t-SNE simulations on sample data from 10x genomics.
# You can tweak parameters below to alter the test.

# Truncate the data to this many datapoints in the interest of speed.
NUM_DATAPOINTS=1000

# How many iterations of dimenSNEon to run.
NUM_ITERATIONS=1000

# Perplexity to target.
PERPLEXITY=30

# Whether to generate an animation of dimenSNEon's t-SNE process.
MAKE_ANIMATION=True

# The example dataset to download. There are two valid values:
# "10xgen" - The 10x genomics example scRNA-seq dataset
# "pancreatic" - The pancreatic cell sample from lab 6. (Specifically, the M3 E7 post-implantation cells)
DATASET = "pancreatic"

## Downloading Test Data
The following cell downloads the data. **This may take a while.**

In [None]:
# First, ensure test data is downloaded
import os, urllib.request, tarfile, sys

DATA_URL, FILENAME, DATADIR, EXTRACTDIR, DATAPREFIX = ("","","","","")

if DATASET == "10xgen":
    DATA_URL = "https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_10k_v3/pbmc_10k_v3_filtered_feature_bc_matrix.tar.gz"
    FILENAME = "data/data.tar.gz"
    DATADIR = "data/filtered_feature_bc_matrix"
    EXTRACTDIR = "data"
    DATAPREFIX = None
elif DATASET == "pancreatic":
    DATA_URL = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE167880&format=file"
    FILENAME = "data/data.tar"
    DATADIR = "data/pancreatic"
    EXTRACTDIR = "data/pancreatic"
    DATAPREFIX = "GSM5114474_M3_E7_"

else:
    print("Error - DATASET must be one of '10xgen' or 'pancreatic'", file=sys.stderr)

if not os.path.isdir(DATADIR) or not os.path.isdir(EXTRACTDIR):
    os.makedirs(EXTRACTDIR)
    print(f"Hold on, downloading sample dataset '{DATASET}'...")

    # We have to fake the user agent to make 10x happy so we don't get a 403...
    request = urllib.request.Request(
        DATA_URL,
        data=None,
        headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
        }
    )
    response = urllib.request.urlopen(request)
    with open(FILENAME, "wb") as file:
        file.write(response.read())
        file.close()

    print("Downloaded! Uncompressing...")
    mode = "r:gz" if FILENAME.endswith("gz") else "r"
    with tarfile.open(FILENAME, mode) as tar:
        tar.extractall(EXTRACTDIR)
        tar.close()

    # If we're using the lab 6 pancreatic cell data, we also need to download the features separately
    if DATASET == "pancreatic":
        print("Downloading features matrix...")
        urllib.request.urlretrieve("https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE167880&format=file&file=GSE167880%5Ffeatures%2Etsv%2Egz", f"{DATADIR}/{DATAPREFIX}features.tsv.gz")

    print("Done!")

## Loading Libraries and Data
The following cell loads needed python libraries and loads the scRNA-seq data.

In [None]:
# Import libraries and load data

# Ensure dimensneon is in the path (so we don't need to install it to run this)
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import scanpy as sc
import dimensneon.tsne as dtsne
import time

data = sc.read_10x_mtx(DATADIR, cache=True, prefix=DATAPREFIX)
print("Loaded data!")

## Data Preprocessing and PCA
In order to run t-SNE, we first need to normalize cell counts, convert them to log scale, and then find our highly variable genes.

Then, we arbitrarily limit the number of datapoints to `NUM_DATAPOINTS` (in the interest of time).

Finally, we create two copies of the data - one for use with scanpy's t-SNE implementation, and one for dimenSNEon's implementation.

In [None]:
# Normalize counts and get highly variable genes
sc.pp.normalize_per_cell(data, counts_per_cell_after=1e4)
sc.pp.log1p(data)
sc.pp.highly_variable_genes(data, n_top_genes=100)

# Arbitrarily limit to NUM_DATAPOINTS datapoints in the interest of speed
data_var = data[:, data.var['highly_variable']][0:NUM_DATAPOINTS, 0:NUM_DATAPOINTS]
sc.pp.neighbors(data_var) # computes neighborhood graphs. Needed to run clustering.
sc.tl.leiden(data_var) # clusters cells based on expression profiles. This is needed to color cells by cluster.

# Create two copies of the data. One for runing with builtin, one for running with dimensneon.
data_builtin = data_var.copy()
data_dsne = data_var

## Scanpy t-SNE

The following cell runs and times the scanpy implementation of t-SNE.

In [None]:
# Run the builtin scanpy tSNE and time it
scanpy_t0 = time.time()
sc.tl.tsne(data_builtin, perplexity=PERPLEXITY)
scanpy_t1 = time.time()

# Graph it
title = f"Scanpy t-SNE ({'{:.2f}'.format(scanpy_t1 - scanpy_t0)}s)"
sc.pl.tsne(data_builtin, color=['leiden'], legend_loc='on data', legend_fontsize=10, alpha=0.8, size=20, title=title)

## dimenSNEon t-SNE

The following cell runs and times the dimenSNEon implementation of t-SNE.

In [None]:
# Uncomment these two lines to reload dimenSNEon when making changes
# import importlib
# importlib.reload(dtsne)

# Run dimenSNEon and time it
dimensneon_t0 = time.time()
dsne_result = dtsne.tsne(data_dsne, iterations=NUM_ITERATIONS, perplexity=PERPLEXITY, animate=MAKE_ANIMATION)
dimensneon_t1 = time.time()

# Graph it
title = f"dimenSNEon t-SNE ({'{:.2f}'.format(dimensneon_t1 - dimensneon_t0)}s)"
sc.pl.tsne(data_dsne, color=['leiden'], legend_loc='on data', legend_fontsize=10, alpha=0.8, size=20, title=title)

# Animation

For my presentation, I included an animation of the t-SNE simulation. The following code will generate the frame images if `MAKE_ANIMATION` is `True`.

In [None]:
# If we wanted to make an animation of the t-SNE simulation, now we do so
if MAKE_ANIMATION:
    idx = 0
    for frame in dsne_result:
        data_dsne.obsm['X_tsne'] = frame
        sc.pl.tsne(data_dsne, color=['leiden'], legend_loc='on data', legend_fontsize=10, alpha=0.8, size=20, save=f"frame_{idx}.png", title=f"Iteration {idx * 10}")
        idx += 1