Keeper tutorial

The primary data structure used in `netflow` is called a `Keeper`. It is used to load, store, manipulate and save data for a set of observations. In particular, there are several specific types of keepers:

- `DataKeeper` : handles feature data
- `DistanceKeeper` : handles pairwise-observation distances (also used to handle pairwise observation similarities)
- `GraphKeeper` : handles graphs (networks)

Interacting with `netflow` will primarily entail making use of the predomenent `Keeper`, which implicitly makes use of the aforementioned specific keeper classes. This tutorial therefore focuses on the `Keeper` class, please see the documentation for more detail on the other Keeper classes. 

Data is organized in the `Keeper` class via the following attributes:

- `self.oudir` : (directory path) : Path to directory where results will be saved.
    - If not provided, no results can be saved.
- `self.observation_labels` : (`list`) Observation labels are kept consistent across all feature data, distances and similarities.
- `self.data` : (`DataKeeper`) Used to handle all feature data.
- `self.distances` : (`DistanceKeeper`) Used to handle all observation-pairwise distances.
- `self.similarities` : (`DistanceKeeper`) Used to handle all observation-pairwise similarities.
- `self.graphs` : (`GraphKeeper`) Used to handle all graphs.
- `self.misc` : (`dict`) Used to handle any miscellaneous data. 
    - Caution should be taken as observation labels and/or ordering of data stored in `self.misc` may not be consistent with the observations as tracked by the `Keeper`.

We will now walk through some use-cases of how to make use of the `Keeper` class.

First, import the necessary packages:

# Load libraries 

In [1]:
import pathlib
import sys

from collections import defaultdict as ddict
import itertools
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse as sc_sparse
from tqdm import tqdm

If ``netflow`` has not been installed, add the path to the library:

In [3]:
sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[3].resolve().as_posix())
# sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[0].resolve().as_posix())

From the ``netflow`` package, we load the following modules:
 - The ``InfoNet`` class is used to compute 1-hop neighborhood distances
 - The ``Keeper`` class is used to store and manipulate data/results

In [4]:
import netflow as nf

# from netflow.keepers import keeper 

# Set up directories

In [5]:
MAIN_DIR = pathlib.Path('.').absolute()

Paths to where data is stored:

In [6]:
DATA_DIR = MAIN_DIR / 'example_data' / 'breast_tcga'

RNA_FNAME = DATA_DIR / 'rna_606.txt'
E_RNA_FNAME = DATA_DIR / 'edgelist_hprd_rna_606.txt'

CNA_FNAME = DATA_DIR / 'cna_606.txt'
E_CNA_FNAME = DATA_DIR / 'edgelist_hprd_cna_606.txt'

METH_FNAME = DATA_DIR / 'methylation_606.txt'
E_METH_FNAME = DATA_DIR / 'edgelist_hprd_methylation_606.txt'

CLIN_FNAME = DATA_DIR / 'clin_606.txt'

Directory where output should be saved:

In [7]:
OUT_DIR = MAIN_DIR / 'example_data' / 'results_netflow_breast_tcga'

# Load data