# Datasets

Datasets consist of complete combinatorial landscapes to visualize and work with, as well as the data from which the were inferred. Both inference of the complete landscape and calculation of the coordinates of the visualization are precomputed to provide rapid access to the different layers of interest.

In [1]:
# Import required libraries
import numpy as np

from gpmap.datasets import DataSet, list_available_datasets
from gpmap.inference import VCregression

## How to load a built-in dataset

We include a series of datasets that are used throughout the documentation for demonstration of the different applications and are directly accessible after installation of the library for any user.
The list of built-in datasets can be easily shown as follows

In [2]:
list_available_datasets()

['5ss', 'f1u', 'test', 'dmsc', 'gb1', 'smn1', 'serine', 'trna', 'pard']

### How to access combinatorial landscape values

And one can easily load one of those datasets as illustrated in some previous tutorials, and all of them should contain at least a `landscape` attribute containing the phenotype associated to each possible genotype

In [3]:
gb1 = DataSet('gb1')
gb1.landscape

Unnamed: 0_level_0,y
seq,Unnamed: 1_level_1
AAAA,0.296301
AAAC,-2.713474
AAAD,-2.912992
AAAE,-4.548719
AAAF,-3.276738
...,...
YYYS,-4.662925
YYYT,-3.223102
YYYV,-3.001718
YYYW,-4.723318


### How to access the processed data in experimental datasets

If the landscape was obtained from experimental data, then it also has a `data` attribute that includes the measurement `y` and, if available, its uncertainty `y_var`. 
The data may not necessarily include measurements for every possible sequence, as in this case, in which about ~10000 sequences were not experimentally measured

In [4]:
gb1.data

Unnamed: 0_level_0,y,y_var
sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
AAAA,0.460831,0.046009
AAAG,-2.192261,0.255906
AAAH,-4.728306,2.064530
AAAI,-4.338842,2.095252
AAAL,-2.326240,0.087518
...,...,...
YYYS,-5.269987,0.291090
YYYT,-3.821426,0.074489
YYYV,-3.143536,0.074682
YYYW,-4.306581,0.699467


### How to access the a dataset visualization

For built-in datasets, we also provide the pre-calculated coordinates of the visualization, the `DataFrame` connecting sequences separated by single point mutations and the relaxation times associated to each of the diffusion axes in the attributes `nodes`, `edges` and `relaxation_times`

In [5]:
gb1.nodes

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,function,stationary_freq
AAAA,-0.270938,-0.944304,-0.227171,0.744803,0.059077,-0.077512,-0.477853,0.174491,0.015944,0.052664,0.296301,1.067767e-04
AAAC,0.033789,-0.232603,-0.271458,0.576487,0.035619,0.087608,0.590118,-0.249005,-0.087750,-0.110291,-2.713474,4.954648e-06
AAAD,-0.020398,-0.127749,-0.174455,0.347843,0.142684,0.208679,0.590025,0.160819,0.354397,0.676487,-2.912992,4.042194e-06
AAAE,-0.001018,-0.138712,-0.183161,0.340728,0.121067,0.157871,0.436407,0.195630,0.211100,0.364298,-4.548719,7.619345e-07
AAAF,0.149717,-0.156524,-0.239304,0.386243,0.103285,0.107756,0.302406,0.051575,0.171278,0.226772,-3.276738,2.789084e-06
...,...,...,...,...,...,...,...,...,...,...,...,...
YYYS,0.073880,0.038075,-0.097751,0.156184,0.056463,0.074291,0.262512,0.019037,0.144439,0.172686,-4.662925,6.781399e-07
YYYT,-0.091125,0.213370,0.256403,0.246274,-0.086279,0.026923,0.217086,0.102111,0.593257,0.003682,-3.223102,2.945947e-06
YYYV,0.016488,0.195242,0.216320,0.035269,0.306726,0.025334,0.217759,-0.028542,-0.038378,0.148356,-3.001718,3.692393e-06
YYYW,0.134072,0.114107,-0.043856,0.011092,0.076565,0.109907,0.261274,0.108909,0.180371,0.365348,-4.723318,6.376209e-07


In [6]:
gb1.edges

Unnamed: 0,i,j
0,0,1
1,0,2
2,0,3
3,0,4
4,0,5
...,...,...
6079995,159996,159998
6079996,159996,159999
6079997,159997,159998
6079998,159997,159999


In [7]:
gb1.relaxation_times

Unnamed: 0,k,decay_rates,relaxation_time
0,1,2.554843,0.391413
1,2,3.566862,0.280359
2,3,4.926568,0.202981
3,4,5.023657,0.199058
4,5,5.303026,0.188572
5,6,5.635594,0.177444
6,7,6.294868,0.15886
7,8,6.543588,0.152821
8,9,6.741685,0.148331
9,10,7.000798,0.142841


## How to build new datasets

We also provide functionality to create new datasets and store them in the local copy of your library for easier access. Lets build new datasets from simulated data

In [8]:
np.random.seed(0)
lambdas = np.array([10, 2, 0.5, 0.1, 0.02, 0])
model = VCregression(seq_length=5, alphabet_type='dna', lambdas=lambdas)
data = model.simulate(p_missing=0.2, sigma=0.1).drop('y_true', axis=1).dropna()
data

Unnamed: 0,y,y_var
AAAAA,0.039540,0.01
AAAAG,-0.117862,0.01
AAAAT,0.303257,0.01
AAACA,0.230550,0.01
AAACC,-0.000383,0.01
...,...,...
TTTCT,-0.010906,0.01
TTTGG,-0.398118,0.01
TTTTC,-0.320779,0.01
TTTTG,-0.266456,0.01


The method `build` will use some default values to run Variance Component regression and compute visualization coordinates automatically, but may not be the best choice for any particular dataset.

In [9]:
test = DataSet('test', data=data)
test.build()

100%|██████████| 100/100 [00:02<00:00, 41.79it/s]


We can now re-load the dataset from disk and verify that it contains the visualization attributes

> Note that reinstalling the library will erase the newly created `DataSet`s

In [10]:
test = DataSet('test')
test.nodes

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,13,14,15,16,17,18,19,20,function,stationary_freq
AAAAA,2.922532,1.612996,1.922566,1.928836,0.651238,0.724700,-0.626236,0.457580,-0.160287,0.765350,...,2.476916,3.135844,0.294568,1.121940,-0.204409,-1.407941,0.419596,-1.271204,-0.019814,0.000053
AAAAC,2.681952,1.434138,2.081914,0.677991,0.593148,1.109453,-0.824506,1.029326,-0.209692,0.677500,...,1.768603,2.792991,0.789508,1.052783,0.499965,-1.928383,0.194958,-0.919782,-0.038036,0.000044
AAAAG,2.604061,1.908399,2.159282,0.320091,0.376623,0.861715,-0.115110,0.965854,0.537670,0.772480,...,2.137501,2.527706,0.950794,1.133436,0.317646,-1.593851,0.628486,-1.566774,-0.120535,0.000020
AAAAT,1.960418,2.015446,2.802114,-0.144481,1.141420,2.506787,-1.358454,1.476034,-0.333192,0.398977,...,1.890287,2.614852,2.739016,-0.219729,-0.199154,-1.568183,0.575084,-1.885819,0.220953,0.000543
AAACA,3.814800,1.261699,0.811080,4.448437,0.032213,-0.514836,-0.255077,-0.670186,-0.133688,0.566255,...,1.577136,1.222345,-0.527213,0.709598,-0.291410,-1.466313,-0.167420,-1.182996,0.201557,0.000450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGT,2.140434,-0.394854,0.736442,-0.155903,-0.180521,1.415161,0.583885,0.179762,-0.547157,0.584356,...,0.441516,0.353044,-0.189312,-0.864722,0.205254,0.092788,-0.209753,0.350808,-0.114952,0.000021
TTTTA,2.902003,0.551990,-0.275240,1.688366,-0.765170,0.558554,0.652595,0.057239,-0.154063,-0.473734,...,0.337019,-0.614797,-0.057125,-0.069688,1.167145,0.828056,-0.541204,1.131136,-0.213529,0.000008
TTTTC,2.601067,0.128646,-0.267689,0.344941,-0.793282,0.625832,0.177349,0.472832,-0.466700,0.047076,...,-0.532616,-0.050942,-0.458822,0.365146,0.399918,0.206811,-0.098076,0.757820,-0.344745,0.000002
TTTTG,2.157413,0.586338,-0.397141,0.210420,-0.676836,0.744688,0.895972,0.421951,0.458642,-0.347510,...,-0.238223,0.093321,0.244368,0.207288,0.372603,0.086408,0.194293,0.455535,-0.283153,0.000004
