# AnswerALS Cloud: Technical Requirements

#### Alex LeNail, alex@lenail.org

This notebook describes requirements for a system to "Query by Data" and "Compute on Data" against the AnswerALS datastore in Azure. 

[Background on the problem, and a proposed architecture can be found here](https://docs.google.com/document/d/1nPZIVqYmNdhf-OSJI7oaNlwPq334o-h0EWiNTRKwjzs/edit?usp=sharing). One section from that document I'd like to highligh here is the conclusion, regarding constraints for the system: 

> We should evaluate competing options on the basis of: 
> - **Query speed**: How long would it take to run a typical “data query” via each of these technologies? 
> - **Flexibility & Completeness**: What operations are supported by each of these technologies? For Compute on Data specifically: Could I run python, R, arbitrary binaries / bash? What would the technology allow me? 
> - **Familiarity & ease of use**: Are the end users already familiar with the proposed technology? If not, what is the learning curve for an average bioinformatician? Once they’ve learned, how easy is it for an experienced user to run queries? 
> - **Development Cost**: Whatever we develop should rely heavily on Azure technologies, and avoid reinventing the wheel. 
> - **Maintenance Cost**: Whatever we develop should be sufficiently robust and minimal as to require minimal maintenance, and be possible for e.g. a grad student to debug, should they need to. 


This notebook explicitly describes a set of representative operations we want to execute via this system. 



In [10]:
import numpy as np
import pandas as pd
# import qbd

In [8]:
files = pd.read_csv('./data/Files.csv')

files 

Unnamed: 0,omic,NeuroGUID,CGND_ID,iPSC_Line,line,differentiation,data_level,experiment,path
0,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
1,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
2,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
3,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
4,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
5,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
6,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
7,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
8,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
9,genomics,CS-NL-017,CGND-HDA-00216,,,,1,WGS,genomics/1_fastq/CGND_11598/Sample_CGND-HDA-00...
