# AnVIL Phenotype Working Group

![image](https://user-images.githubusercontent.com/47808/109844406-f0661c00-7c00-11eb-903b-4ba0f39ea798.png)



# Discussion


## Assumptions
![image](https://user-images.githubusercontent.com/47808/109850568-81d88c80-7c07-11eb-92b9-6269089c9481.png)

### Do we as the AnVIL team (data submitters, ingesters and tool builders) want have uniform meta data for a consortium?
### Can we as the AnVIL team (data submitters, ingesters and tool builders) achieve consensus on a uniform set of entities, attributes and value sets?
### As an Ingestion team, do we want to provide a tool for data submitters to test their workspaces for compliance?
### Does each workspace have a submitter with time and budget to validate and correct meta data?


## For submitters:
![image](https://user-images.githubusercontent.com/47808/109833584-3c5f9380-7bf6-11eb-9a77-8f5caff8b619.png)

### Create a lightweight notebook for installation in the data submitter's workspaces
### Use consensus schemas as a reference
### Create lightweight validation within the notebook to assist the submitter

* Be quick and easy to use, without requiring learning new tech
* Have clear documentation
* Make it fast and easy to validate data
* Have straightforward query and download via the command line.
* Give us control over when we publish our data 


## For ingestion team:
![image](https://user-images.githubusercontent.com/47808/109833487-281b9680-7bf6-11eb-9340-7124b4422aed.png)

### Provide a mechanism to harvest and report on validation status of submitter's notebooks

* Have an easy way to contact and work with the Submitters/Developers.
* Clearly separate the technology from the biology.





## Submitter User Journey ![image](https://user-images.githubusercontent.com/47808/109833584-3c5f9380-7bf6-11eb-9a77-8f5caff8b619.png)

### Using established Terra tools, I upload my data to my workspace.
### I open the notebook and validate the tables
    > should validation happen before upload?
### If successful, I can tag the data as ready for submittal


## Ingestion Team User Journey
![image](https://user-images.githubusercontent.com/47808/109833487-281b9680-7bf6-11eb-9340-7124b4422aed.png)

### Using terra tables, I can create and edit lists of entities and attributes to be used for validating submissions.
### I can track the status of submitter workspaces


![image](https://user-images.githubusercontent.com/47808/109830442-5b105b00-7bf3-11eb-9d23-156953843e03.png)



# Context
## De-facto consensus schemas & Current Schema Compliance


In [1]:
from ingest import schema_problems, get_dashboard_data,show_project_schema_compliance

dashboard_data = schema_problems(get_dashboard_data(), show_compliant=True)
show_project_schema_compliance(dashboard_data, 'CMG')    
show_project_schema_compliance(dashboard_data, 'CCDG')    

Unnamed: 0_level_0,columns
entity,Unnamed: 1_level_1
discovery,"01-subject_id,02-sample_id,03-Gene-1,04-Gene_Class-1,05-inheritance_description-1,06-Zygosity-1,07-Chrom-1,08-Pos-1,09-Ref-1,10-Alt-1,11-hgvsc-1,12-hgvsp-1,13-Transcript-1,14-sv_name-1,15-sv_type-1,16-significance-1,17-Gene-2,18-Gene_Class-2,19-inheritance_description-2,20-Zygosity-2,21-Chrom-2,22-Pos-2,23-Ref-2,24-Alt-2,25-hgvsc-2,26-hgvsp-2,27-Transcript-2,28-sv_name-2,29-sv_type-2,30-significance-2,31-Gene-3,32-Gene_Class-3,33-inheritance_description-3,34-Zygosity-3,35-Chrom-3,36-Pos-3,37-Ref-3,38-Alt-3,39-hgvsc-3,40-hgvsp-3,41-Transcript-3,42-sv_name-3,43-sv_type-3,44-significance-3"
family,"01-subject_id,02-family_id,03-paternal_id,04-maternal_id,05-twin_id,06-family_relationship,07-consanguinity,08-consanguinity_detail,09-pedigree_image,10-pedigree_detail,11-family_history,12-family_onset"
sample,"01-subject_id,02-sample_id,03-dbgap_sample_id,04-sample_source,05-sample_provider,06-data_type,07-date_data_generation"
sample_set,"release_date,samples"
sequencing,"10x_rate,20x_rate,adapter_rate,chimera_rate,collaborator_participant_id,collaborator_sample_id,contamination_rate,crai_path,cram_path,data_type,exc_baseq_rate,exc_dupe_rate,exc_mapq_rate,exc_overlap_rate,exc_total_rate,exc_unpaired_rate,genome_territory,library-1_estimated_library_size,library-1_mean_insert_size,library-1_name,library-1_pair_orientation,library-1_pct_exc_dupe,library-1_percent_duplication,library-1_read_pairs,md5_path,mean_coverage,mean_read_length,pdo,pf_aligned_bases,pf_hq_aligned_bases,pf_hq_aligned_q20_bases,pf_hq_aligned_reads,pf_mismatch_rate,pf_noise_reads,pf_reads,pf_reads_aligned,pf_reads_aligned_rate,pf_reads_rate,project,reads_aligned_in_pairs,reads_aligned_in_pairs_rate,reference_sequence_name,release_date,root_sample_id,sample,total_reads,version"
subject,"01-subject_id,02-prior_testing,03-project_id,04-pmid_id,05-dbgap_submission,06-dbgap_study_id,07-dbgap_subject_id,08-multiple_datasets,09-sex,10-ancestry,11-ancestry_detail,12-age_at_last_observation,13-phenotype_group,14-disease_id,15-disease_description,16-affected_status,17-onset_category,18-age_of_onset,19-hpo_present,20-hpo_absent,21-phenotype_description,22-solve_state"


Unnamed: 0_level_0,Unnamed: 1_level_0,discovery,family,sample,sample_set,sequencing,subject
consortium,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CMG,AnVIL_CMG_Broad_Heart_Ware_WES,True,True,True,True,False,True
CMG,AnVIL_CMG_Broad_Blood_Sankaran_WES,True,True,True,True,False,True
CMG,AnVIL_CMG_Broad_Muscle_Kang_WES,True,True,True,True,False,True
CMG,AnVIL_CMG_Broad_Kidney_Hildebrandt_WES,True,True,True,True,False,True
CMG,AnVIL_CMG_Broad_Muscle_OGrady_WES,True,True,True,True,False,True
CMG,AnVIL_CMG_Broad_Brain_Gleeson_WES,True,True,True,True,False,True
CMG,AnVIL_CMG_Broad_Orphan_Manton_WES,True,True,True,True,False,True
CMG,AnVIL_CMG_Broad_Muscle_KNC_WES,True,True,True,True,False,True
CMG,AnVIL_CMG_Broad_Orphan_VCGS-White_WGS,True,True,True,True,False,True
CMG,ANVIL_CMG_Broad_Muscle_Laing_WES,True,True,True,False,False,True


Unnamed: 0_level_0,columns
entity,Unnamed: 1_level_1
participant,"collaborator_participant_id,gender"
sample,"bait_set,bait_territory,collaborator_participant_id,collaborator_sample_id,crai_path,cram_path,data_type,fold_80_base_penalty,fold_enrichment,het_snp_q,het_snp_sensitivity,library-1_estimated_library_size,library-1_mean_insert_size,library-1_name,library-1_pair_orientation,library-1_pct_exc_dupe,library-1_percent_duplication,library-1_read_pairs,library-2_estimated_library_size,library-2_mean_insert_size,library-2_name,library-2_pair_orientation,library-2_pct_exc_dupe,library-2_percent_duplication,library-2_read_pairs,md5_path,mean_bait_coverage,mean_target_coverage,median_target_coverage,near_bait_bases,off_bait_bases,on_bait_bases,on_bait_vs_selected,on_target_bases,participant,pct_chimeras,pct_exc_baseq,pct_exc_dupe,pct_exc_mapq,pct_exc_off_target,pct_exc_overlap,pct_off_bait,pct_selected_bases,pct_target_bases_100x,pct_target_bases_10x,pct_target_bases_20x,pct_target_bases_2x,pct_target_bases_30x,pct_target_bases_50x,pct_usable_bases_on_bait,pct_usable_bases_on_target,project,sample,target_territory,version,zero_cvg_targets_pct"
sample_set,"DBSNP_INS_DEL_RATIO,DBSNP_TITV,FILTERED_INDELS,FILTERED_SNPS,NOVEL_INDELS,NOVEL_INS_DEL_RATIO,NOVEL_SNPS,NOVEL_TITV,NUM_IN_DB_SNP,NUM_IN_DB_SNP_COMPLEX_INDELS,NUM_IN_DB_SNP_INDELS,NUM_IN_DB_SNP_MULTIALLELIC,NUM_SINGLETONS,PCT_DBSNP,PCT_DBSNP_INDELS,release_date,samples,SNP_REFERENCE_BIAS,TOTAL_COMPLEX_INDELS,TOTAL_INDELS,TOTAL_MULTIALLELIC_SNPS,TOTAL_SNPS,vcf,versions_cloud_path,workflow_wdl"


Unnamed: 0_level_0,Unnamed: 1_level_0,participant,sample,sample_set
consortium,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CCDG,anvil_ccdg_broad_ai_ibd_niddk_daly_silverberg_wes,True,True,True
CCDG,AnVIL_CCDG_Broad_NP_Epilepsy_DEUUPM_HMB_MDS_WES,True,True,True
CCDG,AnVIL_ccdg_asc_ndd_daly_talkowski_puura_asd_exome,True,True,True
CCDG,anvil_ccdg_broad_ai_ibd_daly_kupcinskas_wes,True,True,True
CCDG,AnVIL_CCDG_Broad_NP_Epilepsy_CZEMTH_GRU_WES,True,True,True
CCDG,AnVIL_ccdg_asc_ndd_daly_talkowski_barbosa_asd_exome,True,True,True
CCDG,anvil_ccdg_broad_ai_ibd_niddk_daly_duerr_wes,True,True,True
CCDG,AnVIL_ccdg_asc_ndd_daly_talkowski_herman_asd_exome,True,True,True
CCDG,AnVIL_ccdg_asc_ndd_daly_talkowski_palotie_asd_exome,True,True,True
CCDG,AnVIL_ccdg_asc_ndd_daly_talkowski_pericak-vance_asd_exome_,True,True,True


# Browse workspaces:

* Select problem to see all workspaces that exhibit that problem 
* Then, select workspace to see details

> Summarize data wrangling exceptions

* `inconsistent_entityName`
    The majority of workspaces in this consortium use a {'entityName': ...} structure.  This workspace does not.

*  `missing_subjects`
    This workspace has no subjects.

*  `missing_sequence`
    The majority of workspaces in this consortium use a sequence entity to link sample to blobs(files).  This workspace does not.

*  `dbgap_sample_count_mismatch`
    The count of samples in all workspaces tagged with this accession does not match the length of DbGap/Study/SampleList provided by dbGap's API.

* `missing_accession`
    This workspace does not have an accession.

*  `inconsistent_subject`
    The majority of workspaces in this consortium use a consistent key in the sample to refer to the subject.  This workspace does not.
    * For CCDG this is a misspelling of 'participant'.
    * For CMG this is refers to inconsistent use of '01-subject_id'.


* `missing_samples`
    At least one subject in this workspace is missing a sample.

* `missing_schema`
    The terra API [list_entity_types](https://github.com/broadinstitute/fiss/blob/0440e4822a49c393e65964d9dedaa6d4828587bd/firecloud/api.py#L180) returned null.

* `missing_blobs`
    At least one sample in this workspace is missing a blob.

* `schema_conflict_subject`
    The schema for subjects in this workspace does not match the schema for others in the consortium

* `schema_conflict_sample`
    The schema for sample in this workspace does not match the schema for others in the consortium


In [2]:

from ingest import show_schema_upset

show_schema_upset('CMG')


interactive(children=(UpSetJSWidget(value=None, combinations=[UpSetSetIntersection(name=sequencing_not_complia…

In [3]:
show_schema_upset('CMG', True)


interactive(children=(UpSetJSWidget(value=None, combinations=[UpSetSetIntersection(name=schema_compliant_subje…

In [4]:
show_schema_upset('CCDG')

interactive(children=(UpSetJSWidget(value=None, combinations=[UpSetSetIntersection(name=sample_not_compliant, …

In [5]:
show_schema_upset('CCDG', True)


interactive(children=(UpSetJSWidget(value=None, combinations=[UpSetSetIntersection(name=schema_compliant_parti…

In [6]:

from ingest import show_problems_upset

show_problems_upset()

interactive(children=(UpSetJSWidget(value=None, combinations=[UpSetSetIntersection(name=missing_sequence, sets…

In [None]:
# notes ...
import os
import subprocess
from google.cloud import storage

def bucket_details(bucketName):
    storage_client = storage.Client(project=os.environ['GOOGLE_PROJECT'])
    bucket = storage_client.bucket(bucketName, user_project=os.environ['GOOGLE_PROJECT'])
    return bucket

def bucket_summary(bucketName):
    cmd = f"gsutil ls -lR {bucket_name} | tail -1"
    parts = subprocess.check_output([cmd], shell=True).decode("utf-8").split() 
    total_objects = int(parts[1])
    size = int(parts[3])
    return (total_objects, size)

workspace = [ w for w in FAPI.list_workspaces().json() if w['workspace']['name']== 'AnVIL_CMG_BaylorHopkins_HMB-NPU_WES'][0]

bucketName = f"gs://{workspace['workspace']['bucketName']}"


bucket_summary(bucketName)
    

In [None]:
workspace['vertex'].diseaseOntologyId

In [None]:
b'TOTAL: 1672 objects, 11972138647253 bytes (10.89 TiB)\n'.decode("utf-8") 