Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC79: Incremental Upload of Data Entries #48

Open
wants to merge 128 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 122 commits
Commits
Show all changes
128 commits
Select commit Hold shift + click to select a range
5dfe298
Add clinical_attribute_meta records to the seed mini
forus Mar 21, 2024
531b10a
Implement sample attribute rewriting flag
forus Mar 21, 2024
248a08c
Add --overwrite-existing for the rest of test cases
forus Mar 21, 2024
2bc7271
Test that mutations stay after updating the sample attributes
forus Mar 21, 2024
31e3194
Add overwrite-existing support for mutations data
forus Mar 21, 2024
bd023a9
Fix --overwirte-existing flag description for importer of profile data
forus Mar 22, 2024
c49bbf3
Add loader command to update case list with sample ids
forus Mar 28, 2024
1f5695d
Add option to remove sample ids from the remaining case lists
forus Mar 28, 2024
77cd6a8
Make removing sample ids from not mentioned case lists a default beha…
forus Mar 29, 2024
bd8c4b2
Make update case list command to read case lists files
forus Mar 29, 2024
5fc633b
Fix test clinical data headers
forus Mar 29, 2024
f7132c9
Test incremental patient upload
forus Apr 1, 2024
f45e1e8
Add flag to reload patient clinical attributes
forus Apr 2, 2024
8cc95a0
Add TODO comment to remove MIXED_ATTRIBUTES data type
forus Apr 3, 2024
fa32b7f
WIP adopt py script to incremental upload
forus Apr 3, 2024
f044c3b
Fix java.sql.SQLException: Generated keys not requested
forus Apr 4, 2024
48fca03
Clean alteration_driver_annotation during mutations inc. upload
forus Apr 5, 2024
1302a8e
Fix validator and importer py scripts for inc. upload
forus Apr 5, 2024
659f352
Add test/demo data for incremental loading of study_es_0 study
forus Apr 5, 2024
b5952e3
Rename and move incremental tests to incementalTest folder
forus Apr 8, 2024
753119b
Update TODO comment how to deal with multiple sample files
forus Apr 9, 2024
5725d42
Move study_es_0_inc to the new test data folder
forus Apr 9, 2024
299466a
Fix removing patient attributes on samples inc. upload
forus Apr 9, 2024
c0c28e2
Change study_es_0_inc to contain more diverse data
forus Apr 11, 2024
c6eddbb
Specify that data_directory for incremental data
forus Apr 11, 2024
595d24f
Disambiguate clinical data constants names
forus Apr 11, 2024
c8b4c73
Remove not necessary TODO comments
forus Apr 11, 2024
efd34d8
Remove MSK copyright mistakenly copy-pasted
forus Apr 11, 2024
3b39e0d
Fix comment of UpdateCaseListsSampleIds.run() method
forus Apr 11, 2024
fc785f6
Make --overwrite-existing flag description more generic
forus Apr 11, 2024
e782951
Add TODO comments for possible reuse of the code
forus Apr 11, 2024
b53c8c4
Update case lists for multiple clinical sample files
forus Apr 11, 2024
99550b5
Extract and reuse common logic to read and validate case lists
forus Apr 11, 2024
1829842
Fix TestIntegrationTest
forus Apr 30, 2024
e785a53
Revert RESOURCE_DEFINITION_DICTIONARY initialsation to empty set
forus Apr 30, 2024
e09e1e2
Minor improvments. Apply PRs feedback
forus Apr 30, 2024
7b527b6
Make tests fail the build. Conduct exit status of tests correctly
forus May 1, 2024
f5e8217
Write Validation complete only in case of successful validation
forus May 1, 2024
8d3aaed
Add python tests for incremental/full data import
forus May 1, 2024
1b6ba41
Add unit test for incremental data validation
forus May 1, 2024
d252001
Test rough order of importer commands. Remove sorting in the script t…
forus May 3, 2024
c27b8f1
Extract smaller functions from the big one in py script
forus May 3, 2024
2e80b73
Merge pull request #32 from se4bio/inc-data-upload-poc
forus May 15, 2024
b2c1c21
Refactor tab delim. data importer
forus May 2, 2024
a7aab3a
Implement incremental upload of mRNA data
forus May 3, 2024
bd2d8c1
Add RPPA test
forus May 7, 2024
8b68331
Add normal sample to thest data to test skipping
forus May 7, 2024
b18aab1
Add rows with more columns then in header to skip
forus May 7, 2024
ea688c3
Skip rows that don't have enough sample columns
forus May 7, 2024
cdae501
Test for invalid entrez id
forus May 7, 2024
cf458a4
Extract common code from inc. tab. delim. tests
forus May 7, 2024
9ea1ada
Implement incremntal upload of cna data via tab. delim. loader
forus May 8, 2024
03f9660
Blanken values for genes not mentioned in the file
forus May 8, 2024
93cc6ff
Remove unused code
forus May 8, 2024
842bcd3
Throw unsupported operation exception for GENESET_SCORE incremental u…
forus May 8, 2024
22b688a
Add generic assay data incremental upload test
forus May 8, 2024
d11a353
Fix integration tests
forus May 8, 2024
7dfb1bd
Make tab. delimiter data uploader transactional
forus May 9, 2024
71cdf70
Check for illegal state in tab delim. data update
forus May 9, 2024
2d31dac
Wire incremental tab delim. data upload to cli commands
forus May 9, 2024
4997542
Expand README with section on how to run incremental upload
forus May 10, 2024
911ae28
Address TODOs in tab delim. importer
forus May 10, 2024
c7343f9
Add more data types to incremental data upload folder
forus May 10, 2024
2ed0bd8
Remove obsolete TODO comment
forus May 15, 2024
76b52a9
Reuse genetic_profile record if it exists in db already
forus May 16, 2024
fa16076
Test incremental upload of tab delim. data types from umbrella script
forus May 16, 2024
e5ccc3e
Move counting lines if file inside generic assay patient level data u…
forus May 16, 2024
472f47e
Give error that generic asssay patient level data is not supported
forus May 17, 2024
c54e303
Clean sample_cna_event despite whether it has alteration_driver_annot…
forus May 17, 2024
18dbdd3
Fix cbioportalImport script execution
forus May 28, 2024
c702a8b
Remove not needed spring context initialisation
forus May 28, 2024
0ff7031
Make error message more informative when gene panel is not found
forus May 28, 2024
54cc04e
Add more genes to the mini seed to load study_es_0
forus May 28, 2024
a022aab
Make study_es_0_inc data pass validation
forus May 28, 2024
90cc928
Document in README how to load study_es_0 study
forus May 28, 2024
fb75d7c
Implement incremental upload for timeline data
forus May 17, 2024
3331223
Implement incremental upload of CNA DISCRETE long data
forus May 22, 2024
d7e1918
Add data type sanity check for tsv uploded
forus May 22, 2024
ee183e6
Move storing/dedup logic of genetic alteration values to importer
forus May 22, 2024
697631f
Move all inc. upload logic for tab delim. data types to GeneticAltera…
forus May 22, 2024
65c8b11
Add CNA DISCRETE LONG to study_es0_inc test dataset
forus May 24, 2024
0bf6bf2
Remove unused code
forus May 24, 2024
cc80e56
Make validation to pass for CNA long and study_es_0_inc data
forus May 28, 2024
4070e68
Implement incremental upload for gene panel matrix
forus May 24, 2024
e8bbb34
Make validation of study_es_0_inc data to pass
forus May 28, 2024
feed06c
Implement incremental upload of structural variants data
forus May 24, 2024
bea4987
Implement incremental upload of CNA segmented data
forus May 25, 2024
0cdda9d
Make it explicit that timeline uploader support bulk mode only
forus May 28, 2024
d7e8ff3
Fix number of columns in SV tsv data file
forus May 28, 2024
ec849e2
Update paragraph on inc. upload in README
forus Jun 11, 2024
deb65cb
Rename validation method to better describe it's purpose
forus Jun 11, 2024
8692ead
Fix cleaning alteration_driver_annotation table for specific sample
forus Jun 11, 2024
be9082c
DRY tab separated value string parsing
forus Jun 11, 2024
4e8a7c2
Reuse FileUtil.isInfoLine(String line) throughout the code
forus Jun 11, 2024
b93e741
Extract ensuring header and row match to tsv utility class
forus Jun 11, 2024
9089e77
Simplify delete sql. Rely on cascade delete instead.
forus Jun 11, 2024
16f6295
Generalise overwrite-existing flag description to make it more accurate
forus Jun 11, 2024
79c4041
Rename updateMode to isIncrementalUpdateMode flag
forus Jun 11, 2024
111f58e
Improve description of overwrite-existing flag for gene panel profile…
forus Jun 11, 2024
c4d5ecc
Implement more optimal way to update sample profile
forus Jun 11, 2024
13eb147
Optimize code by always using batch upsert for sample profile
forus Jun 11, 2024
95c32f8
Recognise that SEG importer always use bulkLoad
forus Jun 11, 2024
e3ec5d6
Organise bulk mode flushing for SEG importer
forus Jun 11, 2024
fc84a41
Ignore case for bulkLoad load mode option as everywhere in the code
forus Jun 11, 2024
4eac259
add comma to README
pieterlukasse Jun 13, 2024
d0428f8
improve order comments for INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES
pieterlukasse Jun 13, 2024
bf2d539
Add join by GENETIC_PROFILE_ID column for sample_cna_event and altera…
forus Jun 13, 2024
37dcc20
Check for inconsistency in sample ids and values while reading geneti…
forus Jun 14, 2024
6562716
Make method name to initialise transaction clearer
forus Jun 14, 2024
b0a448e
Remove TODOs that were done
forus Jun 14, 2024
f3d76c7
Rename isInfoLine util. method to isDataLine
forus Jun 14, 2024
f544847
Simplify code by using inheritence instead of composition
forus Jun 14, 2024
ab51a4b
Optimize removing genetic alterations
forus Jun 19, 2024
96acec5
Access inherited variables with this. intead of super.
forus Jun 19, 2024
52714d6
Merge pull request #45 from cBioPortal/inc-tab-delimited-uploader
forus Jun 19, 2024
074372f
Merge pull request #44 from cBioPortal/make_study_es_0_to_load
forus Jun 19, 2024
a5ac232
Merge pull request #43 from cBioPortal/inc-timeline-uploader
forus Jun 19, 2024
a47eb49
Merge pull request #42 from cBioPortal/inc-cna-discrete-long
forus Jun 19, 2024
d28d04d
Merge pull request #41 from cBioPortal/inc-gene-panel-matrix
forus Jun 19, 2024
8c74dbb
Merge pull request #40 from cBioPortal/inc-sv
forus Jun 19, 2024
d15c579
Merge pull request #39 from cBioPortal/inc-seg
forus Jun 19, 2024
d081f8f
Merge pull request #47 from cBioPortal/rfc79-feedback
forus Jun 19, 2024
e795449
Remove unused code from DaoSampleList.addSampleList()
forus Jun 21, 2024
df8f7af
Remove extra semicolons at the end of java statements
forus Jun 21, 2024
f120f5d
Rename upsertSampleProfiles to upsertSampleToProfileMapping
forus Jun 21, 2024
602cc24
Use java 8 way to convert typed list to array in GeneticAlterationInc…
forus Jun 21, 2024
28dfa05
Improve doc comments for TsvUtil.isDataLine(String line)
forus Jun 21, 2024
2dd1e62
Rename and codument better method to updateCaseLists
forus Jun 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/validate-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
- name: 'Validate tests'
working-directory: ./cbioportal-core
run: |
docker run -v ${PWD}:/cbioportal-core python:3.6 /bin/bash -c '
docker run -v ${PWD}:/cbioportal-core python:3.6 /bin/sh -c '
cd cbioportal-core &&
pip install -r requirements.txt &&
source test_scripts.sh'
./test_scripts.sh'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not matter, but running test_scripts.sh like this invokes it in a subshell rather than in the current (shell) process.
If there were environment variables set by the script which need to be exported or read out they would be lost this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. If we lose any relevant env. variable, tests will fail.
source command is bash specific btw

67 changes: 54 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,59 @@ This repo contains:
## Inclusion in main codebase
The `cbioportal-core` code is currently included in the final Docker image during the Docker build process: https://github.com/cBioPortal/cbioportal/blob/master/docker/web-and-data/Dockerfile#L48

## Running in docker

Build docker image with:
```bash
docker build -t cbioportal-core .
```

### Example of how to load `study_es_0` study

Import gene panels

```bash
docker run -it -v $(pwd)/tests/test_data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
perl importGenePanel.pl --data /data/study_es_0/data_gene_panel_testpanel1.txt
docker run -it -v $(pwd)/tests/test_data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
perl importGenePanel.pl --data /data/study_es_0/data_gene_panel_testpanel2.txt
```

Import gene sets and supplementary data

```bash
docker run -it -v $(pwd)/src/test/resources/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
perl importGenesetData.pl --data /data/genesets/study_es_0_genesets.gmt --new-version msigdb_7.5.1 --supp /data/genesets/study_es_0_supp-genesets.txt
```

Import gene set hierarchy data

```bash
docker run -it -v $(pwd)/src/test/resources/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
perl importGenesetHierarchy.pl --data /data/genesets/study_es_0_tree.yaml
```

Import study

```bash
docker run -it -v $(pwd)/tests/test_data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core \
python importer/metaImport.py -s /data/study_es_0 -p /data/api_json_system_tests -o
```

### Incremental upload of data

To add or update specific patient, sample, or molecular data in an already loaded study, you can perform an incremental upload. This process is quicker than reloading the entire study.

To execute an incremental upload, use the -d (or --data_directory) option instead of -s (or --study_directory). Here is an example command:
```bash
docker run -it -v $(pwd)/data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core python importer/metaImport.py -d /data/study_es_0_inc -p /data/api_json -o
```
**Note:**
While the directory should adhere to the standard cBioPortal file formats and study structure, incremental uploads are not supported for all data types though.
For instance, uploading study metadata, resources, or GSVA data incrementally is currently unsupported.

This method ensures efficient updates without the need for complete study reuploads, saving time and computational resources.

## How to run integration tests

This section guides you through the process of running integration tests by setting up a cBioPortal MySQL database environment using Docker. Please follow these steps carefully to ensure your testing environment is configured correctly.
Expand Down Expand Up @@ -78,7 +131,7 @@ After you are done with the setup, you can build and test the project.

1. Execute tests through the provided script:
```bash
source test_scripts.sh
./test_scripts.sh
```

2. Build the loader jar using Maven (includes testing):
Expand Down Expand Up @@ -119,15 +172,3 @@ The script will search for `core-*.jar` in the root of the project:
python scripts/importer/metaImport.py -s tests/test_data/study_es_0 -p tests/test_data/api_json_unit_tests -o
```

## Running in docker

Build docker image with:
```bash
docker build -t cbioportal-core .
```

Example of how to start the loading:
```bash
docker run -it -v $(pwd)/data/:/data/ -v $(pwd)/application.properties:/application.properties cbioportal-core python importer/metaImport.py -s /data/study_es_0 -p /data/api_json -o
```

3 changes: 3 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,9 @@
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.21.0</version>
<configuration>
<trimStackTrace>false</trimStackTrace>
</configuration>
<executions>
<execution>
<id>default-test</id>
Expand Down
155 changes: 122 additions & 33 deletions scripts/importer/cbioportalImporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import logging
import re
from pathlib import Path
from typing import Dict, Tuple

# configure relative imports if running as a script; see PEP 366
# it might passed as empty string by certain tooling to mark a top level module
Expand Down Expand Up @@ -39,6 +40,8 @@
from .cbioportal_common import ADD_CASE_LIST_CLASS
from .cbioportal_common import VERSION_UTIL_CLASS
from .cbioportal_common import run_java
from .cbioportal_common import UPDATE_CASE_LIST_CLASS
from .cbioportal_common import INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES


# ------------------------------------------------------------------------------
Expand Down Expand Up @@ -101,8 +104,17 @@ def remove_study_id(jvm_args, study_id):
args.append("--noprogress") # don't report memory usage and % progress
run_java(*args)

def update_case_lists(jvm_args, meta_filename, case_lists_file_or_dir = None):
args = jvm_args.split(' ')
args.append(UPDATE_CASE_LIST_CLASS)
args.append("--meta")
args.append(meta_filename)
if case_lists_file_or_dir:
args.append("--case-lists")
args.append(case_lists_file_or_dir)
run_java(*args)

def import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity = None, meta_file_dictionary = None):
def import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity = None, meta_file_dictionary = None, incremental = False):
args = jvm_args.split(' ')

# In case the meta file is already parsed in a previous function, it is not
Expand Down Expand Up @@ -133,6 +145,10 @@ def import_study_data(jvm_args, meta_filename, data_filename, update_generic_ass
importer = IMPORTER_CLASSNAME_BY_META_TYPE[meta_file_type]

args.append(importer)
if incremental:
if meta_file_type not in INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES:
raise NotImplementedError("This type does not support incremental upload: {}".format(meta_file_type))
args.append("--overwrite-existing")
if IMPORTER_REQUIRES_METADATA[importer]:
args.append("--meta")
args.append(meta_filename)
Expand Down Expand Up @@ -212,11 +228,20 @@ def process_command(jvm_args, command, meta_filename, data_filename, study_ids,
else:
raise RuntimeError('Your command uses both -id and -meta. Please, use only one of the two parameters.')
elif command == IMPORT_STUDY_DATA:
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity)
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity)
elif command == IMPORT_CASE_LIST:
import_case_list(jvm_args, meta_filename)

def process_directory(jvm_args, study_directory, update_generic_assay_entity = None):
def get_meta_filenames(data_directory):
meta_filenames = [
os.path.join(data_directory, meta_filename) for
meta_filename in os.listdir(data_directory) if
re.search(r'(\b|_)meta(\b|[_0-9])', meta_filename,
flags=re.IGNORECASE) and
not (meta_filename.startswith('.') or meta_filename.endswith('~'))]
return meta_filenames

def process_study_directory(jvm_args, study_directory, update_generic_assay_entity = None):
"""
Import an entire study directory based on meta files found.

Expand All @@ -241,12 +266,7 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
cna_long_filepair = None

# Determine meta filenames in study directory
meta_filenames = (
os.path.join(study_directory, meta_filename) for
meta_filename in os.listdir(study_directory) if
re.search(r'(\b|_)meta(\b|[_0-9])', meta_filename,
flags=re.IGNORECASE) and
not (meta_filename.startswith('.') or meta_filename.endswith('~')))
meta_filenames = get_meta_filenames(study_directory)

# Read all meta files (excluding case lists) to determine what to import
for meta_filename in meta_filenames:
Expand Down Expand Up @@ -353,53 +373,53 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
raise RuntimeError('No sample attribute file found')
else:
meta_filename, data_filename = sample_attr_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Next, we need to import resource definitions for resource data
if resource_definition_filepair is not None:
meta_filename, data_filename = resource_definition_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Next, we need to import sample definitions for resource data
if sample_resource_filepair is not None:
meta_filename, data_filename = sample_resource_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Next, import everything else except gene panel, structural variant data, GSVA and
# z-score expression. If in the future more types refer to each other, (like
# in a tree structure) this could be programmed in a recursive fashion.
for meta_filename, data_filename in regular_filepairs:
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Import structural variant data
if structural_variant_filepair is not None:
meta_filename, data_filename = structural_variant_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Import cna data
if cna_long_filepair is not None:
meta_filename, data_filename = cna_long_filepair
import_study_data(jvm_args=jvm_args, meta_filename=meta_filename, data_filename=data_filename,
meta_file_dictionary=study_meta_dictionary[meta_filename])
import_data(jvm_args=jvm_args, meta_filename=meta_filename, data_filename=data_filename,
meta_file_dictionary=study_meta_dictionary[meta_filename])

# Import expression z-score (after expression)
for meta_filename, data_filename in zscore_filepairs:
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Import GSVA genetic profiles (after expression and z-scores)
if gsva_score_filepair is not None:

# First import the GSVA score data
meta_filename, data_filename = gsva_score_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Second import the GSVA p-value data
meta_filename, data_filename = gsva_pvalue_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

if gene_panel_matrix_filepair is not None:
meta_filename, data_filename = gene_panel_matrix_filepair
import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])

# Import the case lists
case_list_dirname = os.path.join(study_directory, 'case_lists')
Expand All @@ -412,6 +432,72 @@ def process_directory(jvm_args, study_directory, update_generic_assay_entity = N
# enable study
update_study_status(jvm_args, study_id)

def get_meta_filenames_by_type(data_directory) -> Dict[str, Tuple[str, Dict]]:
"""
Read all meta files in the data directory and return meta information (filename, content) grouped by type.
"""
meta_file_type_to_meta_files = {}

# Determine meta filenames in study directory
meta_filenames = get_meta_filenames(data_directory)

# Read all meta files (excluding case lists) to determine what to import
for meta_filename in meta_filenames:

# Parse meta file
meta_dictionary = cbioportal_common.parse_metadata_file(
meta_filename, logger=LOGGER)

# Retrieve meta file type
meta_file_type = meta_dictionary['meta_file_type']
if meta_file_type is None:
# invalid meta file, let's die
raise RuntimeError('Invalid meta file: ' + meta_filename)
if meta_file_type not in meta_file_type_to_meta_files:
meta_file_type_to_meta_files[meta_file_type] = []

meta_file_type_to_meta_files[meta_file_type].append((meta_filename, meta_dictionary))
return meta_file_type_to_meta_files

def import_incremental_data(jvm_args, data_directory, update_generic_assay_entity, meta_file_type_to_meta_files):
"""
Load all data types that are available and support incremental upload
"""
for meta_file_type in INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES:
if meta_file_type not in meta_file_type_to_meta_files:
continue
meta_pairs = meta_file_type_to_meta_files[meta_file_type]
for meta_pair in meta_pairs:
meta_filename, meta_dictionary = meta_pair
data_filename = os.path.join(data_directory, meta_dictionary['data_filename'])
import_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, meta_dictionary, incremental=True)

def update_case_lists_from_folder(jvm_args, data_directory, meta_file_type_to_meta_files):
"""
Updates case lists if clinical sample provided.
The command takes case_list/ folder as optional argument.
If folder exists case lists will be updated accordingly.
"""
if MetaFileTypes.SAMPLE_ATTRIBUTES in meta_file_type_to_meta_files:
case_list_dirname = os.path.join(data_directory, 'case_lists')
sample_attributes_metas = meta_file_type_to_meta_files[MetaFileTypes.SAMPLE_ATTRIBUTES]
for meta_pair in sample_attributes_metas:
meta_filename, meta_dictionary = meta_pair
LOGGER.info('Updating case lists with sample ids', extra={'filename_': meta_filename})
update_case_lists(jvm_args, meta_filename, case_lists_file_or_dir=case_list_dirname if os.path.isdir(case_list_dirname) else None)

def process_data_directory(jvm_args, data_directory, update_generic_assay_entity = None):
"""
Incremental import of data directory based on meta files found.
"""

meta_file_type_to_meta_files = get_meta_filenames_by_type(data_directory)

not_supported_meta_types = meta_file_type_to_meta_files.keys() - INCREMENTAL_UPLOAD_SUPPORTED_META_TYPES
if not_supported_meta_types:
raise NotImplementedError("These types do not support incremental upload: {}".format(", ".join(not_supported_meta_types)))
import_incremental_data(jvm_args, data_directory, update_generic_assay_entity, meta_file_type_to_meta_files)
update_case_lists_from_folder(jvm_args, data_directory, meta_file_type_to_meta_files)

def usage():
# TODO : replace this by usage string from interface()
Expand All @@ -435,26 +521,27 @@ def check_files(meta_filename, data_filename):
print('data-file cannot be found:' + data_filename, file=ERROR_FILE)
sys.exit(2)

def check_dir(study_directory):
def check_dir(data_directory):
# check existence of directory
if not os.path.exists(study_directory) and study_directory != '':
print('Study cannot be found: ' + study_directory, file=ERROR_FILE)
if not os.path.exists(data_directory) and data_directory != '':
print('Directory cannot be found: ' + data_directory, file=ERROR_FILE)
sys.exit(2)

def add_parser_args(parser):
parser.add_argument('-s', '--study_directory', type=str, required=False,
help='Path to Study Directory')
data_source_group = parser.add_mutually_exclusive_group()
data_source_group.add_argument('-s', '--study_directory', type=str, help='Path to Study Directory')
data_source_group.add_argument('-d', '--data_directory', type=str, help='Path to Data Directory')
parser.add_argument('-jvo', '--java_opts', type=str, default=os.environ.get('JAVA_OPTS'),
help='Path to specify JAVA_OPTS for the importer. \
(default: gets the JAVA_OPTS from the environment)')
(default: gets the JAVA_OPTS from the environment)')
parser.add_argument('-jar', '--jar_path', type=str, required=False,
help='Path to scripts JAR file')
help='Path to scripts JAR file')
parser.add_argument('-meta', '--meta_filename', type=str, required=False,
help='Path to meta file')
parser.add_argument('-data', '--data_filename', type=str, required=False,
help='Path to Data file')

def interface():
def interface(args=None):
parent_parser = argparse.ArgumentParser(description='cBioPortal meta Importer')
add_parser_args(parent_parser)
parser = argparse.ArgumentParser()
Expand Down Expand Up @@ -484,7 +571,7 @@ def interface():
# TODO - add same argument to metaimporter
# TODO - harmonize on - and _

parser = parser.parse_args()
parser = parser.parse_args(args)
if parser.command is not None and parser.subcommand is not None:
print('Cannot call multiple commands')
sys.exit(2)
Expand Down Expand Up @@ -547,14 +634,16 @@ def main(args):

# process the options
jvm_args = "-Dspring.profiles.active=dbcp " + args.java_opts
study_directory = args.study_directory

# check if DB version and application version are in sync
check_version(jvm_args)

if study_directory != None:
check_dir(study_directory)
process_directory(jvm_args, study_directory, args.update_generic_assay_entity)
if args.data_directory is not None:
check_dir(args.data_directory)
process_data_directory(jvm_args, args.data_directory, args.update_generic_assay_entity)
elif args.study_directory is not None:
check_dir(args.study_directory)
process_study_directory(jvm_args, args.study_directory, args.update_generic_assay_entity)
else:
check_args(args.command)
check_files(args.meta_filename, args.data_filename)
Expand Down
Loading
Loading