Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial refactoring of tensor_from_file #224

Closed
wants to merge 32 commits into from
Closed

Conversation

mklarqvist
Copy link
Contributor

This is an incomplete pull request to start a conversation regarding the refactoring of the TMAP dictionary into separate submodules (issue #143 ). For example, TMAPS['ecg_rest_raw'] is now defined as ml4cvd.tensormap.ukbb.ecg.ecg.ecg_rest_raw. This introduces more meaningful semantics and imports libraries only required for the target application.
In this initial commit, only TMAP definitions and their associated functions in tensor_from_file are covered.

The current proposed structure looks like this:

.
├── partners
└── ukbb
    ├── __init__.py
    ├── accelerometer
    ├── demographics
    │   └── demographics.py
    ├── ecg
    │   └── ecg.py
    ├── general.py
    ├── genetics
    ├── mri
    │   ├── mri.py
    │   └── vtk.py
    └── survival.py

@ndiamant
Copy link
Contributor

@mklarqvist I like this format! Seems very clean. How will the TensorMaps be referenced from the command line now that they are not in a dictionary?

@paolodi paolodi linked an issue Apr 23, 2020 that may be closed by this pull request
@erikr
Copy link

erikr commented Apr 23, 2020

Would we structure partners tmaps as follows?

.
├── partners
|    ├── __init__.py
|    ├── ecg
|    ├── outcomes
|    |   ├── cardiac_surgery
|    |   └── community
|    ├── echo
|    ├── medications
|    ├── laboratory_tests
|    ├── vitals
|    └── bedmaster_waveforms
└── ukbb
    ├── __init__.py
    ├── accelerometer
    ├── demographics
    │   └── demographics.py
    ├── ecg
    │   └── ecg.py
    ├── general.py
    ├── genetics
    ├── mri
    │   ├── mri.py
    │   └── vtk.py
    └── survival.py

@mklarqvist
Copy link
Contributor Author

Would we structure partners tmaps as follows?

.
├── partners
|    ├── __init__.py
|    ├── ecg
|    ├── outcomes
|    |   ├── cardiac_surgery
|    |   └── community
|    ├── echo
|    ├── medications
|    ├── laboratory_tests
|    ├── vitals
|    └── bedmaster_waveforms
└── ukbb
    ├── __init__.py
    ├── accelerometer
    ├── demographics
    │   └── demographics.py
    ├── ecg
    │   └── ecg.py
    ├── general.py
    ├── genetics
    ├── mri
    │   ├── mri.py
    │   └── vtk.py
    └── survival.py

Looks good. I don't think we need to formally impose any structure. As long as there is some structure there.

@mklarqvist
Copy link
Contributor Author

mklarqvist commented Apr 24, 2020

@mklarqvist I like this format! Seems very clean. How will the TensorMaps be referenced from the command line now that they are not in a dictionary?

I'm in favor of a dirty solution for now: invoking the string literal using getattr. For example, tm = getattr(ml4cvd.tensormap.ukbb.ecg.ecg, 'ecg_rest_raw'). We can check the prefix such that it matches ml4cvd.tensormap.* to prevent the user from accidentally calling unintended functions.

This seems to work fine:

def test_tm_resolve(test: str):
    if isinstance(test,str) == False:
        raise TypeError(f"Input name must be a string. Given: {type(test)}")
    if len(test) == 0:
        raise ValueError(f"Input name cannot be empty.")
    if '.'.join(test.split('.')[0:2]) != 'ml4cvd.tensormap':
        raise ValueError(f"TensorMaps must reside in the path 'ml4cvd.tensormap.*'. Given: {test}")
    import importlib
    try:
        i = importlib.import_module('.'.join(test.split('.')[:-1]))
    except ModuleNotFoundError:
        raise ModuleNotFoundError(f"Could not resolve library {'.'.join(test.split('.')[:-1])} for target tensormap {test}")
    try:
        tm = getattr(i,test.split('.')[-1])
    except AttributeError:
        raise AttributeError(f"Module {'.'.join(test.split('.')[:-1])} has no TensorMap called {test.split('.')[-1]}")
    return tm

Example 1

> test_tm_resolve('ml4cvd.tensormap.ukbb.mri.mri.lax_3ch_lv_cavity_bbox')
2020-04-24 08:43:20.657534: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-24 08:43:20.681661: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7febdfd67ba0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-24 08:43:20.681679: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
<ml4cvd.TensorMap.TensorMap object at 0x14e777438>

Example 2

> test_tm_resolve('ml4cvd.tensormap.ukbb.mri.mri.fake_tensormap')
Traceback (most recent call last):
  File "<stdin>", line 8, in test_tm_resolve
AttributeError: module 'ml4cvd.tensormap.ukbb.mri.mri' has no attribute 'fake_tensormap'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in test_tm_resolve
AttributeError: Module ml4cvd.tensormap.ukbb.mri.mri has no TensorMap called fake_tensormap

Example 3

> test_tm_resolve('ml4cvd.tensormap.ukbb.mri.fake_module.lax_3ch_lv_cavity_bbox')
Traceback (most recent call last):
  File "<stdin>", line 4, in test_tm_resolve
  File "/opt/miniconda3/envs/ml4cvd/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 948, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'ml4cvd.tensormap.ukbb.mri.fake_module'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in test_tm_resolve
ModuleNotFoundError: Could not resolve library ml4cvd.tensormap.ukbb.mri.fake_module for target tensormap ml4cvd.tensormap.ukbb.mri.fake_module.lax_3ch_lv_cavity_bbox

What do you think?

Copy link
Collaborator

@lucidtronix lucidtronix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this big refactor. I think we can flatten the hierarchy a bit especially if the whole path is going to need to be specified on the command line. Before merging the python introspection to import from strings should be integrated into arguments.py _get_tmap function.

ml4cvd/tensormap/ukbb/demographics/demographics.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/demographics/demographics.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/demographics/demographics.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/ecg/ecg.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/ecg/ecg.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/general.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/general.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/mri/mri.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/mri/vtk.py Outdated Show resolved Hide resolved
ml4cvd/tensormap/ukbb/general.py Outdated Show resolved Hide resolved
@lucidtronix
Copy link
Collaborator

let's just prepend 'ml4cvd.tensormap' rather than requiring it in:
if '.'.join(test.split('.')[0:2]) != 'ml4cvd.tensormap':

@lucidtronix
Copy link
Collaborator

Also should add a type check ensuring the import is a TensorMap

@ndiamant
Copy link
Contributor

@mklarqvist I like this format! Seems very clean. How will the TensorMaps be referenced from the command line now that they are not in a dictionary?

I'm in favor of a dirty solution for now: invoking the string literal using getattr. For example, tm = getattr(ml4cvd.tensormap.ukbb.ecg.ecg, 'ecg_rest_raw'). We can check the prefix such that it matches ml4cvd.tensormap.* to prevent the user from accidentally calling unintended functions.

This seems to work fine:

def test_tm_resolve(test: str):
    if isinstance(test,str) == False:
        raise TypeError(f"Input name must be a string. Given: {type(test)}")
    if len(test) == 0:
        raise ValueError(f"Input name cannot be empty.")
    if '.'.join(test.split('.')[0:2]) != 'ml4cvd.tensormap':
        raise ValueError(f"TensorMaps must reside in the path 'ml4cvd.tensormap.*'. Given: {test}")
    import importlib
    try:
        i = importlib.import_module('.'.join(test.split('.')[:-1]))
    except ModuleNotFoundError:
        raise ModuleNotFoundError(f"Could not resolve library {'.'.join(test.split('.')[:-1])} for target tensormap {test}")
    try:
        tm = getattr(i,test.split('.')[-1])
    except AttributeError:
        raise AttributeError(f"Module {'.'.join(test.split('.')[:-1])} has no TensorMap called {test.split('.')[-1]}")
    return tm

Example 1

> test_tm_resolve('ml4cvd.tensormap.ukbb.mri.mri.lax_3ch_lv_cavity_bbox')
2020-04-24 08:43:20.657534: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-24 08:43:20.681661: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7febdfd67ba0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-24 08:43:20.681679: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
<ml4cvd.TensorMap.TensorMap object at 0x14e777438>

Example 2

> test_tm_resolve('ml4cvd.tensormap.ukbb.mri.mri.fake_tensormap')
Traceback (most recent call last):
  File "<stdin>", line 8, in test_tm_resolve
AttributeError: module 'ml4cvd.tensormap.ukbb.mri.mri' has no attribute 'fake_tensormap'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in test_tm_resolve
AttributeError: Module ml4cvd.tensormap.ukbb.mri.mri has no TensorMap called fake_tensormap

Example 3

> test_tm_resolve('ml4cvd.tensormap.ukbb.mri.fake_module.lax_3ch_lv_cavity_bbox')
Traceback (most recent call last):
  File "<stdin>", line 4, in test_tm_resolve
  File "/opt/miniconda3/envs/ml4cvd/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 948, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'ml4cvd.tensormap.ukbb.mri.fake_module'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in test_tm_resolve
ModuleNotFoundError: Could not resolve library ml4cvd.tensormap.ukbb.mri.fake_module for target tensormap ml4cvd.tensormap.ukbb.mri.fake_module.lax_3ch_lv_cavity_bbox

What do you think?

This makes sense to me. I agree with Sam that we should add a type check ensuring the import is a TensorMap. We should also add test_tm_resolve to tests/test_arguments.py.

Copy link
Contributor

@ndiamant ndiamant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of the messy TMAPS dictionary is a great idea, but a big change! Before merging, make sure that all the tests pass and add new ones for the new command line TensorMap interface. Here is some very brief pytest unit testing info https://github.com/broadinstitute/ml/blob/d41308c01649a964627b7fd2037e6981b43f2786/ml4cvd/DATA_MODELING_TESTS.md#unit-tests

@mklarqvist
Copy link
Contributor Author

let's just prepend 'ml4cvd.tensormap' rather than requiring it in:
if '.'.join(test.split('.')[0:2]) != 'ml4cvd.tensormap':

I prefer this way

We could remedy this temporarily until fixed properly by adding a --ukbb-prefix flag that prepends ml4cvd/tensormap/ukb/ to passed TensorMap names

because then we don't force TensorMaps to reside in this path or in the codebase at all.

@mklarqvist
Copy link
Contributor Author

mklarqvist commented Apr 24, 2020

Everyone happy with this?

def tensormap_lookup(module_string: str, prefix: str = None):
    if isinstance(module_string, str) == False:
        raise TypeError(f"Input name must be a string. Given: {type(module_string)}")
    if len(module_string) == 0:
        raise ValueError(f"Input name cannot be empty.")
    path_string = module_string
    if prefix:
        if isinstance(prefix, str) == False:
            raise TypeError(f"Prefix must be a string. Given: {type(prefix)}")
        if len(prefix) == 0:
            raise ValueError(f"Prefix cannot be set to an emtpy string.")
        path_string = '.'.join([prefix,module_string])
    else:
        if '.'.join(path_string.split('.')[0:2]) != 'ml4cvd.tensormap':
            raise ValueError(f"TensorMaps must reside in the path 'ml4cvd.tensormap.*'. Given: {module_string}")
    import importlib
    try:
        i = importlib.import_module('.'.join(path_string.split('.')[:-1]))
    except ModuleNotFoundError:
        raise ModuleNotFoundError(f"Could not resolve library {'.'.join(path_string.split('.')[:-1])} for target tensormap {module_string}")
    try:
        tm = getattr(i,path_string.split('.')[-1])
    except AttributeError:
        raise AttributeError(f"Module {'.'.join(path_string.split('.')[:-1])} has no TensorMap called {path_string.split('.')[-1]}")
    from ml4cvd.TensorMap import TensorMap
    if isinstance(tm, TensorMap) == False:
        raise TypeError(f"Target value is not a TensorMap object. Returned: {type(tm)}")
    return tm

@ndiamant
Copy link
Contributor

ndiamant commented May 8, 2020

Everyone happy with this?

def tensormap_lookup(module_string: str, prefix: str = None):
    if isinstance(module_string, str) == False:
        raise TypeError(f"Input name must be a string. Given: {type(module_string)}")
    if len(module_string) == 0:
        raise ValueError(f"Input name cannot be empty.")
    path_string = module_string
    if prefix:
        if isinstance(prefix, str) == False:
            raise TypeError(f"Prefix must be a string. Given: {type(prefix)}")
        if len(prefix) == 0:
            raise ValueError(f"Prefix cannot be set to an emtpy string.")
        path_string = '.'.join([prefix,module_string])
    else:
        if '.'.join(path_string.split('.')[0:2]) != 'ml4cvd.tensormap':
            raise ValueError(f"TensorMaps must reside in the path 'ml4cvd.tensormap.*'. Given: {module_string}")
    import importlib
    try:
        i = importlib.import_module('.'.join(path_string.split('.')[:-1]))
    except ModuleNotFoundError:
        raise ModuleNotFoundError(f"Could not resolve library {'.'.join(path_string.split('.')[:-1])} for target tensormap {module_string}")
    try:
        tm = getattr(i,path_string.split('.')[-1])
    except AttributeError:
        raise AttributeError(f"Module {'.'.join(path_string.split('.')[:-1])} has no TensorMap called {path_string.split('.')[-1]}")
    from ml4cvd.TensorMap import TensorMap
    if isinstance(tm, TensorMap) == False:
        raise TypeError(f"Target value is not a TensorMap object. Returned: {type(tm)}")
    return tm

I like this except I don't see the reason to put the import of TensorMap and importlib in the function.

mklarqvist and others added 9 commits June 2, 2020 13:11
* revise tmaps to get all partners ecgs

* remove duplicate import

* do not propagate error to all ecgs in hd5

* update cardiac surgery tmaps

* _md tmaps

* tweaks to partners tensorize and ingest scripts (#210)

* dynamic tmap shape in tmap constructor

* modify label tmaps

* explore with tmap shape None

* not empty string validator

* defer str dtype

* better tensor dtype str

* slightly optimize by assuming dtype str

* parallel explore with multiprocess

* fix counter using unsightly global variable

* fix datetime, fix prt axes

* move explore to explorations

* dont fail missing on more tmaps

* fix voltage reshample shape

* move date constants and clean up imports

* dont normalize every loop

* add 'newest' tmaps for single ecgs

* tmap shape when time_series_limit is set

* infer tmap shape dynamic if time_series_limit not None

* comments explaining dynamic shaping
* quick fix for _newest

* clearer ternary

* fix typo
…ect TMaps (#223)

* calls build_cardiac_surgery_tensor_maps - closes #222

* adds path to CARDIAC_SURGERY_OUTCOMES_CSV

* address Paolos input on PR #223

* backup file in case I mess up the merge conflicts

* syncs defines with master

* syncs recipes with master -- moves explore to explorations

* syncs with master

* syncs with master

* syncs with master

* syncs with master

* resolves merge conflicts, imports decompress

* restores explorations from master prior to black formatting

* orders imports by length in arguments.py
* tensor maps for partners with dynamic ECGs
Implements a new command line argument --tsv_style genetics (not active by default) that makes the TSVs output byinfer, infer_hidden and explore compatible with genetic association software (e.g., Plink and BOLT).
* clinical and full view ECG plotting

* add msttcorefonts to docker image

* tmaps

* import order, top panel func name, remove constrained layout warning

* Pre-requirements not to have errors

* Better comment for pre_requirements

* period

Co-authored-by: Paolo Di Achille <paolo.diachille@gmail.com>
lucidtronix and others added 20 commits June 2, 2020 13:27
* xref base

* rename clean cols, escape col names

* source -> tensors, add help to args

* plot changes, require time range with time

* add fpath to xref output, if found

* cleanup logic, remove pandasql

* advanced time windowing

* cleanup
* rewrite get_test_train_valid_paths

* new train valid test paths with balance csv

* bug squashing and tests

* overlapping csv tests

* rename tot->ratio_sum/use_ratios

* cleanup comments

* fix test case where train/valid/test < sample, fix ratio rescaling

* more explicit param docs

* todo for balance csv tests

* goodbye test_modulo, require at least 2 args

* space between functions

* field for partners ecg tensorizer

* refactor arg handling

* simplify to override ratios and document

* and -> or
* sts tmaps exact length

* fix time selection and add new sts tmaps

* use normalizer
* categorical voltage tmaps

* time_series_set

* revert time_series_set

* None check

* _tensors_to_df overhaul

* comment

* revisions and readability
1. The function _cardiac_surgery_str2date no longer hard-codes the date
formatting. Rather, it is set in and imported from defines.py, and used as a
default. It can be easily overriden when calling this function.

2. all str2date functions are now grouped together to keep the script
organized.
…#298)

* implements embed_visualization option. Ref #268

* implements embed_visualization option. Ref #268

* merge PR #288. Remove redundant default.
* Infer MRN column from sample_csv. Ref #287.

* if cannot find MRN col name, uses col 0. Ref #287

* sync w/ master

* sync w/ master

* sync w/ master

* sync w/ master

* check for sample_id as MRN col
CARCIAC --> CARDIAC
)

* removes sort_csv & error cols from explore. Ref #236. Ref #284

* adds explore_export_errors to args. Ref #284.

* sync w/ master

* sync w/ master

* adds embed_visualization to args from master

* typo CARCIAC -> CARDIAC
* wip rehaul tensorizer

* wip tensorizer rehaul

* rehaul tensor writer, get metadata in tensorize

* bug fixes

* bug fixes and correctness regarding missingness

* new voltage qc tmaps

* default sampling frequency tmaps calculate based on length

* multiprocess -> multiprocessing
Also
* Remove obsolete instructions now that ml4cvd has the needed packages and permissions.
* Use newer hd5 for ECGs.
* Switch to local paths when on a ML4CVD VM.
@mklarqvist
Copy link
Contributor Author

rebased

@mklarqvist
Copy link
Contributor Author

File changes look a bit nasty because of rebasing

@lucidtronix
Copy link
Collaborator

Can you try git merge master instead of git rebase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Organize tensor_from_file
7 participants