Issue/237 #241

NickEdwards7502 · 2024-10-02T07:09:38Z

Major issues and features addressed in this update

VariantSpark's python wrapper has been refactored to create Random Forest models from a standalone class
- Previously, in the non-hail VariantSpark release, the model was initialised and trained from the context of importance analyses. This did not seem appropriate for supporting future releases
- A new scala function was created to return a trained RandomForest model without using hail
- Files updated/created
  - python/varspark/rfmodel.py
  - python/varspark/core.py
  - python/varspark/__init__.py
  - src/main/scala/au/csiro/variantspark/api/GetRFModel.scala
A non-hail export model function was created
- This function now processes trees in batches to remediate OOM errors with very large models
- Files created
  - src/main/scala/au/csiro/variantspark/api/ExportModel.scala
The FeatureSource class, which provides wrapper functionalities for initialising genotype data for model training, has been moved to a standalone class
- For better separation of concerns, this class is now imported to the core python wrapper
- head(nrows, ncols) allows the first n rows and columns to be viewed as a pandas DataFrame
- Files updated/created
  - python/varspark/featuresource.py
  - python/varspark/core.py
  - src/main/scala/au/csiro/variantspark/input/FeatureSource.scala
Covariate support was extended
- Covariate sources can be created from a transposed/non-transposed .csv or .txt files with optional per-feature type specification
- From there, covariates can be unioned with a genotype FeatureSource and passed to the model as training data
- Since Covariates are initialised from the same FeatureSource wrapper class and are also of type RDD[Feature], they also support head()
- Note that feature and covariate sources can be unioned multiple times
- Local FDR was updated to remove non-genotype information from manhattan plotting
- Files updated/created
  - `python/varspark/core.py
  - src/main/scala/au/csiro/variantspark/api/VSContext.scala
  - src/main/scala/au/csiro/variantspark/input/CsvStdFeatureSource.scala
  - src/main/scala/au/csiro/variantspark/input/UnionedFeatureSource.scala
  - python/varspark/lfdrvsnohail.py
Importance analyses were moved to a standalone python wrapper class
- Importance analyses are now created from the context of a random forest model
- Functionality remains largely the same, with a few changes
  - Both important_variables() and variable_importance() are now returned as pandas DataFrames
  - Split counts are now included in the DataFrame returned by variable_importance() (required for Local FDR calculations)
  - Optional parameter precision supports rounding for variable_importance()
  - Optional parameter normalized indicates whether to normalise importances for both functions
- Files updated/created
  - python/varspark/importanceanalysis.py
  - python/varspark/core.py
  - src/main/scala/au/csiro/variantspark/api/ImportanceAnalysis.scala
  - src/main/scala/au/csiro/variantspark/api/AnalyticsFunctions.scala
Move lfdr file to non-hail python directory
- Created function for manhattan plotting lfdr derived p-values
- Files removed/created
  - python/varspark/hail/lfdrvs.py
  - python/varspark/lfdrvs.py
Updated all test cases according to the above changes
- Files updated/removed/created
  - src/test/scala/au/csiro/variantspark/api
    - /CommonPairwiseOperationTest.scala
    - /ImportanceApiTest.scala
  - src/test/scala/au/csiro/variantspark/misc
    - /ReproducibilityTest.scala
    - /CovariateReproducibilityTest.scala
  - src/test/scala/au/csiro/variantspark/test
    - /TestSparkContext.scala
  - python/varspark/test
    - /test_core.py
    - /test_hail.py
    - /test_pvalues_calculation.py
  - src/test/scala/au/csiro/variantspark/work/hail
    - /HailApiApp.scala
Removed all files used exclusively in hail version
- python/varspark/hail
  - __init__.py
  - context.py
  - hail.py
  - methods.py
  - plot.py
- src/main/scala/au/csiro/variantspark/hail/methods
  - RFModel.scala
Removed hail installation from pom.xml

FEAT: Implemented RF class method for fitting the model FEAT: Implemented RF class method for obtaining importance analysis from a fitted RF FEAT: Implemented RF class method for returning oob error FEAT: Implemented RF class method for obtaining FDR from a fitted model FEAT: Implemented RF class method for exporting forest to JSON REFACTOR: Make RF model available at package level CHORE: Added type checking to all methods

REFACTOR: Removed FeatureSource and ImportanceAnalysis classes from core REFACTOR: Added FeatureSource import so features can be returned as a class instantiation

REFACTOR: Removed imp analysis and model training FEAT: Added conversion from feature to RDD (python) FEAT: Added conversion from feature to RDD (scala) CHORE: Added type checking

due to import order warning (#237)

separate wrapper file (#237) REFACTOR: Updated important_variables and variable_importance methods to convert to pandas DataFrames

REFACTOR: Removed model training from object instantation and updated class to accept a model as a parameter REFACTOR: Added normalisation as an optional parameter for variable importance methods FEAT: Updated variableImportance method to include splitCount in return as it is required for local FDR analysis

and passes back to python context (#237)

from importAnalysis method of AnalyticsFunctions (#237)

FIX: Update export function to process trees in batches, instead of collecting the whole forest as a map as this led to OOM errors on large forests

REFACTOR: Refactor to mirror changes to python wrapper FEAT: Include FDR calculation in unit test

FEAT: Implement function for manhattan plotting negative log p values

without hail (#237)

STYLE: Format with black

FEAT: Add wrapper class for importing covariates FEAT: Add wrapper class for unioning features and covariates

REFACTOR: Include covariate filtering in manhattan plot function STYLE: Format with black (#237)

FEAT: Add functions for importing std and transposed CSVs FEAT: Add function for unioning features and covariates

…237)

* Deleted hail tests and associated .vds files * Scripts not consumed by any tests or code, so safe to remove

* Remove hail workaround; pypandoc can now be installed normally via dev-requirements.txt

* No longer needed because Hail is not included in the build

* core.py: added default `sparkPar` parameter when importing VCFs to avoid Py4J overload resolution issues * test_rfmodel: updated to pass VarsparkContext (`self.vc`) instead of SparkSession (`self.spark`), ensuring consistency with the Python-Spark integration layer

* Avoid storing VCF on driver memory * Implement hadoop-bam's BGZFCodec with Hadoop File API * Standardise VCFSource instantiations to consistently use the Spark context + file path factory method * Delegate BGZ-aware text file loading to BGZLoader in SparkArgs

* Only covariates listed in the argument are included. * If no argument is provided, all covariates are cast to default types.

…237)

* Deleted hail installation script * Removed hail versioning from jupyter scripts

* Renamed VariantSpark_Hail_EMR_Notebook.yaml -> VariantSpark_EMR_Notebook.yaml * Renamed VariantSpark_Hail_EMR_Step.yaml -> VariantSpark_EMR_Step.yaml and removed hail installation * Deprecated VariantSpark hail example for future recreation * Removed hail configuration from spot-cluster.yaml * Updated README.md to reflect new filenames

* Updated example code to align with non-hail API * Corrected inaccurate requirement versions

)

NickEdwards7502 added 30 commits September 11, 2024 14:28

DEV: Updated varspark python wrapper (#237)

80a9c59

REFACTOR: Removed FeatureSource and ImportanceAnalysis classes from core REFACTOR: Added FeatureSource import so features can be returned as a class instantiation

DEV: Created standalone FeatureSource class in separate file (#237)

23520ec

REFACTOR: Removed imp analysis and model training FEAT: Added conversion from feature to RDD (python) FEAT: Added conversion from feature to RDD (scala) CHORE: Added type checking

REFACTOR: Remove unecessary hail import for hail rf wrapper

4560998

due to import order warning (#237)

DEV: Created standalone ImportanceAnalysis class in

b8b39fd

separate wrapper file (#237) REFACTOR: Updated important_variables and variable_importance methods to convert to pandas DataFrames

DEV: Created scala function that trains a forest

0fc736f

and passes back to python context (#237)

REFACTOR: Removed model definition and training

ea069d6

from importAnalysis method of AnalyticsFunctions (#237)

DEV: Create no-hail equivalent of JSON model export (#237)

e08f12a

FIX: Update export function to process trees in batches, instead of collecting the whole forest as a map as this led to OOM errors on large forests

REFACTOR: Update importance API test cases to reflect changes (#237)

ddc5912

REFACTOR: Update reproducibility test case to reflect changes (#237)

f6d40d4

DEV: Update python unit testing (#237)

3356d9a

REFACTOR: Refactor to mirror changes to python wrapper FEAT: Include FDR calculation in unit test

DEV: Create no hail lfdr class (#237)

59f40bc

FEAT: Implement function for manhattan plotting negative log p values

DEV: Create temp hail notebook for testing JSON export OOM (#237)

3f8066b

DEV: Create temp notebook for demonstrating VS functionality

de29b45

without hail (#237)

DEV: Add covariate import wrapper function (#237)

fe2db4c

STYLE: Format with black

DEV: Create python class for covariate imports (#237)

a9b9570

STYLE: Format with black (#237)

3ea4c8c

REFACTOR: Remove covariatesource as not required (#237)

8f11e62

STYLE: Format with black (#237)

d671f35

DEV: Add wrapper functions for covariate support (#237)

209a463

FEAT: Add wrapper class for importing covariates FEAT: Add wrapper class for unioning features and covariates

STYLE: Format with black (#237)

b94afcc

STYLE: Format with black (#237)

04daae2

DEV: Update lfdr to support covariates (#237)

30732ba

REFACTOR: Include covariate filtering in manhattan plot function STYLE: Format with black (#237)

STYLE: Format with scalamft (#237)

3381e68

DEV: Update VSContext to support covariates (#237)

37f4193

FEAT: Add functions for importing std and transposed CSVs FEAT: Add function for unioning features and covariates

DEV: Update std CSV features to support optional variable type specs (#…

dfae3c2

…237)

DEV: Create class for returning union of features and covariates (#237)

9733844

CHORE: Add reproducibility test that includes covariates (#237)

dd32e0f

CHORE: Remove print statements from RF reproducibility test (#237)

769ce76

ChristinaXu2017 and others added 29 commits August 27, 2025 15:19

Update ci.yml

755d4e8

Update ci.yml

c0952c0

Update ci.yml

a192ada

Update publish-release.yml

5c0b8c2

CHORE: Remove unused Hail data prep scripts and outputs

ac757e7

* Deleted hail tests and associated .vds files * Scripts not consumed by any tests or code, so safe to remove

CHORE: Move pypandoc to dev-requirements.txt

efb7298

* Remove hail workaround; pypandoc can now be installed normally via dev-requirements.txt

CHORE: Simplify bin/jvariant-spark for non-hail usage

78623ee

CHORE: Remove unecessary fastutil relocation

7c2b16a

* No longer needed because Hail is not included in the build

CHORE: Remove helper script for deploying Hail to Maven

0186193

CHORE: Remove hail from python dependencies

922f9cc

DOCS: Drop Hail build/VM setup instructions from dev notes

2debde9

DOCS: Remove Hail references from Python API docs

f5819c4

DOCS: Remove Hail-specific documentation files

213ec7c

REFACTOR: Drop unused Hail utilities

9bc80e8

TST: Remove Hail test marker

5bda6a1

CHORE: Drop Hail-related PYTHONPATH entry from Sphinx makefile

e5d2d95

FIX: Honour covariate types argument when supplied (#237)

0a03a38

* Only covariates listed in the argument are included. * If no argument is provided, all covariates are cast to default types.

DEPRECATE: Mark untested Databricks notebooks for future recreation (#…

3eb1417

…237)

CHORE: Remove Hail references from EMR bootstrap scripts (#237)

0a3bc4c

* Deleted hail installation script * Removed hail versioning from jupyter scripts

DOCS: Remove Hail references from AWS Marketplace README (#237)

cff77b4

DOCS: Update RFLocalFDR documentation for non-hail workflow (#237)

187dd6d

* Updated example code to align with non-hail API * Corrected inaccurate requirement versions

DOCS: Remove obsolete notebooks from command-line examples (#237)

3680b9a

DOCS: Remove Hail notebooks and update examples to new API (#237)

1ae59ed

DOCS: Remove obsolete Hail notebook (#237)

096bb01

DOCS: Remove or update Python example scripts to use new API (#237)

3c5a4c5

FIX: Pass covariate types as Java ArrayList instead of Python list (#237

aa959b4

)

bhosking merged commit 9f186ab into dev Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue/237 #241

Issue/237 #241

Uh oh!

NickEdwards7502 commented Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Issue/237 #241

Issue/237 #241

Uh oh!

Conversation

NickEdwards7502 commented Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants