-
Notifications
You must be signed in to change notification settings - Fork 46
Issue/237 #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FEAT: Implemented RF class method for fitting the model FEAT: Implemented RF class method for obtaining importance analysis from a fitted RF FEAT: Implemented RF class method for returning oob error FEAT: Implemented RF class method for obtaining FDR from a fitted model FEAT: Implemented RF class method for exporting forest to JSON REFACTOR: Make RF model available at package level CHORE: Added type checking to all methods
REFACTOR: Removed FeatureSource and ImportanceAnalysis classes from core REFACTOR: Added FeatureSource import so features can be returned as a class instantiation
REFACTOR: Removed imp analysis and model training FEAT: Added conversion from feature to RDD (python) FEAT: Added conversion from feature to RDD (scala) CHORE: Added type checking
due to import order warning (#237)
separate wrapper file (#237) REFACTOR: Updated important_variables and variable_importance methods to convert to pandas DataFrames
REFACTOR: Removed model training from object instantation and updated class to accept a model as a parameter REFACTOR: Added normalisation as an optional parameter for variable importance methods FEAT: Updated variableImportance method to include splitCount in return as it is required for local FDR analysis
and passes back to python context (#237)
from importAnalysis method of AnalyticsFunctions (#237)
FIX: Update export function to process trees in batches, instead of collecting the whole forest as a map as this led to OOM errors on large forests
REFACTOR: Refactor to mirror changes to python wrapper FEAT: Include FDR calculation in unit test
FEAT: Implement function for manhattan plotting negative log p values
STYLE: Format with black
FEAT: Add wrapper class for importing covariates FEAT: Add wrapper class for unioning features and covariates
REFACTOR: Include covariate filtering in manhattan plot function STYLE: Format with black (#237)
FEAT: Add functions for importing std and transposed CSVs FEAT: Add function for unioning features and covariates
* Deleted hail tests and associated .vds files * Scripts not consumed by any tests or code, so safe to remove
* Remove hail workaround; pypandoc can now be installed normally via dev-requirements.txt
* No longer needed because Hail is not included in the build
* core.py: added default `sparkPar` parameter when importing VCFs to avoid Py4J overload resolution issues * test_rfmodel: updated to pass VarsparkContext (`self.vc`) instead of SparkSession (`self.spark`), ensuring consistency with the Python-Spark integration layer
* Avoid storing VCF on driver memory * Implement hadoop-bam's BGZFCodec with Hadoop File API * Standardise VCFSource instantiations to consistently use the Spark context + file path factory method * Delegate BGZ-aware text file loading to BGZLoader in SparkArgs
* Only covariates listed in the argument are included. * If no argument is provided, all covariates are cast to default types.
* Deleted hail installation script * Removed hail versioning from jupyter scripts
* Renamed VariantSpark_Hail_EMR_Notebook.yaml -> VariantSpark_EMR_Notebook.yaml * Renamed VariantSpark_Hail_EMR_Step.yaml -> VariantSpark_EMR_Step.yaml and removed hail installation * Deprecated VariantSpark hail example for future recreation * Removed hail configuration from spot-cluster.yaml * Updated README.md to reflect new filenames
* Updated example code to align with non-hail API * Corrected inaccurate requirement versions
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
dependencies
Pull requests that update a dependency file
enhancement
java
Pull requests that update Java code
python
Pull requests that update Python code
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Major issues and features addressed in this update
VariantSpark's python wrapper has been refactored to create Random Forest models from a standalone class
python/varspark/rfmodel.pypython/varspark/core.pypython/varspark/__init__.pysrc/main/scala/au/csiro/variantspark/api/GetRFModel.scalaA non-hail export model function was created
src/main/scala/au/csiro/variantspark/api/ExportModel.scalaThe
FeatureSourceclass, which provides wrapper functionalities for initialising genotype data for model training, has been moved to a standalone classhead(nrows, ncols)allows the first n rows and columns to be viewed as a pandas DataFramepython/varspark/featuresource.pypython/varspark/core.pysrc/main/scala/au/csiro/variantspark/input/FeatureSource.scalaCovariate support was extended
FeatureSourcewrapper class and are also of typeRDD[Feature], they also supporthead()src/main/scala/au/csiro/variantspark/api/VSContext.scalasrc/main/scala/au/csiro/variantspark/input/CsvStdFeatureSource.scalasrc/main/scala/au/csiro/variantspark/input/UnionedFeatureSource.scalapython/varspark/lfdrvsnohail.pyImportance analyses were moved to a standalone python wrapper class
important_variables()andvariable_importance()are now returned as pandas DataFramesvariable_importance()(required for Local FDR calculations)precisionsupports rounding forvariable_importance()normalizedindicates whether to normalise importances for both functionspython/varspark/importanceanalysis.pypython/varspark/core.pysrc/main/scala/au/csiro/variantspark/api/ImportanceAnalysis.scalasrc/main/scala/au/csiro/variantspark/api/AnalyticsFunctions.scalaMove lfdr file to non-hail python directory
python/varspark/hail/lfdrvs.pypython/varspark/lfdrvs.pyUpdated all test cases according to the above changes
src/test/scala/au/csiro/variantspark/api/CommonPairwiseOperationTest.scala/ImportanceApiTest.scalasrc/test/scala/au/csiro/variantspark/misc/ReproducibilityTest.scala/CovariateReproducibilityTest.scalasrc/test/scala/au/csiro/variantspark/test/TestSparkContext.scalapython/varspark/test/test_core.py/test_hail.py/test_pvalues_calculation.pysrc/test/scala/au/csiro/variantspark/work/hail/HailApiApp.scalaRemoved all files used exclusively in hail version
python/varspark/hail__init__.pycontext.pyhail.pymethods.pyplot.pysrc/main/scala/au/csiro/variantspark/hail/methodsRFModel.scalaRemoved hail installation from
pom.xml