Skip to content
Aaron Masino edited this page Jan 7, 2014 · 4 revisions

There a number of ETL requirements to support other sub-projects within Phenomantics. They are described here.

HPO Ingest

Last executed

  1. hpo_687_130220.obo
  2. genes_to_phenotype_26_1302201.txt

Obtaining Source Files

hpo_latest.obo

genes_to_phenotypes.txt

For more information, HPO Downloads

Description

The etl/hpo_ingest/hpo_file_converters.scala file is an ETL script that converts the HPO ontology file, hpo_version_date.obo, and the HPO terms to Entrez genes annotation file, genes_to_phenotype_version_date.txt, to properly formatted resource files that can be used by the phenomantics API application.

The output files are: ENTREZ.txt HPO_ALT_IDS.txt HPO_TERMS.txt ENTREZ_HPO_ANNOTATIONS.txt HPO_ANCESTORS.txt

The output files should be moved to apps/api/src/main/resources

Dependencies

  1. DataExpress
  2. hpo_BUILD_DATE.obo
  3. genes_to_phenotype_BUILD_DATE.txt

Genesis Similarity Scores to CDF Ingest

The genesis application generates sample similarity scores for all genes for a randomly selected set of phenotype queries of length k. This data is output to value delimited files. The etl/gp_dist_ingest/genesis.scala script will import process the raw similarity scores to create CDF values and store them to a database with the proper configuration for use with the phenomantics API application. To run the ingest script as is:

  1. Move the genesis output data files to etl/gp_dist_ingest/data/SIMFUNC_k_## where SIMFUNC is one of ASymSim1, SymSim1, SymSim2 and ## is the number of query terms used in generating the data (prefix a leading 0 if k<10, e.g. 01, 02)

  2. Create etl/gp_dist_ingest/conf/postgres.properties file with following content (values in CAPS must be replaced with appropriate values)

    driverClassName=org.postgresql.Driver
    jdbcUri=dbc:postgresql://IP:PORT/dbName
    user=USERNAME
    password=PASSWORD

  3. Ensure postgresql service is running and has appropriate access privileges for user specified in postgres.properties file above

  4. Add dataexpress.jar to etl/gp_dist_ingest/lib

  5. Run ingest from within dir etl/gp_dist_ingest with

    scala -cp "lib/*" genesis.scala