Database Management for the pipeline. #95

grabear · 2017-10-06T23:19:50Z

Database Management will first be developed in Manager/database_management.py
Another module for creating template BioSQL databases will be developed in Manager/BioSQL/biosql.py
It will help keep the following databases updated:
- ETE3's NCBI-taxonomy database
- Local NCBI databases
  - blast database /blast/db NEW (May 2019)
  - GenBank flat files from NCBI's Refseq release (BioSQL) /refseq/release/<collection_subset>
  - ~~[ ] gi lists OR Should we convert this to accession.version via this or this~~
    - ~~vertebrate_mammalian~~

sdhutchins · 2017-11-09T20:27:15Z

@grabear status? Closeable? lol

grabear · 2017-11-09T20:27:47Z

Not yet lol. @sdhutchins

grabear · 2018-01-18T00:51:42Z

Update on this issue.

Scope

Manager/BioSQL/biosql.py
Manager/database_management.py
Manager/database_dispatcher.py
Manager/utils.py
Manager/config/yml/database_config.yml
Manager/db_mana_test.py

Tested Functionality

The following checked items have been tested by changing the parameters in the config file.

Logging is looking amazing.
YAML config file format (database_config.yml)

Configuration

Archiving

Bugs need to be fixed with the file movement and deletion after archiving.

Archiving

Deletion

Not Tested

Config File Explanation and Preview

The config file is loaded into Python as a nested dictionary. The top key value pairs such as:

email:  "rgilmore@umc.edu"
driver: "sqlite3"

are used for changing the parameters in the BaseDatabaseManagement class.

The various strategies for dispatching tasks include the following and are dictionary keys:

['Full', 'Projects', 'NCBI', 'NCBI_blast', 'NCBI_blast_db', 'NCBI_blast_windowmasker_files', 'NCBI_pub_taxonomy', 'NCBI_refseq_release', 'ITIS', 'ITIS_taxonomy']

Some keys are nested in the config file. The concept to note here is that top level keys (or strategies) have flags that control any sub level strategies. So if the configure_flag for 'Full' is True, then the configure_flag for 'Projects', 'NCBI', 'NCBI_blast', 'NCBI_blast_db', 'NCBI_blast_windowmasker_files', 'NCBI_pub_taxonomy', 'NCBI_refseq_release', 'ITIS', and 'ITIS_taxonomy' will also be interpreted as True when the database functions are dispatched.

Below I've added a preview of the entire database_config.yml file for consideration of the above statements.

Database_config:
  email:  "rgilmore@umc.edu"
  driver: "sqlite3"
  Full:
    configure_flag: False
    archive_flag: False
    delete_flag: False
    project_flag: False
    _path: "!!python/object/apply:pathlib.Path ['']"
    Projects:
      Project_Name_1:
        configure_flag: True
        archive_flag: False
        delete_flag: False
        _path: "!!python/object/apply:pathlib.Path ['Project_Name_1']"
      Project_Name_2:
        configure_flag: True
        archive_flag: False
        delete_flag: False
        _path: "!!python/object/apply:pathlib.Path ['Project_Name_2']"
      Project_Name_3:
        configure_flag: True
        archive_flag: False
        delete_flag: False
        _path: "!!python/object/apply:pathlib.Path ['Project_Name_3']"
    NCBI:
      configure_flag: False
      archive_flag: False
      delete_flag: False
      _path: "!!python/object/apply:pathlib.Path ['NCBI']"
      NCBI_blast:
        configure_flag: False
        archive_flag: False
        delete_flag: False
        _path: "!!python/object/apply:pathlib.Path ['NCBI', 'blast']"
        NCBI_blast_db:
          configure_flag: False
          archive_flag:  False
          delete_flag: False
          _path: "!!python/object/apply:pathlib.Path ['NCBI', 'blast', 'db']"
        NCBI_blast_windowmasker_files:
          configure_flag: False
          archive_flag: False
          delete_flag: False
          _path: "!!python/object/apply:pathlib.Path ['NCBI', 'blast', 'windowmasker_files']"
          taxonomy_ids: ""
      NCBI_pub_taxonomy:
        configure_flag: True
        archive_flag: False
        delete_flag: False
        _path: "!!python/object/apply:pathlib.Path ['NCBI', 'pub', taxonomy']"
      NCBI_refseq_release:
        seqtype: "rna" # Other seqtypes are protein and genomic
        seqformat: "gbff"
        collection_subset: "vertebrate_mammalian"
        configure_flag: False
        archive_flag: False
        delete_flag: False
        upload_flag: False
        _path: "!!python/object/apply:pathlib.Path ['NCBI', 'refseq', 'release']"
        upload_list: [1,2,3,4,5,6,7,8,9,10]
    ITIS:
      configure_flag: True
      archive_flag: False
      delete_flag: False
      _path: "!!python/object/apply:pathlib.Path ['ITIS']"
      ITIS_taxonomy:
        configure_flag: True
        archive_flag: False
        delete_flag: False
        _path: "!!python/object/apply:pathlib.Path ['ITIS', 'taxonomy']"

grabear · 2018-09-10T03:59:24Z

Current ToDo List:

Resolve dependency issues (sqlalchemy, tqdm, luigi, sciluigi,) (other?)
Make sure ete3 taxdump files go to the proper database folder

       comparative_genetics - line 240 ish
        # Load taxon ids from a local NCBI taxon database via ete3
       ncbi = NCBITaxa()

Create class for managing sqlite3 database including the duplicates and missing values
Resolve None/Nan values in accession data
Make sure the database_dispatcher is creating the proper sub directory for NCBI_refseq_release (e.g. vertebrate_mammalian)

grabear · 2018-10-05T21:24:09Z

Fix the NCBI_refseq_release database_management functionality:

OrthoEvolution/OrthoEvol/Manager/database_management.py

Lines 505 to 543 in a487971

    
           # Create a list of lists with an index corresponding to the upload number 
        
           if file_list is None: 
        
               db_path = self.database_path / Path('NCBI') / Path('refseq') / Path('release') / Path(collection_subset) 
        
               file_list = os.listdir(str(db_path)) 
        
               file_list = [x for x in file_list if x.endswith(str(seqformat))] 
        
           sub_upload_size = len(file_list) // upload_number 
        
           sub_upload_lists = [file_list[x:x + 100] for x in range(0, len(file_list), sub_upload_size)] 
        
           if (len(file_list) % upload_number) != 0: 
        
               upload_number = upload_number + 1 
        
           add_to_default = 0 
        
           for sub_list in sub_upload_lists: 
        
               add_to_default += 1 
        
               nrr_dispatcher["NCBI_refseq_release"]["upload"].append(refseq_jobber) 
        
               code_dict_string = str({ 
        
                   "collection_subset": collection_subset, 
        
                   "seqtype": seqtype, 
        
                   "seqformat": seqformat, 
        
                   "upload_list": sub_list, 
        
                   "add_to_default": add_to_default 
        
               }) 
        
               # Create a Python script for this in the package 
        
               sge_code_string = \ 
        
               "from OrthoEvol.Manager.management import ProjectManagement\n" \ 
        
               "from OrthoEvol.Manager.database_dispatcher import DatabaseDispatcher\n" \ 
        
               "from OrthoEvol.Manager.config import yml\n" \ 
        
               "from pkg_resources import resource_filename\n" \ 
        
               "import yaml\n" \ 
        
               "pm_config_file = resource_filename(yml.__name__, \"config_template_existing.yml\")\n" \ 
        
               "with open(pm_config_file, \'r\') as f:\n" \ 
        
               "   pm_config = yaml.safe_load(f)\n" \ 
        
               "pm = ProjectManagement(**pm_config[\"Management_config\"])\n" \ 
        
               "code_dict_string = %s\n" \ 
        
               "R_R = DatabaseDispatcher(config_file=\"%s\", proj_mana=pm, upload_refseq_release=True, **code_dict_string)\n" % \ 
        
                   (code_dict_string, self.config_file) 
        
               nrr_config["NCBI_refseq_release"]["upload"].append({ 
        
                   "code": sge_code_string, 
        
                   "base_jobname": "upload_rr_%s", 
        
                   "email_address": self.email, 
        
                   "id": add_to_default})

sdhutchins · 2019-05-15T20:48:26Z

One question as I test this out, @grabear:

How would I use an existing database (refseq) or do we need to create an if statement that compares size of current refseq path (if it exists) to ftp file path?

grabear · 2019-05-20T17:32:22Z

@sdhutchins
Sorry I missed this...

I don't quite understand your question though. Are you asking how do we know if our data is up to date?

Do you still need help with this?

grabear · 2019-05-20T17:44:24Z

Things to do:

New blast database
Make sure that ete3's NCBITaxa() call is using the file we manually download via DatabaseManagement

NCBITaxa(taxdump_file="out_path/taxdump.tar.gz")

Move the Template-BioSQL-SQLite.db to the top level of repositories
Consider moving refseq_release databases as well. If implemented we could copy/paste them for sqlite. And then delete them after the pipeline use.
Try to work on MySQL or PG

sdhutchins added the Feature label Oct 10, 2017

sdhutchins added this to Pipeline Integration in To-Do Lists Oct 10, 2017

sdhutchins added this to the Official project release milestone Oct 13, 2017

sdhutchins assigned grabear and sdhutchins Oct 13, 2017

sdhutchins added the High Priority label Nov 1, 2017

sdhutchins added the ROB-ASSIGNED label Nov 13, 2017

sdhutchins added this to In Progress in Official Release Nov 14, 2017

grabear mentioned this issue Jan 18, 2018

Using YAML tags and Python types with PyYAML. #118

Closed

grabear added Type: Enhancement ❤️ Priority: Critical 🔥🔥🔥🔥 and removed Feature labels Aug 28, 2018

sdhutchins moved this from In Progress to To Do in Official Release Aug 28, 2018

sdhutchins moved this from To Do to In Progress in Official Release Jun 12, 2019

sdhutchins linked a pull request Jul 15, 2020 that will close this issue

Major updates to database management, ftp, pbs, etc. #155

Merged

grabear mentioned this issue Jul 24, 2020

Major updates to database management, ftp, pbs, etc. #155

Merged

sdhutchins closed this as completed in #155 Aug 21, 2020

Official Release automation moved this from In Progress to Done Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Management for the pipeline. #95

Database Management for the pipeline. #95

grabear commented Oct 6, 2017 •

edited

Loading

sdhutchins commented Nov 9, 2017

grabear commented Nov 9, 2017

grabear commented Jan 18, 2018 •

edited

Loading

grabear commented Sep 10, 2018 •

edited

Loading

grabear commented Oct 5, 2018

sdhutchins commented May 15, 2019

grabear commented May 20, 2019

grabear commented May 20, 2019 •

edited

Loading

Database Management for the pipeline. #95

Database Management for the pipeline. #95

Comments

grabear commented Oct 6, 2017 • edited Loading

sdhutchins commented Nov 9, 2017

grabear commented Nov 9, 2017

grabear commented Jan 18, 2018 • edited Loading

Scope

Tested Functionality

Configuration

Archiving

Deletion

Config File Explanation and Preview

grabear commented Sep 10, 2018 • edited Loading

grabear commented Oct 5, 2018

sdhutchins commented May 15, 2019

grabear commented May 20, 2019

grabear commented May 20, 2019 • edited Loading

grabear commented Oct 6, 2017 •

edited

Loading

grabear commented Jan 18, 2018 •

edited

Loading

grabear commented Sep 10, 2018 •

edited

Loading

grabear commented May 20, 2019 •

edited

Loading