-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download and process Entrez Gene #1
Conversation
The method and code for creating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code. I did not run the code. The code LGTM 👍
('#tax_id', 'tax_id'), | ||
]) | ||
|
||
gene_df = (pandas.read_table(path, compression='gzip', na_values='-') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa - love this way of reading these files. We should consider this for django-genes (cc @rzelayafavila)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(specifically the pandas approach here - cleaner than CSV parsing by field index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaner than CSV parsing by field index
Importantly, reference by column name is more resilient than reference by column position.
Tagging @rzelayafavila, our lab's expert in Entrez Gene, for pull request review. |
|
||
This repository creates the set of genes to be used in Project Cognoma. The human subset of Entrez Gene is the basis of Cognoma genes. All genes in Cognoma should be converted to Entrez GeneIDs (using a preferred variable name of `entrez_gene_id`). | ||
|
||
When encountering genes in Project Cognoma, identify which of the following approach should be applied: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rzelayafavila, can you look at these mapping strategies and comment on whether they'll be compatible with django-genes? If we populate django-genes from the files in the download
directory but use these strategies when creating our cancer-data, will everything align in great harmony?
Genes with multiple chromosomes now receive multiple rows for each chromosome as well as retaining the multi-chromosome value. cognoma#2 (comment) Genes with a missing value for chromosome are removed.
`0.genes-download.ipynb` is a notebook to download datasets from `cognoma/genes`. Update `2.TCGA-process.ipynb` to use the gene mapping guidelines in cognoma/genes#1. Remove `mapping/PANCAN-mutation/` since this mapping is now done in `2.TCGA-process.ipynb`. Closes cognoma#23. Closes cognoma#30 by exporting gene info files in `2.TCGA-process.ipynb`
@rzelayafavila, I merged this pull request in the interest of time. Feedback on compatibility with django-genes would still be appreciated and can be left here or in a new issue. |
* Outsource Entrez Gene logic to cognoma/genes `0.genes-download.ipynb` is a notebook to download datasets from `cognoma/genes`. Update `2.TCGA-process.ipynb` to use the gene mapping guidelines in cognoma/genes#1. Remove `mapping/PANCAN-mutation/` since this mapping is now done in `2.TCGA-process.ipynb`. Closes #23. Closes #30 by exporting gene info files in `2.TCGA-process.ipynb` * Average expression values for the same gene * Update cognoma/genes download location
Download and process Entrez Gene. Create gene identification guidelines for Project Cognoma. Closes cognoma#2.
The purpose of this repository is to create a standard gene terminology for all of Cognoma. Repositories will no longer have to download gene data directly from the Entrez Gene FTP site (
ftp://ftp.ncbi.nih.gov/gene/
). The main benefit is that it will now be possible to use versioned gene data and have a consistent terminology. This pull request also establishes guidelines for mapping genes.Large downloaded files are tracked. I considered git-lfs but since we use a forking model for contributions, git-lfs was not an option.