Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download and process Entrez Gene #1

Merged
merged 9 commits into from
Oct 9, 2016
Merged

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Oct 7, 2016

The purpose of this repository is to create a standard gene terminology for all of Cognoma. Repositories will no longer have to download gene data directly from the Entrez Gene FTP site (ftp://ftp.ncbi.nih.gov/gene/). The main benefit is that it will now be possible to use versioned gene data and have a consistent terminology. This pull request also establishes guidelines for mapping genes.

Large downloaded files are tracked. I considered git-lfs but since we use a forking model for contributions, git-lfs was not an option.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 7, 2016

The method and code for creating data/chromosome-symbol-mapper.tsv was based on cognoma/cancer-data#12 by @clairemcleod.

Copy link
Member

@cgreene cgreene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code. I did not run the code. The code LGTM 👍

('#tax_id', 'tax_id'),
])

gene_df = (pandas.read_table(path, compression='gzip', na_values='-')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa - love this way of reading these files. We should consider this for django-genes (cc @rzelayafavila)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(specifically the pandas approach here - cleaner than CSV parsing by field index)

Copy link
Member Author

@dhimmel dhimmel Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaner than CSV parsing by field index

Importantly, reference by column name is more resilient than reference by column position.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 7, 2016

Tagging @rzelayafavila, our lab's expert in Entrez Gene, for pull request review.


This repository creates the set of genes to be used in Project Cognoma. The human subset of Entrez Gene is the basis of Cognoma genes. All genes in Cognoma should be converted to Entrez GeneIDs (using a preferred variable name of `entrez_gene_id`).

When encountering genes in Project Cognoma, identify which of the following approach should be applied:
Copy link
Member Author

@dhimmel dhimmel Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzelayafavila, can you look at these mapping strategies and comment on whether they'll be compatible with django-genes? If we populate django-genes from the files in the download directory but use these strategies when creating our cancer-data, will everything align in great harmony?

Genes with multiple chromosomes now receive multiple rows for each chromosome as
well as retaining the multi-chromosome value.
cognoma#2 (comment)

Genes with a missing value for chromosome are removed.
dhimmel added a commit to dhimmel/cancer-data that referenced this pull request Oct 7, 2016
`0.genes-download.ipynb` is a notebook to download datasets from
`cognoma/genes`. Update `2.TCGA-process.ipynb` to use the gene mapping
guidelines in cognoma/genes#1. Remove `mapping/PANCAN-mutation/` since this
mapping is now done in `2.TCGA-process.ipynb`.

Closes cognoma#23. Closes cognoma#30 by exporting gene info files in `2.TCGA-process.ipynb`
@dhimmel dhimmel merged commit 7212040 into cognoma:master Oct 9, 2016
@dhimmel dhimmel deleted the entrez branch October 9, 2016 17:39
@dhimmel
Copy link
Member Author

dhimmel commented Oct 9, 2016

@rzelayafavila, I merged this pull request in the interest of time. Feedback on compatibility with django-genes would still be appreciated and can be left here or in a new issue.

dhimmel added a commit to cognoma/cancer-data that referenced this pull request Oct 10, 2016
* Outsource Entrez Gene logic to cognoma/genes

`0.genes-download.ipynb` is a notebook to download datasets from
`cognoma/genes`. Update `2.TCGA-process.ipynb` to use the gene mapping
guidelines in cognoma/genes#1. Remove `mapping/PANCAN-mutation/` since this
mapping is now done in `2.TCGA-process.ipynb`.

Closes #23. Closes #30 by exporting gene info files in `2.TCGA-process.ipynb`

* Average expression values for the same gene

* Update cognoma/genes download location
dhimmel added a commit to dhimmel/genes that referenced this pull request Apr 7, 2018
Download and process Entrez Gene.
Create gene identification guidelines for Project Cognoma.
Closes cognoma#2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants