Download and process Entrez Gene #1

dhimmel · 2016-10-07T01:34:57Z

The purpose of this repository is to create a standard gene terminology for all of Cognoma. Repositories will no longer have to download gene data directly from the Entrez Gene FTP site (ftp://ftp.ncbi.nih.gov/gene/). The main benefit is that it will now be possible to use versioned gene data and have a consistent terminology. This pull request also establishes guidelines for mapping genes.

Large downloaded files are tracked. I considered git-lfs but since we use a forking model for contributions, git-lfs was not an option.

dhimmel · 2016-10-07T01:37:53Z

The method and code for creating data/chromosome-symbol-mapper.tsv was based on cognoma/cancer-data#12 by @clairemcleod.

cgreene

I reviewed the code. I did not run the code. The code LGTM 👍

cgreene · 2016-10-07T12:40:43Z

2.process.py

+        ('#tax_id', 'tax_id'),
+    ])
+
+    gene_df = (pandas.read_table(path, compression='gzip', na_values='-')


Whoa - love this way of reading these files. We should consider this for django-genes (cc @rzelayafavila)

(specifically the pandas approach here - cleaner than CSV parsing by field index)

cleaner than CSV parsing by field index

Importantly, reference by column name is more resilient than reference by column position.

dhimmel · 2016-10-07T14:16:47Z

Tagging @rzelayafavila, our lab's expert in Entrez Gene, for pull request review.

dhimmel · 2016-10-07T14:18:34Z

README.md

+
+This repository creates the set of genes to be used in Project Cognoma. The human subset of Entrez Gene is the basis of Cognoma genes. All genes in Cognoma should be converted to Entrez GeneIDs (using a preferred variable name of `entrez_gene_id`).
+
+When encountering genes in Project Cognoma, identify which of the following approach should be applied:


@rzelayafavila, can you look at these mapping strategies and comment on whether they'll be compatible with django-genes? If we populate django-genes from the files in the download directory but use these strategies when creating our cancer-data, will everything align in great harmony?

Genes with multiple chromosomes now receive multiple rows for each chromosome as well as retaining the multi-chromosome value. cognoma#2 (comment) Genes with a missing value for chromosome are removed.

`0.genes-download.ipynb` is a notebook to download datasets from `cognoma/genes`. Update `2.TCGA-process.ipynb` to use the gene mapping guidelines in cognoma/genes#1. Remove `mapping/PANCAN-mutation/` since this mapping is now done in `2.TCGA-process.ipynb`. Closes cognoma#23. Closes cognoma#30 by exporting gene info files in `2.TCGA-process.ipynb`

dhimmel · 2016-10-09T17:44:27Z

@rzelayafavila, I merged this pull request in the interest of time. Feedback on compatibility with django-genes would still be appreciated and can be left here or in a new issue.

* Outsource Entrez Gene logic to cognoma/genes `0.genes-download.ipynb` is a notebook to download datasets from `cognoma/genes`. Update `2.TCGA-process.ipynb` to use the gene mapping guidelines in cognoma/genes#1. Remove `mapping/PANCAN-mutation/` since this mapping is now done in `2.TCGA-process.ipynb`. Closes #23. Closes #30 by exporting gene info files in `2.TCGA-process.ipynb` * Average expression values for the same gene * Update cognoma/genes download location

Download and process Entrez Gene. Create gene identification guidelines for Project Cognoma. Closes cognoma#2.

dhimmel added 8 commits October 6, 2016 19:42

Download gene_history and gene_info from Entrez

8e17d25

Process history and gene info

7fba58c

Improve documentation

b307d2b

Number scripts

e2643cf

Specify environment

738441a

Create README

1b92584

Track downloads

6e5945a

Track data

0262a99

dhimmel mentioned this pull request Oct 7, 2016

Extract gene information cognoma/cancer-data#30

Closed

cgreene approved these changes Oct 7, 2016

View reviewed changes

dhimmel commented Oct 7, 2016

View reviewed changes

Improve chromosome-symbol mapper

b64fcb4

Genes with multiple chromosomes now receive multiple rows for each chromosome as well as retaining the multi-chromosome value. cognoma#2 (comment) Genes with a missing value for chromosome are removed.

dhimmel mentioned this pull request Oct 7, 2016

Outsource Entrez Gene logic to cognoma/genes cognoma/cancer-data#32

Merged

dhimmel merged commit 7212040 into cognoma:master Oct 9, 2016

dhimmel deleted the entrez branch October 9, 2016 17:39

dhimmel added a commit to dhimmel/genes that referenced this pull request Apr 7, 2018

Merge pull request cognoma#1 from dhimmel/entrez

ffc3ddb

Download and process Entrez Gene. Create gene identification guidelines for Project Cognoma. Closes cognoma#2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download and process Entrez Gene #1

Download and process Entrez Gene #1

dhimmel commented Oct 7, 2016

dhimmel commented Oct 7, 2016

cgreene left a comment

cgreene Oct 7, 2016

cgreene Oct 7, 2016

dhimmel Oct 7, 2016 •

edited

Loading

dhimmel commented Oct 7, 2016

dhimmel Oct 7, 2016 •

edited

Loading

dhimmel commented Oct 9, 2016


		This repository creates the set of genes to be used in Project Cognoma. The human subset of Entrez Gene is the basis of Cognoma genes. All genes in Cognoma should be converted to Entrez GeneIDs (using a preferred variable name of `entrez_gene_id`).

		When encountering genes in Project Cognoma, identify which of the following approach should be applied:

Download and process Entrez Gene #1

Download and process Entrez Gene #1

Conversation

dhimmel commented Oct 7, 2016

dhimmel commented Oct 7, 2016

cgreene left a comment

Choose a reason for hiding this comment

cgreene Oct 7, 2016

Choose a reason for hiding this comment

cgreene Oct 7, 2016

Choose a reason for hiding this comment

dhimmel Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

dhimmel commented Oct 7, 2016

dhimmel Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

dhimmel commented Oct 9, 2016

dhimmel Oct 7, 2016 •

edited

Loading

dhimmel Oct 7, 2016 •

edited

Loading