Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map mutation gene symbols to Entrez IDs #12

Merged
merged 8 commits into from Aug 10, 2016

Conversation

@clairemcleod
Copy link
Member

commented Jul 27, 2016

This pull requests addresses Issues #4 and #6. It adds to 2.TCGA-process.ipynb and includes a mapping of mutation gene symbol to Entrez ID as part of the processing workflow.

The mapping is conducted in two stages. First, gene symbols are mapped based on the combination of chromosome # and gene symbol of record. This maps ~95% of observed mutations. Next, yet-unmapped gene symbols are mapped based on the combination of chromosome # and alternate gene symbols. Following the second mapping, ~98% of observations are mapped. The remaining ~2% were either ambiguous mappings or un-mappable; this 2% is currently discarded before writing the data out.

Claire McLeod and others added 2 commits Jul 27, 2016
@dhimmel

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

Can you export the notebooks to scripts using:

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts *.ipynb

This will make it easier to review the changes.

@dhimmel

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

Note that the email configuration in your git, doesn't match you GitHub account email. This makes it so your commits aren't attributed to your profile. See more info here.

@clairemcleod clairemcleod force-pushed the clairemcleod:master branch from 4dc8cac to 939ec6e Jul 27, 2016

@clairemcleod

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2016

@dhimmel Here is the exported script. Good catch with the email - and thanks for point it out. I think it is rectified now?

@dhimmel

This comment has been minimized.

Copy link
Member

commented Jul 28, 2016

Yep, your new commits are associated with your GitHub account.

@@ -4256,7 +4256,6 @@ TCGA-E8-A3X7-01 0 8.05 8.19 9.49 5.06 10.9 8.27 10.8 7.01 9.11 3.23 6.07 0 5.62
TCGA-E8-A413-01 0 8.15 8.44 9.42 6 10.6 9.58 10.3 7.56 8.67 2.42 5.49 0 6.32 13.6
TCGA-E8-A414-01 0 7.97 8.4 9.7 5.71 10.5 9.63 10.8 7.28 8.57 2.99 5.66 0 6.97 12.8
TCGA-E8-A415-01 0 7.94 7.22 9.44 5.22 10.4 9.28 10.6 7.2 8.99 2 5.65 0 6.91 13.2
TCGA-E8-A416-01 0 7.81 6.63 9.39 4.26 10.5 9.77 8.77 7.11 9.2 2.25 6.27 0 7 13.2

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 28, 2016

Member

Interesting -- one sample was removed, meaning they must have had few mutations, all of which didn't map.

path = os.path.join('download', map_name)
urlretrieve(map_url, path)

path = os.path.join('download', 'Homo_sapiens.gene_info.gz')

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 28, 2016

Member

use map_name variable here instead of 'Homo_sapiens.gene_info.gz' (DRY)

map_names = ['tax_id', 'GeneID', 'Symbol', 'LocusTag', 'Synonyms', 'dbXrefs',
'chromosome', 'map_location', 'description', 'type_of_gene',
'Symbol_from_nomenclature_authority', 'Full_name_from_nomenclature_authority',
'Nomenclature_status', 'Other_designations', 'Modification_date']

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 28, 2016

Member

Can you use the column names from the file and then just remove the # from #tax_id? Explicit naming could break if NCBI updates their file.

# In[8]:

#create chr-gene designation and check for duplicates
snp_mutation_df['chr_gene'] = (snp_mutation_df['chr'] + '-' + snp_mutation_df['gene'])

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 28, 2016

Member

I think some gene symbols have - in them. _ is probably safer and | even safer.

map_df['chr_gene'] = 'chr' + map_df['chr'] + '-' + map_df['Symbol']

'{0} of the {1} possible chr-gene combinations in map_df are unique.'.format(
map_df['chr_gene'].nunique(),

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 28, 2016

Member

This counts the number of distinct chr-gene combinations rather than the number of chr-gene combinations that occur only once.

# In[9]:

#remove all duplicated chr-gene combinations to avoid ambiguous mapping
map_df_nodups = map_df.drop_duplicates(subset='chr_gene', keep=False)

This comment has been minimized.

Copy link
@dhimmel

dhimmel Jul 28, 2016

Member

Didn't know about keep=False in drop_duplicates. Soo convenient -- nice find!

Also happy to see that duplicated() also has this option.

@dhimmel

This comment has been minimized.

Copy link
Member

commented Jul 28, 2016

General comments

Great work with this pull request!

I think you should separate the entrez gene processing to it's own notebook. For example, 2-entrez-gene-extract.ipynb. This notebook should export one file for now (we will probably have it export more in the future) named entrez-gene-symbol-map.tsv or similar. It should have three columns: entrez_gene_id, symbol, chromosome. There should only be rows for unambigious mappings. For example, run drop_duplicates with keep=False.

In 3.TCGA-process.ipynb, we could then use the merge command with how='inner (as you're doing, but no need to combine symbol and chromosome to a single column.

I also think we may want to consider the following approach:

  1. Construct entrez_gene_id, symbol, chromosome dataframe from only primary symbols.
  2. Construct entrez_gene_id, symbol, chromosome dataframe from only synonyms and run drop_duplicates with keep=False.
  3. Concatenate the dataframes from 1 and 2 and drop_duplicates with keep='first'.

This approach gives primacy to official symbols (i.e. we don't blacklist official symbols because there's a colliding synonym on the same chromosome), but we still obliterate colliding synonyms. Does that make sense?

@dhimmel

This comment has been minimized.

Copy link
Member

commented Jul 28, 2016

Make sure to subset for tax_id = 9606 (Homo sapiens) from the get go. It's a real gotcha with the Homo_sapiens.gene_info.gz file.

@clairemcleod

This comment has been minimized.

Copy link
Member Author

commented Jul 29, 2016

@dhimmel These are all great points - thanks for the feedback. Would it be best to cancel/close this pull request and resubmit once the changes are made, or to keep the pull request open while I make the changes?

@dhimmel

This comment has been minimized.

Copy link
Member

commented Jul 29, 2016

Would it be best to cancel/close this pull request and resubmit once the changes are made, or to keep the pull request open while I make the changes?

I suggest keeping the pull request open. Any commits you make to your master branch will get added to this pull request.

@clairemcleod

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2016

@dhimmel Sorry for the delay - I think I've addressed all of these points but let me know if I missed any or new ones have popped up.

edit: also tagging @Inquisitive-Geek

@dhimmel

This comment has been minimized.

Copy link
Member

commented Aug 9, 2016

@clairemcleod awesome.

@gwaygenomics would you like to spend ~15 minutes with the cancer-data group tonight reviewing this pull request?

"metadata": {},
"source": [
"### Convert mutation gene symbol labels to Entrez IDs \n",
"Goal: Relabel the mutation data frame with Entrez IDs instead of gene names, by mapping a combination of chromosome and gene symbol to Entrez ID. The NCBI file downloaded and read in the next cell contains the Entrez ID - gene symbol pairs we will use to do so."

This comment has been minimized.

Copy link
@gwaygenomics

gwaygenomics Aug 9, 2016

Member

Can you clarify what you mean in the last sentence?

"base_url = 'ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/'\n",
"map_name = 'Homo_sapiens.gene_info.gz'\n",
"map_url = base_url + map_name\n",
"path = os.path.join('download', map_name)\n",

This comment has been minimized.

Copy link
@gwaygenomics

gwaygenomics Aug 9, 2016

Member

Can you use os to check if the file exists? if it doesn't, you should download. If not, file should not be downloaded

},
"outputs": [],
"source": [
"import os\n",

This comment has been minimized.

Copy link
@gwaygenomics

gwaygenomics Aug 9, 2016

Member

Consider adding urllib and os to conda environment.yml file

This comment has been minimized.

Copy link
@dhimmel

dhimmel Aug 10, 2016

Member

urllib and os are builtin packages.

},
"outputs": [],
"source": [
"alternates = map_df['Synonyms'].str.split('|').apply(pandas.Series, 1).stack()\n",

This comment has been minimized.

Copy link
@gwaygenomics

This comment has been minimized.

Copy link
@gwaygenomics

gwaygenomics Aug 9, 2016

Member

Maybe add a comment above this line about what this function is doing? We had some difficulty figuring it out

"metadata": {
"collapsed": true
},
"outputs": [],

This comment has been minimized.

Copy link
@gwaygenomics

gwaygenomics Aug 9, 2016

Member

You can delete this empty line here

# In[7]:

# Check that chr/symbol are all unique
assert not (maps_combined.chr + '|' + maps_combined.symbol).duplicated().any()

This comment has been minimized.

Copy link
@dhimmel

dhimmel Aug 10, 2016

Member

I think you can avoid the string concatenation with

assert not maps_combined.duplicated(['chr', 'symbol']).any()

# In[9]:

failed_mappings = (set(mutation_df.chr + '|' + mutation_df.gene) -

This comment has been minimized.

Copy link
@dhimmel

dhimmel Aug 10, 2016

Member

Can avoid the string concatenation using tuples.

set(zip(mutation_df.chr, mutation_df.gene))

@dhimmel

This comment has been minimized.

Copy link
Member

commented Aug 10, 2016

Looks like there were only a few small comments and then this will be ready to merge.

I may be AFK, so @gwaygenomics you can do the merge when ready. I recommend a squash commit here.

@gwaygenomics

This comment has been minimized.

Copy link
Member

commented Aug 10, 2016

@clairemcleod @dhimmel - Looks great to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.