# Programmatic Update of SardiNIA publication list

Strategy:
* Use a text scraper to pull and PMIDs found on the original page
* If PMID not there, get the manuscript title, then do a pubmed search on that and get the PMID
* Do manual searches by PI and mention of "sardinia" in the abstract
* Finally, add in PMIDs by hand that came from Nagaraj's list

Additions
* Emboden authors in the HTML who are in our group.

In [1]:
from bs4 import BeautifulSoup

In [2]:
html_doc = open( 'publications.html.bak' )

In [3]:
doc = BeautifulSoup(html_doc, 'html.parser')

# Get a list of all our authors

In [4]:
import regex

In [5]:
s = regex.compile( r'[,]')

In [6]:
authors = doc.find_all( 'b')[1:]

In [7]:
author_list = []
for line in authors:
    text = line.text
    if ':' in text:
        text = text[ text.index(':')+1: ]
        
    for temp in text.split(','):
        temp = temp.strip().translate( {10: None} )
        author_list.append( temp )    

In [8]:
len(author_list)

704

In [9]:
author_list = set( author_list)

In [10]:
len(author_list)

152

In [11]:
author_list

{'',
 'A Maschio',
 'A Terracciano',
 'Abecasis G',
 'Abecasis GR',
 'Alessia Loi',
 'Andrea Angius',
 'Andrea Maschio',
 'Angelo Scuteri',
 'AngeloScuteri',
 'Anne U Jackson',
 'Anne U. Jackson',
 'Antonella M',
 'Antonella Mulas',
 'Antonio Cao',
 'Antonio Terracciano',
 'B Deiana',
 'Bragg-Gresham JL',
 'Busonero F',
 'Cao A',
 'Carla Sollaino',
 'Carlo Sidore',
 'Costa PT',
 'Crisponi L',
 'Cristen J Willer',
 'Cristen J. Willer',
 'Cucca F',
 'D Schlessinger',
 'Dan L. Longo',
 'David Schlessinger',
 'DavidSchlessinger',
 'Dena G. Hernandez',
 'Dena Hernandez',
 'Dennis D. Taub',
 'Ding J',
 'Edward G Lakatta',
 'Edward G. Lakatta',
 'Edward Lakatta',
 'Eleonora Porcu',
 'F Busonero',
 'Fabio Busonero',
 'Ferreli L',
 'Ferrucci L',
 'Francesca Virdis',
 'Francesco C',
 'Francesco Cucca',
 'Fuchsberger C',
 'G R Abecasis',
 'G Usala',
 'Garcia ME',
 'Gianluca Usala',
 'Gianmauro Cuccuru',
 'Giuseppe Albai',
 'Goncalo Abecasis',
 'Goncalo R Abecasis',
 'Goncalo R. Abecasis',
 'Harri

# Get URLS/titles for all publications listed

In [12]:
s = regex.compile( r'\d{8}')

In [13]:
pmids = set()
titles = set()
for i, list_item in enumerate( doc.find_all( 'li') ):
    
    for link_item in list_item.find_all( 'a' ):
        url = link_item.get( 'href')
        m = s.search( url )
        if m:
            pmids.add( m.captures()[0] )
        else:
            titles.add( link_item.text.strip() )

In [14]:
len(pmids)

138

In [15]:
len(titles)

13

# Manually remove 2

In [16]:
titles.remove( 'Download PDF File')

In [17]:
len(titles)

12

In [18]:
titles.remove( 'Genotype Imputation')

In [19]:
len(titles)

11

# Get PMIDS for the titles for which no PMID in the original list

In [20]:
from Bio import Entrez
from Bio import Medline

In [21]:
Entrez.email = 'christopher.coletta@nih.gov'

In [22]:
search_kwargs = { 'db':'pubmed', 'retmax':50 }
fetch_kwargs = {'db':'pubmed', 'rettype':'medline', "retmode":'text'}

title_pmids = set()
for i, title in enumerate( titles ):
    search_kwargs[ 'term' ] = title
    search_result = Entrez.read( Entrez.esearch( **search_kwargs ) )
    fetch_kwargs['id'] = search_result['IdList']
    for j, pub in enumerate( Medline.parse( Entrez.efetch( **fetch_kwargs ) ) ):
        #print( '=================\ntitle', i, 'search index', j, '\n', title, '\n', pub['TI'] )
        if str(pub['TI'].lower()[:10]) != str(title.lower()[:10]):
            continue
        print( '=================\ntitle', i, 'search index', j, '\n', title, '\n', pub['TI'] )
        title_pmids.add( pub['PMID'] )
        break

title 0 search index 0 
 Meta-Analysis of 28,141 Individuals Identifies Common Variants within Five New Loci That Influence Uric Acid Concentrations 
 Meta-analysis of 28,141 individuals identifies common variants within five new loci that influence uric acid concentrations.
title 1 search index 0 
 Discovery and refinement of loci associated with lipid levels 
 Discovery and refinement of loci associated with lipid levels.
title 2 search index 3 
 Common variants associated with plasma triglycerides and risk for coronary artery disease 
 Common variants associated with plasma triglycerides and risk for coronary artery disease.
title 3 search index 0 
 Heritability of Cardiovascular and Personality Traits in 6,148 Sardinians 
 Heritability of cardiovascular and personality traits in 6,148 Sardinians.
title 4 search index 0 
 FTO genotype is associated with phenotypic variability of body mass index 
 FTO genotype is associated with phenotypic variability of body mass index.
title 5 sear

In [25]:
len( title_pmids )

10

In [26]:
pmids |= title_pmids 

In [27]:
len(pmids)

146

# NEW items not on original page

## Search for David's new publications

In [28]:
david_pmids = set()
search_kwargs = { 'db':'pubmed', 'retmax':100 }
fetch_kwargs = {'db':'pubmed', 'rettype':'medline', "retmode":'text'}

search_kwargs[ 'term' ] = "Schlessinger D[Author] AND sardinia[Title/Abstract]"
search_result = Entrez.read( Entrez.esearch( **search_kwargs ) )
fetch_kwargs['id'] = search_result['IdList']
davids = Medline.parse( Entrez.efetch( **fetch_kwargs ) )
for i, pub in enumerate( davids ):
    if pub['PMID'] not in pmids:
        print( '=================\n', i, '\n', pub['TI'] )
        david_pmids.add( pub['PMID'] )

 0 
 Overexpression of the Cytokine BAFF and Autoimmunity Risk.
 1 
 Population- and individual-specific regulatory variation in Sardinia.
 2 
 Mitogenome Diversity in Sardinians: A Genetic Window onto an Island's Past.
 3 
 Menopause modulates the association between thyrotropin levels and lipid parameters: The SardiNIA study.
 4 
 Gender specific profiles of white coat and masked hypertension impacts on arterial structure and function in the SardiNIA study.
 5 
 Depressive symptoms, thyroid hormone and autoimmunity in a population-based cohort from Sardinia.
 6 
 No evidence of association between subclinical thyroid disorders and common carotid intima medial thickness or atherosclerotic plaque.
 18 
 A genome-wide association scan on the levels of markers of inflammation in Sardinians reveals associations that underpin its complex regulation.


## Search for Ed's new publications

In [29]:
lakatta_pmids = set()
search_kwargs = { 'db':'pubmed', 'retmax':100 }
fetch_kwargs = {'db':'pubmed', 'rettype':'medline', "retmode":'text'}

search_kwargs[ 'term' ] = "Lakatta EG[Author] AND sardinia[Title/Abstract]"
search_result = Entrez.read( Entrez.esearch( **search_kwargs ) )
fetch_kwargs['id'] = search_result['IdList']
lakattas = Medline.parse( Entrez.efetch( **fetch_kwargs ) )
for i, pub in enumerate( lakattas ):
    if pub['PMID'] not in pmids:
        print( '=================\n', i, '\n', pub['TI'] )
        lakatta_pmids.add( pub['PMID'] )

 0 
 Gender specific profiles of white coat and masked hypertension impacts on arterial structure and function in the SardiNIA study.
 1 
 No evidence of association between subclinical thyroid disorders and common carotid intima medial thickness or atherosclerotic plaque.
 2 
 Longitudinal perspective on the conundrum of central arterial stiffness, blood pressure, and aging.
 3 
 Serum free thyroxine levels are positively associated with arterial stiffness in the SardiNIA study.


## Search for Goncalo's new publications

In [30]:
goncalo_pmids = set()
search_kwargs = { 'db':'pubmed', 'retmax':100 }
fetch_kwargs = {'db':'pubmed', 'rettype':'medline', "retmode":'text'}

search_kwargs[ 'term' ] = "Abecasis GR[Author] AND sardinia[Title/Abstract]"
search_result = Entrez.read( Entrez.esearch( **search_kwargs ) )
fetch_kwargs['id'] = search_result['IdList']
goncalo = Medline.parse( Entrez.efetch( **fetch_kwargs ) )
for i, pub in enumerate( goncalo ):
    if pub['PMID'] not in pmids:
        print( '=================\n', i, '\n', pub['TI'] )
        goncalo_pmids.add( pub['PMID'] )

 0 
 Overexpression of the Cytokine BAFF and Autoimmunity Risk.
 1 
 Population- and individual-specific regulatory variation in Sardinia.
 2 
 Mitogenome Diversity in Sardinians: A Genetic Window onto an Island's Past.
 7 
 Methods for association analysis and meta-analysis of rare variants in families.
 10 
 Genotype calling and haplotyping in parent-offspring trios.
 11 
 A genome-wide association scan on the levels of markers of inflammation in Sardinians reveals associations that underpin its complex regulation.


# Search for Francesco's new publications

We won't include them since we don't know which one is part of the SardiNIA project, but good to know how many there are.

In [31]:
francesco_pmids = set()
search_kwargs = { 'db':'pubmed', 'retmax':100 }
fetch_kwargs = {'db':'pubmed', 'rettype':'medline', "retmode":'text'}

search_kwargs[ 'term' ] = "Cucca F[Author] AND sardinia[Title/Abstract]"
search_result = Entrez.read( Entrez.esearch( **search_kwargs ) )
fetch_kwargs['id'] = search_result['IdList']
francesco = Medline.parse( Entrez.efetch( **fetch_kwargs ) )
for i, pub in enumerate( francesco ):
    if pub['PMID'] not in pmids:
        print( '=================\n', i, '\n', pub['TI'] )
        francesco_pmids.add( pub['PMID'] )

 0 
 Overexpression of the Cytokine BAFF and Autoimmunity Risk.
 1 
 Population- and individual-specific regulatory variation in Sardinia.
 2 
 Mitogenome Diversity in Sardinians: A Genetic Window onto an Island's Past.
 3 
 Menopause modulates the association between thyrotropin levels and lipid parameters: The SardiNIA study.
 4 
 Gender specific profiles of white coat and masked hypertension impacts on arterial structure and function in the SardiNIA study.
 5 
 Depressive symptoms, thyroid hormone and autoimmunity in a population-based cohort from Sardinia.
 6 
 No evidence of association between subclinical thyroid disorders and common carotid intima medial thickness or atherosclerotic plaque.
 7 
 The burden of multiple sclerosis variants in continental Italians and Sardinians.
 12 
 Detection of phylogenetically informative polymorphisms in the entire euchromatic portion of human Y chromosome from a Sardinian sample.
 13 
 Methods for association analysis and meta-analysis of rar

## Search by SardiNIA project grant numbers

In [32]:
grant_pmids = set()
search_kwargs = { 'db':'pubmed', 'retmax':100 }
fetch_kwargs = {'db':'pubmed', 'rettype':'medline', "retmode":'text'}

search_kwargs[ 'term' ] = "N01AG12109[Grant Number] OR 263-MA-410953[Grant Number]"
search_result = Entrez.read( Entrez.esearch( **search_kwargs ) )
fetch_kwargs['id'] = search_result['IdList']
grant = Medline.parse( Entrez.efetch( **fetch_kwargs ) )
for i, pub in enumerate( grant ):
    if pub['PMID'] not in pmids:
        print( '=================\n', i, '\n', pub['TI'] )
        grant_pmids.add( pub['PMID'] )

 4 
 A unified method for detecting secondary trait associations with rare variants: application to sequence data.


## Add hand curated list

In [33]:
hand_list = ['28360221', '28172616', '28107422', '27876822', '27799538', '27798627', 
            '27659466', '27571263', '27548312', '27355579', '27225129', '27089181',
            '27251161', '27659466', '28443625' ]

In [34]:
len(hand_list)

15

In [35]:
len( set( hand_list))

14

In [36]:
hand_pmids = set()
search_kwargs = { 'db':'pubmed', 'retmax':100 }
fetch_kwargs = {'db':'pubmed', 'rettype':'medline', "retmode":'text'}

for hand_uid in hand_list:
    search_kwargs[ 'term' ] = hand_uid + "[uid]"
    search_result = Entrez.read( Entrez.esearch( **search_kwargs ) )
    fetch_kwargs['id'] = search_result['IdList']
    handsearch = Medline.parse( Entrez.efetch( **fetch_kwargs ) )
    for i, pub in enumerate( handsearch ):
        if pub['PMID'] not in pmids:
            print( '=================\n', i, '\n', pub['TI'] )
            hand_pmids.add( pub['PMID'] )

 0 
 NFAT5 and SLC4A10 Loci Associate with Plasma Osmolality.
 0 
 fastMitoCalc: an ultra-fast program to estimate mitochondrial DNA copy number from whole-genome sequences.
 0 
 Comparison of HapMap and 1000 Genomes Reference Panels in a Large-Scale Genome-Wide Association Study.
 0 
 A principal component meta-analysis on multiple anthropometric traits identifies novel loci for body shape.
 0 
 Genetic variants linked to education predict longevity.
 0 
 Genome-wide analysis identifies 12 loci influencing human reproductive behavior.
 0 
 52 Genetic Loci Influencing Myocardial Mass.
 0 
 Next-generation genotype imputation service and methods.
 0 
 A reference panel of 64,976 haplotypes for genotype imputation.
 0 
 Correction: The Influence of Age and Sex on Genetic Associations with Adult Body Size and Shape: A Large-Scale Genome-Wide Interaction Study.
 0 
 Genome-wide association study identifies 74 loci associated with educational attainment.
 0 
 Genetic variants associated wit

## Combine

In [37]:
pmids |= goncalo_pmids | david_pmids | lakatta_pmids | grant_pmids | hand_pmids

In [38]:
len( pmids)

173

# Pull all publications from combined list

In [39]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

In [40]:
h = Entrez.efetch(db='pubmed', id=list(pmids), rettype='medline', retmode='text')

In [41]:
records = list(Medline.parse(h))

## get most common grant numbers

In [42]:
from collections import Counter

In [43]:
all_grant_numbers = []
for i, _ in enumerate( records) :
    try:
        all_grant_numbers.extend( _['GR'] )
    except KeyError:
        print( i )

42
50
54
59
143


In [44]:
c = Counter( all_grant_numbers )

## Search via grant number

* N01AG12109-14-0-1 
* N01AG12109-11-0-0 
* N01AG12109-12-0-1 
* N01AG12109-14-0-1 

* European Research Council (ERC) grants

[NIH grant search](https://projectreporter.nih.gov/project_info_description.cfm?aid=8328319&icde=34538005&ddparam=&ddvalue=&ddsub=&cr=1&csb=default&cs=ASC&pball=)

In [45]:
c.most_common(5)

[('N01-AG-1-2109/AG/NIA NIH HHS/United States', 52),
 ('Intramural NIH HHS/United States', 40),
 ('G0401527/Medical Research Council/United Kingdom', 39),
 ('090532/Wellcome Trust/United Kingdom', 38),
 ('MC_U106179471/Medical Research Council/United Kingdom', 37)]

## Get all titles

In [46]:
titles = sorted( [ _['TI'][:70] for _ in records] )

In [47]:
titles

['52 Genetic Loci Influencing Myocardial Mass.',
 'A GWAS sequence variant for platelet volume marks an alternative DNM3 ',
 'A genome-wide association scan on the levels of markers of inflammatio',
 'A genome-wide association search for type 2 diabetes genes in African ',
 'A likelihood-based framework for variant calling and de novo mutation ',
 'A meta-analysis of thyroid-related traits reveals novel loci and gende',
 'A principal component meta-analysis on multiple anthropometric traits ',
 'A reference panel of 64,976 haplotypes for genotype imputation.',
 'A unified method for detecting secondary trait associations with rare ',
 'Age- and gender-specific awareness, treatment, and control of cardiova',
 'Amelioration of Sardinian beta0 thalassemia by genetic modifiers.',
 'An alternative to the search for single polymorphisms: toward molecula',
 'Are personality traits associated with white-coat and masked hypertens',
 'Arterial stiffness and influences of the metabolic syndrome: 

# Format each citation and write the HTML

## Write citations in order of date published, and make a per-year pub count

In [48]:
records.sort( key=lambda x: x['DP'], reverse=True)

In [49]:
link_frmt = '<a href="http://www.ncbi.nlm.nih.gov/pubmed/{}" target="_blank">{}</a> '
def create_citation( record ):
    text = '<li>\n'
    mod_au_list = [ '<strong>'+au+'</strong>' if au in author_list else au for au in record['AU']]
    text += ', '.join( mod_au_list ) + '. '
    text += link_frmt.format( record['PMID'], record['TI'] ) 
    text += record['SO']
    text += '\n</li>\n<br>\n'
    return text

In [50]:
years = sorted(list( set( [ _['DP'][:4] for _ in records ] ) ), reverse=True)

In [51]:
years

['2017',
 '2016',
 '2015',
 '2014',
 '2013',
 '2012',
 '2011',
 '2010',
 '2009',
 '2008',
 '2007',
 '2006']

In [52]:
with open( 'publications_mid.html', 'w') as p:
    p.write( "<p>{} total publications.</p>\n".format( len(records)))
    for year in years:
        year_records = [ _ for _ in records if _['DP'][:4] == year ]
        
        p.write( "<p><h3>{}</h3></p>\n".format( year ) )
        p.write( "<p>{} publications.</p>\n".format( len(year_records) ) )
        p.write( '<ol>\n')
        for i, record in enumerate( year_records ):
            p.write( create_citation( record ) )
        p.write( '</ol>\n')

# Combine preamble, new citation list, and postamble

In [53]:
!head publications_mid.html

<p>173 total publications.</p>
<p><h3>2017</h3></p>
<p>7 publications.</p>
<ol>
<li>
Olivieri A, <strong>Sidore C</strong>, Achilli A, Angius A, Posth C, Furtwangler A, Brandini S, Capodiferro MR, Gandini F, Zoledziewska M, Pitzalis M, Maschio A, <strong>Busonero F</strong>, Lai L, Skeates R, Gradoli MG, Beckett J, <strong>Marongiu M</strong>, Mazzarello V, Marongiu P, Rubino S, Rito T, Macaulay V, Semino O, Pala M, <strong>Abecasis GR</strong>, <strong>Schlessinger D</strong>, Conde-Sousa E, Soares P, Richards MB, <strong>Cucca F</strong>, Torroni A. <a href="http://www.ncbi.nlm.nih.gov/pubmed/28177087" target="_blank">Mitogenome Diversity in Sardinians: A Genetic Window onto an Island's Past.</a> Mol Biol Evol. 2017 May 1;34(5):1230-1239. doi: 10.1093/molbev/msx082.
</li>
<br>
<li>
Pala M, Zappala Z, <strong>Marongiu M</strong>, Li X, Davis JR, Cusano R, Crobu F, Kukurba KR, Gloudemans MJ, Reinier F, Berutti R, Piras MG, <strong>Mulas A</strong>, Zoledziewska M, <strong>Maro

In [54]:
!rm publications.html

In [55]:
!cat publications_pre.html > publications.html

In [56]:
!cat publications_mid.html >> publications.html

In [57]:
!cat publications_post.html >> publications.html