# The Unconquerables of Open Access

## Import of MEDLINE journals

Project for the EAHIL conference 2023 : https://eahil2023.org/
Authors : **Floriane Muller & Pablo Iriarte**, University of Geneva  
Last update : 12.04.2023

This purpose of this notebook is to import the PubMed journal list "List of Serials Indexed for Online Users" available here :

https://www.nlm.nih.gov/tsd/serials/lsiou.html

"The 2022 edition of the LSIOU contains 15,366 serial titles, including 5,279 titles currently indexed for MEDLINE as well as titles indexed over time which have ceased or changed titles, listed alphabetically by the journal title abbreviation."

List of Serials Indexed for Online Users, 2022

    XML format - 42 MB
    Instructions for FTP of List of Serials Indexed for Online Users file:
    FTP to NLM's anonymous ftp server: ftp://ftp.nlm.nih.gov/online/journals/
    (login as an anonymous user; use your e-mail address as password).
    Choose the "online" directory.
    Choose the "journals" directory.
    The file name is "lsi2022.xml".
    Copy or save the file to your desktop or local drive.
    Open with your XML viewer of choice.
    DTD NLM Serials DTD (Document Type Definition) for use with the XML data
    
-----------------

Update 2023 :

"The 2023 edition of the LSIOU contains 15,430 serial titles, including 5,280 titles currently indexed for MEDLINE as well as titles indexed over time which have ceased or changed titles, listed alphabetically by the journal title abbreviation."



## Extract data from XML file

In [1]:
import json
import requests
import codecs
import pandas as pd
import time
import os
import re
import collections
from lxml import etree
from collections import Counter
from random import *
# import xml.etree.ElementTree as ET



In [2]:
# File to parse
myfilein = 'data/sources/nlm/lsi2023.xml'
myfilein_tree = 'data/sources/nlm/lsi2023_tree.txt'

# extract XML tree
tree = etree.parse(myfilein)
root = tree.getroot()
# for child in root.iter('*'):
#     print(child.tag, child.attrib)

# print the number of journals
# select the Serial nodes
serials = root.xpath('/SerialsSet/Serial')
print('Serials: ' + str(len(serials)))
print('----------------')

Serials: 15430
----------------


In [3]:
# print the tree for a random item
myserial = randint(1, 100)
print('Tree structure for serial ' + str(myserial))
print('----------------')
# create text file for tree
file = codecs.open(myfilein_tree, 'w', 'utf-8')
file.write('Tree structure for serial ' + str(myserial) + '\n')
file.write('----------------\n')

myiter = 0
for tag in root.iter():
    path = tree.getpath(tag)
    path = path.replace('/', '    ')
    spaces = Counter(path)
    tag_name = path.split()[-1].split('[')[0]
    tag_name = ' ' * (spaces[' '] - 4) + tag_name
    # test of initial tag
    if tag_name == '    Serial':
        myiter = myiter + 1
    if myiter == myserial: 
        print(tag_name)
        file.write(tag_name + '\n')
    if myiter == myserial + 1:
        break
file.close()

Tree structure for serial 3
----------------
    Serial
        NlmWorkID
        NlmUniqueID
        Title
        MedlineTA
        PublicationInfo
            Country
            Place
            Publisher
            PublicationFirstYear
            PublicationEndYear
            Frequency
        ISSN
        ISSNLinking
        Language
        TitleContinuationYN
        GeneralNote
        TitleRelated
            Title
            RecordID
            RecordID
            RecordID
            ISSN
        MinorTitleChangeYN
        ProcessingCode
        IndexingHistoryList
            IndexingHistory
                DateOfAction
                    Year
                    Month
                    Day
            IndexingHistory
                DateOfAction
                    Year
                    Month
                    Day
                Coverage
        CurrentlyIndexedYN
        IndexOnlineYN
        IndexingSubset
        IndexingSelectedURL
        ReportedMedl

In [4]:
# File for the data parsed
myfileout = 'data/sources/nlm/lsi2023.tsv'

# create CSV file tabulated
file = codecs.open(myfileout, 'w', 'utf-8')

# fields to export
myfields = [
    'NlmUniqueID',
    'Title',
    'MedlineTA',
    'PublicationInfo/Country',
    'PublicationInfo/Place',
    'PublicationInfo/Publisher',
    'PublicationInfo/PublicationFirstYear',
    'PublicationInfo/PublicationEndYear',
    'PublicationInfo/Frequency',
    'ISSN[@IssnType="Electronic"]',
    'ISSN[@IssnType="Print"]',
    'ISSNLinking',
    'Language',
    'TitleContinuationYN',
    'IndexingStartDate',
    'CurrentlyIndexedYN',
    'IndexOnlineYN',
    'IndexingSubset',
    'IndexingSelectedURL',
    'ReportedMedlineYN',
]

# write first line
for field in myfields:
    file.write(field + '\t')
file.write('\n')

for i in range(len(serials)):
    NlmUniqueID = serials[i].xpath('NlmUniqueID')[0].text
    Title = serials[i].xpath('Title')[0].text
    MedlineTA = serials[i].xpath('MedlineTA')[0].text
    if (serials[i].xpath('PublicationInfo/Country')):
        PublicationInfo_Country = serials[i].xpath('PublicationInfo/Country')[0].text
    else :
        PublicationInfo_Country = ''
    if (serials[i].xpath('PublicationInfo/Place')):
        PublicationInfo_Place = serials[i].xpath('PublicationInfo/Place')[0].text
    else :
        PublicationInfo_Place = ''
    if (serials[i].xpath('PublicationInfo/Publisher')):
        PublicationInfo_Publisher = serials[i].xpath('PublicationInfo/Publisher')[0].text
    else :
        PublicationInfo_Publisher = ''
    if (serials[i].xpath('PublicationInfo/PublicationFirstYear')):
        PublicationInfo_PublicationFirstYear = serials[i].xpath('PublicationInfo/PublicationFirstYear')[0].text
    else :
        PublicationInfo_PublicationFirstYear = ''
    if (serials[i].xpath('PublicationInfo/PublicationEndYear')):
        PublicationInfo_PublicationEndYear = serials[i].xpath('PublicationInfo/PublicationEndYear')[0].text
    else :
        PublicationInfo_PublicationEndYear = ''
    if (serials[i].xpath('PublicationInfo/Frequency')):
        PublicationInfo_Frequency = serials[i].xpath('PublicationInfo/Frequency')[0].text
    else :
        PublicationInfo_Frequency = ''
    if (serials[i].xpath('ISSN[@IssnType="Electronic"]')):
        ISSN_Electronic = serials[i].xpath('ISSN[@IssnType="Electronic"]')[0].text
    else :
        ISSN_Electronic = ''
    if (serials[i].xpath('ISSN[@IssnType="Print"]')):
        ISSN_Print = serials[i].xpath('ISSN[@IssnType="Print"]')[0].text
    else :
        ISSN_Print = ''
    if (serials[i].xpath('ISSNLinking')):
        ISSNLinking = serials[i].xpath('ISSNLinking')[0].text
    else :
        ISSNLinking = ''
    if (serials[i].xpath('Language')):
        Language = serials[i].xpath('Language')[0].text
    else :
        Language = ''
    if (serials[i].xpath('TitleContinuationYN')):
        TitleContinuationYN = serials[i].xpath('TitleContinuationYN')[0].text
    else :
        TitleContinuationYN = ''
    if (serials[i].xpath('IndexingStartDate')):
        IndexingStartDate = serials[i].xpath('IndexingStartDate')[0].text
    else :
        IndexingStartDate = ''
    if (serials[i].xpath('CurrentlyIndexedYN')):
        CurrentlyIndexedYN = serials[i].xpath('CurrentlyIndexedYN')[0].text
    else :
        CurrentlyIndexedYN = ''
    if (serials[i].xpath('IndexOnlineYN')):
        IndexOnlineYN = serials[i].xpath('IndexOnlineYN')[0].text
    else :
        IndexOnlineYN = ''
    if (serials[i].xpath('IndexingSubset')):
        IndexingSubset = serials[i].xpath('IndexingSubset')[0].text
    else :
        IndexingSubset = ''
    if (serials[i].xpath('IndexingSelectedURL')):
        IndexingSelectedURL = serials[i].xpath('IndexingSelectedURL')[0].text
    else :
        IndexingSelectedURL = ''
    if (serials[i].xpath('ReportedMedlineYN')):
        ReportedMedlineYN = serials[i].xpath('ReportedMedlineYN')[0].text
    else :
        ReportedMedlineYN = ''

    file.write(NlmUniqueID + '\t')
    file.write(Title + '\t')
    file.write(MedlineTA + '\t')
    file.write(PublicationInfo_Country + '\t')
    file.write(PublicationInfo_Place + '\t')
    file.write(PublicationInfo_Publisher + '\t')
    file.write(PublicationInfo_PublicationFirstYear + '\t')
    file.write(PublicationInfo_PublicationEndYear + '\t')
    file.write(PublicationInfo_Frequency + '\t')
    file.write(ISSN_Electronic + '\t')
    file.write(ISSN_Print + '\t')
    file.write(ISSNLinking + '\t')
    file.write(Language + '\t')
    file.write(TitleContinuationYN + '\t')
    file.write(IndexingStartDate + '\t')
    file.write(CurrentlyIndexedYN + '\t')
    file.write(IndexOnlineYN + '\t')
    file.write(IndexingSubset + '\t')
    file.write(IndexingSelectedURL + '\t')
    file.write(ReportedMedlineYN + '\t')
    file.write('\n')
    if (((i/100) - int(i/100)) == 0) :
        print(i)
file.close()

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900
14000
14100
14200
14300
14400
14500
14600
14700
14800
14900
15000
15100
15200
15300
15400


In [5]:
# File for the data parsed : BroadJournalHeading
myfileout = 'data/sources/nlm/lsi2023_BroadJournalHeadings.tsv'

# create CSV file tabulated
file = codecs.open(myfileout, 'w', 'utf-8')

# fields to export
myfields = [
    'NlmUniqueID',
    'BroadJournalHeading'
]

# write first line
for field in myfields:
    file.write(field + '\t')
file.write('\n')

for i in range(len(serials)):
    NlmUniqueID = serials[i].xpath('NlmUniqueID')[0].text
    if (serials[i].xpath('BroadJournalHeadingList/BroadJournalHeading')):
        BroadJournalHeadings = serials[i].xpath('BroadJournalHeadingList/BroadJournalHeading')
        for k in range(len(BroadJournalHeadings)):
            # if len(BroadJournalHeadings) > 1:
            #     print ('BroadJournalHeadings : ' +  str(len(BroadJournalHeadings)))
            BroadJournalHeading = BroadJournalHeadings[k].text
            file.write(NlmUniqueID + '\t')
            file.write(BroadJournalHeading + '\t')
            file.write('\n')
    if (((i/100) - int(i/100)) == 0) :
        print(i)
file.close()

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900
14000
14100
14200
14300
14400
14500
14600
14700
14800
14900
15000
15100
15200
15300
15400


In [6]:
# File for the data parsed : MeshHeadingList/MeshHeading/DescriptorName
myfileout = 'data/sources/nlm/lsi2023_MeshHeadings.tsv'

# create CSV file tabulated
file = codecs.open(myfileout, 'w', 'utf-8')

# fields to export
myfields = [
    'NlmUniqueID',
    'MeshHeading'
]

# write first line
for field in myfields:
    file.write(field + '\t')
file.write('\n')

for i in range(len(serials)):
    NlmUniqueID = serials[i].xpath('NlmUniqueID')[0].text
    if (serials[i].xpath('MeshHeadingList/MeshHeading/DescriptorName')):
        MeshHeadings = serials[i].xpath('MeshHeadingList/MeshHeading/DescriptorName')
        for k in range(len(MeshHeadings)):
            # if len(MeshHeadings) > 1:
            #     print ('MeshHeadings : ' +  str(len(MeshHeadings)))
            MeshHeading = MeshHeadings[k].text
            file.write(NlmUniqueID + '\t')
            file.write(MeshHeading + '\t')
            file.write('\n')
    if (((i/100) - int(i/100)) == 0) :
        print(i)
file.close()

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900
14000
14100
14200
14300
14400
14500
14600
14700
14800
14900
15000
15100
15200
15300
15400


## Import to dataframe

In [7]:
# Open extracted data
pubmed = pd.read_csv('data/sources/nlm/lsi2023.tsv', delimiter='\t', header=0)
pubmed

Unnamed: 0,NlmUniqueID,Title,MedlineTA,PublicationInfo/Country,PublicationInfo/Place,PublicationInfo/Publisher,PublicationInfo/PublicationFirstYear,PublicationInfo/PublicationEndYear,PublicationInfo/Frequency,"ISSN[@IssnType=""Electronic""]",...,ISSNLinking,Language,TitleContinuationYN,IndexingStartDate,CurrentlyIndexedYN,IndexOnlineYN,IndexingSubset,IndexingSelectedURL,ReportedMedlineYN,Unnamed: 20
0,9875136,1199 news. National Union of Hospital and Heal...,1199 News,United States,New York,National Union of Hospital and Health Care Emp...,19uu,,"8 issues a year,",,...,0012-6535,eng,N,,N,N,H,,Y,
1,9015384,20 century British history,20 Century Br Hist,England,"Eynsham, Oxford",Oxford University Press,1990,,"4 no. a year,",1477-4674,...,0955-2359,eng,N,1990.0,Y,N,QIS,,Y,
2,101637720,A & A case reports,A A Case Rep,United States,"[New York, NY]",Wolters Kluwer Health / OvidSP,2013,2017,Biweekly,2325-7237,...,2325-7237,eng,N,,N,Y,IM,https://ovidsp.ovid.com/ovidweb.cgi?T=JS&MODE=...,Y,
3,101714112,A&A practice,A A Pract,United States,"[Philadelphia, PA]","Wolters Kluwer Health, Inc.",2018,,Biweekly,2575-3126,...,2575-3126,eng,Y,2018.0,Y,Y,IM,https://ovidsp.ovid.com/ovidweb.cgi?T=JS&MODE=...,Y,
4,101269322,AACN advanced critical care,AACN Adv Crit Care,United States,"Aliso Viejo, CA",American Association of Critical-Care Nurses (...,2006,,Quarterly,1559-7776,...,1559-7768,eng,Y,2006.0,Y,Y,N,https://aacnjournals.org/aacnacconline,Y,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15425,0056272,Zuchthygiene,Zuchthygiene,Germany,Berlin,Verlag Paul Parey,1966,1989,Six no. a year,,...,0044-5371,ger,N,,N,N,J,,Y,
15426,21830020R,Zürcher medizingeschichtliche Abhandlungen,Zur Medizingesch Abh,Switzerland,Zurich,Juris Verlag,1924,,Irregular,,...,0514-4264,ger,N,,N,N,QIS,,Y,
15427,21830080R,Zvestí c̆erveného kríz̆a,Zvesti Cerv Kriza,Slovakia,[V Bratislava,,1940,19uu,,,...,,slo,N,,N,N,OM,,Y,
15428,0233767,ZWR,ZWR,Germany,Stuttgart,Thieme,1970,,Monthly,1439-9148,...,0044-166X,ger,N,,N,N,D,https://www.thieme-connect.de/products/ejourna...,Y,


In [8]:
pubmed.dtypes

NlmUniqueID                              object
Title                                    object
MedlineTA                                object
PublicationInfo/Country                  object
PublicationInfo/Place                    object
PublicationInfo/Publisher                object
PublicationInfo/PublicationFirstYear     object
PublicationInfo/PublicationEndYear       object
PublicationInfo/Frequency                object
ISSN[@IssnType="Electronic"]             object
ISSN[@IssnType="Print"]                  object
ISSNLinking                              object
Language                                 object
TitleContinuationYN                      object
IndexingStartDate                       float64
CurrentlyIndexedYN                       object
IndexOnlineYN                            object
IndexingSubset                           object
IndexingSelectedURL                      object
ReportedMedlineYN                        object
Unnamed: 20                             

In [9]:
pubmed['CurrentlyIndexedYN'].value_counts()

N    10150
Y     5280
Name: CurrentlyIndexedYN, dtype: int64

In [10]:
pubmed['PublicationInfo/Country'].value_counts()

United States    5815
England          2205
Germany           993
France            599
Italy             579
                 ... 
Korea (North)       1
Barbados            1
Turkmenistan        1
Moldova             1
Oman                1
Name: PublicationInfo/Country, Length: 130, dtype: int64

In [11]:
pubmed_medline = pubmed.loc[pubmed['CurrentlyIndexedYN'] == 'Y']
pubmed_medline['PublicationInfo/Country'].value_counts()

United States    1972
England          1382
Netherlands       352
Germany           297
Switzerland       157
                 ... 
Puerto Rico         1
Ghana               1
Kuwait              1
Finland             1
Oman                1
Name: PublicationInfo/Country, Length: 76, dtype: int64

In [12]:
pubmed_old = pubmed.loc[pubmed['CurrentlyIndexedYN'] == 'N']
pubmed_old['PublicationInfo/Country'].value_counts()

United States    3843
England           823
Germany           696
France            512
Italy             508
                 ... 
Tunisia             1
Martinique          1
Sri Lanka           1
Barbados            1
Moldova             1
Name: PublicationInfo/Country, Length: 123, dtype: int64

In [13]:
# Exports Excel
pubmed.to_excel('data/sources/nlm/lsi2023.xlsx', index=False)
pubmed_medline.to_excel('data/sources/nlm/lsi2023_medline.xlsx', index=False)
pubmed_old.to_excel('data/sources/nlm/lsi2023_not_medline.xlsx', index=False)