# Dissecting PubMed
## Which content is covered by the Library? and Open Access?
#### Floriane Muller & Pablo Iriarte, Geneva University Library, Switzerland


# Merge PubMed with Unpaywall data

Source: http://unpaywall.org/products/snapshot

We download the dump file "oa_status_by_doi.csv.gz" (8Gb csv gzipped file) on 25.01.2018

 - lines in CSV file (file name after dezipping it 'all_dois_20180122T165326.csv'):
     ```wc -l all_dois_20180122T165326.csv
     ``` result: 93722976 lines
     
     
 - columns in CSV (described here http://unpaywall.org/data-format):
 	- doi
 	- is_oa
 	- data_standard
 	- best_oa_location_url
 	- best_oa_location_url_for_landing_page
 	- best_oa_location_url_for_pdf
 	- best_oa_location_host_type
 	- best_oa_location_version
 	- best_oa_location_license
 	- best_oa_location_evidence
 	- title
 	- journal_issns
 	- journal_name
 	- journal_is_oa
 	- publisher
 	- year
 	- genre
 	- updated
 
For the purposes of our study we need only this columns:

 - **doi**: (string) The DOI of this resource. This is always lowercase.
 - **is_oa**: (bool) True if there is an OA copy of this resource. Convenience attribute; returns true when best_oa_location is not null.
 - **journal_is_oa**: (bool) Is this resource published in a completely OA journal. Useful for most definitions of Gold OA. Currently this is based entirely on inclusion in the DOAJ, but eventually may use additional ways of identifying all-OA journals.
 - **best_oa_location_host_type**: (string)
     - **publisher** means the this location is served by the article’s publisher (in practice, this means it is hosted on the same domain the DOI resolves to)
     - **repository** means this location is served by an Open Access repository.
 - **best_oa_location_version**: (string) The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard to define versions of a given article; see those docs for complete definitions of terms. Here's the basic idea, though, for the three version types we support:
	- **submittedVersion** is not yet peer-reviewed.
	- **acceptedVersion** is peer-reviewed, but lacks publisher-specific formatting.
	- **publishedVersion** is the version of record. 

In [1]:
# open GZ file in chunks
import pandas as pd
myfilein = 'F:\data_sources\oadoi\oadoi_20180125\oa_status_by_doi.csv.gz'
chunksize = 10 ** 5
i = 0
for chunk in pd.read_csv(myfilein, chunksize=chunksize, sep=',', header=0, error_bad_lines=False,
                         dtype={'doi': 'str', 'is_oa': 'object', 'journal_is_oa': 'object',
                                'best_oa_location_host_type': 'object','best_oa_location_version': 'object'},
                         usecols=['doi', 'is_oa', 'journal_is_oa', 'best_oa_location_host_type', 'best_oa_location_version'],
                         encoding='utf-8', compression='gzip'):
    i = i + 1
    print (i)
    chunk['doi'] = chunk['doi'].str.lower()
    chunk['doi'] = chunk['doi'].str.strip()
    chunk.to_csv('data/temp/unpaywall/file1_' + str(i) + '.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [None]:
# move unpaywall chunks to have 3 folders (files 1-300, 301-600 and 601-999)
# merge with unpaywall chuncks (folder 1)
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
myfilesin_unpaywall = 'data/temp/unpaywall/1/*.gz'
unpaywall = dd.read_csv(myfilesin_unpaywall, sep='\t', header=0, 
                        dtype={'doi': 'str', 'is_oa': 'object', 'journal_is_oa': 'object', 'best_oa_location_host_type': 'object','best_oa_location_version': 'object'},
                        encoding='utf-8', compression='gzip', blocksize=None)
# Convert Dask dataframe to pandas
with ProgressBar():
    unpaywall_pandas = unpaywall.compute()
# add column to track merged rows
unpaywall_pandas['oadoi'] = 1
# import PubMed DOIs (All, PubMed, EUPMC and APD) in chuncks
filename ='data/results/pmid_dois_pubmed_eupmc_apd_merged.csv.gz'
chunksize = 10 ** 6
i = 0
for df_pubmed in pd.read_csv(filename, chunksize=chunksize, sep='\t', header=0, encoding='utf-8', compression='gzip'):
    i = i + 1
    print (i)
    # merge chunk fil all unpaywall data
    df_pubmed = df_pubmed.merge(unpaywall_pandas, on='doi', how='left')
    df_pubmed_merged = df_pubmed.loc[df_pubmed['oadoi'].notnull()]
    df_pubmed_merged.to_csv('data/results/pubmed_unpaywall/chunks/pubmed_unpaywall_merged_1_' + str(i) + '.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')
    # restart the variables to avoid memory full
    %reset_selective -f df.*
%reset_selective -f unpaywall

[########################################] | 100% Completed | 29.6s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [1]:
# restart kernel
# merge with unpaywall chuncks (folder 2)
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
myfilesin_unpaywall = 'data/temp/unpaywall/2/*.gz'
unpaywall = dd.read_csv(myfilesin_unpaywall, sep='\t', header=0, 
                        dtype={'doi': 'str', 'is_oa': 'object', 'journal_is_oa': 'object', 'best_oa_location_host_type': 'object','best_oa_location_version': 'object'},
                        encoding='utf-8', compression='gzip', blocksize=None)
# Convert Dask dataframe to pandas
with ProgressBar():
    unpaywall_pandas = unpaywall.compute()
# add column to track merged rows
unpaywall_pandas['oadoi'] = 1
# import PubMed DOIs (All, PubMed, EUPMC and APD) in chuncks
filename ='data/results/pmid_dois_pubmed_eupmc_apd_merged.csv.gz'
chunksize = 10 ** 6
i = 0
for df_pubmed in pd.read_csv(filename, chunksize=chunksize, sep='\t', header=0, encoding='utf-8', compression='gzip'):
    i = i + 1
    print (i)
    # merge chunk fil all unpaywall data
    df_pubmed = df_pubmed.merge(unpaywall_pandas, on='doi', how='left')
    df_pubmed_merged = df_pubmed.loc[df_pubmed['oadoi'].notnull()]
    df_pubmed_merged.to_csv('data/results/pubmed_unpaywall/chunks/pubmed_unpaywall_merged_2_' + str(i) + '.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')
    # restart the variables to avoid memory full
    %reset_selective -f df.*
%reset_selective -f unpaywall

[########################################] | 100% Completed | 30.8s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [1]:
# restart kernel
# merge with unpaywall chuncks (folder 3)
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
myfilesin_unpaywall = 'data/temp/unpaywall/3/*.gz'
unpaywall = dd.read_csv(myfilesin_unpaywall, sep='\t', header=0, 
                        dtype={'doi': 'str', 'is_oa': 'object', 'journal_is_oa': 'object', 'best_oa_location_host_type': 'object','best_oa_location_version': 'object'},
                        encoding='utf-8', compression='gzip', blocksize=None)
# Convert Dask dataframe to pandas
with ProgressBar():
    unpaywall_pandas = unpaywall.compute()
# add column to track merged rows
unpaywall_pandas['oadoi'] = 1
# import PubMed DOIs (All, PubMed, EUPMC and APD) in chuncks
filename ='data/results/pmid_dois_pubmed_eupmc_apd_merged.csv.gz'
chunksize = 10 ** 6
i = 0
for df_pubmed in pd.read_csv(filename, chunksize=chunksize, sep='\t', header=0, encoding='utf-8', compression='gzip'):
    i = i + 1
    print (i)
    # merge chunk fil all unpaywall data
    df_pubmed = df_pubmed.merge(unpaywall_pandas, on='doi', how='left')
    df_pubmed_merged = df_pubmed.loc[df_pubmed['oadoi'].notnull()]
    df_pubmed_merged.to_csv('data/results/pubmed_unpaywall/chunks/pubmed_unpaywall_merged_3_' + str(i) + '.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')
    # restart the variables to avoid memory full
    %reset_selective -f df.*
%reset_selective -f unpaywall

[########################################] | 100% Completed | 33.6s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [1]:
# final file after combine 60 parts of PubMed DOIs merged with unpaywall data
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import pandas as pd

myfilesin = 'data/results/pubmed_unpaywall/chunks/pubmed*.gz'
pubmed_dask = dd.read_csv(myfilesin, sep='\t', header=0, dtype={'pmid': 'int', 'doi':'object'},
                          compression='gzip', blocksize=None)
pubmed_dask

Unnamed: 0_level_0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd,is_oa,best_oa_location_host_type,best_oa_location_version,journal_is_oa,oadoi
npartitions=60,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
,int32,object,float64,float64,float64,object,object,float64,object,float64
,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...


In [2]:
pubmed_dask.head()

Unnamed: 0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd,is_oa,best_oa_location_host_type,best_oa_location_version,journal_is_oa,oadoi
0,1,10.1016/0006-2944(75)90147-7,1.0,,1.0,f,,,f,1.0
1,3,10.1016/0006-291x(75)90498-2,1.0,,1.0,f,,,f,1.0
2,8,10.1016/0006-2952(75)90029-5,1.0,,1.0,f,,,f,1.0
3,11,10.1016/0006-2952(75)90001-5,1.0,,1.0,f,,,f,1.0
4,13,10.1016/0006-2952(75)90009-x,1.0,,1.0,f,,,f,1.0


In [3]:
pubmed_dask.tail()

Unnamed: 0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd,is_oa,best_oa_location_host_type,best_oa_location_version,journal_is_oa,oadoi
373519,16492823,10.1213/01.ane.0000197695.24281.df,1.0,1.0,1.0,t,publisher,,f,1.0
373520,16492824,10.1213/01.ane.0000197611.89464.98,1.0,1.0,1.0,t,publisher,,f,1.0
373521,16492826,10.1213/01.ane.0000195421.46107.d0,1.0,1.0,1.0,t,publisher,,f,1.0
373522,16492827,10.1213/01.ane.0000196536.60320.f9,1.0,1.0,1.0,t,publisher,,f,1.0
373523,16492830,10.1213/01.ane.0000195341.65260.87,1.0,1.0,1.0,t,publisher,,f,1.0


In [4]:
# drop duplicates by pmid and doi (original file has 19441925 rows)
with ProgressBar():
    pubmed = pubmed_dask.compute()
pubmed

[########################################] | 100% Completed | 28.8s


Unnamed: 0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd,is_oa,best_oa_location_host_type,best_oa_location_version,journal_is_oa,oadoi
0,1,10.1016/0006-2944(75)90147-7,1.0,,1.0,f,,,f,1.0
1,3,10.1016/0006-291x(75)90498-2,1.0,,1.0,f,,,f,1.0
2,8,10.1016/0006-2952(75)90029-5,1.0,,1.0,f,,,f,1.0
3,11,10.1016/0006-2952(75)90001-5,1.0,,1.0,f,,,f,1.0
4,13,10.1016/0006-2952(75)90009-x,1.0,,1.0,f,,,f,1.0
5,14,10.1016/0006-2952(75)90011-8,1.0,,1.0,f,,,f,1.0
6,19,10.1016/0006-2952(75)90412-8,1.0,,1.0,f,,,f,1.0
7,20,10.1016/0006-2952(75)90415-3,1.0,,1.0,f,,,f,1.0
8,37,10.1021/bi00694a002,1.0,,1.0,f,,,f,1.0
9,38,10.1021/bi00694a009,1.0,,1.0,f,,,f,1.0


In [9]:
# drop and check duplicates by pmid and doi
pubmed_dedup = pubmed.drop_duplicates(subset=['pmid', 'doi'])
pubmed_dups = pubmed.loc[pubmed.duplicated(subset=['pmid', 'doi'], keep=False)]

In [11]:
pubmed_dups

Unnamed: 0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd,is_oa,best_oa_location_host_type,best_oa_location_version,journal_is_oa,oadoi


In [12]:
pubmed_dedup.shape

(19146997, 10)

In [10]:
# export OA rows
pubmed_oa = pubmed.loc[pubmed['is_oa'] == 't']

In [11]:
pubmed_oa

Unnamed: 0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd,is_oa,best_oa_location_host_type,best_oa_location_version,journal_is_oa,oadoi
29,128,10.1016/0006-8993(75)90486-2,1.0,,1.0,t,repository,,f,1.0
50,238,10.1111/j.1432-1033.1975.tb02183.x,1.0,,1.0,t,publisher,,f,1.0
51,239,10.1111/j.1432-1033.1975.tb02184.x,1.0,,1.0,t,publisher,,f,1.0
58,261,10.1042/bst0030630,1.0,,1.0,t,publisher,,f,1.0
59,264,10.1042/bst0030656,1.0,,1.0,t,publisher,,f,1.0
61,274,10.1136/gut.16.9.719,1.0,,1.0,t,publisher,,f,1.0
69,406,10.1083/jcb.67.2.281,1.0,,1.0,t,publisher,,f,1.0
70,407,10.1083/jcb.67.3.566,1.0,,1.0,t,publisher,,f,1.0
78,440,10.1177/23.12.440,1.0,1.0,1.0,t,publisher,,f,1.0
80,463,10.1099/00221287-90-2-260,1.0,1.0,1.0,t,publisher,,f,1.0


In [12]:
# export OA rows to CSV
pubmed_oa.sort_values(by='pmid').to_csv('data/results/pubmed_unpaywall/pmid_dois_oa_in_unpaywall.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')


In [13]:
# OA counts
pubmed_oa.count()

pmid                          6188265
doi                           6188265
doi_eupmc                     5927954
doi_pubmed                    4569126
doi_apd                       5381571
is_oa                         6188265
best_oa_location_host_type    6188265
best_oa_location_version            0
journal_is_oa                 6188265
oadoi                         6188265
dtype: int64

In [14]:
# OA type counts
pubmed_oa['best_oa_location_host_type'].value_counts()

publisher     4872391
repository    1315874
Name: best_oa_location_host_type, dtype: int64

In [15]:
# OA version counts
pubmed_oa['best_oa_location_version'].value_counts()

Series([], Name: best_oa_location_version, dtype: int64)

In [18]:
# separate OA gold and green
pubmed_oa_gold = pubmed_oa.loc[pubmed_oa['best_oa_location_host_type'] == 'publisher']
pubmed_oa_gold

Unnamed: 0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd,is_oa,best_oa_location_host_type,best_oa_location_version,journal_is_oa,oadoi
50,238,10.1111/j.1432-1033.1975.tb02183.x,1.0,,1.0,t,publisher,,f,1.0
51,239,10.1111/j.1432-1033.1975.tb02184.x,1.0,,1.0,t,publisher,,f,1.0
58,261,10.1042/bst0030630,1.0,,1.0,t,publisher,,f,1.0
59,264,10.1042/bst0030656,1.0,,1.0,t,publisher,,f,1.0
61,274,10.1136/gut.16.9.719,1.0,,1.0,t,publisher,,f,1.0
69,406,10.1083/jcb.67.2.281,1.0,,1.0,t,publisher,,f,1.0
70,407,10.1083/jcb.67.3.566,1.0,,1.0,t,publisher,,f,1.0
78,440,10.1177/23.12.440,1.0,1.0,1.0,t,publisher,,f,1.0
80,463,10.1099/00221287-90-2-260,1.0,1.0,1.0,t,publisher,,f,1.0
89,491,10.1113/jphysiol.1975.sp011117,1.0,,1.0,t,publisher,,f,1.0


In [19]:
# separate OA gold and green
pubmed_oa_green = pubmed_oa.loc[pubmed_oa['best_oa_location_host_type'] == 'repository']
pubmed_oa_green

Unnamed: 0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd,is_oa,best_oa_location_host_type,best_oa_location_version,journal_is_oa,oadoi
29,128,10.1016/0006-8993(75)90486-2,1.0,,1.0,t,repository,,f,1.0
85,480,10.1515/jpme.1975.3.1.53,1.0,,1.0,t,repository,,f,1.0
671,4186,10.1111/j.1476-5381.1976.tb07462.x,1.0,,1.0,t,repository,,f,1.0
993,6490,10.1172/jci108447,1.0,1.0,1.0,t,repository,,f,1.0
998,6591,10.1017/s0022172400055261,1.0,,1.0,t,repository,,f,1.0
1108,7238,10.1042/bj1550001,1.0,,1.0,t,repository,,f,1.0
1135,7341,10.1136/bmj.2.6026.47,1.0,,1.0,t,repository,,f,1.0
1219,8016,10.1136/adc.51.6.403,1.0,,,t,repository,,f,1.0
1395,9072,10.1042/bj1570415,1.0,,1.0,t,repository,,f,1.0
1476,9431,10.1177/00220345760550052901,1.0,1.0,1.0,t,repository,,f,1.0


In [20]:
# OA export gold and gree in separate files
pubmed_oa_gold.sort_values(by='pmid').to_csv('data/results/pubmed_unpaywall/pmid_dois_oa_in_unpaywall_gold.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')
pubmed_oa_green.sort_values(by='pmid').to_csv('data/results/pubmed_unpaywall/pmid_dois_oa_in_unpaywall_green.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')


In [22]:
# OA export gold and gree in separate files but only PMIDs
pubmed_oa_gold['unpaywall_oa_gold'] = 1
pubmed_oa_green['unpaywall_oa_green'] = 1
pubmed_oa_gold.sort_values(by='pmid')[['pmid', 'unpaywall_oa_gold']].to_csv('data/results/pubmed_unpaywall/pmids_unpaywall_oa_gold.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')
pubmed_oa_green.sort_values(by='pmid')[['pmid', 'unpaywall_oa_green']].to_csv('data/results/pubmed_unpaywall/pmids_unpaywall_oa_green.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [23]:
# export non OA rows
pubmed_non_oa = pubmed.loc[pubmed['is_oa'] == 'f']
pubmed_non_oa['unpaywall_oa_false'] = 1
pubmed_non_oa.sort_values(by='pmid')[['pmid', 'unpaywall_oa_false']].to_csv('data/results/pubmed_unpaywall/pmids_unpaywall_oa_false.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [5]:
# Try to identify not merged rows
# export PMID, DOI and oaDOI to CSV
pubmed.sort_values(by='pmid')[['pmid', 'doi', 'oadoi']].to_csv('data/temp/pmid_dois_unpaywall_merged.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')

In [1]:
# Restart kernel and open original merged file
import pandas as pd
myfilein = 'data/results/pmid_dois_pubmed_eupmc_apd_merged.csv.gz'
pubmed_merged_orig = pd.read_csv(myfilein, sep='\t', header=0, encoding='utf-8', compression='gzip')
pubmed_merged_orig

Unnamed: 0,pmid,doi,doi_eupmc,doi_pubmed,doi_apd
0,1,10.1016/0006-2944(75)90147-7,1.0,,1.0
1,2,10.1016/0006-291x(75)90482-9,1.0,,1.0
2,3,10.1016/0006-291x(75)90498-2,1.0,,1.0
3,4,10.1016/0006-291x(75)90506-9,1.0,,1.0
4,5,10.1016/0006-291x(75)90508-2,1.0,,1.0
5,6,10.1016/0006-291x(75)90518-5,1.0,,1.0
6,7,10.1016/0006-2952(75)90020-9,1.0,,1.0
7,8,10.1016/0006-2952(75)90029-5,1.0,,1.0
8,9,10.1016/0006-2952(75)90080-5,1.0,,1.0
9,10,10.1002/mrd.21098,1.0,,


In [2]:
pubmed_merged_orig.shape

(19441925, 5)

In [3]:
# rows without match in unpaywall = 19441925 - 19146997
19441925 - 19146997 

294928

In [4]:
# open the unpaywall merged file, append to the first one and dedup
myfilein = 'data/temp/pmid_dois_unpaywall_merged.csv.gz'
pubmed_oadoi = pd.read_csv(myfilein, sep='\t', header=0, encoding='utf-8', compression='gzip')
pubmed_merged = pubmed_merged_orig.append(pubmed_oadoi, ignore_index=True)
pubmed_not_merged = pubmed_merged.drop_duplicates(subset=['pmid', 'doi'], keep=False)
pubmed_not_merged

Unnamed: 0,doi,doi_apd,doi_eupmc,doi_pubmed,oadoi,pmid
307,10.1016/s0022-5347(17)67182-9,1.0,1.0,,,518
473,10.1176/ajp.132.12.aj132121332,,,1.0,,916
475,10.1176/ajp.132.12.aj132121333,,,1.0,,917
490,10.1164/arrd.1975.112.6.817,,,1.0,,935
491,10.1164/arrd.1975.112.6.879,,,1.0,,936
527,10.1002/ardp.19753081002,1.0,1.0,,,980
1076,10.1097/00000658-197602000-00008,1.0,1.0,,,2111
1832,10.1203/00006450-198601000-000243,,,1.0,,3661
2002,10.1002/ardp.19753081202,1.0,1.0,,,4047
2004,10.1002/ardp.19753081213,1.0,1.0,,,4049


In [5]:
# export non matched rows to CSV
pubmed_not_merged.sort_values(by='pmid').to_csv('data/results/pubmed_unpaywall/pmid_dois_not_matched_in_unpaywall.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')

In [25]:
# export only PMIDs of non matched rows
myfilein = 'data/results/pubmed_unpaywall/pmid_dois_not_matched_in_unpaywall.csv.gz'
pubmed_not_merged = pd.read_csv(myfilein, sep='\t', header=0, encoding='utf-8', compression='gzip')
pubmed_not_merged

Unnamed: 0,doi,doi_apd,doi_eupmc,doi_pubmed,oadoi,pmid
0,10.1016/s0022-5347(17)67182-9,1.0,1.0,,,518
1,10.1176/ajp.132.12.aj132121332,,,1.0,,916
2,10.1176/ajp.132.12.aj132121333,,,1.0,,917
3,10.1164/arrd.1975.112.6.817,,,1.0,,935
4,10.1164/arrd.1975.112.6.879,,,1.0,,936
5,10.1002/ardp.19753081002,1.0,1.0,,,980
6,10.1097/00000658-197602000-00008,1.0,1.0,,,2111
7,10.1203/00006450-198601000-000243,,,1.0,,3661
8,10.1002/ardp.19753081202,1.0,1.0,,,4047
9,10.1002/ardp.19753081213,1.0,1.0,,,4049


In [26]:
pubmed_not_merged['unpaywall_not_matched'] = 1
pubmed_not_merged.sort_values(by='pmid')[['pmid', 'unpaywall_not_matched']].to_csv('data/results/pubmed_unpaywall/pmids_unpaywall_not_matched.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')


In [1]:
# OA green by source
# OA gold: data/results/pubmed_unpaywall/pmid_dois_oa_in_unpaywall_gold.csv.gz
# OA green: data/results/pubmed_unpaywall/pmid_dois_oa_in_unpaywall_green.csv.gz
#  	pmid 	doi 	doi_eupmc 	doi_pubmed 	doi_apd 	is_oa 	best_oa_location_host_type 	best_oa_location_version 	journal_is_oa 	oadoi
import pandas as pd
pubmed_oagold = pd.read_csv('data/results/pubmed_unpaywall/pmid_dois_oa_in_unpaywall_gold.csv.gz', delimiter='\t', 
                         dtype={'pmid': 'int'},
                         usecols=('pmid', 'doi_eupmc', 'doi_pubmed', 'doi_apd', 'journal_is_oa'), header=0)
pubmed_oagold

Unnamed: 0,pmid,doi_eupmc,doi_pubmed,doi_apd,journal_is_oa
0,28,1.0,,1.0,f
1,124,1.0,,1.0,f
2,125,1.0,,1.0,f
3,126,1.0,,1.0,f
4,132,1.0,,1.0,f
5,170,1.0,,1.0,f
6,171,1.0,,1.0,f
7,235,1.0,,1.0,f
8,236,1.0,,1.0,f
9,237,1.0,,1.0,f


In [None]:
# OA gold by source
pubmed_oagold['oagold_by_doi_pubmed'] = 0
pubmed_oagold['oagold_by_pmc_doi'] = 0
pubmed_oagold['oagold_by_pmc_doi'] = 0
pubmed_non_oa.sort_values(by='pmid')[['pmid', 'unpaywall_oa_false']].to_csv('data/results/pubmed_unpaywall/pmids_unpaywall_oa_false.csv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')

In [2]:
# OA green
pubmed_oagreen = pd.read_csv('data/results/pubmed_unpaywall/pmid_dois_oa_in_unpaywall_green.csv.gz', delimiter='\t', 
                         dtype={'pmid': 'int'},
                         usecols=('pmid', 'doi_eupmc', 'doi_pubmed', 'doi_apd', 'journal_is_oa'), header=0)
pubmed_oagreen

Unnamed: 0,pmid,doi_eupmc,doi_pubmed,doi_apd,journal_is_oa
0,10,1.0,,,f
1,10,1.0,,1.0,f
2,128,1.0,,1.0,f
3,206,1.0,,1.0,f
4,480,1.0,,1.0,f
5,1136,1.0,,1.0,f
6,1203,,,1.0,f
7,1425,1.0,,,f
8,1525,1.0,,1.0,f
9,1814,1.0,,1.0,f
