This script makes requires installation of the [`woslite_py_client`](https://github.com/Clarivate-SAR/woslite_py_client), a tool for making queries to the [Web of Science](https://clarivate.com/webofsciencegroup/solutions/web-of-science/). As noted below, use of this API requires authorization, which may not be free (I obtained it through my institution).

In [1]:
from __future__ import print_function
import time
import woslite_client
import json
import pandas as pd
import numpy as np
from woslite_client.rest import ApiException
from progressbar import progressbar
#from pprint import pprint

Read in the API key from an external file named `wos-api-key.csv` that has a single column and two rows organized as follows:

| X-APIKey |
------------
| [key]    |

In [2]:
# Configure API key authorization: key
configuration = woslite_client.Configuration()

api_key = pd.read_csv("wos-api-key.csv")
configuration.api_key['X-ApiKey'] = api_key['X-APIKey'][0]
# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['X-ApiKey'] = 'Bearer'


In [3]:
# Create an instance of the API class
api_instance = woslite_client.SearchApi(woslite_client.ApiClient(configuration))
database_id = 'WOS' # str | Database to search. Must be a valid database ID, one of the following: BCI/BIOABS/BIOSIS/CCC/DCI/DIIDW/MEDLINE/WOK/WOS/ZOOREC. WOK represents all databases.


In [15]:
years = ['2000','2005','2010','2015'] # years to search
subjects = ['physics','psychology','biology','philosophy','economics'] # subjects to search

count = 100 # Number of records to return, must be 0-100.

# lang = 'lang_example' # str | Language of search. This element can take only one value: en for English. If no language is specified, English is passed by default. (optional)
# edition = 'edition_example' # str | Edition(s) to be searched. If null, user permissions will be substituted. Must include the name of the collection and edition name separated by '+', ex: WOS+SCI. Multiple editions are separated by ','. Editions available for collection(WOS) - AHCI,CCR,IC,ISSHP,ISTP,SCI,SSCI,BHCI,BSCI and ESCI. (optional)
# publish_time_span = 'publish_time_span_example' # str | This element specifies a range of publication dates. If publishTimeSpan is used, the loadTimeSpan parameter must be omitted. If publishTimeSpan and loadTimeSpan are both omitted, then the maximum time span will be inferred from the editions data. Beginning and end dates should be specified in the yyyy-mm-dd format separated by +, ex: 1993-01-01+2009-12-31. (optional)
# load_time_span = 'load_time_span_example' # str | Load time span (otherwise described as symbolic time span) defines a range of load dates. The load date is the date a record was added to the database. If load date is specified, the publishTimeSpan parameter must be omitted. If both publishTimeSpan and loadTimeSpan are omitted, the maximum publication date will be inferred from the editions data. Any of D/W/M/Y prefixed with a number where D-Day, M-Month, W-Week, Y-Year allowed. Acceptable value range for Day(0-6), Week(1-52), Month(1-12) and Year(0-10), ex: 5D,30W,10M,8Y. (optional)
# sort_field = 'sort_field_example' # str | Order by field(s). Field name and order by clause separated by '+', use A for ASC and D for DESC, ex: PY+D. Multiple values are separated by comma. (optional)


records = list() # Store all returned records in a list

for subject in progressbar(subjects):
    print(subject)
    
    for year in progressbar(years):
        print(year)
        
        first_record = 1 # First record to return, must be between 1-100000
        count = 100 # Number of records to return, must be 0-100.
        total_records = np.Inf
        
        while (first_record <= total_records and first_record <= 100000):
        
            query = '(WC=(' + subject + ')) AND (PY==(\"' + year + '\") AND DT==(\"ARTICLE\"))'
            try:
                # Submits a user query and returns results
                api_response = api_instance.root_get(database_id, query, count, first_record)
            except ApiException as e:
                print("Exception when calling SearchApi->root_get: %s\n" % e)
                print(total_records)
                print(first_record)
                print(count)
            
            # Get the total number of records
            total_records = api_response.query_result.records_found
            
            for pi in range(0,count):
                record_data = api_response.data[pi]
                doi = record_data.other.identifier_doi
                published_month = record_data.source.published_biblio_date
                published_year = record_data.source.published_biblio_year
                ut = record_data.ut

                paper_df = pd.DataFrame(
                    {
                        "DOI": doi,
                        "UT": ut,
                        "Published_Month": published_month,
                        "Published_Year": published_year,
                        "Subject": subject
                    }
                )
                records.append(paper_df)
        
            first_record += count
            
            # if we've reached the end of the list, shorten the count
            if (first_record + count - 1 > total_records):
                count = total_records - (first_record - 1)
        
# Glue all the records together into a big data frame
papers = pd.concat(records)

N/A% (0 of 4) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

physics
2000


 25% (1 of 4) |######                    | Elapsed Time: 0:36:00 ETA:   1:48:01

2005


 50% (2 of 4) |#############             | Elapsed Time: 1:15:03 ETA:   1:18:05

2010


 75% (3 of 4) |###################       | Elapsed Time: 2:14:16 ETA:   0:59:13

2015


100% (4 of 4) |##########################| Elapsed Time: 2:39:33 Time:  2:39:33
N/A% (0 of 4) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

psychology
2000


 25% (1 of 4) |######                    | Elapsed Time: 0:01:05 ETA:   0:03:17

2005


 50% (2 of 4) |#############             | Elapsed Time: 0:02:12 ETA:   0:02:13

2010


 75% (3 of 4) |###################       | Elapsed Time: 0:04:05 ETA:   0:01:53

2015


100% (4 of 4) |##########################| Elapsed Time: 0:06:59 Time:  0:06:59
N/A% (0 of 4) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

biology
2000


 25% (1 of 4) |######                    | Elapsed Time: 0:04:17 ETA:   0:12:51

2005


 50% (2 of 4) |#############             | Elapsed Time: 0:09:08 ETA:   0:09:43

2010


 75% (3 of 4) |###################       | Elapsed Time: 0:16:08 ETA:   0:07:00

2015


100% (4 of 4) |##########################| Elapsed Time: 0:24:37 Time:  0:24:37
N/A% (0 of 4) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

philosophy
2000


 25% (1 of 4) |######                    | Elapsed Time: 0:00:11 ETA:   0:00:34

2005


 50% (2 of 4) |#############             | Elapsed Time: 0:00:22 ETA:   0:00:21

2010


 75% (3 of 4) |###################       | Elapsed Time: 0:00:42 ETA:   0:00:20

2015


100% (4 of 4) |##########################| Elapsed Time: 0:01:21 Time:  0:01:21
N/A% (0 of 4) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

economics
2000


 25% (1 of 4) |######                    | Elapsed Time: 0:00:27 ETA:   0:01:23

2005


 50% (2 of 4) |#############             | Elapsed Time: 0:00:57 ETA:   0:00:59

2010


 75% (3 of 4) |###################       | Elapsed Time: 0:01:58 ETA:   0:01:01

2015


100% (4 of 4) |##########################| Elapsed Time: 0:03:27 Time:  0:03:27
100% (5 of 5) |##########################| Elapsed Time: 3:16:00 Time:  3:16:00


In [32]:
# Add a unique index to each row and write to a csv file

papers.reset_index(drop=True, inplace=True)
papers.to_csv("wos-records.csv")