# Defining WoS download strategy

There are two main approaches to obtain the full records for the publications of interest from the Web of Science online search tool, accessed through the [Portal Periódicos CAPES](https://www-periodicos-capes-gov-br.ez29.periodicos.capes.gov.br/index.php?):

- Getting dataset with all custom fields selected (allows for 1000 records/download)
- Getting dataset with full record + reference option (allows for 500 records/download)

We are going to test if a sample dataset ([UFRJ publications in Nature](https://www.webofscience.com/wos/woscc/summary/73efcaff-4020-4f4e-aeb7-e2deb927394a-6b87d5d4/relevance/1)) from each approach have the same fields/information. If there's no difference between the datasets, we will use the custom fields approach to download all of our data, since it allows retrieval of more records/download.

In [1]:
import pandas as pd

In [2]:
#Getting custom fields df
custom_fields = pd.read_csv('data/wos_ufrj_nature_custom_selection_all_fields.txt', delimiter='\t')

#Getting full record df
full_record = pd.read_csv('data/wos_ufrj_nature_full_records_and_references.txt', delimiter='\t')

In [3]:
#Checking if the two dfs are identical
custom_fields.equals(full_record)

True

# Downloading WoS records

We have selected all publications of interest from the Web of Science online search tool. [The query](https://www.webofscience.com/wos/woscc/summary/351b40d9-0687-497b-ba2d-e9fad2d57254-6b6cf9a6/relevance/1) can be described as follows:

```
Universidade Federal do Rio de Janeiro (Affiliation) and
NOT Publication Years: 2023
```

This query returned 77654 results (30/01/2021), which were sorted by date (oldest first).

Since this ammount of data is way higher than the limit of records recoverable per download (1000 records/download), we recovered these records by spliting them into <1k downloadable parts. 

The records were downloaded between 27-01-2023 and 30-01-2023 in both .csv and .bib file formats.