# Downloading Scopus records (csv)

First, we have selected all publications of interest from the Scopus online search tool, accessed through our institution. The query can be described as follows:

```
AF-ID ( "Universidade Federal do Rio de Janeiro"   60000036 )  AND  NOT PUBYEAR  >  2022
```

This query returned 94809 results (10/02/2023). Since this ammount of data surpasses the limit of records recoverable per download (2000 records/download), we recovered these records by splitting them into downloadable parts through the 'Subject Area' and 'Year' filters. 


In order to do that, we'll use a combination of Scopus filters to obtain slices of the data that falls under the 2k records limit.

The records will be split by Subject Area. When an area has more than 2k records, we'll split it by year to obtain downloadable sets of data.

To better organize this download process, we'll use data from Scopus' 'Analyze results' option to assess our data retrieval approach.

In [3]:
import pandas as pd

In [4]:
#Confirming the total number of records
scopus_year = pd.read_csv('data/Scopus-94809-Analyze-Year.csv',
                  sep=',', 
                  names=['year','n_records'],
                  header=0,
                  skiprows=7) # The first lines of the file contains metadata

In [5]:
display(scopus_year)
display(f'Total records (analysis by year): {scopus_year.n_records.sum()}')

Unnamed: 0,year,n_records
0,2022,5245
1,2021,5758
2,2020,5723
3,2019,5337
4,2018,5315
...,...,...
71,1937,2
72,1935,1
73,1932,1
74,1928,1


'Total records (analysis by year): 94809'

In [8]:
#Importing the scopus analysis data for each subject
scopus_subjects = pd.read_csv('data/Scopus-94809-Analyze-Subject.csv',
                  sep=',', 
                  names=['scopus_subject_area','n_records'],
                  header=0,
                  skiprows=7) # The first lines of the file contains metadata

In [9]:
#Visualizing dataframe
scopus_subjects.index = range(1,len(scopus_subjects)+1) #Starting index at 1
display(scopus_subjects)
display(f'Total records (analysis by subject): {scopus_subjects.n_records.sum()}')

Unnamed: 0,scopus_subject_area,n_records
1,Medicine,22243
2,Physics and Astronomy,13070
3,"Biochemistry, Genetics and Molecular Biology",12662
4,Engineering,12439
5,Chemistry,11344
6,Agricultural and Biological Sciences,10973
7,Computer Science,7227
8,Materials Science,7126
9,Environmental Science,6226
10,Social Sciences,6146


'Total records (analysis by subject): 157208'

Note that the total number of records when analysing by subject area (`157208`) far surpasses that of the analysis by year (`94809`), which is the actual number of results for our query. 

This happens because a single document can be categorized into several subject areas, but can't be published in two or more years.

To avoid downloading such a huge number of duplicates, we have used the following approach when recovering the records:

1. Ordered the subject areas by number of records (the `scopus_subjects` dataframe is already ordered following this logic)

2. Downloaded the subject areas in said order. However, when downloading records from an area, the previously downloaded areas are excluded from the search, avoiding download of duplicates. 


Let's exemplify this process with some query examples. 

For the first area (Medicine), the query used was:

```
AF-ID ( "Universidade Federal do Rio de Janeiro"   60000036 )  AND NOT  PUBYEAR  >  2022  AND  ( LIMIT-TO ( SUBJAREA ,  "MEDI" ) )
```

Note that nothing was excluded from the search. As long as one of the subject areas the document was categorized in was "Medicine", it was downloaded. That takes us to the second query, regarding the `Physics and Astronomy` subject area:

```
AF-ID ( "Universidade Federal do Rio de Janeiro"   60000036 )  AND NOT  PUBYEAR  >  2022  AND  ( LIMIT-TO ( SUBJAREA ,  "PHYS" ) )  AND  ( EXCLUDE ( SUBJAREA ,  "MEDI" ) )  
```

In this case, the documents that have been categorized in both "MEDI" and "PHYS" areas have been filtered out, since they have already been downloaded through the previous query. In a similar fashion, the third query was:

```
AF-ID ( "Universidade Federal do Rio de Janeiro"   60000036 )  AND NOT  PUBYEAR  >  2022  AND  ( LIMIT-TO ( SUBJAREA ,  "BIOC" ) )  AND  ( EXCLUDE ( SUBJAREA ,  "MEDI" )  OR  EXCLUDE ( SUBJAREA ,  "PHYS" ) )  
```

...and so on.

Obs: As mentioned before, the results were split by year when returning more than 2000 records.

Through this approach, all 94809 records were downloaded in 10-02-2023, in both csv and bibtex file formats.