# Scraping MetaData and getting Fulltexts

This tutorial notebook uses the functions from `CorpusGenerator` and shows how to get the abstracts and fulltexts from Scopus.

The first part of the notebook is used for pulling metadata from articles via Scopus' literature search. It can technically be used to scrape abstracts from anywhere within Scopus' database, but we've specifically limited it to Elsevier journals as that is the only journal that we have access to the fulltext options from. Specifically, this sets up a way to pull PII identification numbers automatically.

To manually test queries, go to https://www.scopus.com/search/form.uri?display=advanced

Elsevier maintains a list of all journals in a single excel spreadsheet. The link to that elsevier active journals link: https://www.elsevier.com/__data/promis_misc/sd-content/journals/jnlactivesubject.xls

The second part of the notebook uses the metadata generated from the first part and gets the fulltexts out of that.

In [2]:
import sys
sys.path.append('/Users/nisarg/Desktop/summer research/BETO_NLP/modules')
import corpus_generation
from corpus_generation import CorpusGenerator

In order to get the articles, the first step requires you to get an API key from Scopus and adding it to your local config file. You can easily get an API key from https://dev.elsevier.com/documentation/SCOPUSSearchAPI.wadl with a quick registration. 

Once you have your API key, you need to add it to your computer using the following command:

`import pybliometrics`

`pybliometrics.scopus.utils.create_config()`

This will prompt you to enter an API key which you obtained from the Scopus website. Once you're done with that you are good to download the articles using the following functions.

**Note**: While downloading the articles from the Scopus, make sure you are connected to UW VPN (All Internet Traffic) using the BIG-IP Edge Client. Without that you might end up getting the Scopus authorization error.

Your `scopus path` would be under the `.scopus` directory in your local. 

`scopus_path = '/Users/nisarg/.scopus/'`

The config path for `pybliometrics` is: `/Users/nisarg/.scopus/config.ini` (Would vary as per your local path)

## Walking through the algorithm

The algorithm will take the apikey and cache_path. We will also be defining the other parameters which are required for the functions in the class and show how to use the class to generate the stuff you need.

`apikey:` could be one apikey or multiple keys which you generated from Scopus. 

In [3]:
#Enter your keys in the below list
apikey = ['a', 'b', 'c']

In [4]:
scopus_path = '/Users/nisarg/.scopus/'

In [5]:
c_gen = CorpusGenerator(apikey, scopus_path)

`term_list` is the list of the keywords through which the function generates the Corpus.

### Example for getting the corpus and the metadata from the journals

After mentioning the term_lists and save_dir (where the corpus generated from the function are stored) we use the `get_corpus` function. `save_dir` stores the corpus as well as the fulltexts in the same path as different `.json` files. If you want to obtain the full texts with the corpus, set `fulltexts=True` as shown in the example below.

User also has the option to generate fulltexts later, by using the function `get_fulltexts`

In [6]:
term_list = ['deposition', 'corrosion', 'inhibit', 'corrosive', 'resistance', 'protect', 'acid', 'base', 'coke', 'coking', 'anti', \
             'layer', 'steel', 'mild steel', 'coating', 'degradation', 'oxidation', \
             'film', 'photo-corrosion', 'hydrolysis', 'Schiff']

In [6]:
save_dir = '/Users/nisarg/Desktop/summer research/Ci_pii'

In [None]:
c_gen.get_corpus(term_list, range(1995,2021), save_dir, fulltexts=True)

After obtaining the piis, the metadata will be generated in the `save_dir` which will be used for obtaining the fulltexts as well as the dataframe for all the abstracts.

In [None]:
#If fulltexts not obtained using the get_corpus, can be obtained separately using the below function
c_gen.get_fulltexts(save_dir)

In [None]:
dataframe_path = '/Users/nisarg/Desktop'

In [None]:
c_gen.make_dataframe(dataframe_path, save_dir)