# Purpose

This notebook is setup to query the [OSTI.gov](https://www.OSTI.gov) API for project records. The goals for the code located herein are:

1. Determine what fields are available for different records in OSTI
2. Design a DOE Solar Energy Technologies Office (SETO) query that only pulls that technology office's data
    * **Note:** this required `{'sponsoring_org': '"EE-4S"'}` in order to work the same as the browser-based search query. It appears that the syntax of the API and of the browser-based search is not fully harmonized right now, although I'm told it will be in the future.
3. Build the query to work using an arbitrarily-large list of formatted project IDs, assuming the Solar Information Management System (SIMS) project code syntax as the input.
    * **Note:** SIMS is an internal DOE system


In [1]:
#Query the API, mimicking the pre-made SS search URL as closely as possible
import requests

URL = "https://www.osti.gov/api/v1/records"

#sort by publication date, with the most current dates first (these can be future values)
    #and only return records that are for thing sponsored by the solar office, EE-4S
params = {'sort': 'publication_date desc', 'sponsor_org': '"EE-4S"'}

r = requests.get(URL, params=params)

query_date = r.headers["Date"]
results_count = r.headers['X-Total-Count']

print(f"Query was successful: {r.status_code == requests.codes.ok}")
print(f"Query made on {query_date} returned {results_count} hits")
print(f"URL used was {r.url}")


Query was successful: True
Query made on Wed, 27 Feb 2019 21:50:06 GMT returned 2402 hits
URL used was https://www.osti.gov/api/v1/records?sort=publication_date+desc&sponsor_org=%22EE-4S%22


In [2]:
#Import the JSON query response into a DataFrame for cleaning
import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict(r.json())
df

Unnamed: 0,article_type,authors,availability,contributing_org,country_publication,description,doe_contract_number,doi,entry_date,format,...,links,osti_id,product_type,publication_date,publisher,report_number,research_orgs,sponsor_orgs,subjects,title
0,,"[Dong, Changgui, Sigrin, Benjamin]",,,United States,"Distributed energy resources, such as rooftop ...",AC36-08GO28308,10.1016/j.enpol.2019.02.017,2019-02-25T05:00:00Z,Medium: X; Size: p. 100-110,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1494980,Journal Article,2019-06-01T04:00:00Z,Elsevier,NREL/JA-6A20-66020,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 29 ENERGY PLANNING, POLICY, ...",Using Willingness to Pay to Forecast the Adopt...
1,,"[Sulas, Dana B., Johnston, Steve (ORCID:000000...",,,United States,We investigate the implications of using parti...,AC36-08GO28308,10.1016/j.solmat.2018.12.022,2019-01-23T05:00:00Z,Medium: X; Size: p. 81-87,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1491141,Journal Article,2019-04-01T04:00:00Z,Elsevier,NREL/JA-5K00-71930,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 36 MATERIALS SCIENCE, silico...",Comparison of Photovoltaic Module Luminescence...
2,,"[Cai, Can, Miller, David C., Tappan, Ian A., D...",,,United States,We developed a framework to predict and model ...,AC36-08GO28308,10.1016/j.solmat.2018.11.024,2019-01-08T05:00:00Z,Medium: X; Size: p. 486-492,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1489188,Journal Article,2019-03-01T05:00:00Z,Elsevier,NREL/JA-5K00-73005,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 36 MATERIALS SCIENCE, accele...",Framework for Predicting the Photodegradation ...
3,,"[Neises, Ty, Turchi, Craig]",,,United States,"This analysis investigates the design, cost, a...",AC36-08GO28308,10.1016/j.solener.2019.01.078,2019-02-25T05:00:00Z,Medium: X; Size: p. 27-36,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1494976,Journal Article,2019-03-01T05:00:00Z,Elsevier,NREL/JA-5500-72674,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 47 OTHER INSTRUMENTATION, co...",Supercritical Carbon Dioxide Power Cycle Desig...
4,,"[Jain, Himanshu [National Renewable Energy Lab...",,,United States,This paper explores the advantages and challen...,AC36-08GO28308,,2019-02-27T05:00:00Z,Medium: ED; Size: 1.1 MB,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1496050,Conference,2019-02-19T05:00:00Z,,NREL/CP-5D00-70197,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[24 POWER TRANSMISSION AND DISTRIBUTION, flexi...",Evaluating the Impact of Price-Responsive Load...
5,,"[Jain, Akshay Kumar [National Renewable Energy...",,,United States,Distributed photovoltaic systems (DPV) can cau...,AC36-08GO28308,,2019-02-26T05:00:00Z,Medium: ED; Size: 1.4 MB,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495718,Conference,2019-02-15T05:00:00Z,,NREL/CP-5D00-72284,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 24 POWER TRANSMISSION AND DI...",Quasi-Static Times Series PV Hosting Capacity ...
6,,"[Woodhouse, Michael A [National Renewable Ener...",,,United States,In this paper we provide an overview of the ac...,AC36-08GO28308,10.2172/1495719,2019-02-27T05:00:00Z,Medium: ED; Size: 4.5 MB,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495719,Technical Report,2019-02-15T05:00:00Z,,NREL/TP-6A20-72134,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 29 ENERGY PLANNING, POLICY, ...",Crystalline Silicon Photovoltaic Module Manufa...
7,,"[Engel-Cox, Jill [National Renewable Energy La...",,,United States,Energy systems across the world are undergoing...,AC36-08GO28308,,2019-02-25T05:00:00Z,Medium: ED; Size: 4.0 MB,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495383,Conference,2019-02-14T05:00:00Z,,NREL/PR-6A50-73118,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[29 ENERGY PLANNING, POLICY, AND ECONOMY, 54 E...",Clean Energy Technologies for Economic Transit...
8,,"[Engel-Cox, Jill [National Renewable Energy La...",,,United States,Energy systems in communities across the world...,AC36-08GO28308,,2019-02-25T05:00:00Z,Medium: ED; Size: 3.7 MB,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495384,Conference,2019-02-14T05:00:00Z,,NREL/PR-6A50-73119,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[29 ENERGY PLANNING, POLICY, AND ECONOMY, 54 E...",Clean Energy Technologies for Economic Develop...
9,,"[OShaughnessy, Eric J [National Renewable Ener...",,,United States,The New York City Mayor's Office of Sustainabi...,AC36-08GO28308,10.2172/1495387,2019-02-26T05:00:00Z,Medium: ED; Size: 558 KB,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495387,Technical Report,2019-02-14T05:00:00Z,,NREL/TP-6A20-72186,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 29 ENERGY PLANNING, POLICY, ...",Expanding Community Shared Solar in New York C...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 24 columns):
article_type           0 non-null object
authors                20 non-null object
availability           20 non-null object
contributing_org       20 non-null object
country_publication    20 non-null object
description            20 non-null object
doe_contract_number    20 non-null object
doi                    20 non-null object
entry_date             20 non-null object
format                 20 non-null object
journal_issue          20 non-null object
journal_name           20 non-null object
journal_volume         20 non-null object
language               20 non-null object
links                  20 non-null object
osti_id                20 non-null object
product_type           20 non-null object
publication_date       20 non-null object
publisher              20 non-null object
report_number          20 non-null object
research_orgs          20 non-null object
sponsor_orgs    

In [4]:
#Provide some basic info about missing values
missing = pd.DataFrame(df.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(df),2)
missing.sort_values('total missing', ascending = False)

Unnamed: 0,total missing,percent missing
article_type,20,1.0
authors,0,0.0
subjects,0,0.0
sponsor_orgs,0,0.0
research_orgs,0,0.0
report_number,0,0.0
publisher,0,0.0
publication_date,0,0.0
product_type,0,0.0
osti_id,0,0.0


# Cleaning and Exploring the Data

## Checking Consistency of Office Specificity

As multiple DOE program offices can be associated with a project output in the OSTI database, I want to check to make sure that the filter we're applying to get only solar-related projects is working as expected.

In [25]:
#Convert sponsoring_org field to be a str instead of list and split on the comma delimiter to make sure 
    #EE-4S is everywhere
df['sponsor_orgs'].astype('str').str.split(", ", expand = True)[1].value_counts()

Solar Energy Technologies Office (EE-4S)']    18
Solar Energy Technologies Office (EE-4S)       1
Solar Energy Technologies Office (EE-4S)'      1
Name: 1, dtype: int64

## Determining What Links We Get and Where They Go

The `links` field seems to provide some URLs for us to use, let's do some spot checks and see if they go to project landing pages or straight to the full text itself (the former is preferred over the latter).

**It looks like `'rel': 'citation'` is the link type we want to go straight to the landing page.** `'href'` key for dict gives us what we're looking for.

In [52]:
def citation_URL(dict_list):
    '''
    Takes a list of dicts in which at least one dict is {'rel': 'citation', 'href': URL}
    and returns the URL. Intended to be used
    in pd.Series.apply(). If none of the dicts in the list is of the form {'rel': 'citation', 'href': URL},
    returns None.
    
    Parameters
    ----------
    dict_list: list of dicts
    
    Outputs
    -------
    url: citation URL as a str
    '''
    
    #does {'rel': 'citation'} exist in the list?
    for e in dict_list:
        if e['rel'] == 'citation':
            return e['href']
        else: return None

In [58]:
df['citation_link'] = df['links'].apply(citation_URL)

# To Do

3. Look at how embargoed pubs appear in the API, if at all
    * Look for award EE0007326 and OSTI report ID 1490198
    * Look for embargoed Halo Industries Incubator 10 FTR (EE0007192)