# OpenAlex Data Pipeline

Author: Alex Davis

Date: 07/05/2024

The purpose of this script is to ingest OSINT data from the OpenAlex API (https://docs.openalex.org/), preprocess the data, and
prepare it for modeling.

In [19]:
#import packages
import pandas as pd
import requests
import re
import string
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
import pickle 

[nltk_data] Downloading package stopwords to /home/studio-lab-
[nltk_data]     user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/studio-lab-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/studio-lab-
[nltk_data]     user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Import Data

The function below connects to the OpenAlex API and conducts searches based on the parameters (year, keywords, etc.). The function returns a dataframe as a result of the search.

In [2]:
def import_data(pages, start_year, end_year, search_terms):
    
    """
    This function is used to use the OpenAlex API, conduct a search on works, a return a dataframe with associated works.
    
    Inputs: 
        - pages: int, number of pages to loop through
        - search_terms: str, keywords to search for (must be formatted according to OpenAlex standards)
        - start_year and end_year: int, years to set as a range for filtering works
    """
    
    #create an empty dataframe
    search_results = pd.DataFrame()
    
    for page in range(1, pages):
        
        #use paramters to conduct request and format to a dataframe
        response = requests.get(f'https://api.openalex.org/works?page={page}&per-page=200&filter=publication_year:{start_year}-{end_year},type:article&search={search_terms}')
        data = pd.DataFrame(response.json()['results'])
        
        #append to empty dataframe
        search_results = pd.concat([search_results, data])
    
    #subset to relevant features
    search_results = search_results[["id", "title", "display_name", "publication_year", "publication_date",
                                        "type", "countries_distinct_count","institutions_distinct_count",
                                        "has_fulltext", "cited_by_count", "keywords", "referenced_works_count", "abstract_inverted_index"]]
    
    return(search_results)

Below, we conduct multiple searches to cover different technology areas inspired by the DoD's critical technology areas (https://www.cto.mil/usdre-strat-vision-critical-tech-areas/). We do this to ensure variety within our data and to have a general idea of the topics included.

We set the years from 2016 to 2024 to filter to relevant data. We choose 25 pages to keep the size of each search reasonable. Lastly, we pick a handfull of key terms for each search.

In [3]:
#search for Trusted AI and Autonomy
ai_search = import_data(35, 2016, 2024, "'artificial intelligence' OR 'deep learn' OR 'neural net' OR 'autonomous' OR drone")

In [4]:
#search for Biotechnology
biotech_search = import_data(35, 2016, 2024, "biotech OR dna OR genome OR crispr OR rna")

In [5]:
#search for Advanced Materials
materials_search = import_data(35, 2016, 2024, "biomaterial OR 'smart material' OR nanotech OR 'carbon fiber' OR superalloy")

In [6]:
#search for Space Technology
space_search = import_data(35, 2016, 2024, "satellite OR gps OR 'space navigation' OR 'space communications'")

In [7]:
#search for Human Machine Interfaces
interfaces_search = import_data(35, 2016, 2024, "'augmented reality' OR 'virtual reality' OR 'human-machine interfac' OR 'brain-mach'")

In [8]:
#concatenate into a master dataframe and drop duplicates/null abstracts
master_search = pd.concat([ai_search, biotech_search, materials_search, space_search, interfaces_search])
master_search = master_search.drop_duplicates(subset = 'id')
master_search = master_search[master_search['abstract_inverted_index'].notna()]

print(f"Final Number of Works: {len(master_search)}")

Final Number of Works: 33148


## Preprocess Data

The abstracts associated with the works we pulled are returned as an inverted index due to legal reasons. This invereted index can used to return the original text. Then, the text must be cleaned to be prepared for embeddings.

### Inverted Indices

In [9]:
def undo_inverted_index(inverted_index):
    
    """
    The purpose of the function is to 'undo' and inverted index. It inputs an inverted index and
    returns the original string.
    """

    #create empty lists to store uninverted index
    word_index = []
    words_unindexed = []
    
    #loop through index and return key-value pairs
    for k,v in inverted_index.items(): 
        for index in v: word_index.append([k,index])

    #sort by the index
    word_index = sorted(word_index, key = lambda x : x[1])
    
    #join only the values and flatten
    for pair in word_index:
        words_unindexed.append(pair[0])
    words_unindexed = ' '.join(words_unindexed)
    
    return(words_unindexed)

In [10]:
#create 'original_abstract' feature
master_search['original_abstract'] = list(map(undo_inverted_index, master_search['abstract_inverted_index']))

In [11]:
master_search.head(3)

Unnamed: 0,id,title,display_name,publication_year,publication_date,type,countries_distinct_count,institutions_distinct_count,has_fulltext,cited_by_count,keywords,referenced_works_count,abstract_inverted_index,original_abstract
0,https://openalex.org/W2664267452,"Artificial intelligence in healthcare: past, p...","Artificial intelligence in healthcare: past, p...",2017,2017-06-21,article,2,6,True,2328,[{'id': 'https://openalex.org/keywords/artific...,57,"{'Artificial': [0], 'intelligence': [1], '(AI)...",Artificial intelligence (AI) aims to mimic hum...
1,https://openalex.org/W2981731882,Explainable Artificial Intelligence (XAI): Con...,Explainable Artificial Intelligence (XAI): Con...,2020,2020-06-01,article,2,5,False,3870,[{'id': 'https://openalex.org/keywords/xai-con...,229,"{'In': [0], 'the': [1, 19, 28, 38, 45, 53, 70,...","In the last few years, Artificial Intelligence..."
2,https://openalex.org/W2766447205,Mastering the game of Go without human knowledge,Mastering the game of Go without human knowledge,2017,2017-10-01,article,1,1,True,6893,[{'id': 'https://openalex.org/keywords/compute...,33,"{'A': [0], 'long-standing': [1], 'goal': [2], ...",A long-standing goal of artificial intelligenc...


### Combine and Clean Text

In [12]:
#combine all text into one column for analysis and embedding
master_search["all_text"] = master_search["title"] + master_search["display_name"] + master_search["original_abstract"]

In [21]:
def preprocess(text):
    
    """
    This function takes in a string, coverts it to lowercase, cleans
    it (remove special character and numbers), and tokenizes it.
    """
    
    #convert to lowercase
    text = text.lower()
    
    #remove special character and digits
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    
    #tokenize
    tokens = nltk.word_tokenize(text)
    
    return(tokens)

In [22]:
def remove_stopwords(tokens):
    
    """
    This function takes in a list of tokens (from the 'preprocess' function) and 
    removes a list of stopwords. Custom stopwords can be added to the 'custom_stopwords' list.
    """
    
    #set default and custom stopwords
    stop_words = nltk.corpus.stopwords.words('english')
    custom_stopwords = []
    stop_words.extend(custom_stopwords)
    
    #filter out stopwords
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    return(filtered_tokens)

In [23]:
def lemmatize(tokens):
    
    """
    This function conducts lemmatization on a list of tokens (from the 'remove_stopwords' function).
    This shortens each word down to its root form to improve modeling results.
    """
    
    #initalize lemmatizer and lemmatize
    lemmatizer = nltk.WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return(lemmatized_tokens)

In [25]:
def clean_text(text):
    
    """
    This function uses the previously defined functions to take a string and\
    run it through the entire data preprocessing process.
    """
    
    #clean, tokenize, and lemmatize a string
    tokens = preprocess(text)
    filtered_tokens = remove_stopwords(tokens)
    lemmatized_tokens = lemmatize(filtered_tokens)
    clean_text = ' '.join(lemmatized_tokens)
    
    return(clean_text)

In [17]:
#use functions above to create a column with preprocessed text
master_search['clean_text'] = list(map(clean_text, master_search['all_text']))

In [18]:
master_search.head(3)

Unnamed: 0,id,title,display_name,publication_year,publication_date,type,countries_distinct_count,institutions_distinct_count,has_fulltext,cited_by_count,keywords,referenced_works_count,abstract_inverted_index,original_abstract,all_text,clean_text
0,https://openalex.org/W2664267452,"Artificial intelligence in healthcare: past, p...","Artificial intelligence in healthcare: past, p...",2017,2017-06-21,article,2,6,True,2328,[{'id': 'https://openalex.org/keywords/artific...,57,"{'Artificial': [0], 'intelligence': [1], '(AI)...",Artificial intelligence (AI) aims to mimic hum...,"Artificial intelligence in healthcare: past, p...",artificial intelligence healthcare past presen...
1,https://openalex.org/W2981731882,Explainable Artificial Intelligence (XAI): Con...,Explainable Artificial Intelligence (XAI): Con...,2020,2020-06-01,article,2,5,False,3870,[{'id': 'https://openalex.org/keywords/xai-con...,229,"{'In': [0], 'the': [1, 19, 28, 38, 45, 53, 70,...","In the last few years, Artificial Intelligence...",Explainable Artificial Intelligence (XAI): Con...,explainable artificial intelligence xai concep...
2,https://openalex.org/W2766447205,Mastering the game of Go without human knowledge,Mastering the game of Go without human knowledge,2017,2017-10-01,article,1,1,True,6893,[{'id': 'https://openalex.org/keywords/compute...,33,"{'A': [0], 'long-standing': [1], 'goal': [2], ...",A long-standing goal of artificial intelligenc...,Mastering the game of Go without human knowled...,mastering game go without human knowledgemaste...


## Save Data

Here, we save the data as a .pkl file. This is a light-weight solution that we can save to the data folder and use in modeling and analysis notebooks.

In [20]:
#save file as .pkl file
with open('Data/preprocessed_data.pkl', 'wb') as file: 
      
    # A new file will be created 
    pickle.dump(master_search, file) 