# A Geospatial-Natural Language Processing Analysis of the Snow Albedo Literature

Eric A. Sproles<sup>1</sup>, Yun Li<sup>2</sup>, Forest Edwards<sup>1</sup>, Chris Crawford<sup>3</sup>

<sup>1</sup> Department of Earth Sciences; Geospatial Snow, Water, & Ice Research Lab; Montana State University

<sup>2</sup> Department of Geography and GeoInformation Science; George Mason University

<sup>3</sup> Earth Resources Observation and Science (EROS) Center,
Integrated Science and Applications Branch, U.S. Geological Survey


## Scope and intent
Snow albedo has long been recognized as a key variable for snow hydrology, climate modeling, and energy balance calculations. The Snow Albedo Working Group (SAWG) is composed of ~20 domain science professionals from across government agencies and academia in the United States and Europe. The primary goal of the SAWG is to complete a scoping document and accompanying journal article focused on previous snow albedo studies that helps define and leverage future directions of research focused on the radiative forcing of snow.

The rationale behind this scoping study is based upon two primary factors. First, new opportunities for scientific discovery are possible through the recent expansion of ground-based, airborne, and spaceborne remote sensing measurements.  While new opportunities are promising, a comprehensive examination of the scientific publications generated by the snow albedo community does not exist.  Thus, as a group of domain scientists, the SAWG are not able to collectively identify research scientific and geographic research gaps. Put simply, the scope of this challenge and the absence of any foundational information is too broad. The SAWG identified the need for a comprehensive review of the snow albedo literature as the first step in identifying scientific and geographic knowledge gaps to develop a path forward for transformative research.

The intent of this ESIP project is to test methods that optimize the arduous process of literature reviews by: 1) completing a Natural Language Processing (NLP) analysis of the published literature on snow albedo; 2) transitioning the NLP results into a scalable mapping analysis; and 3) producing this summary report of the results. 

This summary report supports subsequent efforts of the SAWG. This group has received funding from the NASA Terrestrial Hydrology Program to continue a snow albedo scoping document and literature starting in May,  culminating in a community workshop by November 2022.



## Technical approach
The primary objective of the SAWG scoping document is to provide a comprehensive review of published literature on the topic of snow albedo indexed in digital libraries (e.g. SCOPUS, IEEE Xplore, Web of Science, JSTOR,  PubMed, and Springer), synthesize and map the results, and identify key knowledge gaps. The methods to evaluate a discipline-focused analysis requires researchers to organize, read, and provide meta-analysis  on the body of literature - a process that requires a considerable investment of human capital [1]. Machine Learning, specifically NLP, expedites the process, organizes the results, and provides a replicable framework for similar or continuing efforts. An NLP analysis also has the ability to extract geographic locations of a study area or field region, which for Earth science research is particularly applicable. This facilitates the  identification of regions and snow types where the state of knowledge in snow albedo science is missing (i.e. Andes, Atlas, low latitude regions).

An NLP approach automates the process of a literature review programatically by interpreting the meaning of text by a machine. Incorporating NLP into analyses of literature reviews is well documented in biomedical research [2,3,4], but to date has been underutilized in the Earth Sciences.  

## NLP Analysis of the published literature on Snow Albedo
Our selected  corpus of literature was based upon a comprehensive web search from keywords developed by the SAWG (https://github.com/ESIPFed/Snow-Albedo-Mapping). The temporal range of the search was limited to articles published in English from 1991-2020 and that were available as pdf files. 

The NLP analysis focused on the abstract and conclusion of these using the Stanford Named Entity Recognition Tagger (NERT) [5]. The open-source package Apache Tika is accessed through python to detect structural text contents from PDF files. Then the Stanford NLP tools were utilized to recognize name entities such as locations, time expressions from extracted contents. Additionally, domain specific terms can serve as training data for the NLP to extract entities and their relationships, key findings, and future research from abstract and conclusion.

The NLP analysis can also automatically detect geographic information from input text, providing geospatial supplements to the output of the named entity recognition methods. The geographic location of these articles were mined using the capacities of the Stanford NERT and geograph3, a python library that extracts place names from a URL or text, and add context to those names—for  example distinguishing between a country, region or city (https://pypi.org/project/geograpy3/). 

## Mapping of the NLP Analysis
Geospatial indexing provided a means to organize, aggregate, and analyze the NLP results at multiple scales. We used three distinct hierarchical scales for the NLP results that represent differing levels of geospatial clarity extracted from the NLP analysis. We applied the open-source H3 framework to transform latitude and longitude into a unique index that can be redrawn at 16 spatial resolutions (https://h3geo.org/docs/). Originally developed by Uber,  H3 is hierarchical, facilitating single maps to be drawn at multiple resolutions. The results of the NLP were mapped at three H3 hierarchical resolutions, (2,3,4) based upon the geographic detail provided by the NLP results (Table 1). For example, in the case of snow albedo, the resolution of the Longyearbyen glacier would be at a higher level of spatial detail (small hexagons), and its parent, Svalbard, would be rendered at a more coarse resolution. Each hexagon is color coded on the map to show the frequency of each location.  Each hexagon can be clicked inside Jupyter Notebook, and a popup will appear showing exact counts.

The mapping framework is designed to automatically add to count of a location's parent hexagon (if there is already a hexagon with a matching name).  For example, if there were 3 occurrences of "Colorado" from the NLP results, and the assigned parent location is "United States", then a value of 3 would be added to the hexagon labeled "United States".


Table 1 - Explanation of H3 Hierarchies

| NLP Geographic Descriptor | Resolution | Parent     |
| :---                      |    :----   | :---       |
| Antarctica                | 2          | None       |
| Caucasus Mountains        | 3          | Russia     |
| Longyearbyen              | 4          | Svalbard   |

## Results
The NLP analysis identified 1,255 published articles for this testbed analysis that matched the list of keywords developed by the SAWG. While the analysis focused on the article’s abstract and conclusion, not all publications were readable by the NLP algorithms. In these articles 553 (42%) of the abstracts were readable, and 419 (33%) of the conclusions were readable. A blank in the abstract or conclusion indicates the paper did not have such a section or the related information is unable to be distinguished from normal text. 
Identifying and extracting the geographic location of a study was completed in 395 (31%) of the articles using the Stanford NERT and geography3 algorithms. Null or blank results represent articles that did not have a location in either abstract or conclusion, or had a location that was not recognizable by the NLP analysis. To actually map these results, required manual enhancement of the outputs, looking through the publications, and adding locational data to the geographic extraction. This increased the number of mappable locations to 480.  
 
### Qualitative examination of the results
Of the 1,255 articles that were identified by the NLP and geographic analysis based upon key words and resulting articles provided varying degrees of relevance. For example, an article such as:

> *Black Carbon Aerosols in Arctic Snow and Implications for Albedo Change*

is highly related to the goals of the SAWG, and this article returned both an abstract and conclusion. Geographically, Svalbard (Artic) and Mt Zeppelin (Antarctic) were identified as study locations. 
However, some articles were less relevant, such as:

> *Ultrafine grained bulk steel produced by accumulative roll bonding ARB process*

This article is relevant to material sciences, but not snow albedo. 
Similarly, some of the study areas that were extracted contain terms that are not locations. For example the Stanford NLP package and geography3 both identified El Niño as location and geography3 identified MODIS as location.

A comprehensive evaluation of the relevance of the 1,255 results was not completed as part of this test bed study, but does identify the reality that NLP analyses should be augmented by humans when working in this cryospheric sciences domain.  

These results were organized in a csv file and published on GitHub (https://github.com/ESIPFed/Snow-Albedo-Mapping).  An interactive map of the NLP results and resultant cluster analysis are provided below, and will be hosted and maintained by Dr. Eric Sproles at Montana State University.


## Code to create map

In [1]:
# IMPORTS
import pandas as pd
import folium
import h3
import jenkspy
import branca

In [2]:
class Hexagon:
    def __init__(self, name, hex_coordinate, weight, parent_loc):
        self.name = name
        self.hex_coordinate = hex_coordinate
        self.weight = weight
        self.parent_loc = parent_loc

class HexagonGrid: # Class that contains all hexagons in h3 format at some specified resolution
    def __init__(self, name, geo_locs):
        self.name = name
        self.geo_locs = geo_locs
        self.resolution = geo_locs['resolution'].iloc[0]
        self.parent_grid = None
        self.sub_grid = None
        self.hexagons = self.generate_grid()
        
    def generate_grid(self): # Returns a list of Hexagon objects
        hex_locations = [] # Keeps track of hex locations already counted
        hex_grid = [] # list of Hexagon objects to be returned
        for i in range(len(self.geo_locs)):
            location = h3.geo_to_h3(self.geo_locs.iloc[i]['latitude'], 
                                              self.geo_locs.iloc[i]['longitude'], 
                                              self.resolution)
            if location in hex_locations: # Add to weight of existing hexagon
                for item in hex_grid:
                    if location == item.hex_coordinate:
                        item.weight = item.weight + 1
            else: # Create new hexagon
                hex_locations.append(location)
                hexagon = Hexagon(self.geo_locs.iloc[i]['name'], location, 1, self.geo_locs.iloc[i]['parent'])
                hex_grid.append(hexagon)
        return hex_grid
    
    def set_sub_grid(self, sub_grid):
        # Increase weights of this grid by amount in sub_grid
        self.sub_grid = sub_grid
        while sub_grid != None:
            for i in sub_grid.hexagons:
                for j in self.hexagons:
                    if i.parent_loc == j.name:
                        j.weight = j.weight + i.weight
                        break
            sub_grid = sub_grid.sub_grid

In [3]:
def read_csv(filename, datafields): #Parameters: name of file, fields to be read
    data = pd.read_csv(filename, usecols = datafields)
    cleaned = data
    for i in range(len(data.index)): # Some values maay be null or zero, we will just drop those.
        for j in datafields:
            if pd.isnull(data.iloc[i][j]) or data.iloc[i][j] == 0:
                cleaned = cleaned.drop(i)
                break
    return cleaned

def build_color_classification(hexgrid, color_scheme):
    # Uses natural breaks classification to generate color breaks
    weights = []
    for hexagon in hexgrid.hexagons:
        weights.append(hexagon.weight)
    color_breaks = jenkspy.jenks_breaks(weights, nb_class = len(color_scheme) - 1)
    return color_breaks
    
def plot_map(grid, color_scheme):
    m = folium.Map(location = [0, 0],
                     tiles = 'http://tile.stamen.com/terrain/{z}/{x}/{y}.jpg', 
                     attr = "<a href=https://endless-sky.github.io/>Endless Sky</a>", 
                     zoom_start = 1, 
                     overlay = True)
    while(grid != None):
        feature_group = folium.FeatureGroup(name = grid.name, show = False)
        color_class = build_color_classification(grid, color_scheme)
        color_map = branca.colormap.LinearColormap(colors = color_scheme, index = color_class, vmin = 1)
        for hexagon in grid.hexagons:
            hex_color = color_map.rgb_hex_str(hexagon.weight)
            hex_vertices = h3.h3_to_geo_boundary(hexagon.hex_coordinate)
            folium.Polygon(locations = hex_vertices, 
                           popup = "<b>" + hexagon.name + "</b>" + ": " + str(hexagon.weight) + "<br><b>Parent: </b>" + hexagon.parent_loc, 
                           fill_color = hex_color, 
                           fill_opacity = .5,
                           weight = 1,
                           color = '#000000'
                          ).add_to(feature_group)
        feature_group.add_to(m)
        grid = grid.sub_grid
    folium.LayerControl(collapsed = False).add_to(m) # Allows toggling of resolutions while running
    return m


def generate_h3_map(filename, color_scheme):
    datafields = ['latitude', 'longitude', 'name', 'parent', 'resolution']
    nlp_geolocations = read_csv(filename, datafields)
    hex_grids_list = []
    
    # Split geolocation data into unique resolutions (16 possible resolutions): 
    for i in range(16):
        split_df = nlp_geolocations[nlp_geolocations['resolution'] == i]
        if not split_df.empty:
            hex_grid = HexagonGrid('Resolution ' + str(i), split_df)
            hex_grids_list.append(hex_grid)
    
    # Set sub grids:
    for i in range(len(hex_grids_list) - 2, -1, -1):
        hex_grids_list[i].set_sub_grid(hex_grids_list[i + 1])
        
        
    # Plot map:
    m = plot_map(hex_grids_list[0], color_scheme) # Color scheme: [Green, Yellow, Red]
    display(m)

## Map of results

In [4]:
generate_h3_map('nlp_geolocations.csv', ['#008000', '#FFFF00', '#FF0000'])

## Lessons learned, future directions and recommendations

The project provided a first pass to test the efficacy of an NLP to generate a meta-analysis  of the snow albedo literature. While informative, the results do not provide a definitive analysis of the corpus of snow albedo research. On the other hand, the geolocational analysis was informative, but could not definitively identify all of the geographic locations in the study. These testbed results provide a foundation from which the SAWG will build upon moving forward. A series of ideas and suggestions is provided subsequently.

### Refine methods
A means to potentially improve the NLP analysis with regards to snow albedo is to refine the set of keywords more methodically. The set of words applied in this study was more of a “group think” exercise than a list that was vetted against publications (e.g. The Cryosphere, Water Resources Research, etc). An enhanced keyword list could potentially select a more reflective body of snow albedo literature, and thus increase the number of publications selected. 

The Stanford NERT code applied in this project was not customized for this test bed project. Augmenting the code to specifically target snow science research provides potential opportunities for more robust results that identify published articles and extracting relevant information. These potential improvements would require a programmer skilled in NLP and additional funding support. 

Geographically, both the Stanford and geograph3 geolocational packages did not provide robust results. One of the primary reasons is that both packages are developed to look for more common place names, such as a country, region, or city. These packages are not set up for locations that are more scientifically esoteric, such as Niwot Ridge, HJ Andrews Experimental Forest, or Central Sierra Snow Lab. Thus, even though these locations are embedded in the abstract or conclusion, they are not recognized by the geographic packages. A potential means to address this deficiency would be to create a Look Up Table (LUT) of potential sites for a geolocational package to reference. One of the inherent downsides of developing such a LUT is that it would be biased to established research locations, and potentially not identify newer sites. 




### Language bias
The researchers on this project fully recognize that the literature search was performed only on articles published in English. As a result there will be an inherent bias towards publications in English, predominantly from North America and western European. There is a large body of snow albedo research from eastern Europe, Russia, Japan, and China.  Extending the search to other languages has the potential to provide depth and breadth to the SAWG scoping study.  This additional analysis would come at a considerable computational and programming cost, and would most likely take months for the processing to complete. A multilingual meta-analysis would be an impressive final objective, however doing so would require a multi-year budget, specialists in each of the languages, and considerable computing power. 

### Mapping of results
The H3 framework provides an effective means of mapping the nested geolocational information associated with a meta-analysis  of field-based science. This approach could be augmented in the future with more detailed attribute information that are associated with each mapped location. This would allow results to be filtered by topic (i.e. dust on snow) or location (i.e. Spain).


## Foundation for a SAWG white paper
This test bed study identified the strengths and weaknesses of using NLP for a meta-analysis  of the snow albedo literature. The ability of the NLP to process over 1300 published articles in a fraction of the time that it would take a human and streamline the task of reviewing thousands of journal articles is a strength. However, the inability of the NLP analysis to fully filter only for relevant publications is a weakness. Geolocational extraction of each publication was incomplete and required an augmentation of locational names and processing. 

Based upon our results and experiences, the application of an NLP analysis is a positive first phase for a machine learning meta-analysis . Even so, it is critical that the pre-processing and NLP results are evaluated by experienced domain scientists. It should also be expected that there would be a considerable investment of post-processing to interpret the results.  
Additionally, we did not conduct a precision or accuracy assessment, in that we quantify the number of True Positives, False Positives, False Negatives, and True Negatives. Doing so in subsequent studies would better frame how complete (or incomplete) the results of an NLP analysis.  


## Relevance to the ESIP Community
The integration of Data and Earth Science is the foundation of ESIP, and this project integrates open-source approaches to NLP and spatial analysis. We present modern workflows, that while not perfect, do advance domain and data science. While the scope of this research focuses on snow albedo, a targeted Designated Observable, it will also be readily adaptable to assess other themes in the scientific literature. 

## Works cited
[1] Zdravevski E. et al. (2019) Automation in Systematic, Scoping and Rapid Reviews by an NLP Toolkit: A Case Study in Enhanced Living Environments. In: Ganchev I., Garcia N., Dobre C., Mavromoustakis C., Goleva R. (eds) Enhanced Living Environments. Lecture Notes in Computer Science, vol 11369. Springer.

[2] Pons, Ewoud, et al. "Natural language processing in radiology: a systematic review." Radiology 279.2 (2016): 329-343.

[3] Kreimeyer, Kory, et al. "Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review." Journal of biomedical informatics 73 (2017): 14-29.

[4] Wang, Yanshan, et al. "A comparison of word embeddings for the biomedical natural language processing." Journal of biomedical informatics 87 (2018): 12-20.

[5] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf
