# Geo-spatial analysis

The objective of this section is to extract geographical information about the papers. 
Initially, an attempt was made to search for information on the country of the authors' affiliation of various papers, but two problems were found:
- Information regarding affiliation is available for a small number of authors
- even when available, the affiliation is only one, thus not taking into account the authors' affiliation records, which can obviously lead to an error in the geographical classification of the paper.

The abstracts were then searched for geographical references on the subject of the papers. 

In [None]:
#importing the required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import spacy
import numpy as np
import warnings
from geopy.geocoders import Nominatim
from geopy.exc import GeopyError
import time
import string
import pycountry_convert as pc
warnings.filterwarnings('ignore')

In [None]:
#csv loading
df = pd.read_csv('Dataset_API/papers_shortlisted_final2.csv', index_col=0)
df

### Authors' affiliation

Checking how many information are there in the dataset regardin authors affiliaiton

In [None]:
df['Affiliaton_info'] = 0
authors_info = []
for i in range(len(df)):
    a = eval(df['authors'][i])
    for j in range(len(a)):
        authors_info.append(a[j])
        if a[j]['affiliatons'] != None:
            df['Affiliaton_info'][i] = 1

In [None]:
print('Number of papers with at least one affiliation: ', len(df[df['Affiliaton_info'] == 1]))

## Geographical information search in abstracts

Since information on affiliations are not sufficient, a search is made for geographical information in the abstracts. 
The first step is to make nlp on the abstracts, extracting information on locations in the texts, thanks to the spacy library

In [None]:
#Loading the English nlp spacy library 
nlp = spacy.load("en_core_web_sm")

The Spacy library associates each word or group of words with a label representing whether they are sotantives, verbs, adjectives, etc. 
The label 'GPE' identifies countries, cities, states, while the label 'LOC' identifies Non-GPE locations, mountain ranges, bodies of water.

In [None]:
#Storing list of geo location found with spacy
df['Geo'] = None
for i in tqdm(range(len(df))):
    doc = nlp(df['abstract'][i])
    local = []
    for w in doc.ents:
        if w.label_ == 'LOC' or w.label_ == 'GPE':
            local.append(w.text)
    if local != []:
        local_nop = []
        for j in local:
            local_nop.append(j.translate(str.maketrans('', '', string.punctuation)))
        df['Geo'][i] = local_nop

In [None]:
print('Percentage of papers with geospatial information in the abstract: ',round((len(df)-df['Geo'].isnull().sum())/len(df)*100,2),'%')

In [None]:
#Checkpoint
df.to_csv('df_geo.csv')

In [None]:
#Defining a function useful later used in combination with geopy, to save only information about the country of every location
def get_last_word(phrase):
    last_comma_index = phrase.rfind(',')
    return phrase[last_comma_index + 1:].strip()

In [None]:
#Loading the open access Nominatim as geolocator
geolocator = Nominatim(user_agent='User', timeout=1.1)

Defining a function that returns the address of each location found using spaCy or an error if the location cannot be identified with Nominatim. 
Since Nominatim has a rate limiter, the function will pause for a few seconds when it encounters exceptions and then attempt to retrieve the address again. 
If it fails after five attempts, the function will return the location for which it encountered the failure.

In [None]:
def do_geocode(address, attempt=1, max_attempts=5):
    try:
        location = geolocator.geocode(address, language='en')
        stringa = None
        return location, stringa
    except GeopyError:
        if attempt <= max_attempts:
            time.sleep(1.1)
            return do_geocode(address, attempt=attempt+1)
        elif attempt > max_attempts:
            location = 'Not Available'
            stringa = f'request failed for {address}'
            return location, stringa

In the next section, the geopy library is used to gather information on the country of reference of the various locations identified earlier. In this case, a free API is used that relies on OpenStreetMap, and together with the names allows the coordinates to be retrieved. However, it tolerates continuous and repetitive requests poorly. For this reason, with each cycle, the location information is saved so that if the same location is found, the request to the API is not repeated, but the already saved data is used.

In [None]:
Country = {}
Continent = {}
Ocean = {}
Sea = {}

In [None]:
loc_column = []
lat_column = []
lon_column = []

for i in tqdm(df['Geo']):
     if i != None:
          loc_row = []
          lat_row = []
          lon_row = []
          for j in np.unique(i):
               if j not in Country and j not in Continent and j not in Ocean and j not in Sea:
                    stringa= None
                    location, stringa = do_geocode(j, attempt=1,max_attempts=5)
                    if stringa:
                         print(stringa)
                    if location != None and location != 'Not Available':
                         addresstype[location.raw['addresstype']] = 0
                         loc_row.append(get_last_word(location.address))
                         lat_row.append(location.latitude)
                         lon_row.append(location.longitude)
                         if location.raw['addresstype'] == 'continent':
                              Continent[j] = {'loc' : get_last_word(location.address),
                                              'lat' : location.latitude,
                                              'lon' : location.longitude}
                         elif location.raw['addresstype'] == 'ocean':
                              Ocean[j] = {'loc' : get_last_word(location.address),
                                              'lat' : location.latitude,
                                              'lon' : location.longitude}
                         elif location.raw['addresstype'] == 'sea':
                              Sea[j] = {'loc' : get_last_word(location.address),
                                              'lat' : location.latitude,
                                              'lon' : location.longitude}
                         else:
                              Country[j] ={'loc' : get_last_word(location.address),
                                              'lat' : location.latitude,
                                              'lon' : location.longitude}
                    else:
                         loc_row.append(None)
                         lat_row.append(None)
                         lon_row.append(None)
               elif j in Country:
                    loc_row.append(Country[j]['loc'])
                    lat_row.append(Country[j]['lat'])
                    lon_row.append(Country[j]['lon'])
               elif j in Continent:
                    loc_row.append(Continent[j]['loc'])
                    lat_row.append(Continent[j]['lat'])
                    lon_row.append(Continent[j]['lon'])
               elif j in Ocean:
                    loc_row.append(Ocean[j]['loc'])
                    lat_row.append(Ocean[j]['lat'])
                    lon_row.append(Ocean[j]['lon'])
               elif j in Sea:
                    loc_row.append(Sea[j]['loc'])
                    lat_row.append(Sea[j]['lat'])
                    lon_row.append(Sea[j]['lon'])
          loc_column.append(loc_row)
          lat_column.append(lat_row)
          lon_column.append(lon_row)
     else:
          loc_column.append(None)
          lat_column.append(None)
          lon_column.append(None)

In [None]:
#Attaching the information to the dataset
df['location'] = loc_column
df['latitude'] = lat_column
df['longitude'] = lon_column

In [None]:
#checkpoint
df.to_csv('df_geo_loc.csv')

In [None]:
df = pd.read_csv('df_geo_loc.csv', index_col=0)
df

In [None]:
#Cleaning the location column removing duplicated values for each entry
loc_unique = []
for i, lat, lon in tqdm(zip(df['location'], df['latitude'], df['longitude'])):
    if pd.notna(i):
        count = 0
        for j in eval(i):
            if j == None:
                count += 1
        if len(eval(i)) == count:
            loc_unique.append(None)
        else:
            filtered_list = [item for item in eval(i) if item is not None]
            loc_unique.append(list(np.unique(filtered_list)))
    else:
        loc_unique.append(None)
df['country'] = loc_unique
#Check the number of unique values
unique_values_list = list(set(x for sublist in df['country'] if sublist is not None for x in sublist))
len(unique_values_list)

In [None]:
#Check the address type of each location
locations = {}
add_type = []
for i in tqdm(unique_values_list):
    location, stringa = do_geocode(i, attempt=1,max_attempts=5)
    if stringa:
        print(stringa)
    locations[i] = location.raw['addresstype']
    add_type.append(location.raw['addresstype'])
np.unique(add_type)

### Cleaning the results

In the next section, the results are analysed, and manually, as the number of unique entries is not prohibitive, the places incorrectly classified for the purpose of the search are remapped. Specifically, there are marine places that can be associated with a specific country, and many places belonging to Antarctica, which geopy classifies individually and not at an aggregate level. 

In [None]:
repl_dict = {'Goguryeo Hill': 'Japan',
    'Mount Gauss': 'Antarctica',
    'Larsemann Hills': 'Antarctica',
    'Mount Hancox': 'Antarctica',
    'Mount Boreas': 'Antarctica',
    'Brama': 'Antarctica',
    'Grootes Peak': 'Antarctica',
    'Dome Fuji': 'Antarctica',
    'Utsteinen Nunatak': 'Antarctica',
    'Waitt Peaks': 'Antarctica',
    'Mount Palsson': 'Antarctica',
    'Usnea Plug': 'Antarctica',
    'Mayeda Peak': 'Antarctica',
    'Northern Foothills': 'Antarctica',
    'Anderson Nunataks': 'Antarctica',
    'Allan Hills': 'Antarctica',
    'The Gambia':'Gambia',
    'Potter Peninsula': 'Antarctica',
    'Byers Peninsula': 'Antarctica',
    'Antarctic Peninsula': 'Antarctica',
    'Sobral Peninsula': 'Antarctica',
    'McMurdo Station': 'Antarctica',
    'Aue': 'Germany',
    'Congo-Brazzaville':'Congo',
    'Villa Las Estrellas': 'Antarctica',
    'Fountain Creek': 'United States',
    'Aire': 'France',
    'Mount Erebus': 'Antarctica',
    'Cape Ross' : 'Philippines',
    'North Foreland' : 'United Kingdom',
    'Elk River' : 'Poland',
    'Rybnitsa' : 'Moldova',
    'Campbell Creek' : 'United States',
    'Vechtaer Moorbach' : 'Germany',
    'Poland contiguous zone': 'Poland',
    'France (contiguous area in the Gulf of Biscay and west of English Channel)': 'France',
    'France (contiguous area in the Mediterranean Sea)': 'France',
    'South Pole' : 'Antarctica',
    'Denmark Strait' : 'Denmark',
    'Marsyangdi' : 'Nepal',
    'East River' : 'United States',
    'Natural Marine Park of the Gulf of Lion': 'France',
    'Southeast Atlantic Seamounts Marine Protected Area': 'Atlantic Ocean',
    'Área Marinha Protegida do MARNA': 'Atlantic Ocean',
    'Adélie Land': 'Antarctica',
    'West Antarctica': 'Antarctica',
    'East Antarctica': 'Antarctica',
    'Victoria Land': 'Antarctica',
    'McMurdo Dry Valleys': 'Antarctica',
    'The Fleet': 'United Kingdom',
    'Bruchwetter': 'Germany',
    "Pugsley's Creek": 'United States',
    'Wrobel': 'Antarctica',
    'Porter Brook': 'United Kingdom',
    'Wissahickon Creek': 'United States',
    'Lake Bonney': 'Antarctica',
    'Lane Cove River': 'Australia',
    'Patuxent River': 'United States',
    'Lake Vanda': 'Antarctica',
    'Castenholz Pond': 'Antarctica',
    'Concordia Station': 'Antarctica',
    'Rothera Research Station': 'Antarctica',
    'Neumayer-Station III': 'Antarctica',
    'Dome Fuji Station': 'Antarctica',
    'West Antarctic Ice Sheet Divide': 'Antarctica',
    'Transantarctic Mountains': 'Antarctica',
    'Larsen C Ice Shelf' : 'Antarctica'
     }

to_drop = ['35000',
 '1086',
 '3962',
 '7262',
 'Tar',
 'Siple Dome',
 'Pisonia'
 ]

In [None]:
#Remapping the wrong classified elements
def replace_items(lst, mapping, values_to_drop):
    if lst is not None:
        updated_list = [mapping.get(item, item) for item in lst]
        filtered_list = [item for item in updated_list if item not in values_to_drop]
        return filtered_list
    else:
        return None

df['country_adj'] = df['country'].apply(lambda x: replace_items(x, repl_dict, to_drop))
unique_values_list = list(set(x for sublist in df['country_adj'] if sublist is not None for x in sublist))
len(unique_values_list)

In [None]:
#check again the address type
locations = {}
add_type = []
for i in tqdm(unique_values_list):
    location, stringa = do_geocode(i, attempt=1,max_attempts=5)
    if stringa:
        print(stringa)
    locations[i] = location.raw['addresstype']
    add_type.append(location.raw['addresstype'])
np.unique(add_type)

In [None]:
filtered_dict = {key: value for key, value in locations.items() if value == 'city'}
filtered_dict

In the next section, the address types are remapped so that we only have an indication of whether the location is: a country, a continent, a gulf, a river, a sea or an ocean. For the latter, when these cannot be associated with a single nation.

In [None]:
mapping_addtype = {
    'archipelago' : 'island',
    'canal' : 'river',
    'claimed_administrative' : 'country',
    'islet' : 'island',
    'land_area' : 'country',
    'strait' : 'sea',
    'water' : 'river',
    'waterway' : 'river',
    'locality' : 'continent',
    'city' : 'country'
}

mapping_remaining = {
    'Nicobar' : 'island',
    'Rio Grande' : 'river',
    'Andaman' : 'island'
}

In [None]:
#Saving latitude and longitude of locations
lat = {}
lon = {}
for i in tqdm(unique_values_list):
    location, stringa = do_geocode(i, attempt=1,max_attempts=5)
    if stringa:
        print(stringa)
    lat[i] = location.latitude
    lon[i] = location.longitude

In [None]:
#Saving address type and coordinates for each location
coord_col = []
long_col = []
add_type_col = []
for i in tqdm(df['country_adj']):
    if i != None:
        coord = []
        add_type_list = []
        for j in i:
            lat_values = lat[j] if isinstance(lat[j], (list, tuple)) else [lat[j]]
            lon_values = lon[j] if isinstance(lon[j], (list, tuple)) else [lon[j]]
            coord.extend(zip(lat_values,lon_values))
            if j in mapping_remaining:
                add_type_list.append(mapping_remaining[j])
            elif locations[j] in mapping_addtype:
                add_type_list.append(mapping_addtype[locations[j]])
            else:
                add_type_list.append(locations[j])
        coord_col.append(coord)
        add_type_col.append(add_type_list)
    else:
        coord_col.append(None)
        add_type_col.append(None)

df['coord_adj'] = coord_col
df['add_type'] = add_type_col

In [None]:
unique_values_list = list(set(x for sublist in df['add_type'] if sublist is not None for x in sublist))
unique_values_list

In [None]:
add_type = []
for i in df['add_type']:
    if i != None:
        for j in i:
            add_type.append(j)
keys_location, counts_locations = np.unique(add_type, return_counts=True)
loc_counts = pd.DataFrame({'keys':keys_location,'counts':counts_locations})
loc_counts = loc_counts.sort_values(by='counts', ascending=False)
sns.barplot(data=loc_counts,x='counts',y='keys')

In [None]:
country = []
for i,k in zip(df['add_type'],df['country_adj']):
    if i != None:
        for j,z in zip(i,k):
            if j == 'country':
                country.append(z)

keys_countries, counts_countries = np.unique(country, return_counts=True)
country_counts = pd.DataFrame({'keys':keys_countries,'counts':counts_countries})
country_counts = country_counts.sort_values(by='counts', ascending=False)
sns.barplot(data=country_counts[country_counts['counts'] > 1000],x='counts',y='keys')

In [None]:
continent = []
for i,k in zip(df['add_type'],df['country_adj']):
    if i != None:
        for j,z in zip(i,k):
            if j == 'continent':
                continent.append(z)

keys_continent, counts_continent = np.unique(continent, return_counts=True)
continent_counts = pd.DataFrame({'keys':keys_continent,'counts':counts_continent})
continent_counts = continent_counts.sort_values(by='counts', ascending=False)
sns.barplot(data=continent_counts,x='counts',y='keys')

### Dataset finalisation

In the next section, information on which continent each country belongs to is saved, thanks to geopy and pycountry. Subsequently, information is saved in the dataset in this way:
- One column contains only the countries
- One the continents, either when information on the continent alone is found in the abstract, or when the continent is extracted from the country
- Waters, i.e. rivers and streams
- Oceans, seas and gulfs

The relevant coordinates are also saved. 

In [None]:
#Define a function to get country code with geopy from coordinates, and then the continent of each country thanks to pycountry
def get_country_code_from_coordinates(coordinates,country, attempt=1, max_attempts=5):
    latitude, longitude = coordinates
    try:
        location = geolocator.reverse((latitude, longitude), language="en")
    except GeopyError:
        if attempt <= max_attempts:
            time.sleep(1.1)
            return get_country_code_from_coordinates(coordinates,country, attempt=attempt+1)
    # Extract country code (ISO 3166-1 alpha-2 code) from the location address
    if location == None:
        missing_class = country
        country_code = None
    else:
        country_code = (location.raw.get('address', {}).get('country_code', None))
        if country_code == None:
            missing_class = country
        else:
            country_code = country_code.upper()
            missing_class = None
    
    return country_code, missing_class


In [None]:
countries_code = {}

In [None]:
# Manual mapping of countries for which pycountry mapping fails
countries_code['Palestinian Territories'] = 'PS'
countries_code['Kosovo'] = 'KS' #ISO not existant
countries_code['Andaman'] = 'IN'
countries_code['Nicobar'] = 'IN'
countries_code['Signy Island'] = 'AQ'
countries_code['South Orkney Islands'] = 'AQ'
countries_code['Penguin Island'] = 'AU'
countries_code['Ascension and Tristan da Cunha'] = 'GB'
countries_code['Torgersen Island'] = 'AQ'
countries_code['Sahrawi Arab Democratic Republic'] = 'EH'
countries_code['Horseshoe Island'] = 'AQ'
countries_code['Scholander Island'] = 'AQ'
countries_code['Ross Island'] = 'AQ'
countries_code['Caroline Islands'] = 'FM'
countries_code['Alectoria Island'] = 'AQ'
countries_code['Weertman Island'] = 'AQ'
countries_code['Smith Island'] = 'US'
countries_code['Marguerite Bay'] = 'AQ'
countries_code['Shelikhov Gulf'] = 'RU'
countries_code['Petermann Island'] = 'AQ'

In [None]:
#Getting the continent for each country
missing = {}
for country_l, coord_l, add_type_l in tqdm(zip(df['country_adj'], df['coord_adj'], df['add_type'])):
    if country_l != None:
        for country, coord, add_type in zip(country_l,coord_l,add_type_l):
            if country not in countries_code and country not in missing:
                if add_type == 'country':
                    try:
                        countries_code[country] = pc.country_name_to_country_alpha2(country)
                    except KeyError:
                        missing[country] = [coord,add_type]
                elif add_type in ['continent','ocean','sea','bay']:
                    break
                elif country == 'Antarctica':
                    countries_code[country] = "AQ"
                else:
                    code, missing_class = get_country_code_from_coordinates(coord,country)
                    if missing_class:
                        missing[missing_class] = [coord,add_type]
                    else:
                        countries_code[country] = code

In [None]:
missing

In [None]:
#Dividing the locations in countries, continents, rivers and streams, oceans amd seas
country_only = []
continent_only = []
ocean_sea_bay = []
coord_waters = []
for country_l,coord_l, add_type_l in tqdm(zip(df['country_adj'],df['coord_adj'], df['add_type'])):
    if country_l != None:
        countries = []
        continents = []
        waters = []
        coord_w = []
        for country, coord, add_type in zip(country_l,coord_l,add_type_l):
            if country in countries_code:
                if countries_code[country] == 'AQ':
                    countries.append('Antarctica')
                    continents.append('Antarctica')
                elif countries_code[country] == 'KS':
                    countries.append('Kosovo')
                    continents.append('Europe')
                elif countries_code[country] == 'TL':
                    countries.append('Timor-Leste')
                    continents.append('Oceania')
                elif countries_code[country] == 'EH':
                    countries.append('Western Sahara')
                    continents.append('Africa')
                else:
                    countries.append(pc.country_alpha2_to_country_name(countries_code[country]))
                    continents.append(pc.convert_continent_code_to_continent_name(pc.country_alpha2_to_continent_code(countries_code[country])))
            elif add_type == 'continent':
                continents.append(country)
            else:
                waters.append(country)
                coord_w.append(coord)
        if countries == []:
            country_only.append(None)
        else:
            country_only.append(list(np.unique(countries)))
        if continents == []:
            continent_only.append(None)
        else:
            continent_only.append(list(np.unique(continents)))
        if waters == []:
            ocean_sea_bay.append(None)
            coord_waters.append(None)
        else:
            ocean_sea_bay.append(list(np.unique(waters)))
            coord_waters.append(coord_w)
    else:
        country_only.append(None)
        continent_only.append(None)
        ocean_sea_bay.append(None)
        coord_waters.append(None)

In [None]:
ocean_sea_bay_un = []
for i in ocean_sea_bay:
    if i != None:
        for j in i:
            ocean_sea_bay_un.append(j)
np.unique(ocean_sea_bay_un)

In [None]:
continent_only_un = []
for i in continent_only:
    if i != None:
        for j in i:
            continent_only_un.append(j)
np.unique(continent_only_un)

In [None]:
country_only_un = []
for i in country_only:
    if i!=None:
        for j in i:
            country_only_un.append(j)
np.unique(country_only_un)

In [None]:
#Define a function to get coordinates from locations
def geolocate_coord(country, attempt=0, max_attempt=5):
    try:
        location = geolocator.geocode(country, language='en')
    except GeopyError:
        time.sleep(1.1)
        if attempt < max_attempt:
            location = geolocate_coord(country, attempt=attempt+1)
        elif attempt == max_attempt:
            time.sleep(10)
            location = geolocate_coord(country)
    return location

In [None]:
#Getting coordinates from countries
coord_countries = {}
for i in tqdm(list(np.unique(country_only_un))):
    if i == 'Taiwan, Province of China':
        j = 'Taiwan'
        location = geolocate_coord(j)
        coord_countries[i] = (location.latitude, location.longitude) 
    else:
        location = geolocate_coord(i)
        coord_countries[i] = (location.latitude, location.longitude)
coord_countries

In [None]:
#Getting coordinates for continents
coord_continents = {}
for i in tqdm(list(np.unique(continent_only_un))):
    location = geolocate_coord(i)
    coord_continents[i] = (location.latitude, location.longitude) 
coord_continents

In [None]:
#Attaching information to the dataset
df['country_only'] = country_only
df['continent'] = continent_only
df['Ocean_Sea_Bay'] = ocean_sea_bay
df['waters_coord'] = coord_waters
df

In [None]:
#Storing coordinates for countries to attach them to the dataset
coord_countr_col = []
for i in df['country_only']:
    if i != None:
        coord = []
        for j in i:
            coord.append(coord_countries[j])
        coord_countr_col.append(coord)
    else:
        coord_countr_col.append(None)

In [None]:
#Storing coordinates for continents to attach them to the dataset
coord_cont_col = []
for i in df['continent']:
    if i != None:
        coord = []
        for j in i:
            coord.append(coord_continents[j])
        coord_cont_col.append(coord)
    else:
        coord_cont_col.append(None)

In [None]:
df['coord_countr'] = coord_countr_col
df['coord_cont'] = coord_cont_col

In [None]:
df.to_csv('df_geo_final.csv')

In [None]:
#Load the dataset with keywords 
df_kw = pd.read_csv('df_fe.csv')

In [None]:
#Merge the keywords in the dataset with geographical information
df['keywords'] = df_kw['key_words']

In [None]:
#saving the final dataset
df.to_csv('final_geo_kw.csv')