# Description

 This jupyter notebook takes a dataframe with the coordinates of each affiliation. It then finds the corresponding country using geopandas for each entry. Finally, using geopandas once again, the notebook finds the latitude and longitude of the "center" of the country. All three lists of the newfound values are added to the dataframe as new columns.

# Necessary Installations and Imports

In [None]:
#pip install geopandas

In [64]:
import pandas as pd # standard import
import geopandas as gpd #contains data about country boundaries and centers
from shapely.geometry import Point

# Load And Modify Dataframe

In [60]:
df = pd.read_hdf("author_references_nov22nd_v3.h5")
display(df.head())

Unnamed: 0,@path,title,abstract,author,aff,block,latitude,longitude
0,/0000-0003-1178-1001,Class of ghost-free non-Abelian gauge theories,We discuss a class of non-Abelian gauge theori...,"Frenkel, Josif","Instituto de Fisica, Universidade de São Paulo...",j.frenkel,-23.559998,-46.735252
2,/0000-0001-5974-7043,Weak nonleptonic decays of charmed hadrons in ...,We analyze the two-body weak nonleptonic decay...,"Branco, G.","Department of Physics, The City College of the...",g.branco,40.820047,-73.949272
3,/0000-0003-2257-3080,Target asymmetry in inclusive photoproduction ...,We study the target asymmetry in inclusive pio...,"Craigie, N. S.","CERN, Geneva, Switzerland",n.craigie,46.204391,6.143158
4,/0000-0003-2257-3080,A space-time description of quarks and hadrons,A more concrete formulation of the previously ...,"Craigie, N. S.","CERN, Geneva, Switzerland",n.craigie,46.204391,6.143158
5,/0000-0001-9638-3082,Observation of spatial and temporal variations...,Observations of X-ray bright points (XBP) over...,"Golub, L.","American Science and Engineering, Inc., Cambri...",l.golub,42.524182,-71.25494


We check how many na values are in latitude and longitude.

In [61]:
df.isna().sum()

@path             0
title             0
abstract          0
author            0
aff               0
block             0
latitude     398711
longitude    398711
dtype: int64

We drop all na values and check that we have indeed gotten rid of them all

In [62]:
df2 = df.dropna()
df2.isna().sum()

@path        0
title        0
abstract     0
author       0
aff          0
block        0
latitude     0
longitude    0
dtype: int64

### Finding Country Name from Coordinates

We first create functions to find the country name.

In [69]:
# Download and load the Natural Earth countries dataset
gdf_countries = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

def get_country_name(latitude, longitude):
    point = Point(longitude, latitude)
    for index, country in gdf_countries.iterrows():
        if point.within(country['geometry']):
            return country['name']
    return None

# This function will get the country name for a specific row in df2 (the dataframe with na values dropped) 
def get_country_for_row(row):
    return get_country_name(row['latitude'], row['longitude'])

  gdf_countries = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))


We try the new function on coordinates in the Sanfrancisco

In [70]:
# Example coordinates from sanfrancisco
latitude = 37.7749
longitude = -122.4194

country_name = get_country_name(latitude, longitude)

print(f"The coordinates {latitude}, {longitude} correspond to {country_name}")

The coordinates 37.7749, -122.4194 correspond to United States of America


We will apply the function to the entire dataframe (df2). This cell takes several minutes to run.

In [None]:
# Create empty column to store country names
df2['country'] = None

# Apply the get country for row function to each row in the DataFrame
df2['country'] = df2.apply(get_country_for_row, axis=1)

# Display the DataFrame with country names
display(df2[['latitude', 'longitude', 'country']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['country'] = None


We find how many latitudes and longitudes do not map to a country. Answer: 3216 out of the 156251 entries.

In [32]:
print(df2[df2['country'].isna()].shape)
print(df2.shape)

(3216, 9)

### Finding Representative Country Coordinates

We first define a function to get the center

In [48]:
# Create a new DataFrame to store representative coordinates for each country
df_representatives = pd.DataFrame(columns=['country', 'country_latitude', 'country_longitude'])

# Function to get representative coordinates for each country
def get_representative_coordinates(row):
    country_name = row['name']
    centroid = row['geometry'].centroid
    representative_latitude, representative_longitude = centroid.y, centroid.x
    return pd.Series([country_name, representative_latitude, representative_longitude],
                     index=['country', 'country_latitude', 'country_longitude'])


                      country  country_latitude  country_longitude
0                        Fiji        -17.316309         163.853165
1                    Tanzania         -6.257732          34.752990
2                   W. Sahara         24.291173         -12.137831
3                      Canada         61.469076         -98.142381
4    United States of America         45.705628        -112.599436
..                        ...               ...                ...
172                    Serbia         44.233037          20.819652
173                Montenegro         42.789040          19.286182
174                    Kosovo         42.579367          20.895356
175       Trinidad and Tobago         10.428237         -61.330367
176                  S. Sudan          7.292890          30.198618

[177 rows x 3 columns]


  gdf_countries = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))


We apply the function to eafch row in our newly created dataframe

In [None]:
# Apply the function to each row in the GeoDataFrame
df_representatives = gdf_countries.apply(get_representative_coordinates, axis=1)

# Display the DataFrame with representative coordinates for each country
display(df_representatives.head())

We merge the newly created dataframe with country names and coordinate representations with the old dataframe.

In [49]:
df_merged = pd.merge(df2, df_representatives, on='country', how='left')

                       @path  \
0       /0000-0003-1178-1001   
1       /0000-0001-5974-7043   
2       /0000-0003-2257-3080   
3       /0000-0003-2257-3080   
4       /0000-0001-9638-3082   
...                      ...   
156246  /0000-0002-0275-0927   
156247  /0000-0002-0275-0927   
156248  /0000-0002-0720-1927   
156249  /0000-0003-2882-0927   
156250  /0000-0003-2882-0927   

                                                    title  \
0          Class of ghost-free non-Abelian gauge theories   
1       Weak nonleptonic decays of charmed hadrons in ...   
2       Target asymmetry in inclusive photoproduction ...   
3          A space-time description of quarks and hadrons   
4       Observation of spatial and temporal variations...   
...                                                   ...   
156246  Defect trajectories and domain-wall loop dynam...   
156247  Direct observation of magnetic monopole defect...   
156248  Shear veins observed within anisotropic fabric...   
15624

We remove remaining na values.

In [55]:
df_final = df_merged[df_merged["country_latitude"].isna()==False]

We check that the modification to the data frame was sucessful by seeing if country latitude and longitude is close to original latitude and longitude.

In [None]:
# Display the first 10 entries
display(df_final.head(10))

# Convert Modified dataframe to hdf.

Convert the final dataframe with dropped na values.

In [58]:
output_filename = 'migration_dataset_with_countries_dropNA.h5'
df_final.to_hdf(output_filename, key='data', mode='w')

### If we want to keep the NA values, we can run this code instead.

In [None]:
output_filename = 'migration_dataset_with_countries.h5'
df_merged.to_hdf(output_filename, key='data', mode='w')