# Description

This notebook will take in a dataframe with paths, countries, country latitudes, and country longitudes and output a dataframe with the following columns: addresspath(unique); country_list(all unique countries in "order"), latlist(all latitudes corresponding to unique countries), longlist(all longitudes...).

# Necessary Imports

In [2]:
import pandas as pd

# Loading in the Data

In [4]:
df = pd.read_hdf("migration_dataset_with_countries_dropNA.h5")
display(df.head())

Unnamed: 0,@path,title,abstract,author,aff,block,latitude,longitude,country,country_latitude,country_longitude
0,/0000-0003-1178-1001,Class of ghost-free non-Abelian gauge theories,We discuss a class of non-Abelian gauge theori...,"Frenkel, Josif","Instituto de Fisica, Universidade de São Paulo...",j.frenkel,-23.559998,-46.735252,Brazil,-10.806774,-53.05434
1,/0000-0001-5974-7043,Weak nonleptonic decays of charmed hadrons in ...,We analyze the two-body weak nonleptonic decay...,"Branco, G.","Department of Physics, The City College of the...",g.branco,40.820047,-73.949272,United States of America,45.705628,-112.599436
2,/0000-0003-2257-3080,Target asymmetry in inclusive photoproduction ...,We study the target asymmetry in inclusive pio...,"Craigie, N. S.","CERN, Geneva, Switzerland",n.craigie,46.204391,6.143158,France,42.460704,-2.876697
3,/0000-0003-2257-3080,A space-time description of quarks and hadrons,A more concrete formulation of the previously ...,"Craigie, N. S.","CERN, Geneva, Switzerland",n.craigie,46.204391,6.143158,France,42.460704,-2.876697
4,/0000-0001-9638-3082,Observation of spatial and temporal variations...,Observations of X-ray bright points (XBP) over...,"Golub, L.","American Science and Engineering, Inc., Cambri...",l.golub,42.524182,-71.25494,United States of America,45.705628,-112.599436


# Create new Grouped Dataframe

We will check how many unique paths there are.

In [16]:
total_paths = df["@path"].size
unique_paths = df["@path"].unique().size
print(f"Total Paths: {total_paths}")
print(f"Unique Paths: {unique_paths}")

Total Paths: 153035
Unique Paths: 28152


Check how many unique authors there are

In [22]:
total_auths = df["author"].size
unique_auths = df["author"].unique().size
print(f"Total Authors: {total_auths}")
print(f"Unique Authors: {unique_auths}")

Total Authors: 153035
Unique Authors: 40123


Unsurprisingly, there are many more "unique" authors than unique paths.

### Now lets use the groupby function and print an output.

In [33]:
grouped_data = df.groupby('@path')
flag = 0
start = 10
end = 15
# Display the groups
for group_name, group_df in grouped_data:
    if flag < start:
        flag += 1
        continue
    print(f"\nGroup '@path'={group_name}:\n")
    print(group_df["country"])
    flag += 1
    if flag > end:
        break


Group '@path'=/0000-0001-5008-8619:

29268     Italy
29269     Italy
29270     Italy
43147     Italy
52301     Italy
75814     Italy
107199    Italy
107200    Italy
126589    Italy
126590    Italy
Name: country, dtype: object

Group '@path'=/0000-0001-5009-0727:

24647    Denmark
29864    Denmark
Name: country, dtype: object

Group '@path'=/0000-0001-5009-2271:

119522    Italy
Name: country, dtype: object

Group '@path'=/0000-0001-5009-3960:

45958    Taiwan
Name: country, dtype: object

Group '@path'=/0000-0001-5010-0112:

32261    Japan
68986    Japan
68987    Japan
Name: country, dtype: object

Group '@path'=/0000-0001-5010-4148:

57858    France
69517    France
83617    France
99432    France
Name: country, dtype: object


At first glance, country migration seems to be relatively rare but definitely happens often enough to be interesting. 

### Now lets flatten all of the dataframes into a list of countries, latitudes and longitudes.

#### Check out what the grouped data looks like

In [None]:
for null, group_df in grouped_data:
    if flag < start:
        flag += 1
        continue
    print(f"\nGroup '@path'={group_name}:\n")
    print(group_df["country"])
    flag += 1
    if flag > end:
        break

#### Aggregate the groups

In [40]:
result_df = grouped_data.agg({'country': list, 'country_latitude': list,'country_longitude': list, 'longitude': 'count', 'latitude': lambda x: x.index.tolist()}).reset_index()
result_df = result_df.rename(columns={'longitude': 'count', 'latitude': 'idxs_compatible_with'})
display(result_df.head())

Unnamed: 0,@path,country,country_latitude,country_longitude,count,idxs
0,/0000-0001-5000-0736,[Portugal],[39.63404977497817],[-8.055765588295687],1,[109265]
1,/0000-0001-5000-5991,[United States of America],[45.70562800215178],[-112.5994359115045],1,[81173]
2,/0000-0001-5002-2708,"[United Kingdom, United Kingdom]","[53.91477348053706, 53.91477348053706]","[-2.8531353951805545, -2.8531353951805545]",2,"[108768, 128393]"
3,/0000-0001-5002-5685,"[France, France, France, France]","[42.46070432663372, 42.46070432663372, 42.4607...","[-2.8766966992706267, -2.8766966992706267, -2....",4,"[108433, 108434, 127968, 150519]"
4,/0000-0001-5002-827X,"[South Africa, South Africa]","[-28.947033259979115, -28.947033259979115]","[25.048013879861678, 25.048013879861678]",2,"[71238, 119696]"


### Define some functions to eliminate duplicates in country, country_latitude, and country_longitude lists

In [72]:
def eliminate_duplicates(list_item):
    if len(list_item) == 1:
        return list_item
    building_set = set()
    unique_list = [x for x in list_item if not (x in building_set or building_set.add(x))]
    return unique_list

In [74]:
def elim_dupes_in_row(row):
    countries = eliminate_duplicates(row["country"])
    country_lats = eliminate_duplicates(row["country_latitude"])
    country_longs = eliminate_duplicates(row["country_longitude"])
    return (countries,country_lats,country_longs)

### Apply the function to the resultant grouped dataframe

In [77]:
result_df[['unique_countries','unique_lats', 'unique_longs']] = result_df.apply(elim_dupes_in_row, axis=1, result_type='expand')
display(result_df.head())

# Convert new dataframe to hdf

In [84]:
output_filename = 'author_journeys.h5'
result_df.to_hdf(output_filename, key='data', mode='w')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['@path', 'country', 'country_latitude', 'country_longitude', 'idxs',
       'unique_countries', 'unique_lats', 'unique_longs'],
      dtype='object')]

  result_df.to_hdf(output_filename, key='data', mode='w')
