[NPR: "Where the World's Refugees Are"](http://www.npr.org/sections/goatsandsoda/2017/03/27/518217052/chart-where-the-worlds-refugees-are)

I was able to find a neat dataset at [UNHCR PopStats](http://popstats.unhcr.org/en/overview) and so I decided to play around to see if I can understand how refugee migration may have changed from my parents' time, and today.

First, since I know I'll be working with distances, I'll write a function to compute distances from latitudes and longitudes using the [haversine formula](https://en.wikipedia.org/wiki/Haversine_formula). 

In [1]:
def haversine(lon1, lat1, lon2, lat2):
    #if abs(lon1) < 180 or abs(lon2) < 180 or abs(lat1) < 90 or abs(lat2) < 90:
    #    return
    #else:
        """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees)
        """
        from math import radians, cos, sin, asin, sqrt
        # convert decimal degrees to radians 
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * asin(sqrt(a)) 
        km = 6367. * c
        # convert from km to miles
        dist = km/0.621371
        return dist

To first inspect the data, I considered that there is likely a distribution of distances travelled by refugees I first considered that I figured that the seaborn's violinplot might be a good way to start visualizing the distribution of distances refugees . 

For the inputs, 

In [2]:
def vis_refugeedist(orig, orig_latlon, dest, dest_latlon):
    # first make sure that everything is of the right length!
    
    
    import seaborn as sns
    
    sns.violinplot(x=df_temp2['Origin'].astype(str), y=pd.to_numeric(df_temp2['distance']), data=df_temp2, palette="Set3", bw=.2, cut=1, linewidth=1)

    # no result for vietnam and palestine....seems to be bc string mismatch when performing merge


Now that I've written something to take in 

Let's quickly load the data using pandas, and reconcile some inconsistencies between the country names

In [3]:
import numpy as np
import pandas as pd

df = pd.read_csv('C:/Users/Dat Tien Hoang/Documents/GitHub/unhcr_distances/unhcr_popstats_export_asylum_all_data.csv', skiprows=3)
#df = pd.read_csv('C:/Users/Dat Tien Hoang/Documents/GitHub/unhcr_distances/unhcr_popstats_export_resettlement_all_data.csv', skiprows=3)
df = df.rename(index=str, columns={'Country / territory of asylum/residence':'Destination'})
# make all the words uppercase for ease
df['Destination'] = map(lambda x: x.upper(), df['Destination'])
df['Origin']      = map(lambda x: x.upper(), df['Origin'])

countries = pd.read_csv('C:\Users\Dat Tien Hoang\Documents\GitHub\unhcr_distances\countries.csv')
countries['name'] = map(lambda x: x.upper(), countries['name'])
countries.ix[countries['name'] == 'VIETNAM', 'name'] = 'VIET NAM'
countries.ix[countries['name'] == 'PALESTINIAN TERRITORIES', 'name'] = 'PALESTINIAN'

Next we'll have to join these dataframes...

In [4]:
df = df.merge(countries, how='left', left_on='Destination', right_on='name', suffixes=['_UNHCR_d', '_countries_d'])
df = df.rename(index=str, columns={'latitude':'latitude_d', 'longitude':'longitude_d'})
df = df.merge(countries, how='left', left_on='Origin', right_on='name', suffixes=['_UNHCR_o', '_countries_o'])
df = df.rename(index=str, columns={'latitude':'latitude_o', 'longitude':'longitude_o'})

In [5]:
df.head()

Unnamed: 0,Year,Destination,Origin,Population type,Value,country_UNHCR_o,latitude_d,longitude_d,name_UNHCR_o,country_countries_o,latitude_o,longitude_o,name_countries_o
0,1979,IRAN (ISLAMIC REP. OF),AFGHANISTAN,Refugees (incl. refugee-like situations),100000,,,,,AF,33.93911,67.709953,AFGHANISTAN
1,1979,PAKISTAN,AFGHANISTAN,Refugees (incl. refugee-like situations),400000,PK,30.375321,69.345116,PAKISTAN,AF,33.93911,67.709953,AFGHANISTAN
2,1980,UNITED ARAB EMIRATES,AFGHANISTAN,Refugees (incl. refugee-like situations),1500,AE,23.424076,53.847818,UNITED ARAB EMIRATES,AF,33.93911,67.709953,AFGHANISTAN
3,1980,IRAN (ISLAMIC REP. OF),AFGHANISTAN,Refugees (incl. refugee-like situations),300000,,,,,AF,33.93911,67.709953,AFGHANISTAN
4,1980,ITALY,AFGHANISTAN,Refugees (incl. refugee-like situations),191,IT,41.87194,12.56738,ITALY,AF,33.93911,67.709953,AFGHANISTAN


In [6]:
df['distance'] = [haversine(df['longitude_d'][i], df['latitude_d'][i], df['longitude_o'][i], df['latitude_o'][i]) for i in range(len(df))]
print 'done distances!'


df.ix[df['Value'] == '*', 'Value'] = '1'
df.ix[df['distance'] == np.nan, 'Value'] = '0'

done distances!


In [7]:
df.head()

Unnamed: 0,Year,Destination,Origin,Population type,Value,country_UNHCR_o,latitude_d,longitude_d,name_UNHCR_o,country_countries_o,latitude_o,longitude_o,name_countries_o,distance
0,1979,IRAN (ISLAMIC REP. OF),AFGHANISTAN,Refugees (incl. refugee-like situations),100000,,,,,AF,33.93911,67.709953,AFGHANISTAN,
1,1979,PAKISTAN,AFGHANISTAN,Refugees (incl. refugee-like situations),400000,PK,30.375321,69.345116,PAKISTAN,AF,33.93911,67.709953,AFGHANISTAN,683.704626
2,1980,UNITED ARAB EMIRATES,AFGHANISTAN,Refugees (incl. refugee-like situations),1500,AE,23.424076,53.847818,UNITED ARAB EMIRATES,AF,33.93911,67.709953,AFGHANISTAN,2869.818066
3,1980,IRAN (ISLAMIC REP. OF),AFGHANISTAN,Refugees (incl. refugee-like situations),300000,,,,,AF,33.93911,67.709953,AFGHANISTAN,
4,1980,ITALY,AFGHANISTAN,Refugees (incl. refugee-like situations),191,IT,41.87194,12.56738,ITALY,AF,33.93911,67.709953,AFGHANISTAN,7774.091293


In [None]:
# choose several origins
df_temp = df.loc[(df['Origin'] == 'VIET NAM') ]#| 
#        (df['Origin'] == 'PALESTINIAN') | 
#        (df['Origin'] == 'AFGHANISTAN') |
#        (df['Origin'] == 'UGANDA') |
#        (df['Origin'] == 'SUDAN')]
df_temp.ix[df_temp['Value'] == '*', 'Value'] = '1'
# reformatting and cleanup

df_temp['Value'] = pd.to_numeric(df_temp['Value'], errors='coerce')
#df_temp['Value'] = df_temp['Value'].fillna(1)

print 'doing df2'
#http://stackoverflow.com/questions/26777832/replicating-rows-in-a-pandas-data-frame-by-a-column-value
df_temp2 = df_temp.loc[np.repeat(df_temp.index.values, pd.to_numeric(df_temp['Value']))]
#df_temp2 = df_temp
#for i in range(len(df_temp)):
#    df_temp2=df_temp2.append(df_temp.iloc[i,:]*pd.to_numeric(df_temp[i,:]))
print 'done df2'


doing df2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


done df2


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
#plt.figure()
sns.violinplot(x=df_temp2['Origin'].astype(str), y=pd.to_numeric(df_temp2['distance']), data=df_temp2, palette="Set3", bw=.2, cut=1, linewidth=1)
#sns.plt.show()
#plt.show()

In [None]:
df_temp2.head()
df_temp2.shape

Upon inspecting some of these intial results, I realized that my approach overlooked some complicated aspects of human migration and geography. This made it difficult to assessing refugee preference for countries nearby was difficult for two reasons. 

(1) It is difficult to determine which country is a combination of being "safest" and "easiest to get to". For example, with a water-locked country, the easiest country to get to may already be quite far. Unfavorable political situations in neighboring countries also may reduce some options for migration. In the end I figured it would be too difficult to normalize these factors. 

(2) Though the UNHCR PopStats dataset provides resettlement information, it only describes the country of origin and resettlement country while omitting the intermediate country where asylum was declared.

The violinplot itself also obscures directionality, and the many possible paths refugee migration can take. The UNHCR summarizes the refugee resettlement process in a concise way: 
![alt text](http://rsq.unhcr.org/media/hxs/query/splash-desktop.svg "Basic Schema for Refugee Resettlement") 


["Network Analysis of the Contemporary 'International Refugee System': Is There Any Structure?"](http://iussp2009.princeton.edu/papers/90854)

...think it can be illustrated better by a graph, like money flow!




[Resettlement Data Finder](http://rsq.unhcr.org/) mention with graphs, can make points relative to each other...and then do flow analysis! do people take small steps before a large one typically? ie, do we see clusters of nodes that are closer, and a cluster of nodes that are further?

[reference](https://stackoverflow.com/questions/32488772/drawing-nodes-with-coordinates-in-correct-position-using-networkx-matplotlib)


I found the NetworkX package to be very useful for graph analysis.

In [None]:
#def vis_refugeegraf(orig, orig_latlon, dest, dest_latlon):
import networkx as nx
import matplotlib.pyplot as plt

In [None]:
MDG=nx.MultiDiGraph()

MDG.add_node('Hamburg', pos=(53.5672, 10.0285))
MDG.add_node('Berlin', pos=(52.51704, 13.38792))
MDG.add_node('1',pos=(52.5,12.5))

#DG.add_weighted_edges_from([(1,2,0.5), (3,1,0.75)])
#DG.out_degree(1,weight='weight')
#DG.degree(1,weight='weight')

#DG.successors(1)
#DG.neighbors(1)

MDG.add_edge('Berlin', '1', weight=50)
# equiv to above...
#MDG.add_weighted_edges_from([('Berlin', 'Hamburg', 50)], weight=53)
MDG.add_edge('Berlin', 'Hamburg', 1, weight=5)

#https://stackoverflow.com/questions/25639169/networkx-change-color-width-according-to-edge-attributes-inconsistent-result
pos=nx.get_node_attributes(MDG,'pos')
edges = MDG.edges()
weights = [MDG[u][v][0]['weight'] for u,v in edges]
nx.draw(MDG, pos, width=weights, with_labels=True)# edge_color=colors
plt.show()

In [None]:
print MDG['Berlin']['1'][0]['weight']
#print MDG.edges()
#weights

The above graphs are pretty rudimentary, since NetworkX was designed more for graph analysis rather than graph visualization. Right now, they're looking quite yucky, but in the near future, I hope to learn to use graphviz or pygraphviz to make a prettier version.

Unfortunately, the Resettlement Data Finder only contains relatively recent data for a selection of source countries only. So for now, this also means I'm be unable to make the comparison between Vietnamese refugees and modern refugee cases as I originally intended...but at least I got to learn a lot about . idea compare shortest_path_length with path of maximum flow!

In [None]:
#from networkx.drawing.nx_agraph import graphviz_layout
#A = nx.to_agraph(MDG)
#A.layout(prog='dot')
#A.draw('test.png')