# Dynamic Refugee Matching
# Simulations

This notebook replicates the simulations performed by Andersson, Ehlers and Martinello (2018). All the necessary documentation, requirements and dependecies should be documented in the package. If you have any comment, spot any bug or some documentation is missing, please let us know.

We proceed in three steps. First, we assess the performance of ``aem`` in case of misclassification error in the locality preference partitions. Second, we examine the performance of the algorithm if yearly quotas are split into monthly, trimestral, or semestral subperiods. Finally, we provide some supporting evidence for our conjecture relative to Theorem 2 in that paper: That is, at any matching $x(k)$ selected by our proposed mechanism, envy is always bounded by a single acceptable asylum seeker *and* a single unacceptable asylum seeker.


In [5]:
import numpy as np
import scipy as sp
import pandas as pd 
pd.options.mode.chained_assignment = None
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

import time

%load_ext autoreload
%autoreload 2

## 1. Sensitivity to misclassification error in the locality partitions

In this section we focus on highlighting the dynamic properties of the algorithm while allowing for imperfect inputs to the model. In the paper, we prove that our proposed mechanism guarantees an efficient an fair allocation at every processed asylum seeker $k$, given that the ``scores`` matrix (the locality-specific partitions of acceptable and unacceptable asylum seekers) is observed. 

In practice, the ex-post match quality (and thus the **scoring matrix**) needs to be estimated. As any estimation necessarily involves some estimation error, this sections shows that even with mismeasures locality-specific partitions, our algorithm substantially outperforms naive, uninformed allocation mechanisms. 

Our dynamic measures of **fairness** are 
1) The proportion of localities envying at least another municipality in the sample by 1 refugee
2) The proportion of localities envying at least another municipality in the sample by more that a fourth of the average assigned refugees per locality. 
Theorem 1 and 2 show that if we observed the true scoring matrix, both measures would be equal to zero under our allocation mechanism. 

Our dynamic measure of **efficiency** is the proportion of *demanded* refugees that are assigned to a municipality that considers them non-demanded. That is, the proportion of potentially good matches between a refugee and a locality that instead are realized as bad matches due to an imperfect allocation. Reallocating (ex-post) such a refugee to a demanding municipality would be a Pareto-improvement. As such, the higher this measure, the more inefficient the llocation can be considered. With perfect knowledge about the scoring matrix, our mechanism ensures that this measure is always equal to zero.

In order to assess the gains in fairness and efficiency due to the algorithm, we simulate random refugee flows calibrated to match the US and the Swedish situation. We then scramble the real scoring matrix with an increasing amount of misclassification error, and compare the resulting assignment with that of a naive sequential algorithm. 

Note that empirically the distiction betwen demanded and non-demanded asylum seekers creates three types of refugees:
- **Refugees $\underline{D}$:** These refugees are *non-demanded*, meaning that no locality finds them acceptable
- **Refugees $\overline{D}$:** These refugees are a special case of *over-demanded* refugees, such that **all** localities find them acceptable
- **Refugees $D$:** These refugees can be either *demanded* or *over-demanded*. At least one locality finds them acceptable, and at least one locality finds them *non-acceptable*

As we argue for in the paper, the algorithm works best when the proportion of refugees that exhibit synergies across localitites $D$ (i.e., those that integrate in some localities but not others) is highest. When calibrating the asylum seeker flows we put ourselves in a worst-case scenario situation by minimizing the number of this type of refugees. That is, the proportion of refugees of type $\overline{D}$  in the refugee flow is given by the lowest amount of refugees finding employment within 3 months (US data) and 3 years (Swedish data) across localities. The proportion of refugees of type $\underline{D}$ is given by the highest amount of refugee not finding employment across localities in the same time period.

We begin by setting the simulation parameters

In [24]:
np.random.seed(0)

# set general parameters
n_simulations = 1000
n_refugees = 1000

# set country-specific parameters: n_localities, [AA, NA, autocorrelation]
country_properties = {}
country_properties['us'] =  50, [0.34, 0.39, 0.5] ## AA: Science paper
country_properties['swe'] = 21, [0.28, 0.45, 0.5] ## AA: 3-years employment rate of 2013 refugee wave

# (alternative) threshold for second envy measure 
envy_limit_fraction = 0.25 

# list of misclassification errors on which to run AEM
errorlist = [0, 10, 25, 40]

country_properties.items()

dict_items([('us', (50, [0.34, 0.39, 0.5])), ('swe', (21, [0.28, 0.45, 0.5]))])

In [29]:
np.array(errorlist).shape[0]

4

In [19]:
from dynamic_refugee_matching.assignment import assign
import dynamic_refugee_matching.flowgen as flg

start_time = time.time()
# Simulation
output = np.zeros((n_refugees, len(sim_types)*3))

# Dictionaries initialization
assignments = {}
mis_dem = {}

for country in ['swe','us']:
    simint = round(n_simulations/10)
    for sim in np.arange(n_simulations):
        if sim%simint==0:
            print("Country:", country, "; Simulation", sim, "of", n_simulations, 
                  ". Elapsed time:", f_writetime(time.time() - start_time))

        # simulate demand/refugee flow
        demand_matrix = af.simulate_matrix(n_refugees, n_municipalities[country], 
                                           p_nond = refflow[country][0], 
                                           p_over = refflow[country][1], 
                                           autocorrelation = refflow[country][2]
                                          )

        n_demanded_refugees = np.cumsum(np.amax(demand_matrix, axis=1))

        assignments['sequential'] = af.assign_seq(demand_matrix)
        # Initialize misallocated count
        mis_dem['sequential'] = 0
        for error in errorslist:
            assignments['err_{0}'.format(error)] = af.assign(af.add_error(demand_matrix, error))
            # Initialize misallocated count
            mis_dem['err_{0}'.format(error)] = 0

        # foreach refugee
        for k in np.arange(n_refugees):
            for val, atype in enumerate(sim_types):
                # update misallocated count
                if (np.sum(demand_matrix[k])>0) and (np.sum(assignments[atype].assignment[k][:]*demand_matrix[k][:])==0):
                    mis_dem[atype] += 1
                # calculate measures
                max_envy = np.amax(assignments[atype].get_envy(refugee=k,real_acceptance=demand_matrix), axis=1)
                envy0 = np.mean((max_envy>0))
                envy1 = np.mean((max_envy>=envy_limit))
                if n_demanded_refugees[k]>0:
                    effic = mis_dem[atype]/n_demanded_refugees[k]
                else:
                    effic = 0

                # Update averages
                output[k,val*3 + 0] = (output[k,val*3 + 0]*(sim) + envy0)/(sim+1)
                output[k,val*3 + 1] = (output[k,val*3 + 1]*(sim) + envy1)/(sim+1)
                output[k,val*3 + 2] = (output[k,val*3 + 2]*(sim) + effic)/(sim+1)

    nameslist = []
    for atype in sim_types:
        nameslist.append('envy0_'+atype)
        nameslist.append('envy1_'+atype)
        nameslist.append('effic_'+atype)
    if country == 'swe':
        simerror_swe = pd.DataFrame(output, columns=nameslist)
        simerror_swe.to_pickle("data/simerror_swe")
    if country == 'us':
        simerror_us = pd.DataFrame(output, columns=nameslist)
        simerror_us.to_pickle("data/simerror_us")

elapsed_time = time.time() - start_time
print('Total running time: ', f_writetime(elapsed_time))
# Save running time
f = open('timers/misclassification.txt','a')
f.write(
    'Date: ' + str(datetime.date.today()) + '\n' +
    'Parameters: ' + '\n' + 
    '  - # simulations: ' + str(n_simulations) + '\n'
    '  - # refugees   : ' + str(n_refugees) + '\n'
    '  - # localities : ' + str(n_municipalities) + '\n'
    'Total running time: ' + f_writetime(elapsed_time) + '\n\n'
)
f.close()

SyntaxError: invalid syntax (<ipython-input-19-56defb06f5cb>, line 1)