# Simulated Annealing for Microsimulation

We apply the simulated annealing algorithm to a **microsimulation** task in this short example. Microsimulation can be used to combine individual-level and aggregate datasets that are available at different spatial resolutions in order to produce a synthetic population where every individual within has a detailed set of characteristics.

This may occur in a scenario where individual-level data is available over a large area (e.g. samples from the census), but the data is needed on a smaller spatial scale. If the numbers of people with certain properties are known at a finer spatial resolution from another data source, it is possible to generate a synthetic population at a smaller spatial scale where the properties of the people within are consistent with the second data source.

The synthetic population generated by microsimulation techniques can be used in further simulations, such as agent-based models, where the characteristics of each simulated individual in the population must be known.

In [20]:
import numpy as np
import os
import pandas as pd
from MicrosimulationOptimiser import MicrosimulationOptimiser

## SimpleWorld

We'll work with a very simple dataset, "SimpleWorld", for the first part of this demonstration. The example data comes from *Spatial Microsimulation with R*, by Robin Lovelace and Morgane Dumont.

### Load the data

We'll first load two files that contain counts of people with certain attributes who reside in three separate regions. These attributes - here, age and sex - form the **constraints** of our problem.

In the **age** dataset, we have counts of people aged below 50, and 50 and above. The three rows of the table correspond to the three different regions. We can see that there are slightly more younger people in regions 0 and 2, and many more older people in region 1.

In [21]:
age = pd.read_csv(os.path.join(os.getcwd(), "..", "..", "datasets", "SimpleWorld", "age.csv"))
age

Unnamed: 0,a0.49,a.50+
0,8,4
1,2,8
2,7,4


The **sex** dataset is similarly formatted. The number of males and females are close two equal in regions 0 and 1, while there are far more females than males in region 2.

We would expect our synthetic dataset to reflect these characteristics to give younger population with more females in region 2, and so on for the other regions.

In [22]:
sex = pd.read_csv(os.path.join(os.getcwd(), "..", "..", "datasets", "SimpleWorld", "sex.csv"))
sex

Unnamed: 0,m,f
0,6,6
1,4,6
2,3,8


We also have some individual-level data. Each individual has properties which match those in the constraints, along with other properties which are of interest to us. In this example, we would like to generate a synthetic population for each of the three regions where the individuals within have an exact age and an income.

In [23]:
ind = pd.read_csv(os.path.join(os.getcwd(), "..", "..", "datasets", "SimpleWorld", "ind-full.csv"))
ind

Unnamed: 0,id,age,sex,income
0,1,59,m,2868
1,2,54,m,2474
2,3,35,m,2231
3,4,73,f,3152
4,5,49,f,2473


### Prepare the optimisation problem

In this example, we use an implementation of a simulated annealing algorithm from the `simanneal` package. We use this implementation as the one in the widely-used `scipy` package is deprecated, and both it and subsequent implementations of similar algorithms in `scipy` are not able to accept discrete choices for the inputs.

We initialise the class with the constraints a list of the individuals and their attributes. We also define two functions in the class:
- `move()` Selects one person in the synthetic population and swaps their ID to an alternative from the population of individuals.
- `energy()` Computes the total absolute error, which we want to minimise during the optimisation process. The error is defined as
$$ \sum_c\sum_o |S_{c,o} - E_{c,o}| $$
where $c$ is the constraint type, $o$ is the option for that constraint, and $S$ and $E$ are the counts of these combinations of $c$ and $o$ in the synthetic and expected (constraint) populations respectively.

Before we run the optimisation process, we need to reformat the table of individuals. Our seed population should have one row per person, one column for each of the constraints. The values in each column indicate which of the options for that constraint each person has.

In [24]:
age_conditions = [ind["age"] < 50, ind["age"] >= 50]
sex_conditions = [ind["sex"] == "m", ind["sex"] == "f"]

# Now create an array that holds the individuals and how their properties correspond to the constraints
ind_array = np.array([np.select(age_conditions, range(len(age_conditions)), default=None),
                      np.select(sex_conditions, range(len(sex_conditions)), default=None)]).transpose()

print("Seed population:\n {}".format(ind_array))

Seed population:
 [[1 0]
 [1 0]
 [0 0]
 [1 1]
 [0 1]]


The first column of the array indicates the age band of each person in the seed population: <50 (0) or >=50 (1). The second column indicates the sex: male (0) or female (1). The indices in this array correspond to the column used for that characteristic in each of the constraint arrays.

Before we start, we'll set a random seed for reproducibility and choose the region for which we want to generate a population.

In [25]:
np.random.seed(1)
region = 0

Now, we can prepare the optimiser. We provide the following arguments:
- An array of individuals as shown above.
- Arrays containing the constraints (also converted to numpy arrays). The order of the constraints corresponds to the column ordering in the first argument.

When we view the initial state, we can see that we have an array of individuals (by default, the same number given in the constraints) with a random set of IDs, which correspond to individuals from the seed population. The optimiser will change the IDs in this array until it generates a population which is consistent with (or as close as possible to) the constraints we specified earlier.

In [26]:
opt = MicrosimulationOptimiser(ind_array, age.to_numpy()[region], sex.to_numpy()[region])
print("Initial state:", opt.state)

Initial state: [3 4 0 1 3 0 0 1 4 4 1 2]


Now, let's run the simulated annealing algorithm. We change `Tmax` and `Tmin` from their defaults of 25,000 and 2.5 as the documentation for the `simanneal` package recommends ~98% acceptance of moves at `Tmax` and close to 0% improvement at `Tmin`. These changes were made manually, and will need revising for other datasets. (In this case, the defaults do still give reatively good results - comment out the next two lines to try.)

In [27]:
opt.Tmax = 100
opt.Tmin = 0.1
population_ids, error = opt.anneal()

 Temperature        Energy    Accept   Improve     Elapsed   Remaining
     0.10000          0.00     1.80%     0.00%     0:00:09     0:00:00

The outputs from the annealing algorithm are a list of IDs, corresponding to individuals from our sample population who are included in our synthetic population, and the total absolute error in the population. If it is zero, our population should exactly fulfil the constraints we specified earlier.

In [28]:
synthetic_population = ind.iloc[population_ids]
synthetic_population

Unnamed: 0,id,age,sex,income
0,1,59,m,2868
4,5,49,f,2473
4,5,49,f,2473
4,5,49,f,2473
4,5,49,f,2473
2,3,35,m,2231
0,1,59,m,2868
0,1,59,m,2868
3,4,73,f,3152
2,3,35,m,2231


Finally, let's compare the properties of our synthetic population with the original. The properties that we used for constraints are age (under/over 50) and sex.

In [29]:
print("Region {}".format(region))

print("\nAge")
print("Under 50: expected {}, got {}".format(age["a0.49"].iloc[region], 
                                             len(synthetic_population[synthetic_population["age"] < 50])))
print("50 and over: expected {}, got {}".format(age["a.50+"].iloc[region], 
                                                len(synthetic_population[synthetic_population["age"] >= 50])))

print("\nSex")
print("Males: expected {}, got {}".format(sex["m"].iloc[region], 
                                          len(synthetic_population[synthetic_population["sex"] == "m"])))
print("Females: expected {}, got {}".format(sex["f"].iloc[region], 
                                            len(synthetic_population[synthetic_population["sex"] == "f"])))

Region 0

Age
Under 50: expected 8, got 8
50 and over: expected 4, got 4

Sex
Males: expected 6, got 6
Females: expected 6, got 6
