# Segregation dynamics with reinforcement learning and agent based modeling

**Sert, Egemen, Yaneer Bar-Yam, and Alfredo J. Morales.  *Scientific reports* 10.1 (2020): 1-12.**

## PRELIMINARY CONCEPTS
- **Agent Based Modeling (ABM)**
    - a generative approach to study natural phenomena based on the interaction of individuals in social, physical and biological systems
- **Reinforcement Learning** 
    - a simulation method where agents become intelligent and create new, optimal behaviors based on a previously defined structure of rewards and the state of their environment
- **Multiagent Reinforcement Learning**
    - employs multiple agents
- **Schelling's Segregation Model**
    - Demonstrates that individual preferences to live away from those that are different may sort social systems in the large scale and generate patterns of social segregation without the need of centralized enforcement

## RESEARCH GAP
- Previous studies using ABM tackled segregation and cases of integration ...
- But they were unable to explore a wide space of possible behaviours based on different types rewards
    - Rewards are key to understand people's choices and decisions. Thus, it seems logical to incorporate them into ABMs

## CONTRIBUTIONS
- Adapt Schelling's model to RL
- Combine RL with ABM to explore self-organizing dynamics of social segregation and explore space of probabilities by considering different types of rewards

## METHOD 

#### Part 1: RL Elements

### General Idea
- Explore varying levels of segregation and integration rewards and observe the effects on the dynamics of the system

### Grid Environment
- 50 x 50 with periodic boundary conditions (wrapped environment)

<img src='grid.png' style="width: 250px"></img>
<justify><small>*Grid world of experiments. The grid size is 50×50 locations. Red and blue squares denote the two types of agents respectively. White cells
represents empty regions.*</small></justify>

### Agents
- 2 types: A & B   
- Has an $n$ x $n$ observation window
    - $ n = 2r + 1$
    - $n=11$; $r=5$

### State
- Consists of Spatial observation and Age observation
- Spatial observation ($o_{spatial}$)
    - Values in the agent's observation window (agent at the center)
        - Possible values are {1,0,-1} corresponding to (self/friend, empty, foe)
- Age observation ($o_{age}$)
    - Agent's remaining normalized life time
- State space is $O(M3^{n^2})$, if there are $M$ age values  

### Actions
- 5 possible actions
    - stay still, left, right, up or down (vs. Schelling where you can move anywhere)
- Agents take 1 action per iteration
    - Sequence of agent who will take action is taken randomly
- Agent lives for a maximum of 100 iterations
- When an agent dies, a new agent is born in a random location
- Agents extend their lifespan by interacting with agents of the opposite kind
    - Interaction happens when an agent moves to a location of another agent of different kind
    - Agent who moved is given reward plus lifespan extension; loser dies
        - Possible interpretation: emigration of the losing agent out of the neighborhood    

### Rewards
- $R = SR + IR + VR + DR + OR + TR$
- Segregation Reward (SR)
    - $SR =   s − \alpha d$
    - Promotes segregation 
    - $s$ agents of same kind; $d$ agents of different kind; $\alpha$ [0,1]
    - $\alpha$ is intolerance to agents of different kind (higher $\alpha$, more intolerant)
- Interdependence Reward (IR)
    - [0,100]
- Vigilance Reward (VR)
    - 0.1 for each time step an agent remains alive
    - Encourages staying alive 
- Death Reward (DR)
    - -1 if agent die; 0 if remains alive
- Occlusion Reward (OR)
    - -1 if you occupy same area of same kind; else 0
- Stillness Reward (TR)
    - -1 if action=remain still, else 0

## METHOD 

#### Part 2: Network

### Network Architecture
- 2 NN for agent type A and B (Fig 1 bottom)
    - for a competitive multi-agent RL environment
<img src='network_architecture.png' style="width: 750px"></img>
<justify><small>*Each type of agent has its own Deep Q-Network. Every agent has a field of view of 11
×11 locations. Green border denotes the field of view of the agent illustrated in green. Agents can move acrossempty spaces.Two models are created for $φ_A$ and $φ_B$ respectively. Each network receives an input of 11×11 locations, runs it through five convolution steps and concatenates the resulting activations with the agent’s remaining age normalized by the maximum initial age. The feature vector is mapped over the action space using a fully connected layer. The action with the maximum Q-value is taken for the agent.*</small></justify>

### Reinforcement Learning
- Let $φ_A$ and $φ_B$ denote the **Deep Q-Networks** of type A and B agents. The goal of these networks is to satisfy the following:
<img src='network_objective.png' style="width: 300px"></img>
- $N_A$ and $N_B$ are the number of type A and B agents
- $\gamma$ is the discount factor
- $r_t$ is the reward at time $t$

### Experiment Parameters
- Exploration via $\epsilon$-greedy strategy
- Optimizer: Adam
- Uses Experience Replay to mitigate time correlation among the inputs of NN
- Runs 1 episode per experiment 
- Each experiment has 5000 iterations
- Each experiment repeated 10 times

<img src='parameters.png' style="width: 300px"></img>

## RESULTS

### Segregation
<img src='segregation.png' style="width: 850px"></img>
***Notes*** 
- averaged over 1000 iterations
- $\alpha$ is intolerance to agents of different kind (higher $\alpha$, more intolerant)

#### Observations on Iteration lengths (Panel A)
- Legend
    - red - dominated by A agents
    - blue dominated by blue B agents
    - white - average pattern or mixed population for small $\alpha$
        - empty for higher $\alpha$
- Observation
    - Lower $\alpha$: mixed population; as $\alpha$ increases, segregation happens 
        - happens even in first 1K iterations
    - Similar to Schelling, segregation still happens for in the long run smaller alpha  (e.g. 0.5)
    

<img src='segregation.png' style="width: 850px"></img>
<justify><small>*Agents collective behavior for multiple values of segregation reward $\alpha$ (rows) at multiple times (columns).*</small></justify>

***Notes*** 
- averaged over 1000 iterations
- $\alpha$ is intolerance to agents of different kind (higher $\alpha$, more intolerant)

#### Observations on Agent Age (Panel B)
- Legend
    - red - older average age
    - blue - younger average age
- Observation
    - Low  $\alpha$, low mixing of age; higher  $\alpha$, higher mixing of age
    - White intercluster regions have low average age
    - Segregated clusters -older ones inside, younger ones in periphery
       - Meaning, there's little interaction across all agents (same or different type)

#### Multiscale entropy
- Higher levels of alpha achieved higher levels of segregation faster
- Lower levels of alpha don't reach equilibrium unlike Schelling model bec. agents are always seeking  reward

<img src='entropy.png' style="width: 550px"></img>
<justify><small>*Segregation dynamics for multiple values of segregation reward (α). The curves are
obtained by averaging 50 iterations over 10 experiment realizations. Shades denote the standard deviation across experiments.*</small></justify>

### Interdependencies

<img src='interdependence.png' style="width: 850px"></img>
<justify><small>*Agents collective behavior for multiple values of interdependence reward (IR) at multiple times (columns) for maximum segregation parameter ($\alpha$ = 1).*</small></justify>

***Notes*** 
- averaged over 1000 iterations
- $IR$ promotes interactions and create interdependecies among populations

#### Observations on Iteration lengths (Panel A)
- Observation
    - $IR$=0 segregation immediately result
    - As $IR$ increases, areas become uniform

### Segregation and Interdependence

<img src='segregation_interdependence.png' style="width: 300px"></img>

- Red means high segregation, blue lower segregation
- Segregation is high when promoted and $IR$ is low
- As $IR$ increase, agents mix even for high values of alpha
- Conclusion: High values of $IR$ counter the rewards fro segregation

### Age Dynamics

<img src='age_grid.png' style="width: 800px"></img>

- Red higher probability of finding an age group at a given level of segregation
- Older agents have significantly more segregated observation windows than younger agents (more red)
    - more pronounced for lower values of $IR$

### Biases of Actions

<img src='action_grid.png' style="width: 800px"></img>

- Certain movements are biased towards certain age groups
    - Older agents tend to stay more still  
    - Younger agents seem to explore the space further
- Reason: for older agents, rewards for other social interactions are lower than staying safe. 
- This behavior has been verified with human behavior using Census data across the US. 

## CRITIQUE

### Decisions and Tricks
- Low density of agents (greater density may result in less learning)
    - 5% of each agent type wrt size of grid
- maximum age = 100
- $IR$ seems to be big vs. $SR$ (factors of 25 for $IR$ vs max 12 for $SR$)
- Use DQN instead of QLearning because of the size of state space
- 2 netwoks for a competitive multi-agent RL environment

### Unsupported Claims
- Weak claim: "We believe  older agents become more segregated because the expected rewards for other social interactions are lower than staying safe."

### Fit with other Papers
- Original Schelling
-  ABM tackled segregation and cases of integration but without exploring rewards

### Ideas I disagree with 
- Agent who moved is given reward (lifespan extension); loser dies
    - Can the interaction be non-hostile? (E.g. interaction is when you move beside an agent of a different kind)
- Why allow 2 agents to co-locate 
- Why not combine Vigilance and Death rewards

### Points for clarification
- In order to homogenize the networks’ inputs, we normalize the observation windows by the agents’ own kind, such that positive and negative values respectively represent equal and opposite kind for each agent.
- Older agents become more segregated because the expected rewards for other social interactions are lower than staying safe. 


### Possible Experiments
- What 'breaks' am I looking for?
    - if segregation does not happen when...
        - IR is not hostile
- ABM side
    - Tweak interdependency definition 
        - No death by interaction
        - Don't allow occlusion/co-location
    - Tweak number of agents
    - Agents' age
    - White is empty or mixed?  New plot or coloring
        - **Differentiate**
- RL side
    - Play with other rewards
        - Occlusion
        - Stillness
    - Age as part of RL?
- Segregation Dynamics (Fig 3)
    - Experiment on default: 50 iterations over 10 experimental realizations

#### Experiment Setup

- Step 1 reproduce results and graphs
    - Each experiment has 5000 iterations
    - Each experiment repeated 10 times

## QUESTIONS?