# AI for science and government (ASG) community workshop
### 6, 7 July 2022 - Birmingham
#### Demonstration session

Lead by: Fernando Benitez-Paez - Hadrien Salat

**Resources and useful links:**
- **SPC Repo:** https://github.com/alan-turing-institute/uatk-spc
- **SPC website:** https://alan-turing-institute.github.io/uatk-spc/
- **Urban Analytcs website:** https://www.turing.ac.uk/research/research-programmes/urban-analytics
- **Protocol buffers:**  https://developers.google.com/protocol-buffers/docs/overview


The following Notebook is a simple guide on how to use and explore the attributes/variables included in the .pb file created with the tool SPC.


### 1. From .pb to pandas dataframe
An initial exploration of the dimensions included in the output files is always a good start point to work with SyntPop. There is already a script to run in shell the following functionality in case you are not a Notebook person (https://github.com/alan-turing-institute/uatk-spc/blob/main/python/protobuf_to_csv.py)

In [49]:
import pandas as pd
import sys
import os
sys.path.append('../')
import synthpop_pb2

In [50]:
os.getcwd()

'/Users/fbenitez/Documents/ResearchATI/ASG-SPC/uatk-spc/python/demos'

In [56]:
def convert_to_csv(input_path):
    """Export some per-person attributes to CSV."""
    # Parse the .pb file
    print(f"Reading {input_path}")
    pop = synthpop_pb2.Population()
    f = open(input_path, "rb")
    pop.ParseFromString(f.read())
    f.close()

    # Based on the per-person information you're interested in, you can extract
    # and fill out different columns
    people = []
    for person in pop.people:
        # The Person message doesn't directly store MSOA. Look up from their household.
        msoa11cd = pop.households[person.household].msoa11cd

        record = {
            "person_id": person.id,
            "household_id": person.household,
            "msoa11cd": msoa11cd,
            "age_years": person.demographics.age_years,
            # Protobuf enum types show up as numbers; this converts to a string
            "pwkstat": synthpop_pb2.PwkStat.Name(person.employment.pwkstat),
            "diabetes": person.health.has_diabetes,
            "employment": person.employment.sic1d07,
        }

        # Add a column for the duration the person spends doing each activity
        for pair in person.activity_durations:
            key = synthpop_pb2.Activity.Name(pair.activity) + "_duration"
            record[key] = pair.duration

        people.append(record)

    df = pd.DataFrame.from_records(people)
    return(df)

In [57]:
input_path = '../../data/output/west_midlands.pb'
if __name__ == "__main__":
    df = convert_to_csv(input_path)

Reading ../../data/output/west_midlands.pb


In [58]:
df

Unnamed: 0,person_id,household_id,msoa11cd,age_years,pwkstat,diabetes,employment,RETAIL_duration,PRIMARY_SCHOOL_duration,SECONDARY_SCHOOL_duration,HOME_duration,WORK_duration
0,0,0,E02001827,76,RETIRED,False,3,0.000000,0.0,0.000000,0.083333,0.000000
1,1,0,E02001827,72,RETIRED,False,16,0.000000,0.0,0.000000,1.000000,0.000000
2,2,1,E02001827,52,EMPLOYEE_FT,False,17,0.000000,0.0,0.000000,0.062500,0.416667
3,3,1,E02001827,52,SELF_EMPLOYED,False,6,0.000000,0.0,0.000000,0.069444,0.354167
4,4,2,E02001827,41,HOMEMAKER,False,16,0.000000,0.0,0.000000,0.034722,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
2475913,2475913,523263,E02006901,40,HOMEMAKER,False,13,0.000000,0.0,0.000000,0.041667,0.000000
2475914,2475914,523263,E02006901,32,EMPLOYEE_FT,False,3,0.000000,0.0,0.000000,0.062500,0.506944
2475915,2475915,523263,E02006901,12,,False,0,0.000000,0.0,0.000000,0.097222,0.000000
2475916,2475916,523263,E02006901,46,EMPLOYEE_FT,False,14,0.020833,0.0,0.000000,0.159722,0.263889


### 2. Map the Venues

A second excersice a resercher migth do is to map the venues or individual and becase we have an spatially enriech sythn pop, we are also able to plot the synth venues ( randonmly located) and set of synth individuals created in the .pb file. There is already a script to run in shell the following functionality in case you are not a Notebook person (https://github.com/alan-turing-institute/uatk-spc/blob/main/python/protobuf_to_csv.py)

In [39]:
import plotly.express as px
import click

def draw_venues(input_path):
    """Draw a dot per venue, colored by activity."""
    print(f"Reading {input_path}")
    pop = synthpop_pb2.Population()
    f = open(input_path, "rb")
    pop.ParseFromString(f.read())
    f.close()

    dots = []
    for activity in pop.venues_per_activity.keys():
        for venue in pop.venues_per_activity[activity].venues:
            dots.append(
                (
                    venue.location.latitude,
                    venue.location.longitude,
                    synthpop_pb2.Activity.Name(activity),
                )
            )

    # This is some public Mapbox token I copied from somewhere. It works for me
    # now, but it might eventually expire
    px.set_mapbox_access_token(
        "pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpejY4NXVycTA2emYycXBndHRqcmZ3N3gifQ.rJcFIG214AriISLbB6B5aw"
    )
    df = pd.DataFrame(dots, columns=["latitude", "longitude", "activity"])
    fig = px.scatter_mapbox(df, lat="latitude", lon="longitude", color="activity")
    fig.show()


if __name__ == "__main__":
    draw_venues(input_path)

Reading ../../data/output/west_midlands.pb


In [40]:
os.getcwd()

'/Users/fbenitez/Documents/ResearchATI/ASG-SPC/uatk-spc/python/demos'

In [41]:
import geopandas as gpd
gdf = gpd.read_file('../../data/raw_data/nationaldata/MSOAS_shp')

In [59]:
gdf

Unnamed: 0,msoa11cd,objectid,MSOA11NM,MSOA11NMW,st_area,st_length,district,pop_densit,connectivi,qimd,risk,pop,lng,lat,geometry
0,E02000001,1,City of London 001,City of London 001,2.905399e+06,9024.059703,City of London,3004.309368,8.504667e+07,20984.032258,Medium,8730.0,-0.092128,51.514822,"MULTIPOLYGON (((532135.145 182198.119, 532158...."
1,E02000002,2,Barking and Dagenham 001,Barking and Dagenham 001,2.165634e+06,8152.697593,Barking and Dagenham,3571.486507,5.563570e+06,6799.045455,Medium,7730.0,0.139476,51.588273,"POLYGON ((548881.563 190845.265, 548881.125 19..."
2,E02000003,3,Barking and Dagenham 002,Barking and Dagenham 002,2.143565e+06,9118.449453,Barking and Dagenham,5160.362933,8.226083e+06,10862.466667,Medium,11060.0,0.140898,51.574927,"POLYGON ((549102.438 189324.625, 548954.500 18..."
3,E02000004,4,Barking and Dagenham 003,Barking and Dagenham 003,2.490215e+06,8207.610394,Barking and Dagenham,2641.852129,5.097344e+06,12579.210526,Medium,6580.0,0.176828,51.555477,"POLYGON ((551549.998 187364.637, 551478.000 18..."
4,E02000005,5,Barking and Dagenham 004,Barking and Dagenham 004,1.186180e+06,6964.961665,Barking and Dagenham,8610.609767,7.726827e+06,8153.269231,Medium,10210.0,0.143324,51.561421,"POLYGON ((549099.634 187656.076, 549161.375 18..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7196,W02000419,7197,Denbighshire 017,Sir Ddinbych 017,2.707320e+06,9278.657236,Denbighshire,,,,,,-3.465492,53.315860,"POLYGON ((302972.647 381406.150, 303008.064 38..."
7197,W02000420,7198,Wrexham 020,Wrecsam 020,7.918522e+07,68543.495017,Wrexham,,,,,,-2.936473,53.078813,"POLYGON ((335857.094 359909.500, 335987.000 35..."
7198,W02000421,7199,Ceredigion 011,Ceredigion 011,7.694548e+08,186394.088753,Ceredigion,,,,,,-3.881899,52.301207,"POLYGON ((281652.906 291392.187, 281771.937 29..."
7199,W02000422,7200,Cardiff 048,Caerdydd 048,5.011074e+06,23067.155678,Cardiff,,,,,,-3.154343,51.464327,"MULTIPOLYGON (((319551.523 175706.412, 319771...."


In [45]:
gdf.rename(columns = {'MSOA11CD':'msoa11cd'}, inplace = True)

In [60]:
df_inner = pd.merge(df, gdf, on='msoa11cd', how='inner')

In [63]:
df_inner

Unnamed: 0,person_id,household_id,msoa11cd,age_years,pwkstat,diabetes,employment,RETAIL_duration,PRIMARY_SCHOOL_duration,SECONDARY_SCHOOL_duration,...,st_length,district,pop_densit,connectivi,qimd,risk,pop,lng,lat,geometry
0,0,0,E02001827,76,RETIRED,False,3,0.000000,0.0,0.000000,...,9627.724265,Birmingham,2688.248422,1.410654e+06,20670.000000,Medium,6240.0,-1.842717,52.599474,"POLYGON ((411756.668 301222.002, 411677.496 30..."
1,1,0,E02001827,72,RETIRED,False,16,0.000000,0.0,0.000000,...,9627.724265,Birmingham,2688.248422,1.410654e+06,20670.000000,Medium,6240.0,-1.842717,52.599474,"POLYGON ((411756.668 301222.002, 411677.496 30..."
2,2,1,E02001827,52,EMPLOYEE_FT,False,17,0.000000,0.0,0.000000,...,9627.724265,Birmingham,2688.248422,1.410654e+06,20670.000000,Medium,6240.0,-1.842717,52.599474,"POLYGON ((411756.668 301222.002, 411677.496 30..."
3,3,1,E02001827,52,SELF_EMPLOYED,False,6,0.000000,0.0,0.000000,...,9627.724265,Birmingham,2688.248422,1.410654e+06,20670.000000,Medium,6240.0,-1.842717,52.599474,"POLYGON ((411756.668 301222.002, 411677.496 30..."
4,4,2,E02001827,41,HOMEMAKER,False,16,0.000000,0.0,0.000000,...,9627.724265,Birmingham,2688.248422,1.410654e+06,20670.000000,Medium,6240.0,-1.842717,52.599474,"POLYGON ((411756.668 301222.002, 411677.496 30..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2475913,2475913,523263,E02006901,40,HOMEMAKER,False,13,0.000000,0.0,0.000000,...,11009.945799,Birmingham,6585.877529,5.999762e+05,3767.235294,Medium,15670.0,-1.855830,52.452400,"POLYGON ((409652.121 285157.000, 409778.716 28..."
2475914,2475914,523263,E02006901,32,EMPLOYEE_FT,False,3,0.000000,0.0,0.000000,...,11009.945799,Birmingham,6585.877529,5.999762e+05,3767.235294,Medium,15670.0,-1.855830,52.452400,"POLYGON ((409652.121 285157.000, 409778.716 28..."
2475915,2475915,523263,E02006901,12,,False,0,0.000000,0.0,0.000000,...,11009.945799,Birmingham,6585.877529,5.999762e+05,3767.235294,Medium,15670.0,-1.855830,52.452400,"POLYGON ((409652.121 285157.000, 409778.716 28..."
2475916,2475916,523263,E02006901,46,EMPLOYEE_FT,False,14,0.020833,0.0,0.000000,...,11009.945799,Birmingham,6585.877529,5.999762e+05,3767.235294,Medium,15670.0,-1.855830,52.452400,"POLYGON ((409652.121 285157.000, 409778.716 28..."
