# Identify employment centers

This notebooks covers the identification of employment centers presented in Section 4.2 of the published paper.

## Dependencies and Data

In [None]:
%matplotlib inline

import tools, time, os, sys, traceback
import multiprocessing as mp
import pandas as pd
import numpy as np
import geopandas as gpd
from sqlalchemy import create_engine
import multiprocessing as mp

from sklearn.cluster import DBSCAN
from tools import ADBSCAN

In [None]:
linux = '/media/dani/baul/'
macos = '/Users/dani/'
server = '/home/jovyan/host/'
db_path = server + 'Dropbox/Cadastre/01 Catastro maps/sqlite_db/cadastro.db'
engine = create_engine('sqlite:////'+db_path)

* Load point data

In [None]:
varlist=['X', 'Y', 'geometry', 'localId', 'numberOfBuildingUnits']
%time db = pd.read_sql(("SELECT X, Y, localId, numberOfBuildingUnits, numberOfDwellings "\
                        "FROM cadastro"), engine)

* Load solution data

In [None]:
%time solu = pd.read_parquet(('../output/revision/solution_'\
                              'rep1000_eps2000_mp2000_thr90.parquet'))\
               .set_index('id')

To keep in line with structure of old code, turn `pct` into proportions:

In [None]:
solu['pct'] = solu['pct'] / 100

* Join

In [None]:
%%time
one = db.join(solu[['lbls']])

* Non-residential points

In [None]:
emp = db['numberOfBuildingUnits'] - db['numberOfDwellings']
emp = pd.DataFrame({'emp': emp[emp > 0]})\
        .join(db)\
        .join(solu[['lbls']])
emp.info()

In [None]:
emp.reset_index()\
   .to_parquet('emp.parquet')

With the above, `emp.parquet` contains a table with the number of buildings, residential, and non-residential units per building.

## Employment center identification

Since identification in every city is independent, this is an ["embarrasingly parallel"](https://en.wikipedia.org/wiki/Embarrassingly_parallel) task. To compute several cities at the same time, we first split all of the cities into as many groups as the machine has cores, and we then execute the computation.

* Set up parameters

In [None]:
dens, eps, min_pts = tools.dbs_params(eps=250, dens=0.00025)
print("Dens: %.4f | eps: %.4f | min_pts: %i"\
      %(dens, eps, min_pts))

* Chunk points by city (in parallel)

In [None]:
def picker(i):
    return emp.loc[emp['lbls']==i, :]
pool = mp.Pool(mp.cpu_count() - 1)

city_ids = [i for i in emp['lbls'].unique() \
            if i!=-1]
%time cities = pool.map(picker, city_ids)

* Identify centers (in parallel)

In [None]:
reps = 1000

def identifier(tab_id):
    t0 = time.time()
    tab, c_id = tab_id
    cnts = tools.identify_city_centres(tab, \
                                       log_file=f'log_centre_identification_rep{reps}.txt', \
                                       dens=dens, \
                                       eps=eps, \
                                       min_pts=min_pts, \
                                       reps=reps)
    cnts['city_id'] = c_id
    t1 = time.time()
    return cnts

pool = mp.Pool(mp.cpu_count() - 1)

%time cnts_all = pool.map(identifier, \
                          zip(cities, city_ids))

pd.concat(cnts_all)\
  .to_file(f'rep{reps}_mp2000_eps2000_centres_adbscan.gpkg',
           driver='GPKG')

The output of these computations is a single GeoPackage file (`rep1000_mp2000_eps2000_centres_adbscan.gpkg`) with the polygon boundaries for all the employment centers in our cities.