## Urban Africa

This notebook uses the Africapolis 2015 data to append a csv dataset containing longitudes and latitudes with a new attribute specifying whether each long/lat pair lies in an urban area or not. This task is completed by checking whether each point lies inside any of the polygons defining the borders of African cities, given by the Africapolis dataset: https://africapolis.org/data This dataset is automatically downloaded if it is not already present in the specified directory.

Depending on the size of the dataset and your processing power, this can take some time. If you are using Google colab, it may not be worth utilising the multiprocessing option, as by default you will only have 2 CPUs. However, if the dataset is very large, this may still be better than just one CPU. Through experimenting with my own laptop (12 cores) it seems that using multiprocessing with 6 processors reduces runtime by a lot for a relatively small dataset (~5000 entries) from 2m 48s to 1m 30s, so for large datasets it is likely worth using. Using a few more or less makes little difference, so I have made the default half of the available processors.

In [None]:
## If using Google Colab, run this block to access data in your Google Drive.
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
## If using Google Colab, these packages will need to be installed
!pip install geopandas
!pip install pdl
## If not, then any packages you have not installed will need to be installed, so include them as needed
#!pip install pandas
#!pip install tqdm
#!pip install dask
#!pip install multiprocessing
#!pip install numpy
#!pip install os
#!pip install sys
#!pip install time
#!pip install threading

In [1]:
import pandas as pd
import geopandas as gpd
import tqdm
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import multiprocessing
import numpy as np
from pdl import pdl
import os
import sys
import time
import threading

### Running the application

This block will begin running the application. You will be prompted to provide file paths and other necessary information. The data will then be processed, and saved to the given file path.

In [2]:
class SpinnerThread(threading.Thread):

    def __init__(self):
        super().__init__(target=self._spin)
        self._stopevent = threading.Event()

    def stop(self):
        self._stopevent.set()

    def _spin(self):

        while not self._stopevent.isSet():
            for t in '|/-\\':
                sys.stdout.write(t)
                sys.stdout.flush()
                time.sleep(0.5)
                sys.stdout.write('\b')


def containment_tests(data, shapes, long_name='longitude', lat_name='latitude'):
  spinner_thread = SpinnerThread()
  spinner_thread.start()
  data = pd.DataFrame(data)
  points = gpd.GeoDataFrame(data.loc[:,[long_name,lat_name]], geometry=gpd.points_from_xy(data.loc[:,long_name], data.loc[:,lat_name])) #create a series of point objects representing location of events
  polys = shapes.geometry #This is a series of polygons
  containment_checker = polys.geometry.buffer(0).contains
  tqdm.tqdm.pandas(position=0, leave=True)
  spinner_thread.stop()
  r = points.geometry.progress_apply(containment_checker)
  return r.any(axis=1)

def multi_process_containment_tests(data, shapes, long_name='longitude', lat_name='latitude', cores=int(np.round(multiprocessing.cpu_count()/2))):
  spinner_thread = SpinnerThread()
  spinner_thread.start()
  data = pd.DataFrame(data)
  
  points = gpd.GeoDataFrame(data.loc[:,[long_name,lat_name]], geometry=gpd.points_from_xy(data.loc[:,long_name], data.loc[:,lat_name])) #create a series of point objects representing location of events
  polys = shapes.geometry #This is a series of polygons
  containment_checker = polys.geometry.buffer(0).contains
  spinner_thread.stop()
  with ProgressBar():  
    r = dd.from_pandas(points.geometry, npartitions=cores).map_partitions(lambda dframe: pd.Series(np.any(dframe.apply(containment_checker), axis=1)), meta=pd.Series(dtype=bool)).compute(scheduler='processes')  
  return r



if __name__ == '__main__':
  def yes_no(question):
    yes = set(['yes','y'])
    no = set(['no','n'])
     
    while True:
        choice = input(question).lower()
        if choice in yes:
           return True
        elif choice in no:
           return False
        else:
           print('Please respond with y/n.')
  
  def get_directory(question, error_message, prev_path=''):
    dir_ = input(question)
    dir_ = os.path.join(prev_path, dir_)
    if os.path.exists(dir_):
      return dir_
    else:
      print(error_message)
      dir_ = get_directory(question, error_message)
      return dir_
          
  data_dir = get_directory('Please enter the directory containing your dataset, \neg. "C://Users/Username/Desktop/Data/:" on local machine, or "/content/gdrive/My Drive/Project/Data" on Google Colab.\n',
                           "That path doesn't exist, please enter the correct path.")

  data_filename = get_directory('Please enter the name of your data file (csv), \neg. "data.csv":\n',
                               "The file you have given does not exist in the specified directory. Please check the details given and start again if necessary.",
                               data_dir)
  multiprocess = yes_no('To speed up computation for large datasets\nmultiprocessing can be utilised. Should this be done? (y/n)\n')
  
  if multiprocess:
    cores = input('How many cores should be used? If left blank half will be used.\n')
    if cores == '':
      cores = int(np.round(multiprocessing.cpu_count()/2))
    else: cores = int(cores)
      
  def load_data():
    global africapolis
    try: 
      africapolis = gpd.read_file(os.path.join(data_dir, 'africapolis.shp'))
    except Exception:
      africapolis_url = 'http://www.africapolis.org/download/Africapolis_2015_shp.zip'
      pdl.download(africapolis_url, data_dir=data_dir, keep_download=False, overwrite_download=True, verbose=True)
      africapolis = gpd.read_file(os.path.join(data_dir,'africapolis.shp'))

  print('Loading Data ')
  task = threading.Thread(target=load_data)
  task.start()

  spinner_thread = SpinnerThread()
  spinner_thread.start()

  task.join()
  data = pd.read_csv(os.path.join(data_dir, data_filename))
  spinner_thread.stop()

  long_name = 'longitude'
  lat_name = 'latitude'

  def check_col(col_name):
      if col_name not in list(data.columns):
        col_name = input(f"Dataset doesn't contain column named {col_name}. Please enter the name of longitude column.\n")
        col_name = check_col(col_name)
      return col_name
    
  long_name = check_col('longitude')
  lat_name = check_col('latitude')
    
  print('\bStarting processing...\n')
  if multiprocess:
    print('The progress bar updates as the tasks are completed, may stay on 0% for a long time.')
    isurban = multi_process_containment_tests(data=data, 
                                              shapes=africapolis,
                                              long_name=long_name,
                                              lat_name=lat_name,
                                              cores=cores)
  else:
    isurban = containment_tests(data=data, 
                                shapes=africapolis,
                                long_name=long_name,
                                lat_name=lat_name)
  
  data['is_urban'] = isurban
  print('Saving Data...')
  data.to_csv(os.path.join(data_dir, 'data_isurban.csv'))
  
  print('Done!')
  


Please enter the directory containing your dataset, 
eg. "C://Users/Username/Desktop/Data/:" on local machine, or "/content/gdrive/My Drive/Project/Data" on Google Colab.
C://Users/ewand/Downloads
Please enter the name of your data file (csv), 
eg. "data.csv":
data_isurban.csv
To speed up computation for large datasets
multiprocessing can be utilised. Should this be done? (y/n)
y
How many cores should be used? If left blank half will be used.
4
Loading Data 
Dataset doesn't contain column named longitude. Please enter the name of longitude column.
ubefi
Starting processing...
[########################################] | 100% Completed |  1min 34.7s
Saving Data...
Done!
