## Data Summary

The data is stored in the `UNHCR_data` subfolder. The data on the boreholes is a `csv` file, the data on the tent locations is stored in several `geojson` files.

Note that the locations for which there is borehole data and for which there is geolocation data do not fully overlap. I will give a quick idea of what's in the data and show how I personally like to import it. 

I will also demonstrate a pass at filtering the data geographically.

### Borehole Data

Two ways of importing the data, one with native python, one with pandas

In [16]:
# option 1, using native python. 

import csv

def get_borehole_data():
    with open('UNHCR_data/boreholes_wash.csv','r') as csvfile:
        reader = csv.reader(csvfile, delimiter = ',',quotechar = '"')
        for row in reader:
            yield row
            
borehole_data = get_borehole_data()

columns = next(borehole_data)

print('Length of Borehole Dataset:\n\n%s entries.\n\n\nBorehole Data Columns:\n\n%s' 
      % (str(len(list(borehole_data))),'\n'.join(columns)))

Length of Borehole Dataset:

11268 entries.


Borehole Data Columns:

﻿objectid
Iso3
Country
Site ID
Site name
Borehole name
Borehole ID
Last update
Latitude
Longitude
Elevation in m
Status
Drilling date
Depth in m
Static water level in m
Dynamic water level in m
Type of pump
Pump brand and model
Pump depth in m
Pump motor power in kWt
Energy source 
Generator brand and model
2nd generator brand and model
Generator capacity in KVA
2nd generator capacity in KVA
Casing diameter in inch
Casing material
1st screened area - from in m
1st screened area - to in m
2nd screen area - from in m
2nd screened area - to in m
3rd screen area - from: in m
3rd screened area - to in m
Aquifer type
Safe yield in m3/h
Daily pumping time in h
Date of last water quality control
Conductivity in µS
pH
Turbidity in NTU
Ammonia concentration in mg/L
Arsenic concentration in µg/L
Free chlorine concentration in mg/L
Total chlorine concentration in mg/L
Fluoride concentration in mg/L
Nitrate concentration in mg/L


In [12]:
# option 2, using pandas. 
# Pandas is convenient unless you need to have lots of exact control over what happens.

import pandas as pd

borehole_data = pd.read_csv('UNHCR_data/boreholes_wash.csv')

borehole_data.tail()

Unnamed: 0,objectid,Iso3,Country,Site ID,Site name,Borehole name,Borehole ID,Last update,Latitude,Longitude,...,Fluoride concentration in mg/L,Nitrate concentration in mg/L,Nitrite concentration in mg/L,Attachment 1,Attachment 2,Attachment 3,Attachment 4,Comments,globalid,Unnamed: 53
11263,17984,SDN,Sudan,SDNs001510,Wad Sharife,Western station B,SDN-SDNs001510-000004,2018-6-29,15.3604,36.44341,...,,,,,,,,,{BD885ED3-CB9A-4E71-9969-717E44805789},
11264,17985,SDN,Sudan,SDNs001478,Abuda,Northern station,SDN-SDNs001478-000001,2018-6-29,14.35944,35.88561,...,1.7,1.76,0.04,,,,,Screen type: Johnson continuous slot,{C56D5E17-9E67-45AD-8A7F-6117939DEC02},
11265,17986,SDN,Sudan,SDNs001478,Abuda,Southern station,SDN-SDNs001478-000002,2018-6-29,14.34753,35.89094,...,2.05,1.76,0.04,,,,,Screen type: Johnson continuous slot,{2351F077-3296-4B96-AC0A-CDE646D70E90},
11266,17987,SDN,Sudan,SDNs001478,Abuda,New station,SDN-SDNs001478-000003,2018-6-29,14.36172,35.88845,...,0.4,0.01,0.1,,,,,Screen type: Johnson continuous slot,{2855BF8A-1085-4138-8749-21014994F7BA},
11267,1574,JOR,Jordan,JORs004465,Zaatari,BH2,JOR-JORs004465-003,2018-8-2,32.280991,36.337528,...,,,,Zaatari-BH2-Z13-Well-completion_report-2012.pdf,,,,,{624C0832-CC2C-48D0-9BA7-5A90148E1203},


### Geolocation Data

There are several files, each of which contain one big json file. The json files are easiest imported as dictionaries, where the field `features` contains a few thousand entries of data.

In [25]:
import os

geolocation_datasets = [path for path in os.listdir('UNHCR_data/') if 'json' in path]
print("The following geolocation datasets exist:\n\n%s" % '\n'.join(geolocation_datasets))

The following geolocation datasets exist:
bangladesh.geojson
africa1.geojson
africa2.geojson
western_asia.geojson


In [44]:
from json import loads

def get_json_data(filename):
    with open('UNHCR_data/'+filename,'r') as file:
        return loads(file.read())
    
for filename in geolocation_datasets:
    data = get_json_data(filename)
    print("Filename: %s\n" % filename)
    print("Dataset is imported as %s" % type(data))
    for field in data:
        print("field: %s" % field)
        print("type: %s" % type(data[field]))
        print("length: %s\n" % len(data[field]))
    print('\n'+10*'#'+'\n')
    
print("SAMPLE DATA:\n")
for item in data['features'][:5]:
    print(item,end='\n\n')

Filename: bangladesh.geojson

Dataset is imported as <class 'dict'>
field: type
type: <class 'str'>
length: 17

field: crs
type: <class 'dict'>
length: 2

field: features
type: <class 'list'>
length: 2861


##########

Filename: africa1.geojson

Dataset is imported as <class 'dict'>
field: type
type: <class 'str'>
length: 17

field: crs
type: <class 'dict'>
length: 2

field: features
type: <class 'list'>
length: 95219


##########

Filename: africa2.geojson

Dataset is imported as <class 'dict'>
field: type
type: <class 'str'>
length: 17

field: crs
type: <class 'dict'>
length: 2

field: features
type: <class 'list'>
length: 60670


##########

Filename: western_asia.geojson

Dataset is imported as <class 'dict'>
field: type
type: <class 'str'>
length: 17

field: crs
type: <class 'dict'>
length: 2

field: features
type: <class 'list'>
length: 17785


##########

SAMPLE DATA:

{'type': 'Feature', 'geometry': {'type': 'Point', 'coordinates': [43.887513, 36.472326]}, 'properties': {'id': 

## Geographically filtering the tent locations

The coordinates of both boreholes and tents are provided in terms of longitude and latitude. We can create a dataset that only includes tent data for tents that are within 50 km of a borehole.

In [97]:
from math import sin, cos, sqrt, atan2, radians

def distance(coord1,coord2):
    # function to calculate distance of two points based on longitude & latitude
    # lifted from: https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude
    
    # approximate radius of earth in km
    R = 6373.0
    
    lon1,lat1 = coord1
    lon2,lat2 = coord2

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    return R * c


def to_floats(x):
    try:
        return (float(x[0]),float(x[1]))
    except:
        pass


# gets all borehole coordinates as (longitude,latitude) tuples
borehole_coordinates = sorted([to_floats(item[9:7:-1]) for item in get_borehole_data()][1:]
, key = lambda x:x[0])

def nearest_borehole_distance(tent):
    # calculates the distance to the nearest borehole
    coord1 = tuple(tent['geometry']['coordinates'])
    nearest = 10000
    for coord2 in borehole_coordinates:
        d = distance(coord1,coord2)
        if d <= nearest:
            nearest = d
    return nearest

