Skip to content

Commit

Permalink
rm gsapi
Browse files Browse the repository at this point in the history
  • Loading branch information
phil8192 committed Jan 7, 2019
1 parent 06f80ce commit 331bf66
Show file tree
Hide file tree
Showing 14 changed files with 67 additions and 201 deletions.
81 changes: 12 additions & 69 deletions README.md
@@ -1,6 +1,4 @@
# GONDOLA WISH

> Distributed street view image processing
# Distributed street level image image processing

[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](http://www.repostatus.org/badges/latest/wip.svg)](http://www.repostatus.org/#wip)
[![LICENSE.](https://img.shields.io/badge/license-OGL--3-blue.svg?style=flat)](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)
Expand Down Expand Up @@ -54,15 +52,15 @@ Where:
|geom |Coordinates of this sample, stored as a [WKT](https://en.wikipedia.org/wiki/Well-known_text) POINT(lng lat) |
|sample\_order |Order in which this sample should be taken |
|sample\_priority |Sampling priority for this point (lowest = highest priority) |
|ts |Timestamp for when this point was sampled or interpoalted |
|ts |Timestamp for when this point was sampled or interpolated |
|cam\_dir |Sample direction - currently looking **left** or **right** w.r.t **bearing** |
|predicted |True if this point has been interpolated/predicted, False if an actual reading has been made |
|value |The **green** percentage value for this sample point |

## Initial data load

The code in [infrastructure/mysql](infrastructure/mysql) will import the initial
(unsampled) dataset into mysql. It expects as input the **csv** files for each
(un-sampled) dataset into mysql. It expects as input the **csv** files for each
city generated by the
[trees-open-street-map](https://github.com/datasciencecampus/trees-open-street-map)
module. During import, sample points will be assigned a **sampling priority** as
Expand All @@ -74,10 +72,7 @@ There are 418,941 unique
[osm way](http://wiki.openstreetmap.org/wiki/Elements#Way) ids/roads for the 112
major towns and cities in England and Wales. If sampling at 10 metre intervals,
this will result in 8,315,247 coordinates, or 16,630,494 sample points if
obtaining a **left** and **right** image at each point. With an image download
quota of 25k images per day, this equates to ~2 years of overall processing
time. (halfed for each API key or 6 months with 100k images per day premium
plan).
obtaining a **left** and **right** image at each point.

**A key requirement of this project is a complete dataset**

Expand Down Expand Up @@ -154,7 +149,7 @@ for each city (option 2).

### Option 1: Quota based data load

25,000 pending sample points are selected from the image processing API ordered
N pending sample points are selected from the image processing API ordered
by their **sampling priority**.

### Option 2: City queue data loader
Expand Down Expand Up @@ -214,35 +209,17 @@ Note that this layer can by skipped if the

### Image downloader

This layer consists of 1 or more processes dedicated to downloading images from
the [Google street view image API](https://developers.google.com/maps/documentation/streetview/).
As a prerequisite, a Google developer key must first be obtained which will be
associated with a quota of 25,000 daily image downloads.

25,000 40k images = ~1G per day, and since we must request each image
individualy, this can take some time. As such, besides separating image
acquisition and processing logic, the main purposes of this layer are:

* **Parallelisation** of image downloads. Can speed up download process and also
distribute the effort amongst different network nodes which may be making use of
different API keys, thus increasing the daily download quota. Disclaimer: this
would most likely be in breach of the API terms-and-conditions. Note that each
download process also makes use of persistent HTTP connections.

* **Control**. It is possible to start, stop and control the download throughput
by adding/removing processes to/from the pool.

The process blocks on the **image\_download\_jobs** queue until a new job
arrives. Note that this behaviour can follow a scheme - see
[image\_download/download.sh](download.sh) for details.

Upon receiving a new image download job, the process will make a request for a
street-view image ata specific location and direction according to the
street-level image at a specific location and direction according to the
parameters in the download job. In addition, a second request is made to the
street-view API for associated image meta data which includes the month/year in
which the image was obtained.

Each downloaded image is stored locally (although chould be pushed to an
Each downloaded image is stored locally (although could be pushed to an
external store) and then an **image processing** job is generated and pushed to
the **image\_processing\_jobs** queue where it will live until consumed by
workers in layer 4.
Expand All @@ -258,7 +235,7 @@ resolved the issue.

### Image processor

This layer consists of 1 or more processes dedictated to image processing. At an
This layer consists of 1 or more processes dedicated to image processing. At an
abstract level, an image processing process is both a consumer and a producer.
Image processing jobs are **consumed** from the **image\_processing\_jobs**
queue, processed and then the result is **pushed** to the
Expand Down Expand Up @@ -289,11 +266,11 @@ The next step is to quantify the amount of vegetation present in the image.

Currently, the image is first converted to the L\*a\*b\* colour space. Lab is
better suited to image processing tasks since it is much more intuitive than
RGB. In Lab, the lightness of a pixel (L value) is seperated from the colour
RGB. In Lab, the lightness of a pixel (L value) is separated from the colour
(A and B values). A negative A value represents degrees of green, positive A,
degrees of red. Negative B represents blue, while positive B represents yellow.
A colour can never be red *and* green or yellow *and* blue at the same time.
Therefore the Lab colour space provides a more intuitive seperability than RGB
Therefore the Lab colour space provides a more intuitive separability than RGB
(where all values must be adjusted to encode a colour.) Furthermore, since
lightness value (L) is represented independently from colour, a 'green' value
will be robust to varying lighting conditions.
Expand Down Expand Up @@ -331,16 +308,13 @@ python3 ./image_processor.py WORKER_NAME SCHEME SRC_QUEUE SRC_STORE
```

Where `WORKER_NAME` = name of process, `SCHEME` = -1, 0, or N according to the
consumption scheme described above, `SRC_QUEUE` = name of the incomming job
consumption scheme described above, `SRC_QUEUE` = name of the incoming job
queue and `SRC_STORE` = (local) directory where downloaded images have been
stored.


# Sample point interpolation

Since we are restricted to downloading 25,000 images per day, we can predict
missing values for non-sampled points.

For **green** percentage, it has been assumed that *there exists a spatial
relationship between each point*: It is likely that points closer together are
likely to be similar. Furthermore, it has been assumed that this relationship
Expand All @@ -357,37 +331,6 @@ of sampled values of the nearest *left* and *right* points:
`left` and `right` sample points. -- See code+unit tests in
[interpolator/](interpolator/) for details.

After all 25,000 images for a day have been downloaded and processed, the
[interpolator](interpolator) will predict the missing points for the entire
dataset. As such:

* The "resolution" of the dataset will **improve** over time as more samples are
collected.
* It is expected that the **prediction error** will reduce over time as sampled
points become closer according to the sampling prioritisation scheme. It is
likely that sampling at lower distances may not reduce error at all.
* The dataset will always be **complete** despite the number of reminaing
samples.

# Running

The end-to-end processing process can be invoked as a one-shot daily task. The
[daily\_run.sh](daily_run.sh) script can be invoked with no arguments and is a
good starting point to understand the overall flow of the system.

To invoke once per day using **Layer 1, Option 1** the following cron-job could
be created: `crontab -e`

```
30 21 * * * daily_run.sh 1 >> run.log 2 >> run.err
```

If using **Option 2** (inidivdual queues per city), the
[shovel/fair\_scheduler.sh](shovel/fair_scheduler.sh) script could be invoked in
the same way.

Note that option (1) is most suitable for very large datasets, since beanstalkd
is an **in-memory** store.

# Monitoring

Expand All @@ -414,4 +357,4 @@ pip3 install -r requirements.txt
```

# Licence
Open Government license v3 [OGL v3](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
Open Government license v3 [OGL v3](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
4 changes: 0 additions & 4 deletions api/README.md
Expand Up @@ -7,9 +7,5 @@ Street view image processing REST API.
```bash
sudo apt-get install -y libmysqlclient-dev
sudo apt-get install libgeos-dev
sudo pip3 isntall 'flask >= 0.12.2'
sudo pip3 install flask_mysqldb
sudo pip3 install shapely
sudo pip3 install flask-cli
```

1 change: 1 addition & 0 deletions api/api.py
Expand Up @@ -2,6 +2,7 @@
import logging
import requests


class API():

def __init__(self):
Expand Down
2 changes: 2 additions & 0 deletions api/server.py
Expand Up @@ -12,6 +12,7 @@
mysql = MySQL()
mysql.init_app(app)


def sanitise(db_job):
"""convert WKT point to shapely geom."""
wkt_point = db_job.pop('geom')
Expand Down Expand Up @@ -65,6 +66,7 @@ def jobs(city, sample_order):

return jsonify(jobs)


@app.route('/api/all_jobs', defaults={'limit': 25000})
@app.route('/api/all_jobs/<limit>')
def all_jobs(limit):
Expand Down
12 changes: 6 additions & 6 deletions generic/sequence.py
@@ -1,5 +1,6 @@
from collections import deque


def fold(x):
"""fold a list x into equal parts.
Expand Down Expand Up @@ -48,9 +49,8 @@ def schedule(x):


def pp(x):
"""just vis."""
ind, depth = schedule(x)
md = max(depth) + 1
for ind, depth in zip(ind, depth):
print("{:003d} {}".format(ind, "*" * ((md - depth) **2)))

"""just vis."""
ind, depth = schedule(x)
md = max(depth) + 1
for ind, depth in zip(ind, depth):
print("{:003d} {}".format(ind, "*" * ((md - depth) ** 2)))
4 changes: 2 additions & 2 deletions generic/test_sequence.py
@@ -1,5 +1,6 @@
from sequence import fold, schedule


def test_fold():
assert(list(fold(range(1, 8))) == [(4, 1), (2, 2), (6, 2), (1, 3), (3, 3), (5, 3), (7, 3)])

Expand All @@ -9,5 +10,4 @@ def test_schedule():
order, depth = schedule(x)
print(order)
assert order == [7, 3, 8, 1, 9, 4, 10, 0, 11, 5, 12, 2, 13, 6, 14]
assert depth == [4, 3, 4, 2, 4, 3, 4, 1, 4, 3, 4, 2, 4, 3, 4]

assert depth == [4, 3, 4, 2, 4, 3, 4, 1, 4, 3, 4, 2, 4, 3, 4]
20 changes: 9 additions & 11 deletions image_download/README.md
@@ -1,16 +1,14 @@
# Street view image downloader
# Image downloader

This **consumes** from the `image_download_jobs` queue. downloads images from street
view, saves the image to disk and **pushes** jobs to the `image_processing_jobs`
queue.
This **consumes** from the `image_download_jobs` queue.

`downloader.py` is a template: Modify accordingly to download images from a
remote source.

For each downloaded image, the image is saved to disk and then **pushed** to
the `image_processing_jobs` queue.

This stage can be:
* **long lived** - Blocking on `image_download_jobs` forever.
* **concurrent** - Image download workers can consume/produce in parallel.
* **distributed** - Image download workers can access job queues over a network.

## Dependencies

```bash
sudo pip3 install requests
```
* **distributed** - Image download workers can access job queues over a network.

0 comments on commit 331bf66

Please sign in to comment.