rm gsapi

datasciencecampus · Jan 7, 2019 · 331bf66 · 331bf66
1 parent 06f80ce
commit 331bf66
Show file tree

Hide file tree

Showing 14 changed files with 67 additions and 201 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,4 @@
-# GONDOLA WISH 
-
-> Distributed street view image processing
+# Distributed street level image image processing
 
 [![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](http://www.repostatus.org/badges/latest/wip.svg)](http://www.repostatus.org/#wip)
 [![LICENSE.](https://img.shields.io/badge/license-OGL--3-blue.svg?style=flat)](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)
@@ -54,15 +52,15 @@ Where:
 |geom             |Coordinates of this sample, stored as a [WKT](https://en.wikipedia.org/wiki/Well-known_text) POINT(lng lat) |
 |sample\_order    |Order in which this sample should be taken                                                                  |
 |sample\_priority |Sampling priority for this point (lowest = highest priority)                                                |
-|ts               |Timestamp for when this point was sampled or interpoalted                                                   |
+|ts               |Timestamp for when this point was sampled or interpolated                                                   |
 |cam\_dir         |Sample direction - currently looking **left** or **right** w.r.t **bearing**                                |
 |predicted        |True if this point has been interpolated/predicted, False if an actual reading has been made                |
 |value            |The **green** percentage value for this sample point                                                        |
 
 ## Initial data load
 
 The code in [infrastructure/mysql](infrastructure/mysql) will import the initial
-(unsampled) dataset into mysql. It expects as input the **csv** files for each 
+(un-sampled) dataset into mysql. It expects as input the **csv** files for each
 city generated by the 
 [trees-open-street-map](https://github.com/datasciencecampus/trees-open-street-map)
 module. During import, sample points will be assigned a **sampling priority** as
@@ -74,10 +72,7 @@ There are 418,941 unique
 [osm way](http://wiki.openstreetmap.org/wiki/Elements#Way) ids/roads for the 112
 major towns and cities in England and Wales. If sampling at 10 metre intervals, 
 this will result in 8,315,247 coordinates, or 16,630,494 sample points if 
-obtaining a **left** and **right** image at each point. With an image download
-quota of 25k images per day, this equates to ~2 years of overall processing 
-time. (halfed for each API key or 6 months with 100k images per day premium 
-plan).
+obtaining a **left** and **right** image at each point.
 
 **A key requirement of this project is a complete dataset**
 
@@ -154,7 +149,7 @@ for each city (option 2).
 
 ### Option 1: Quota based data load
 
-25,000 pending sample points are selected from the image processing API ordered
+N pending sample points are selected from the image processing API ordered
 by their **sampling priority**.
 
 ### Option 2: City queue data loader
@@ -214,35 +209,17 @@ Note that this layer can by skipped if the
 
 ### Image downloader
 
-This layer consists of 1 or more processes dedicated to downloading images from
-the [Google street view image API](https://developers.google.com/maps/documentation/streetview/).
-As a prerequisite, a Google developer key must first be obtained which will be
-associated with a quota of 25,000 daily image downloads.
-
-25,000 40k images = ~1G per day, and since we must request each image 
-individualy, this can take some time. As such, besides separating image 
-acquisition and processing logic, the main purposes of this layer are:
-
-* **Parallelisation** of image downloads. Can speed up download process and also
-distribute the effort amongst different network nodes which may be making use of
-different API keys, thus increasing the daily download quota. Disclaimer: this
-would most likely be in breach of the API terms-and-conditions. Note that each 
-download process also makes use of persistent HTTP connections.
-
-* **Control**. It is possible to start, stop and control the download throughput
-by adding/removing processes to/from the pool.
-
 The process blocks on the **image\_download\_jobs** queue until a new job 
 arrives. Note that this behaviour can follow a scheme - see 
 [image\_download/download.sh](download.sh) for details.
 
 Upon receiving a new image download job, the process will make a request for a 
-street-view image ata specific location and direction according to the 
+street-level image at a specific location and direction according to the
 parameters in the download job. In addition, a second request is made to the 
 street-view API for associated image meta data which includes the month/year in 
 which the image was obtained.
 
-Each downloaded image is stored locally (although chould be pushed to an 
+Each downloaded image is stored locally (although could be pushed to an
 external store) and then an **image processing** job is generated and pushed to
 the **image\_processing\_jobs** queue where it will live until consumed by 
 workers in layer 4.
@@ -258,7 +235,7 @@ resolved the issue.
 
 ### Image processor
 
-This layer consists of 1 or more processes dedictated to image processing. At an
+This layer consists of 1 or more processes dedicated to image processing. At an
 abstract level, an image processing process is both a consumer and a producer. 
 Image processing jobs are **consumed** from the **image\_processing\_jobs** 
 queue, processed and then the result is **pushed** to the 
@@ -289,11 +266,11 @@ The next step is to quantify the amount of vegetation present in the image.
 
 Currently, the image is first converted to the L\*a\*b\* colour space. Lab is 
 better suited to image processing tasks since it is much more intuitive than 
-RGB. In Lab, the lightness of a pixel (L value) is seperated from the colour 
+RGB. In Lab, the lightness of a pixel (L value) is separated from the colour
 (A and B values). A negative A value represents degrees of green, positive A, 
 degrees of red. Negative B represents blue, while positive B represents yellow. 
 A colour can never be red *and* green or yellow *and* blue at the same time. 
-Therefore the Lab colour space provides a more intuitive seperability than RGB 
+Therefore the Lab colour space provides a more intuitive separability than RGB
 (where all values must be adjusted to encode a colour.) Furthermore, since 
 lightness value (L) is represented independently from colour, a 'green' value 
 will be robust to varying lighting conditions.
@@ -331,16 +308,13 @@ python3 ./image_processor.py WORKER_NAME SCHEME SRC_QUEUE SRC_STORE
 ```
 
 Where `WORKER_NAME` = name of process, `SCHEME` = -1, 0, or N according to the
-consumption scheme described above, `SRC_QUEUE` = name of the incomming job 
+consumption scheme described above, `SRC_QUEUE` = name of the incoming job
 queue and `SRC_STORE` = (local) directory where downloaded images have been 
 stored.
 
 
 # Sample point interpolation
 
-Since we are restricted to downloading 25,000 images per day, we can predict
-missing values for non-sampled points.
-
 For **green** percentage, it has been assumed that *there exists a spatial 
 relationship between each point*: It is likely that points closer together are 
 likely to be similar. Furthermore, it has been assumed that this relationship
@@ -357,37 +331,6 @@ of sampled values of the nearest *left* and *right* points:
 `left` and `right` sample points. -- See code+unit tests in
 [interpolator/](interpolator/) for details.
 
-After all 25,000 images for a day have been downloaded and processed, the
-[interpolator](interpolator) will predict the missing points for the entire
-dataset. As such:
-
-* The "resolution" of the dataset will **improve** over time as more samples are 
-  collected.
-* It is expected that the **prediction error** will reduce over time as sampled
-  points become closer according to the sampling prioritisation scheme. It is
-  likely that sampling at lower distances may not reduce error at all.
-* The dataset will always be **complete** despite the number of reminaing 
-  samples. 
-
-# Running
-
-The end-to-end processing process can be invoked as a one-shot daily task. The
-[daily\_run.sh](daily_run.sh) script can be invoked with no arguments and is a
-good starting point to understand the overall flow of the system.
-
-To invoke once per day using **Layer 1, Option 1** the following cron-job could
-be created: `crontab -e`
-
-```
-30 21 * * * daily_run.sh 1 >> run.log 2 >> run.err
-```
-
-If using **Option 2** (inidivdual queues per city), the 
-[shovel/fair\_scheduler.sh](shovel/fair_scheduler.sh) script could be invoked in
-the same way.
-
-Note that option (1) is most suitable for very large datasets, since beanstalkd
-is an **in-memory** store.
 
 # Monitoring
 
@@ -414,4 +357,4 @@ pip3 install -r requirements.txt
 ```
 
 # Licence
-Open Government license v3 [OGL v3](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
+Open Government license v3 [OGL v3](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
diff --git a/api/README.md b/api/README.md
@@ -7,9 +7,5 @@ Street view image processing REST API.
 ```bash
 sudo apt-get install -y libmysqlclient-dev
 sudo apt-get install libgeos-dev
-sudo pip3 isntall 'flask >= 0.12.2'
-sudo pip3 install flask_mysqldb
-sudo pip3 install shapely
-sudo pip3 install flask-cli
 ```
 
diff --git a/api/api.py b/api/api.py
@@ -2,6 +2,7 @@
 import logging
 import requests
 
+
 class API():
 
     def __init__(self):

diff --git a/api/server.py b/api/server.py
@@ -12,6 +12,7 @@
 mysql = MySQL()
 mysql.init_app(app)
 
+
 def sanitise(db_job):
     """convert WKT point to shapely geom."""
     wkt_point = db_job.pop('geom')
@@ -65,6 +66,7 @@ def jobs(city, sample_order):
 
     return jsonify(jobs)
 
+
 @app.route('/api/all_jobs', defaults={'limit': 25000})
 @app.route('/api/all_jobs/<limit>')
 def all_jobs(limit):

diff --git a/generic/sequence.py b/generic/sequence.py
@@ -1,5 +1,6 @@
 from collections import deque
 
+
 def fold(x):
     """fold a list x into equal parts.
 
@@ -48,9 +49,8 @@ def schedule(x):
 
 
 def pp(x):
-   """just vis."""
-   ind, depth = schedule(x)
-   md = max(depth) + 1
-   for ind, depth in zip(ind, depth):
-       print("{:003d} {}".format(ind, "*" * ((md - depth) **2)))     
-
+    """just vis."""
+    ind, depth = schedule(x)
+    md = max(depth) + 1
+    for ind, depth in zip(ind, depth):
+        print("{:003d} {}".format(ind, "*" * ((md - depth) ** 2)))
diff --git a/generic/test_sequence.py b/generic/test_sequence.py
@@ -1,5 +1,6 @@
 from sequence import fold, schedule
 
+
 def test_fold():
     assert(list(fold(range(1, 8))) == [(4, 1), (2, 2), (6, 2), (1, 3), (3, 3), (5, 3), (7, 3)])
 
@@ -9,5 +10,4 @@ def test_schedule():
     order, depth = schedule(x)
     print(order)
     assert order == [7, 3, 8, 1, 9, 4, 10, 0, 11, 5, 12, 2, 13, 6, 14] 
-    assert depth == [4, 3, 4, 2, 4, 3, 4, 1, 4, 3, 4, 2, 4, 3, 4]
-
+    assert depth == [4, 3, 4, 2, 4, 3, 4, 1, 4, 3, 4, 2, 4, 3, 4]
diff --git a/image_download/README.md b/image_download/README.md
@@ -1,16 +1,14 @@
-# Street view image downloader
+# Image downloader
 
-This **consumes** from the `image_download_jobs` queue. downloads images from street
-view, saves the image to disk and **pushes** jobs to the `image_processing_jobs`
-queue.
+This **consumes** from the `image_download_jobs` queue.
+
+`downloader.py` is a template: Modify accordingly to download images from a
+remote source.
+
+For each downloaded image, the image is saved to disk and then **pushed** to
+the `image_processing_jobs` queue.
 
 This stage can be:
 * **long lived** - Blocking on `image_download_jobs` forever.
 * **concurrent** - Image download workers can consume/produce in parallel.
-* **distributed** - Image download workers can access job queues over a network.
-
-## Dependencies
-
-```bash
-sudo pip3 install requests
-```
+* **distributed** - Image download workers can access job queues over a network.