# 2. Turn the algorithm into a service

## Service high level view

This service would consist of 3 components as shown in the diagram below.

- a really simple frontend component
- a compute component, a cluster of virtual machines running an improved, hypothetical version of the "working demo" from Part 1
- a storage component powered by an object store (S3) and a NoSQL database (DynamoDB) with index-based search functionality 

![image.png](./api_design.png)

### Assumptions

- assume the service is to be deployed to a public, aws-like cloud platform (rather than in-house data center)
- assume the users interface the system in an anonymous fashion, meaning no login is required. However, please see the "System hardening" section at the end for further thoughts.
- assume the networks are defined in the same format as in Part. 1, i.e. each network is a list of triples `[start_location, end_location, edge_flow]`
- assume the network definitions are stored in `.json` files to be uploaded by the client users
- assume the users do not need to see the actual networks (in a graphical interface or so) but only the evaluated max-flow value.

### User interaction and request life cycle

The client users would interact with the system in the following ways:


#### Option 1. HTTP-POST to upload a network definition

The user can send an http-post request with a network `.json` file and a network description in plain text.

- if the network file is in a valid format and the description is not empty, upload the file and return http 201
- if otherwise, return http 400 with error description
- if a network file exists server-side with the same description, return http 403 with "resource exists" error


#### Option 2. HTTP-GET to query the maximum flow

The user can send an http-get request with (start, end) locations and two types of parameters:
- type 1: with specific network description, the system then returns the max-flow accordingly
- type 2: without specific network description, the system then returns all the known "versions" (all the proposals).

For example, if 3 engineers have submitted their network json files that design a particular route connecting Mascot to North Sydney:

```
Network Design by A
Network Design by B
Network Design by C
``` 

if I issue a query command to the API, asking for the "max flow capacity between Mascot and North Sydney with Network Design by A", the system would return me a single record: 

```
Network Design by A, from <Mascot> to <North Sydney>, max flow: 4000/h
```

Whereas if I issue a query command to the API, asking for "max flow capacity between Mascot and North Sydney", the system would give me three records:

```
Network Design by C, from <Mascot> to <North Sydney>, max flow: 1000/h
Network Design by B, from <Mascot> to <North Sydney>, max flow: 2000/h
Network Design by A, from <Mascot> to <North Sydney>, max flow: 4000/h
```

(assume that the system has already computed all 3 max-flow values and persisted the results)

However, there is one **exception** with the second type of parameter.

If the system can not find any record designated by (start, end), it returns http 404 with an error description, because it does not know which network the pair of locations (start, end) refer to.

## Frontend

- Could either be an embedded element on the caller's site (such as Transport NSW), or
- Dynamically served by a lambda function, or
- A dedicated site served on our VPC

It would ensure the user submits sufficient information and perform basic validations, for example:

- ensure the network description contain valid alphanumeric characters 
- ensure the json file exists and is in the valid format


### Handle Upload

The frontend would deserialize the client json and send it with the request.

The size of the payload would be acceptable as in the extreme cases, where a network json file contains 40-80k roads, the file size would only be a few Mbs (with space characters stripped)

## Compute component

![image.png](./compute_component.png)

The requests from the frontend would firstly land on the API Gateway, where they are dispatched to the backend cluster managed by an auto-scaling group. 

The design intention is:

- API Gateway acts as the warden facing the hostile outside world (DDOS attack etc.)
- the cluster runs in DMZ (demilitarized zone), a separate internal subnet that operates in a safe environment
- the auto-scaling mechanism would use CPU utilization to scale up or down (for example: if utilization > 50%, scale up) to ensure system remains responsive and not overly idle 
- the compute instances are provisioned from a machine image (the green box above) with the latest security patches. See the deployment section for more details.
- both API Gateway and the cluster have their own monitoring services, typically provided by the platform (such as AWS CloudWatch) or via trusted third-parties such as Dynatrace.

### API Gateway

To specify the end points (for query and upload) and harden security.

Here is a brief comparison between provider-managed gateway (such as AWS API Gateway) and dedicated gateway instances. The pros and cons are:

- managed gateway simplifies maintenance but is likely more expensive; does not expose every detail for fine-tuning.
- gateway instances are just normal virtual machine instances with intense security measure; can be inexpensive and allows for easy customization (such as streaming and jump-host); require dedicated maintenance efforts

Here I assume we have chosen to use a provider-managed API Gateway service.


### Handle Upload

The max-flow service, which is an improved version from the working demo in Part 1 would take the network definition wrapped in the json payload along with the network description.

It would then send a PUT request via the S3 api to upload the network file to a bucket. This creates a blob object named after the description. 

Note that the PUT request would fail if the blob already exists (even with a different content). **This restricition is configured via bucket policy**. See the System hardening section for details. In this case the backend will return http 403 "resource exists" error.

### Handle Query

In the case of user querying the max-flow, the backend would perform a few operations in the following order:

Firstly, 

- if the user specifies the (start, end) locations and the network description, the system combines these two pieces of information to form a key, and try to find an record in the DynamoDB.

- if the user does not specify the network description, it would use the (start, end) location as the search index and scan for all the matching keys. (recall that in the above example, the system would find three records, with A, B, C's network design respectively)

If either situation is successful, the backend crafts a response with a single record (case 1) or a list of records (case 2) and returns it.

If not successful, meaning that no records exist for the query, it would proceed to the second operation:

It firstly sends a GET request via the S3 api to download the network file (the blob object). If not found it would result in http 404 without further processing.

Then it uses the user-specified (start, end) locations to compute the max-flow. 

Note that if user does not provide the network description at all, the system will consider this an user error (the system doesn't know which network the (start, end) is referring to). An http 400 is returned with an error description.

The result of the computation, the max-flow, will be sent to DynamoDB (PUT) with a new key: (start, end, network-description). This record is available for further queries.


## The Storage Component

![image](./storage_component.png)

The storage component has the following invariants:

- it only accepts traffic from the backend subnet (an internal subnet)
- both the S3 bucket and DynamoDB enables multi-AZs (availability zones) backup to prevent data loss.
- the bucket disables delete-object and disallows overriding existing blobs
- enable indexing in DynamoDB, use the (start, end) locations, a tuple, as the index key (such that the compute component can efficiently search for all the records associated with this pair of locations)

### Handle Upload and Query

As explained in the compute component section.

## Deployment

I would choose the following deployment strategies:

- use an infra-as-code system to provision all the above components; recommending Hashicorp's Terraform
- separate **test**, **staging**, **production** VPCs; allow devs to test deployment in the **test** VPC anytime; only deploy to **staging** after merging and successful unit testing; only deploy to **production** once all the functional tests have passed in staging (see the testing section below)
- use rolling deployment, taking advantage of auto-scaling (see the details below)

### Infra as Code

Basically, in the service repository, a subdirectory named "infrastructure" contains
all the provision logic required to spin up and teardown the corresponding cloud components.

A team or person working in the service source code would also update the infrastructure source code and make sure both parts always stay in-sync.

Two separate "build" processes exist for both the service source code and the infrastructure. Typically the service needs certain provisional changes done before the developers can deploy the artifacts.

### Rolling Deployment

#### Deploymeny Target

Each time, a successful merge would lead to the creation of a new machine image (AMI) which contains the host OS, the artifacts (the binary + dependencies) and all the security patches (It should pass all the security compliance checking).

This machine image is the deployment target. It is labelled with the commit, the development ticket number and other descriptive information.

In the staging environment, a small cluster (auto-scaling group) is created with this machine image so that we can carry on functional + manual testing.

#### Rolling 

Once a deployment target is thoroughly tested in the staging environment, it is relabeled the "chosen machine image" (which is the green box in the compute component diagram). 

The production auto-scaling group sets this chosen machine image as its provision template.

Then an automated process periodically increments the scaling factor to spawn more instances with the new machine image. 

The cluster's monitoring system would raise an alarm if any new instances throw errors. In such cases, the automated process would reset the machine image to the previous one (rollback) and kill off the newly spawned instances. 

## Monitoring

I would use two types of monitoring strategies for different resources.

- security-oriented monitoring methods for critical resources, such as the API Gateway
- performance-oriented methods for data intensive resources, such as the instances

### Security-oriented monitoring objectives

To keep an eye on:

- DDOS attacks
- port brutal-forcing (which can be less an issue for managed API Gateway)
- malicious staff

The corresponding data sources to keep track of are:

- Gateway traffic dashboard, to look for highly repetitive requests from a small set of IP addresses
- Gateway error graph (to look for high volumes of port rejection) 
- Alarms sent from the login trail (such as AWS cloudtrail) that indicate possible permission escalation

All the security-oriented warnings should be treated as alarms and dealt with immediately.

### Performance-oriented monitoring objectives

I would apply the [USE](http://www.brendangregg.com/usemethod.html) method, which recommends keeping track of the following metrics:

- utilization (are the resources on average busy doing things or idle ?)
- saturation (are the resources too busy doing things hence unable to handle any request ?)
- error (are the service functioning or failing ?)

In the case of this network max-flow service, I would let the system generate low-priority warnings for the following events:

- temporary cluster saturation, CPU spikes (should trigger auto-scaling)
- temporary errors (caused by provider instability etc.)
- low cluster utilization (again, should trigger automated down-scaling)

On the other hand, these events should trigger high-priority warnings and alarms:

- long cluster satuaration (> 30 seconds), should look for malconfiguration and security incidents
- long stream of errors (> 100 continuous error entries), should identify the cause and if it is source-related, rollback immdiately
- infrastructure provider warnings (scheduled maintenance or downtime)

## Testing

### Unit Tests

We should maintain over 90% unit test coverage. It should be a hard metric set in the build pipeline: no unit test coverage no deploy.

### Performance Tests

Also run performance-testing on every merge. Should upload the time records to a vault (like AWS parameter store) and make sure performance does not degrade over releases.

For developers, performance-test programs should be easy to run and easy to understand just like unit tests.

### Functional Tests

Test the API end points via functional testing. 

There is space for creativity: 
- typically one could write func-test in cucumber or python/ruby with custom executor
- but there also exists more automated and integrated approaches, such as using the function docstring to perform certain testing (doctest)

I'm personally not a big fun of manual functional-testing, but under tight resource constraint this could serve as a temporary strategy while the team progressively automates the functional testing process.

## System Hardening Ideas

These are not addressed in the requirements but I thought I would share these ideas.


### Lowering Risks

The current design assumes any anonymous user (any "engineer") can interact with the system which is highly risky.

#### The short term hardening options



- limited amount of API actions per IP per minute (configurable via API Gateway)
- completely disable SSH ports on the compute instances and use managed console access (such as AWS System Manager Session) if need to. This would keep track of every login in the CloudTrail and trigger security alarm if escalation happens.
- set max instance life time in the auto-scaling group to a relatively short duration (6 hours to 1 day). Instances that live longer than this limit are automatically torn down and replaced by newly spawned instances. This prevents long attack


#### The longer term hardening options

- disable anonymous access
- require login from, for example, google account; use Identity Federation to manage the user accounts and control their access 

### Bucket Policy

The S3 bucket policy should follow the least-privilage principle and disable override.

- encrypt at rest
- block public access
- enable versioning
- disable `Delete*, Update*` blobs; only allow `Get*, Describe*, Put*, ` blobs