# Implementation

## Contents

* [System Architecture](#architecture)
* [Dependencies and Libraries](#libraries)
* [Technical Difficulties](#difficulties)
* [git, gh-pages, Process Book](#git)


<a id="architecture"></a>
## System Architecture

The following diagram provides a high level overview of the system architecture:

![System Architecture](SystemArchitecture.png)

The majority of the processing is performed in client-side JavaScript. The prediction calculations are currently performed in Python-based webservice hosted on Heroku, as the implementation is using Sci-kit Learn's [RandomForest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) library. Initially, there was significant latency when using the webservice, with delays up to 20 seconds for a single prediction. This does not meet our goals of having a very interactive, game-like experience when performing predictions. 

We unsuccessfully attempted to replace the webservice with a client-based Random Forest JavaScript implementation, such as [Karpathy's](https://github.com/karpathy/forestjs).

Update 4/18/2016. We did some further investigation and found that Karpathy's implementation does not appear to provide the prediction probabilities in a useful (or perhaps even accurate) fashion. The next library we will try is from [JFrazelle](https://github.com/jfrazelle/random-forest-classifier).

Update April 20, 2016. Today, we made major progress in solving the web service performance problem. It turns out that when being hosted on Heroku, the webservice was retraining the entire Random Forest for every single prediction request. We made three changes to the web service that **dropped the prediction time from at least 20 seconds to 2 seconds**:

* We trained the Random Forest on our development machines and persisted the result in a file (using Python pickling). We tried the using recommended Sci-kit [jobload](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#model-persistence) but that generated 2 GB of 5000 persistence files! 
  * We also looked into using Redis, but that had other complications. It wasn't clear if we could store a 300 MB string in Redis. It would have involved other dependencies in the webservice, and the performance improvement may not have been great. Given we were reading a static Random Forest object, there was likely little advantage to using Redis.
* We reduced the number of estimators / decision trees from 1000 to 150 to save time and pickle file size. The resulting pickle size was *only* 271 MB. We do not believe there is a significant difference in prediction accuracy.
* We reduced the number of gunicorn threads from unlimited to 2 based on a [Stackoverflow Suggestion](http://stackoverflow.com/questions/12079582/error-r14-memory-quota-exceeded-not-visible-in-new-relic) that worked out really well. It had the added advantage of keeping the classifier in memory more frequently.

We believe that the performance is adequate enough for now. We will turn our attention to the innovative visualization.

<a id="libraries"></a>
## Dependent Libraries and Tools

The Prediction and Drill down visualizations were implemented in [D3](https://d3js.org/). The Innovative (3d) visualization was implemented in [three.js](http://threejs.org/).

In addition to aforementioned Scikit Learn Random Forest library, we also used a Python script to convert the Process book from iPython / [Jupyter](http://jupyter.org/) notebooks to static HTML. Each chapter is a separate notebook, which is then loaded in a single HTML page, accessed via [jQuery](https://jquery.com/) 

The public website and process book were hosted via [GitHub pages](https://help.github.com/articles/creating-project-pages-manually/). As mentioned previously, the web service was hosted on [Heroku](https://www.heroku.com/).

For the grid layout, we re-used a custom CSS grid layout from a prior class. We did not use Bootstrap, Foundation or any other CSS framework for multiple reasons: 1) we did not want a lot of extra bloat for the layouts 2) we did not want the CSS framework possibly interfering with our visualization areas 3) we had finer control for media queries and other responsive techniques in order to support mobile devices in the future. We did use [Sass](http://sass-lang.com/) at compile time to standardize the majority of our CSS.





<a id="difficulties"></a>
## Technical Difficulties Encountered

In the course of implementing a large project with multiple authors and dependencies, we encountered a number of challenges.

### 3D Visualization

The 3D visualization was the most technically challenging portion of the project. The three.js library is not as well documented and does not have as many sample files as D3. Learning about cameras, Phong Meshs, light sources, sphere turbidity and other details while imagining a new visualization in 3D was *interesting*. There are probably two labs worth of material just on building 3d visualizations, perhaps even enough material for an entire course if virtual and augmented reality are included. 

### Web Service Challenges

Implementation of a custom webservice was challenging. We encountered numerous problems: 

1. **Heroku out-of-memory errors** Heroku would run out of memory on a regular basis even when the web service would correctly locally. We discovered that Heroku was spawning too many threads to service requests. By lowering the number of threads, memory usage dropped while improving reliability and response time. This was counter-intuitive.
1. **Cross-domain web service requests** In Lab 8, we learned how to consume a web service through a proxy. In our
case, we had to add the capability of accessing our own web service from any domain by implementing the `Access-Control-Allow-Origin` header.

### JavaScript Challenges

1. **Calling the web service** Prior to Lab 8 and the web service changes, we implemented a provisional solution using JSONP. After the web service improvement, we did not use D3 to access the web service, instead relying on the lower level `XMLHttpRequest()` request in JavaScript in order to have finer grained control.
1. **`<object> is not defined`** When the code base continued to grow especially with the incorporation of three.js, we started encountering load errors where D3 would load the data files before the browser had completed loading all the pre-requisite JavaScript files, leading to sporadic error messages and failed visualizations. We solved this by wrapping the D3 `queue()` call inside `window.addEventListener("load",...`.
1. **Sharing Data Between Pages** We used a combination of cookies and `localStorage` to share data between the multiple pages. Performance was sufficient so we did not optimize further. There is ample room for further optimization such as caching the CSV data files in localStorage.

### Other

1. **Publishing to PDF** Jupyter notebooks offer the possibility of publishing to PDF. We split the process book into multiple chapters to allow modular updates (since Jupyter is not a git-friendly format). Unfortunately, recombining multiple PDFs together led to low quality results. Exporting to HTML, with some manipulation, proved more robust.
1. As is all too common, a **bad commit to git** led to dysfunctional git repository. We lost a day rebuilding the HEAD and manually merging the changes.

<a id="git"></a>
## git Usage

Since we were working concurrently on multiple parts of this
project, `git` was be very useful tool to ensure that we can
work in parallel, even on the same code, without interference.

Here are our internal notes on using git, and publishing the site and process book to gh-pages.

### Copy this repository to your machine

`git clone git@github.com:wihl/cs171-project.git`

### Creating a branch

This will create a new branch in the code. You can do anything you
want in this branch and it will not affect anything in master,
until a merge is done (see below).

```
cd cs171-project
git checkout -b newbranch
git push origin newbranch
git push --set-upstream origin newbranch
```
The `push origin` is needed to send the branch to github, which you
will need to do before pushing the branch so everyone else can see
it.

From this point do whatever edits you want. Create a subdirectory,
add files, make changes.

### Commit early and regularly

Do not leave your code unbackedup on your machine. I commit every
time I add something new and useful that works. It could be as little
as 2-3 lines of code.

This is the same as in class, although now you will be committing just
to your branch.

Assuming you are in the cs171-project directory

```
git add .
git commit -m "<one line summary of what changed>"
git push
```

### See what others have done

To grab a fresh local copy of everything in the github repository, use
`git fetch`. Any new branches anyone else has created will be moved
down.

To switch to their branch, while leaving yours untouched use
`git checkout otherbranch`. To get back to your branch `git checkout mybranch`.

### Merging - Step 1

The first step in merging is to merge the latest master branch into
your branch

```
git fetch
git merge origin/master
```
If there are any conflicts, like two people having written to the same
file in the same place, it will let you know. It is pretty smart
about merging different pieces of code or file that do not impact each
other. This is one of git's most amazing features.

After you have reconciled any merge conflicts, you can commit
the merged master into your branch, which are the same commands as
before

```
git add .  
git commit -m "I merged in master"  
git push  
```
Now your branch has all of master and all of your new stuff

### Merging - Step 2

Now that you are fully up to date in your branch, you can update master:

```
git checkout master  
git merge newbranch    
git status    
git push  
```
Now your code has been integrated into master. Your work is done.

Create a new branch to start on another unit of work (or keep
working on the same branch if it isn't done yet).

It is a good idea to not wait too long to merge into master. Once it is
in a stable state, ready to use by others if not feature complete, do
the merge into master.

### That's It

`git` has many powerful features like pull requests and code reviews,
but I don't think we need them for this project.

## Using GitHub Pages

The [gh-pages](https://help.github.com/articles/creating-project-pages-manually/) branch is where the public facing information
lives. Per the github docs, this branch is an orphan. You
might as well consider it being two repositories in one.

Since it is effectively two repositories, I keep two
copies of the repository on my local hard drive:

Primary copy:
```
cd <wherever you put projects>
git clone git@github.com:wihl/cs171-project.git
```
This creates a `cs171-project` directory in your current location.

Then I make secondary copy in a parallel directory **not under cs171-project**.

So while still in the current directory, not in the cs171-project directory:
```
mkdir projectwebsite
cd projectwebsite
git clone git@github.com:wihl/cs171-project.git
cd cs171-project
git checkout gh-pages
```

So now I have
```
./cs171-project
./projectwebsite/cs171-project # gh-pages branch
```


### Contents of `gh-pages`

There are two subdirectories: `./client` and `./processbook`

We also use a CNAME to have a well defined public facing URL (vis.chanceme.info) instead of the default `wihl.github.io/cs171-project`.

When the user navigates to [vis.chanceme.info](http://vis.chanceme.info), a small `index.html` file redirects them to the client code.

If the user explicitly navigates to [vis.chanceme.info/processbook](http://vis.chanceme.info/processbook), they skip over the small `index.html` and land on the process book.

### How to Release a New Public Version

Given the file structure above, releasing a new public version
of the client consists of copying the client code from the master
repository into the repository having the gh-pages branch.

```
cd <wherever you put projects>
cd cs171-project
cp -r client ../projectwebsite/cs171-project/
cd ../projectwebsite/cs171-project
```

Now the files are staged. Check them on your local machine with a webserver, such as:

```
python -m SimpleHTTPServer 8888 &
open http://localhost:8888
```

If you are happy with results, publish it to the public:

```
git add .
git commit -m "awesome new features"
git push
```

About a minute after the push is done, it will be live.

### Updating the Process Book(s)

The process books are stored in iPython/Jupyter notebooks.

```
cd <wherever you put projects>/cs171-project/processbook
jupyter notebook
```

and then open the notebook you want to modify.

When you are done, and are ready to publish a new version of the
process book, there is a convenient script in the processbook directory called `publish.py`

This script converts the Jupyter notebooks to HTML, stitches them together, copies the results to the project website and updates the processbook
index page.

Once it has done it's thing, check it and then publish it publicly.

```
python publish.py
cd ../../projectwebsite/cs171-project
python -m SimpleHTTPServer 8888 &
open http://localhost:8888/processbook
```

If it all looks good, then publish it to the public:
```
git add .
git commit -m "updates to processbook"
git push
```

And then check it: [vis.chanceme.info/processbook](http:/vis.chanceme.info/processbook)