Skip to content

Commit

Permalink
Finish overview
Browse files Browse the repository at this point in the history
  • Loading branch information
JiaweiZhuang committed Jan 26, 2018
1 parent 832a966 commit dd95d82
Show file tree
Hide file tree
Showing 7 changed files with 248 additions and 63 deletions.
17 changes: 8 additions & 9 deletions doc/additional.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
Additional resources
====================
Learn more about cloud computing
================================

The majority of cloud computing textbooks and AWS documentations are written
for web developers and system architects, NOT for domain scientists
who just want to do scientific computing and data analysis on the cloud.
As a domain scientist with limited IT background,
it is crucial the pick up the correct tutorial on cloud computing.
Here are my recommendations:
who just want to do scientific computing and data analysis.
As a researcher with limited IT background, it is crucial the pick up the
correct tutorial when learning cloud computing. Here are my recommendations:

[1] **University of Washington** has a very nice
`high-level overview <https://itconnect.uw.edu/research/
Expand All @@ -15,13 +14,13 @@ and
`technical documentation <https://cloudmaven.github.io/documentation/>`_
about cloud computing for scientific research.

[2] **Cloud Computing for Science and Engineering (Foster and Gannon 2017)**
[2] **Cloud Computing for Science and Engineering** (Foster and Gannon 2017)
is the first textbook I am aware of that provides hands-on tutorials for domain scientists.
The book is `free available online <https://cloud4scieng.org/chapters/>`_.

[3] **Cloud Computing in Ocean and Atmospheric Sciences (Vance et al. 2016)**
[3] **Cloud Computing in Ocean and Atmospheric Sciences** (Vance et al. 2016)
gives a nice overview of various cloud computing applications in our field.
It doesn't tell you how to actually do cloud computing, though.
It doesn't tell you how to actually use the cloud, though.

[4] **Researcher’s Handbook by AWS** is the most useful AWS material for you
(as a scientist, not an IT person). You will need to sign-up the
Expand Down
40 changes: 0 additions & 40 deletions doc/advanced.rst

This file was deleted.

33 changes: 22 additions & 11 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,33 +7,44 @@ GEOS-Chem on cloud computing platforms
======================================

`GEOSChem-on-cloud <http://acmg.seas.harvard.edu/research.html#cloud>`_
project aims to build a cloud computing capability for GEOS-Chem_ that is
accessible by researchers worldwide, addressing many computational challenges
for the next generation of atmospheric chemistry modeling.
project aims to build a cloud computing capability for GEOS-Chem_ that can be easily
accessed by researchers worldwide.

See :ref:`motivation-label` for why moving to the cloud.
See :ref:`tutorial-label` to start your first GEOS-Chem simulation on the
`Amazon Web Service <https://aws.amazon.com>`_ (AWS) cloud within 10 minutes
(and within seconds for the second time).
See :ref:`motivation-label` and :ref:`new-opportunity-label` for why moving to the cloud.
See :ref:`tutorial-label` to start your first GEOS-Chem simulation on the
`Amazon Web Services (AWS) <https://aws.amazon.com/>`_ cloud within 10 mintues
(and within seconds for the next time).

This project is supported by the AWS Public Data Set Program and
the NASA Atmospheric Composition Modeling and Analysis Program (ACMAP).

.. warning::
This project is at initial development and things are moving very fast.
*GEOSChem-on-cloud* is at initial development and things are moving very fast.
Please use the `GitHub issue tracker <https://github.com/JiaweiZhuang/cloud_GC/issues>`_
to request new funtionalities, report bugs, or just discuss general issues.

Contents
Contents
--------


.. toctree::
:maxdepth: 1
:caption: Why moving to the cloud

motivation
new_opportunity

.. toctree::
:maxdepth: 1
:caption: Tutorials

tutorial
advanced
additional

.. toctree::
:maxdepth: 1
:caption: Additional resources

additional
status_of_cloud

.. _GEOS-Chem: http://acmg.seas.harvard.edu/geos/
75 changes: 73 additions & 2 deletions doc/motivation.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,75 @@
.. _motivation-label:

Motivation
==========
Remove technical barriers
=========================

Atmospheric scientists often need to waste time on non-science tasks:
installing software libraries, making models compile and run without bugs,
preparing model input data, or even setting up a Linux server.

Those technical tasks are getting more and more challenging --
as atmospheric models evolve to incorporate more scientific understandings
and better computing technologies, they also need more complicated software,
more computing power, and much more data.

Cloud computing can largely alleviate those problems. **The goal of this project is
to allow researchers to fully focus on scientific analysis, not
fighting with software and hardware problems.**

Software
--------

On the cloud, you can launch a server with everything configured correctly.
Once I have built the model and saved it as an `Amazon Machine Image (AMI)
<https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html>`_,
anyone can replicate exactly the same software environment and start using
the model immediately (see :ref:`tutorial-label`).
You will never see compile errors anymore.

This has more implications in the age of High-Performance Computing (HPC).
Modern atmospheric models are often built with complicated software frameworks,
notably the `Earth System Modeling Framework (ESMF)
<https://www.earthsystemcog.org/projects/esmf/>`_.
Those frameworks allow model developers to utilize HPC technologies without
writting tons of boilerplate `MPI <https://computing.llnl.gov/tutorials/mpi/>`_ code,
but they add extra burdens on model users --
installing and configuring those frameworks is daunting, if not impossible,
for a typical graduate student without a CS background. Fortunately,
no matter how difficult it is to install those libraries, there only needs to be
one person to build it once on the cloud. Then, no one needs to redo this labor again.

.. note::
This software dependency hell can also be solved by containers such as
`Docker <https://www.docker.com>`_ and `Singularity <http://singularity.lbl.gov>`_
(e.g. `Docker-WRF <https://ral.ucar.edu/projects/ncar-docker-wrf>`_).
But the cloud also solves compute and data problems, as discussed below.
You can combine containers and cloud to have a consistent environment
across local machines and cloud platforms.

Compute
-------

Local machines need up-front investment and have fixed capability.
Right before AGU, everyone is running models and jobs are pending forever in the queue.
During Christmas, no one is working and machines are just idle but still incur maintenance cost.

Clouds are elastic. You can request an `HPC cluster <https://aws.amazon.com/hpc/>`_
with 1000 cores for just 1 hour, and only pay for exactly that hour.
If you have powerful local machines, you can still use the cloud
to boost computing power temporarily.

Data
----

GEOS-Chem currently have 30 TB of GEOS-FP/MERRA2 meteorological input data.
With a bandwidth of 1 MB/s, it takes two weeks to download a 1-TB subset
and a year to download the full 30 TB. To set up a high-resolution
nested simulation, one often need to spend long time getting the
corresponding meteorological fields. `GCHP <http://wiki.seas.harvard.edu/geos-chem/index.php/GEOS-Chem_HP>`_
can ingest global high-resolution data and will further push the data size to increase.

The new paradigm to solve this big data challenge is to "move compute to data"
(also see :ref:`earth-data-label`).
AWS has agreed to host all GEOS-Chem input data for free under the Public Data Set Program.
By having all the data already available in the cloud environment,
you can perform simulations over any periods with any configurations.
81 changes: 81 additions & 0 deletions doc/new_opportunity.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
.. _new-opportunity-label:

Open new research opportunities
===============================

Cloud not only makes model simulations much easier, but also opens many new
research opportunities in Earth science.

.. _earth-data-label:

Massive Earth observation data
------------------------------

Massive amounts of satellite and other Earth science data are being moved to the cloud.
One success story is the migration of NOAA's NEXRAD data to AWS
(`Ansari et al., 2017, BAMS <https://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-16-0021.1>`_) --
it is reported that "data access that previously took 3+ years to complete now requires only a few days"
(`NAS, 2018 <https://www.nap.edu/catalog/24938/thriving-on-our-changing-planet-a-decadal-strategy-for-earth>`_,
Chapter "Data and Computation in the Cloud").
By learning cloud computing you can get access to massive
`Earth science datasets on AWS <https://aws.amazon.com/earth/>`_,
without having to spend long time downloading them to local machines.

The most exciting project is perhaps
`the cloud migration of NASA’s Earth Observing System Data and Information System (EOSDIS)
<https://earthdata.nasa.gov/about/eosdis-cloud-evolution>`_.
It will open new opportunities such as ultra-high-resolution inversion of satellite data,
leveraging massive data and computing power available on the cloud.
This kind of analysis will be hard to imagine on traditional platforms.

.. _deep-learning-label:

Deep learning and AI
--------------------

There is a growing interest in applying machine learning in Earth science,
as illustrated clearly by the AGU 2017 Fall meeting
(`H082 <https://agu.confex.com/agu/fm17/preliminaryview.cgi/Session22660>`_,
`A028 <https://agu.confex.com/agu/fm17/preliminaryview.cgi/Session26710>`_)
and AMS 2018 meeting
(`AMS-AI <https://ams.confex.com/ams/98Annual/webprogram/17AI.html>`_).

Cloud platforms are the go-to choice for training machine learning models, especially
deep neural networks. There are massive amounts of GPUs on the cloud,
which can offer `~50x performance <https://github.com/jcjohnson/cnn-benchmarks>`_
than CPUs for training neural nets. Pre-configured environment on the cloud
(e.g. `AWS Deep Learning AMI <https://aws.amazon.com/machine-learning/amis/>`_)
allows users to run the program immediately without wasting time configuring
GPU libraries (mostly `cuDNN <https://developer.nvidia.com/cudnn>`_).

Instructions on using cloud are often included in deep learning textbooks and course materials:

- `Stanford CS231n: Convolutional Neural Networks for Visual Recognition
<http://cs231n.github.io/>`_.
See `Google Cloud Tutorial <http://cs231n.github.io/gce-tutorial/>`_ and
`AWS Tutorial <http://cs231n.github.io/aws-tutorial/>`_.
(CS231n should be one of the most popular deep learning courses,
with all videos and materials freely available online)

- `Deep learning with Python <https://www.manning.com/books/deep-learning-with-python>`_.
by François Chollet, the author of Keras. See Appendix B. Running Jupyter notebooks on AWS GPU.
(This book `got full 5-star on Amazon
<https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438>`_)

- `Deep Learning - The Straight Dope <http://gluon.mxnet.io/index.html>`_.
It is a very nice interactive textbook on deep learning.
Its `official Chinese version <https://zh.gluon.ai/>`_ has
`an instruction on using AWS <https://zh.gluon.ai/chapter_preface/aws.html>`_.
See `AWS official docs <https://docs.aws.amazon.com/mxnet/latest/dg/gs.html>`_
for the equivalent English version.

... and in the official documentations of ML/DL frameworks:

- `Keras on AWS GPU <https://blog.keras.io/running-jupyter-notebooks-on-gpu-on-aws-a-starter-guide.html>`_.
Keras is the most popular high-level deep learning library, built on top of TensorFlow.

- `XGBoost on AWS cluster <https://xgboost.readthedocs.io/en/latest/tutorials/aws_yarn.html>`_.
XGBoost is the most popular library for
`gradient boosting <https://xgboost.readthedocs.io/en/latest/model.html>`_,
and is also `the most widely used tool in Kaggle
<http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/>`_.
52 changes: 52 additions & 0 deletions doc/status_of_cloud.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
Status of cloud for scientific computing
========================================

Cloud was originally invented for web applications, not for scientific computing.
But the interest in using cloud platforms for science is growing rapidly,
especially in the recent 2~3 years. Technical tests on whether cloud is suitable
for science have been done over 10 years, tracing back to `Evangelinos and Hill (2008)
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.296.3779>`_
who tested `MITgcm <http://mitgcm.org>`_ on AWS.
Now we start to see mature applications for daily research work --
the democratization of cloud computing!

Atmospheric models
------------------

`The Modeling Research in the Cloud Workshop
<https://www.unidata.ucar.edu/events/2017CloudModelingWorkshop/>`_
was hosted by NCAR in 2017. Most presentation slides
are `available online
<https://www.unidata.ucar.edu/events/2017CloudModelingWorkshop/#schedule>`_.

Relavant applications
---------------------

- `OpenFOAM on cloud <https://cfd.direct/cloud/>`_.
OpenFOAM is a popular library for computational fluid dynamics (CFD).
The team `started cloud migration in 2015 <https://cfd.direct/cloud/year-1-cloud/>`_
and the project is now very mature. CFD simulations share a lot of similarities with
atmospheric simualtions. They both solve variants of Navier–Stokes equations
using MPI-based domain decomposition (several studies
`couple OpenFOAM with WRF <https://scholar.google.com/scholar?q=OpenFOAM+WRF>`_).


Classes
-------

Scientific computing classes start to teach and use cloud platforms,
mostly AWS:

- `Harvard CS205 <http://iacs-courses.seas.harvard.edu/courses/cs205/index.html>`_,
2018 Spring, Computing Foundations for Computational Science

- `MIT 18.337/6.338 <http://courses.csail.mit.edu/18.337/2017/index.html>`_,
2017 Fall, Modern Numerical Computing in Julia

- `Duke STA663 <http://people.duke.edu/~ccc14/sta-663-2017/>`_,
2017, Computational Statistics in Python

- Also see :ref:`deep-learning-label`

(If you know more about them please `post an issue
<https://github.com/JiaweiZhuang/cloud_GC/issues>`_!)
13 changes: 12 additions & 1 deletion doc/tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
.. _tutorial-label:

Beginner Tutorial
Beginner tutorial
=================

.. warning::
This tutorial shows how to perform proof-of-concept GEOS-Chem simulations.
AWS has agreed to host all GEOS-Chem input data for free. After the data
transfer is done, I will add tutorials for real research workflow
(`issue#4 <https://github.com/JiaweiZhuang/cloud_GC/issues/4>`_).

The current version is the same as what was presented at
`IGC8 <http://acmg.seas.harvard.edu/geos/meetings/2017/index.html>`_ on May 2017.
I will provide a more up-to-date version soon
(`issue#1 <https://github.com/JiaweiZhuang/cloud_GC/issues/1>`_).

Step 1: Sign up an Amazon Web Service(AWS) account
--------------------------------------------------
Go to http://aws.amazon.com,
Expand Down

0 comments on commit dd95d82

Please sign in to comment.