Reinvent 2018 - ARC329 - Massively Parallel Data Processing at Scale
In this session we make use of a project called PyWren to run Python code in parallel at massive scale across AWS Lambda functions. We will be using Landsat 8 satellite imagery to calculate a Normalized Difference Vegetation Index (NDVI) across multiple points of interests in the world, using the GeoTIFF data across multiple light spectrum bands using an embarrassingly parallel NVDI calculation function written in Python.
This session will use a Jupyter Notebook, an open-source web application that allows to create and share documents that contain live code, equations, visualizations and narrative text. This session also assumes the participant has a functional AWS account with at least one IAM(AWS Identity Access Management) user configured for CLI(AWS Command Line Interface) access. Please follow the below setup instructions first, before proceeding with this session.
Virtual Enviornment Config
create a virtual enviornment for this session. If you have a favorite enviornment manager installed feel free to use that otherwise install virtualenv
sudo pip install virtualenv
Next Create an Enviornment for this workshop
Next activate the enviornment
We will need to have Jupyter Notebook running locally in your enviornment for this session, and ensure we have all the necessary Python libraries downloaded to visualize outputs.
1. Install Jupyter Notebook on your laptop. (Note: You can safely skip this step, if you have Jupyter Notebook with kernel of Python 2.7 installed on your machine) To do so, follow the instructions on http://jupyter.org/install.html. If you have
pip installed, the easiest way to install Jupyter is by using pip. Please make sure to use Python 2.7 and not Python 3+ for this workshop.
sudo python -m pip install --upgrade pip sudo python -m pip install jupyter
2. In addition to this, we need to install a variety of Python libraries. These libraries are used for visualization, processing of datasets and to also talk to AWS straight from Python using the AWS SDK for Python. To do so, please execute the following commands:
sudo python -m pip install boto3 warcio matplotlib numpy wordcloud nltk rasterio scipy seaborn awscli tldextract bs4 rio_toa folium pandas
3. In the last step, let's install the dependencies for PyWren and set it up locally on our machine.
sudo python -m pip install pywren==0.3.0
4. Let's use the Pywren interactive setup process to setup Pywren on your local system, When asked to configure advanced properties, select Y: Take note of the AWS Region(us-west-2 recommended), S3 Bucket Name and Function Name as we will need these values to configure our notebook parameters later.
$ pywren-setup This is the PyWren interactive setup script Your AWS configuration appears to be set up, and your account ID is 1234567890 This interactive script will set up your initial PyWren configuration. If this is the first time you are using PyWren then accepting the defaults should be fine. What is your default aws region? [us-west-2]: Location for config file: [/Users/pauvince/.pywren_config]: /Users/pauvince/.pywren_config already exists, would you like to overwrite? [y/N]: y PyWren requires an s3 bucket to store intermediate data. What s3 bucket would you like to use? [pauvince-pywren-973]: Bucket does not currently exist, would you like to create it? [Y/n]: y PyWren prefixes every object it puts in S3 with a particular prefix. PyWren s3 prefix: [pywren.jobs]: Would you like to configure advanced PyWren properties? [y/N]: y Each lambda function runs as a particularIAM role. What is the name of the role youwould like created for your lambda [pywren_exec_role_1]: pywren_exec_role_973 Each lambda function has a particular function name.What is your function name? [pywren_1]: pywren_973 PyWren standalone mode uses dedicated AWS instances to run PyWren tasks. This is more flexible, but more expensive with fewer simultaneous workers. Would you like to enable PyWren standalone mode? [y/N]: n Creating config /Users/pauvince/.pywren_config new default file created in /Users/pauvince/.pywren_config lambda role is pywren_exec_role_973 Creating bucket pauvince-pywren-973. Creating role. Deploying lambda. Pausing for 5 seconds for changes to propagate. Pausing for 5 seconds for changes to propagate. Successfully created function. Pausing for 10 sec for changes to propagate. function returned: Hello world
5. PyWren uses the default Python logger to communicate progress back. Let's raise our log level in the current terminal session to INFO before running the test-function so we can see the AWS Lambda activity:
6. Time to test our PyWren function and see if our laptop can communicate appropriately with our AWS environment and use PyWren in the AWS Lambda function:
You should now see an output similar to the following one:
2018-10-26 09:30:35,072 [INFO] pywren.executor: using serializer with meta-supplied preinstalls 2018-10-26 09:30:40,403 [INFO] pywren.executor: map 780d4829-7ff5-42b9-80f3-73e020aca329 00000 apply async 2018-10-26 09:30:40,405 [INFO] pywren.executor: call_async 780d4829-7ff5-42b9-80f3-73e020aca329 00000 lambda invoke 2018-10-26 09:30:40,880 [INFO] pywren.executor: call_async 780d4829-7ff5-42b9-80f3-73e020aca329 00000 lambda invoke complete 2018-10-26 09:30:40,910 [INFO] pywren.executor: map invoked 780d4829-7ff5-42b9-80f3-73e020aca329 00000 pool join 2018-10-26 09:30:49,215 [INFO] pywren.future: ResponseFuture.result() 780d4829-7ff5-42b9-80f3-73e020aca329 00000 call_success True function returned: Hello world
Set the loggin level to WARNING for the remainder of the session:
7. The last step we need to copy the function file we will be using in Lambda to the S3 bucket created by the pyWren setup. At the CMD line, type the following command. Remember to use your bucket name from PyWren setup.
aws s3 cp s3://aws-samples-reinvent-arc329/lambda_function.zip s3://<your bucket name>/lambda_function.zip
Run the session
We will use Jupyter Notebooks to locally run our session examples with Pywren and have the workload executed remotely. To do so, we first need to clone this repository to our local machine:
git clone https://github.com/aws-samples/reinvent2018-arc329-builders-workshop
Now enter the newly create
reinvent2018-arc329-builders-workshop folder and start a Jupyter Notebook instance by typing the following command:
This will launch a Jupyter Notebook instance and open your web browser. Please click onto the folder and launch the Notebook by clicking on the file named ReInvent_ARC_329.ipynb. The notebook will open and provide you with the instructions to complete the session.
NDVI Calculation on Satellite Imagery
We will query various locations from the Landsat 8 satellite imagery and analyze them. Example scene:
We will then use the GeoTIFF imagery and it's different information to calculate the NDVI index and analyze how much cloud coverage vs the NVDI we had on certain days:
Lastly we will plot NDVI changes over time for certain areas of interest of the world: