Automated and Serverless API Scraping with Python and AWS: Get Data Sets on Aussie Residential Listings
I created a Lambda function in Amazon Web Services (AWS) that runs on a schedule to collect information on Aussie residential listings from the Domain Application Programming Interface (API) and store it as csv
data sets in a Simple Storage Service (S3) bucket. I wrote the Lambda function in Python and deployed it in the AWS cloud with specific permissions and schedule of execution (IAM role and CloudWatch event) thru AWS Command Line Interface (CLI). My Lambda function has access to my Domain API credentials stored in the AWS Systems Manager (SSM) Parameter Store. This project follows the architectural diagram shown below. The data sets are in long format and suitable for data analytics and visualization.
These instructions will get you a copy of the project up and running on your AWS account for development and testing purposes. See deployment for notes on how to deploy the project on the AWS cloud.
The following are needed to deploy the project on the AWS cloud:
-
Create a developer account at Domain and start a new project. Take note of your OAuth credentials (
Client Id
andClient Secret
). -
Sign up for an AWS account. It is free and will give you access to the
AWS Free Tier
services. -
(Optional) Install the AWS CLI. Aside from the AWS CLI, you can create and deploy AWS resources using the AWS Management Console.
-
Store your OAuth credentials (
Client Id
andClient Secret
) as string parameters in the cloud using AWS SSM Parameter Store. -
Prepare your deployment package as a zip file(
lambda.zip
). The deployment package contains the following:- Python script of the Lambda function (
lambda_function.py
). In the script, you will need to edit the following: names of your OAuth parameters stored in the AWS SSM Parameter Store;suburb
list and contents of thepayload
function if you are interested in residential listings for sale on other Aussie states; name of the S3 bucket where you want to save your data sets; and folder and file names of your data set files. Checklambda_function.ipynb
for a detailed discussion of the code. - Dependencies for
pandas
,numpy
, andrequests
Python modules. I built the dependencies using an Amazon Linux 2 AMI (Amazon Machine Image) in an AWS Elastic Compute Cloud (EC2) instance. I deployed the EC2 instance using AWS CLI and used PuTTY for connection. I then installed Python 3.8.1 and pip. I used FileZilla to transferlambda_function.py
to my EC2 instance. Withlambda_function.py
inside the new directorylambda
, I pip-installed the modules, zipped the contents, and transfered the createdlambda.zip
back to my local Windows machine.
- Python script of the Lambda function (
-
Create two S3 buckets in the AWS cloud: one for receiving your deployment package ('lambda.zip') and another for storing your
csv
data sets. -
Copy
lambda.zip
into your S3 bucket. Take note of the name of the S3 bucket and the S3 key (filename of the deployment package) that you provided in the copy operation.
To deploy the Lambda function in the AWS cloud:
-
Prepare the CloudFormation template (see
my1studacityproj.yml
). In the template, I have defined a Lambda function (AWS::Lambda::Function
) calledPullListings
, specified the handler (Python_filename.function_name
), indicated the source of the code (S3 bucket name and S3 key), and stated the runtime (python3.8
) and timeout (300
seconds). I have also attached the execution rolePullListingsRole
in the deployment of the Lambda function.PullListingRole
was also defined in the template (AWS::IAM:Role
) which provides specific permissions to the Lambda function. The entitylambda.amazonaws.com
was allowed to access AWS services and resources using temporary security credentials (sts:AssumeRole
). Policies were also attached to this role:AWSLambdaBasicExecutionRole
to create log groups and streams, and put events into log streams (CloudWatch Logs permissions);AmazonS3FullAccess
to read and write to the S3 bucket assigned to store thecsv
data sets; andssm_read
to access the variables stored in the Parameter Store.The template also has an event rule (
AWS::Events::Rule
) which executes the Lambda function every 7 days. This event rule needs permission to invoke the Lambda function (AWS::Lambda::Permission
) which was also written in the CloudFormation template. -
Create the CloudFormation stack. I used script files (
create.sh
,update.sh
) in creating and updating the stack in the bash terminal. Check S3, CloudFormation, and CloudWatch management consoles in your AWS account to confirm the successful deployment of the Lambda function. You can also invoke the function manually using AWS CLI. For a sample data set, see2020-02-20 06_13_45.160412.csv
.
How to get Aussie property price guides using Python & the Domain API
Make Data Acquisition Easy with AWS & Lambda (Python) in 12 Steps
Create and deploy an AWS Lambda function with AWS CloudFormation