# Automated and Serverless API Scraping with Python and AWS: Get Data Sets on Aussie Residential Listings

My goal in this project is to get data sets on Aussie residential listings by creating a Lambda function in Amazon Web Services (AWS) that runs on a schedule, collects data from the [Domain Application Programming Interface (API)](https://developer.domain.com.au/), and stores it in a Simple Storage Service (S3) bucket. I intend to use the data sets for data analysis and visualization, e.g. using [Tableau](https://www.tableau.com/en-au). 

You can find the Lambda script `lambda_function.py` stored in the file `lambda.zip`. In here, I explain and discuss the Python code used to create the Lambda function. 

## `lambda_function.py`

To begin with, I imported the following modules:

* `numpy` and `pandas` - for handling and storing of data pulled from the Domain API as a `DataFrame` object
* `requests` - for sending `POST` requests to the Domain API and receive access token and data on property listings
* `boto3` - for managing the S3 bucket assigned to store the data sets
* `io.StringIO` - for handling the pulled data as an in-memory file-like object
* `datetime.datetime` - for obtaining the current date and time 

In [None]:
import numpy as np
import pandas as pd
import requests
import boto3
from io import StringIO
from datetime import datetime

I created a Python function `lambda_handler` with two inputs following the official [AWS documentation](https://docs.aws.amazon.com/lambda/latest/dg/python-programming-model-handler-types.html). Next, to get an access token from the Domain API, I created an AWS Systems Manager (SSM) low-level client to retrieve my `DomainClientID` and `DomainSecret` variables stored in the Parameter Store.   These parameters were provided by my Domain developer account.

In [None]:
def lambda_handler(event, context):

    ssm = boto3.client('ssm')
    client_dict = ssm.get_parameter(Name='DomainClientID')
    secret_dict = ssm.get_parameter(Name='DomainSecret')
    
    myclient_id = client_dict['Parameter']['Value']
    mysecret = secret_dict['Parameter']['Value']
    url = "https://auth.domain.com.au/v1/connect/token"

Then, I made a POST request to `url` to receive my access token.

In [None]:
    data = {
            "client_id" : myclient_id,
            "client_secret" : mysecret,
            "grant_type" : "client_credentials",
            "scope" : "api_listings_read",
            "Content-Type" : "text/json"
        }

        response = requests.post(url=url, data=data)
        token=response.json()
        access_token=token["access_token"]
        auth = {"Authorization":"Bearer "+access_token}

I am interested in residential listings in the Australian Capital Territory (ACT) that are for sale. Since the API search results are only [limited to the first 1000 results](https://developer.domain.com.au/docs/apis/pkg_agents_listings/references/listings_detailedresidentialsearch), I made a [list of ACT suburbs](https://www.yellowpages.com.au/act/localities) which I later `for loop`ed and made `POST` request to the API endpoint. I assumed that by limiting the search query per suburb, the search results are unlikely to go beyond 1000.  

In [None]:
    url2 = "https://api.domain.com.au/v1/listings/residential/_search"

        suburbs = [
            'Acton', 'Ainslie','Amaroo', 'Aranda',
            'Banks', 'Barton', 'Beard', 'Belconnen', 'Bonner', 'Bonython', 'Braddon', 'Bruce',
            'Calwell', 'Campbell', 'Canberra', 'Canberra Airport', 'Capital Hill', 'Casey', 'Chapman', 'Charnwood', 'Chifley',
            'Chisholm', 'Conder', 'Cook', 'Coombs', 'Coree', 'Cotter River',
            'Deakin', 'Dickson', 'Downer', 'Duffy', 'Dunlop',
            'Evatt',
            'Fadden', 'Farrer', 'Fisher', 'Florey', 'Flynn', 'Forde', 'Forrest', 'Franklin', 'Fraser', 'Fyshwick',
            'Garran', 'Gilmore', 'Giralang', 'Gordon', 'Gowrie', 'Greenway', 'Griffith', 'Gungahlin',
            'Hackett', 'Hall', 'Harrison', 'Hawker', 'Higgins', 'Holder', 'Holt', 'Hughes', 'Hume', 
            'Isaacs', 'Isabella Plains', 
            'Jacka', 'Jervis Bay', 
            'Kaleen', 'Kambah', 'Kenny', 'Kingston', 'Kowen',
            'Latham', 'Lawson', 'Lyneham', 'Lyons', 
            'Macarthur', 'Macgregor', 'Macquarie', 'Majura', 'Mawson', 'Mckellar', 'Melba', 'Mitchell', 'Monash', 'Moncrieff', 
            'Narrabundah', 'Ngunnawal', 'Nicholls',
            'O\'connor', 'O\'malley', 'Oaks Estate', 'Oconnor', 'Omalley', 'Oxley', 
            'Paddys River', 'Page', 'Palmerston', 'Parkes', 'Pearce', 'Phillip', 'Pialligo', 
            'Red Hill', 'Reid', 'Richardson', 'Rivett', 'Russell', 
            'Scullin', 'Spence', 'Stirling', 'Stromlo', 'Symonston', 
            'Taylor', 'Tharwa', 'Theodore', 'Throsby', 'Torrens', 'Tuggeranong', 'Turner',
            'Uriarra Village',
            'Wanniassa', 'Waramanga', 'Watson', 'Weetangera', 'Weston', 'Weston Creek', 'Wright', 
            'Yarralumla'
        ]

The function `payload` provides the required parameters for the POST request at each ACT suburb. 

In [None]:
    def payload(suburb):

            params = {
            #     "pageNumber": 0,
                "listingType": "Sale",
            #     "propertyTypes": [
            #         "House",
            #         "NewApartments"
            #     ],
            #     "propertyFeatures": [
            #         "AirConditioning"
            #     ],
            #     "listingAttributes": [
            #         "HasPhotos"
            #     ],
            #     "propertyEstablishedType": "Any",
            #     "minBedrooms": 0,
            #     "maxBedrooms": 0,
            #     "minBathrooms": 0,
            #   "maxBathrooms": 0,
            #     "minCarspaces": 0,
            #     "maxCarspaces": 0,
            #     "minPrice": 0,
            #     "maxPrice": 0,
            #     "minLandArea": 0,
            #     "maxLandArea": 0,
            #     "advertiserIds": [
            #         0
            #     ],
            #     "adIds": [
            #         0
            #     ],
            #     "excludeAdIds": [
            #         0
            #     ],
                "locations": [
                {
                    "state": "ACT",
                    "region": "",
                    "area": "",
                    "suburb": suburb,
                    "postCode": "",
                    "includeSurroundingSuburbs": False
                }
                ]
            #     "locationTerms": "string",
            #     "keywords": [
            #         "string"
            #     ],    
            #     "newDevOnly": true,
            #     "inspectionFrom": "2020-02-18T00:15:14.184Z",
            #     "inspectionTo": "2020-02-18T00:15:14.184Z",
            #     "auctionFrom": "2020-02-18T00:15:14.184Z",
            #     "auctionTo": "2020-02-18T00:15:14.184Z",
            #     "ruralOnly": true,
            #     "excludePriceWithheld": true,
            #     "sort": {
            #         "sortKey": "Default",
            #         "direction": "Ascending",
            #         "proximityTo": {
            #             "lat": 0,
            #             "lon": 0
            #         }
            #     },
            #     "pageSize": 0,
            #     "geoWindow": {
            #         "box": {
            #             "topLeft": {
            #                 "lat": 0,
            #                 "lon": 0
            #             },
            #             "bottomRight": {
            #                 "lat": 0,
            #                 "lon": 0
            #             }
            #         },
            #         "circle" : {
            #             "center": {
            #                 "lat": 0,
            #                 "lon": 0
            #             },
            #             "radiusInMeters": 0
            #         },
            #         "polygon": {
            #             "points": [
            #                 {
            #                     "lat": 0,
            #                     "lon": 0
            #                 }
            #             ]
            #         }
            #     },
            #     "updatedSince": "2020-02-18T00:15:14.184Z"
            }

            return params

I now made a `POST` request to `url2` for each ACT suburb and retrieved the residential listings. Sample response of the `POST` request is shown [here](https://developer.domain.com.au/docs/apis/pkg_agents_listings/references/listings_detailedresidentialsearch). For each listing, I retrieved the following information: 

* listing type and id
* advertiser type, id, and name
* display price of the property
* property features, type, number of bathrooms, bedrooms, and car spaces
* property address such as unit number, street number and name, area, region, suburb, postcode, and GPS coordinates (latitude and longitude)

I stored the pulled data as a list of lists.

In [None]:
    listings = []
        for suburb in suburbs:

            content = requests.post(url=url2, headers=auth, json=payload(suburb)).json()

            if content:
                for item in content:
                        ptype = list(item.keys())[0]
                        listing = list(item.keys())[1]
                        dict1 = item[listing]

                        if isinstance(dict1, dict):
                            dict2 = item[listing]['advertiser']
                            dict3 = item[listing]['priceDetails']
                            dict4 = item[listing]['propertyDetails']

                            listings.append(
                                [item[ptype], dict1.get('id'),
                                dict2.get('type'), dict2.get('id'), dict2.get('name'),
                                dict3.get('displayPrice'),
                                dict4.get('features'), dict4.get('propertyType'),
                                dict4.get('bathrooms', 0), dict4.get('bedrooms', 0), dict4.get('carspaces', 0),
                                dict4.get('unitNumber'), dict4.get('streetNumber'), dict4.get('street'),
                                dict4.get('area'), dict4.get('region'), dict4.get('suburb'),
                                dict4.get('postcode'), dict4.get('latitude'), dict4.get('longitude')] 
                            )

                        elif isinstance(dict1, list):
                            for row in dict1:
                                dict2 = row['advertiser']
                                dict3 = row['priceDetails']
                                dict4 = row['propertyDetails']

                                listings.append(
                                    [item[ptype], row.get('id'),
                                    dict2.get('type'), dict2.get('id'), dict2.get('name'),
                                    dict3.get('displayPrice'),
                                    dict4.get('features'), dict4.get('propertyType'),
                                    dict4.get('bathrooms', 0), dict4.get('bedrooms', 0), dict4.get('carspaces', 0),
                                    dict4.get('unitNumber'), dict4.get('streetNumber'), dict4.get('street'),
                                    dict4.get('area'), dict4.get('region'), dict4.get('suburb'),
                                    dict4.get('postcode'), dict4.get('latitude'), dict4.get('longitude')] 
                                )

I converted the list into a `DataFrame` with appropriate column names.

In [None]:
    dataset = pd.DataFrame(listings)
        dataset.columns = [
            'type', 'id',
            'advertiser_type', 'advertiser_id', 'advertiser_name',
            'displayPrice',
            'propertyFeatures', 'propertyType',
            'bathrooms', 'bedrooms', 'carspaces',
            'unitNumber', 'streetNumber', 'street',
            'area', 'region', 'suburb',
            'postcode', 'latitude', 'longitude'
        ]

I stored the `DataFrame` dataset directly to the folder `dataset` in my S3 bucket as a `csv` file. The filename would show the date and time the data is pulled from the Domain API. 

In [None]:
    now = str(datetime.today())

        bucket = 'myactlistings' 
        csv_buffer = StringIO()
        dataset.to_csv(csv_buffer)
        s3_resource = boto3.resource('s3')
        s3_resource.Object(bucket, 'dataset/{}.csv'.format(now)).put(Body=csv_buffer.getvalue())