Skip to content

aws-samples/aws-glue-connector-google-spreadsheet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Import Google Spreadsheets Data into Amazon S3 Files

Motivation

This repository provides a tutorial on how to use AWS Glue for Python Shell to access your Data in Google Spreadsheet and import it to your Amazon S3 bucket.

Architecture

Project steps

Follow the steps to copy Google Spread Sheet files from your Google Drive to your Amazon S3 bucket with AWS Glue for Python Shell.

Create Google Service Account

  1. Create a Google project in GCP console and a Service Account associated.

Go to cloud.google.com and create a new Google Cloud project. Provide a Project Name and Unique ID.

  1. Enable the API's for this project.

Choose APIs & Services from the left menu and click on Enable APIs and Services to enable the Google API's.

APIs:

  • Drive
  • Spreadsheet

  1. Create a Service Account for this project.

In the APIs & Services section, click on Credentials, click on Create credentials and Service Account.

Provide the following information:

  • Assign a name
  • modify the ID accordingly
  • insert a description.

  1. Once the Service Account is created, assign a private key file associated to this service account.

In the Cloud Console, go to IAM and Admin > Service accounts.

  • Click the email address of the service account.
  • Click the Keys tab.
  • Click the Add key drop-down menu
  • Select Create new key
  • Select JSON as the Key type and
  • Click Create.

It downloads a JSON file

  1. Finally, navigate to your google drive folder and share the folder or files you would like to explore with the email address of the Service Account.

Create an Amazon S3 bucket

Create an Amazon S3 bucket to save the data you are consuming from Google Drive.

Navigate to Amazon S3 Console and create a bucket.

Store Credentials in AWS Secrets Manager

Copy the credentials JSON file in AWS Secrets Manager as a key/value object.

  1. Open Secrets Manager console and Store a new secret.

  2. Choose the option other type of secret and paste your JSON object in plain text and assign a name.

Copy the python snippet to retrieve the secret within a python routine

Setup AWS IAM Role for AWS Glue, AWS Secrets Manager, and Amazon S3

Create an IAM role to execute the necessary operations:

  1. Navigate to IAM Console and click Roles.

  2. Create Role and choose Glue in other AWS services.

  1. Create a new custom policy for this role. Chose JSONtab to edit your policy and Use this policy file to attach the minimum necessary permissions. Remember to modify the policy with your resources values.

These are the necessary permissions to run the job:

  • Amazon S3: Read and Write
  • Amazon CloudWatch: List and Write
  • AWS Secrets Manager: Read

It is recommended to attach the minimum set of permission per role

  1. Make sure that the Trust Policy allows this role to assume AWS Glue. Click Next and assign a name to the role.

  1. Choose a name for your role. Use gsheet-glue-role as a suggestion.

Create Glue Python Shell Job (Glue Interface)

The use of AWS Glue for Python Shell makes it very easy to use python packages or environment variables. It supports Python 3.9 with a flexible pre-loaded environment and support to import additional libraries, see this AWS Blog Post for more information about this feature. Additionally, it can be also setup to be used either programmatically with AWS CLI or through its user interface.

  1. Navigate to AWS Glue console.
  2. Choose AWS Glue Studio - Jobs.
  3. Select Python Shell script Editor.
  4. Create a new job.

  1. Modify the job title and copy the script.
"""
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0
"""
import sys
import logging
import pandas as pd
import json
from datetime import datetime
from os.path import join
import awswrangler as wr
import boto3
from awsglue.utils import getResolvedOptions
from googleapiclient.discovery import build
from google.oauth2 import service_account

# start your logging session
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Collect your script variables
args = getResolvedOptions(sys.argv,
                          ['google_secret_name',
                           'google_secret_region',
                           'google_spreadsheet_id',
                           'google_spreadsheet_tab',
                           'bucket',
                           'folder',
                           'filename',
                           'output_format',
                           'compression']
                          )


# convert param to boolean
def str_to_bool(val):
    return val.lower() in ['true', '1', 't', 'y', 'yes']


# Method to save data in Amazon S3 Bucket
def save_file(df, bucket, folder='data',
              filename='filename', output_format='csv',
              compression=False):
    """
    Save your pandas dataframe in your Amazon S3 bucket.   
    This method requires df and bucket as mandatory variables.
    Provide the rest of variables as job parameters or use the default values.
    
    Modify the script to handle more text file separators.
    Extend the script to include more compression algorithms.
    
    Include the subfolder structure directly in your job variable folder.
    Use the following pattern: folder/subfolder1/subfolder2/...
    """
    # Create the file path with bucket, folder and filename
    subpath_args = [bucket, folder, filename]
    sub_path = '/'.join(subpath_args)
    # Add the creation time in your filename
    path_date_args = [sub_path, datetime.now().strftime('%Y_%m_%d_%H_%M_%S')]
    path_date = '_'.join(path_date_args)
    # Add the S3 url into your path
    path = 's3://'+path_date
    if output_format == 'csv':
        """
        By default this routine uses tab separator.
        Define a new job variable for different separator.
        Insert here your code to handle different types of separators
        """
        separator = '\t'
        # Add the csv file extension
        path_ext = path+'.csv'
        if (str_to_bool(compression)):
            # Include your logic to handle more compression algorithms
            """
            Insert here your code to apply more textfile compression methods.
            """
            comp_opts = 'zip'
            # Add the compression to your filename
            path_ext = path_ext+'.'+comp_opts
        else:
            comp_opts = None
        # Save your file with awswrangler package
        wr.s3.to_csv(
            df=df,
            path=path_ext,
            compression=comp_opts,
            index=False
        )
    else:
        # Add the parquet file extension
        path_ext = path+'.parquet'
        if (str_to_bool(compression)):
            """
            Insert here your code to apply more Parquet compression methods.
            """
            comp_opts = 'snappy'
            # Add the compression to your filename
            path_ext = path_ext+'.'+comp_opts
        else:
            comp_opts = None
        # Save your file with awswrangler package
        wr.s3.to_parquet(
            df=df,
            path=path_ext,
            compression=comp_opts,
            index=False
        )
    return


# Retrieve credentials from AWS Secrets Manager
# replace this value with your secret name
secret_name = args['google_secret_name']
region_name = args['google_secret_region']  # modify to your region
session = boto3.session.Session()
client = session.client(
    service_name='secretsmanager',
    region_name=args['google_secret_region']
)

get_secret_value_response = client.get_secret_value(
    SecretId=args['google_secret_name']
)

secret = get_secret_value_response['SecretString']
secret = json.loads(secret)

# Create credentials object with google.oauth2.
# Use those scopes for to connect to Google Drive.
# More information:
# https://google-auth.readthedocs.io/en/master/reference/google.oauth2.service_account.html
credentials = service_account.Credentials.from_service_account_info(
    secret,
    scopes=['https://www.googleapis.com/auth/drive', 'https://www.googleapis.com/auth/spreadsheets'])

# Open connection with googleapiclient and create object connection.
# More information: https://developers.google.com/sheets/api/quickstart/python
service = build('sheets', 'v4', credentials=credentials)
sheet = service.spreadsheets()

# Read google object
# More information:
# https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get
result = sheet.values().get(
                       spreadsheetId=args['google_spreadsheet_id'],
                       range=args['google_spreadsheet_tab']).execute()
data = result.get('values')

# save data into a dataframe
data = pd.DataFrame(data[1:], columns=data[0])
bucket = args['bucket']
folder = args['folder']
filename = args['filename']
output_format = args['output_format']
compression = args['compression']
save_file(data, bucket, folder, filename, output_format, compression)

This code imports native Google libraries:

You can refer to their documentation to extend the proposed code sample.

  1. Setup the job parameters.

Follow the Glue interface, and provide the following information:

  • IAM Role: You can provide the glue role that we have created before.
  • Python version: Python 3.9
  • Data Processing: 1/16 DPU
  • Number of retries: 1
  • Job Timeout: 2880 min

You can keep advance properties as per default.

The code uses a set of script variables to easily integrate with your pipeline. In Job parameters, add all the additional python modules imported and the environment variables. To do that include the following job parameter in key-value format:

  • --additional-python-modules : google-api-python-client, google-auth-oauthlib, google-auth-httplib2, fastparquet
  • --google_secret_name: your-google-secret-name
  • --google_secret_region: your-google-secret-region
  • --google_spreadsheet_id: your-google-spreadsheet-id (More information here)
  • --google_spreadsheet_tab: your-google-spreadsheet-tab
  • --bucket: your-data-bucket
  • --folder: your-data-bucket-folder (default = data)
  • --filename: your-data-filename (default = filename)
  • --output_format: your-filename-format (supported file formats: parquet and csv tab separated)
  • --compression: your-filename-compression (supported compression algorithms: snappy for parquet and zstd for csv)

Create Glue Python Shell Job (AWS CLI)

In case you would prefer to use AWS CLI to create your Glue Job follow those steps:

  1. Save the code sample file in a S3 Bucket. (Reuse the Bucket you have created before or create new one)
  2. Open your preferred shell terminal with configured AWS CLI access to your account.
  3. Execute the following bash command:
aws glue create-job --profile account-session #Choose the profile that enables access to your account
--name google_spreadsheet_data #Choose a unique name for this job
--role gsheet-glue-role #Choose teh Role created in previous sections
--command 
Name=pythonshell, 
PythonVersion=3.9, 
# Navigate to your python script in S3. Create a new bucket or reuse the existent one. 
# Chose a your strategy for the TempDir location in --default-arguments
ScriptLocation=s3://bucket/script_folder/script.py 
--default-arguments 
'{
    "--TempDir":"s3://${temp_bucket}/${temp_folder}", 
    "--job-language":"python",
    "library-set":"analytics",
    "--additional-python-modules":"google-api-python-client, google-auth-oauthlib, google-auth-httplib2, fastparquet"
    "--google_secret_name":"${your-google-secret-name}"
    "--google_secret_region":"${your-google-secret-region}"
    "--google_spreadsheet_id":"${your-google-spreadsheet-id}"
    "--google_spreadsheet_tab":"${your-google-spreadsheet-tab}"
    "--bucket":"${your-data-bucket}"
    "--folder":"${your-data-bucket-folder}"
    "--filename":"${your-data-filename}"
    "--output_format":"${your-filename-format}"
    "--compression":"${your-filename-compression}"
}'
--timeout 30 
--max-capacity 0.0625
--max-retries 1

Run the Job

Once you have finished the setup you can run the Glue Job in both AWS Glue interface or in AWS CLI.

  • Interface:

  • AWS CLI:

aws glue start-job-run --job-name google_spreadsheet_data

Clean up

AWS Glue is a fully managed service and you will not incur in extra charges if you do not run your jobs. However, to avoid any additional charges linked to other services follow these steps:

(Optional) For local testing

In case you want to try the code locally or you would like to add new capabilities to this routine, we recommend to test it locally. Use your preferred code IDE with the same JSON credentials file you have created previously and make any enhancements that applies to your use case. Once you are satisfied with your code follow the steps in this tutorial with your new routine.

Run the following terminal command to clone the repo:

git clone http://...
cd <folder>

Navigate to your python environment and run the following command to install the requirements.

pip install -r requirements.txt

Contributing

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.