This repository provides a tutorial on how to use AWS Glue for Python Shell to access your Data in Google Spreadsheet and import it to your Amazon S3 bucket.
Follow the steps to copy Google Spread Sheet files from your Google Drive to your Amazon S3 bucket with AWS Glue for Python Shell.
- Create a Google project in GCP console and a
Service Account
associated.
Go to cloud.google.com and create a new Google Cloud project. Provide a Project Name
and Unique ID
.
- Enable the API's for this project.
Choose APIs & Services
from the left menu and click on Enable APIs and Services
to enable the Google API's.
APIs:
Drive
Spreadsheet
- Create a
Service Account
for this project.
In the APIs & Services
section, click on Credentials
, click on Create credentials
and Service Account
.
Provide the following information:
- Assign a name
- modify the ID accordingly
- insert a description.
- Once the
Service Account
is created, assign a private key file associated to this service account.
In the Cloud Console, go to IAM and Admin > Service accounts
.
- Click the email address of the service account.
- Click the
Keys
tab. - Click the
Add key
drop-down menu - Select
Create new key
- Select
JSON
as the Key type and - Click Create.
It downloads a JSON file
- Finally, navigate to your google drive folder and share the folder or files you would like to explore with the email address of the
Service Account
.
Create an Amazon S3 bucket to save the data you are consuming from Google Drive.
Navigate to Amazon S3 Console and create a bucket.
Copy the credentials JSON file in AWS Secrets Manager
as a key/value object.
-
Open Secrets Manager console and
Store a new secret
. -
Choose the option
other type of secret
and paste yourJSON
object in plain text and assign a name.
Copy the python snippet to retrieve the secret within a python routine
Create an IAM role
to execute the necessary operations:
-
Navigate to IAM Console and click
Roles
. -
Create
Role
and chooseGlue
in other AWS services.
- Create a new custom policy for this role. Chose
JSON
tab to edit your policy and Use this policy file to attach the minimum necessary permissions. Remember to modify the policy with your resources values.
These are the necessary permissions to run the job:
- Amazon S3: Read and Write
- Amazon CloudWatch: List and Write
- AWS Secrets Manager: Read
It is recommended to attach the minimum set of permission per role
- Make sure that the
Trust Policy
allows this role to assumeAWS Glue
. Click Next and assign a name to the role.
- Choose a name for your role. Use
gsheet-glue-role
as a suggestion.
The use of AWS Glue for Python Shell makes it very easy to use python packages or environment variables. It supports Python 3.9 with a flexible pre-loaded environment and support to import additional libraries, see this AWS Blog Post for more information about this feature. Additionally, it can be also setup to be used either programmatically with AWS CLI or through its user interface.
- Navigate to AWS Glue console.
- Choose
AWS Glue Studio - Jobs
. - Select
Python Shell script Editor
. - Create a new job.
- Modify the job title and copy the script.
"""
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0
"""
import sys
import logging
import pandas as pd
import json
from datetime import datetime
from os.path import join
import awswrangler as wr
import boto3
from awsglue.utils import getResolvedOptions
from googleapiclient.discovery import build
from google.oauth2 import service_account
# start your logging session
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Collect your script variables
args = getResolvedOptions(sys.argv,
['google_secret_name',
'google_secret_region',
'google_spreadsheet_id',
'google_spreadsheet_tab',
'bucket',
'folder',
'filename',
'output_format',
'compression']
)
# convert param to boolean
def str_to_bool(val):
return val.lower() in ['true', '1', 't', 'y', 'yes']
# Method to save data in Amazon S3 Bucket
def save_file(df, bucket, folder='data',
filename='filename', output_format='csv',
compression=False):
"""
Save your pandas dataframe in your Amazon S3 bucket.
This method requires df and bucket as mandatory variables.
Provide the rest of variables as job parameters or use the default values.
Modify the script to handle more text file separators.
Extend the script to include more compression algorithms.
Include the subfolder structure directly in your job variable folder.
Use the following pattern: folder/subfolder1/subfolder2/...
"""
# Create the file path with bucket, folder and filename
subpath_args = [bucket, folder, filename]
sub_path = '/'.join(subpath_args)
# Add the creation time in your filename
path_date_args = [sub_path, datetime.now().strftime('%Y_%m_%d_%H_%M_%S')]
path_date = '_'.join(path_date_args)
# Add the S3 url into your path
path = 's3://'+path_date
if output_format == 'csv':
"""
By default this routine uses tab separator.
Define a new job variable for different separator.
Insert here your code to handle different types of separators
"""
separator = '\t'
# Add the csv file extension
path_ext = path+'.csv'
if (str_to_bool(compression)):
# Include your logic to handle more compression algorithms
"""
Insert here your code to apply more textfile compression methods.
"""
comp_opts = 'zip'
# Add the compression to your filename
path_ext = path_ext+'.'+comp_opts
else:
comp_opts = None
# Save your file with awswrangler package
wr.s3.to_csv(
df=df,
path=path_ext,
compression=comp_opts,
index=False
)
else:
# Add the parquet file extension
path_ext = path+'.parquet'
if (str_to_bool(compression)):
"""
Insert here your code to apply more Parquet compression methods.
"""
comp_opts = 'snappy'
# Add the compression to your filename
path_ext = path_ext+'.'+comp_opts
else:
comp_opts = None
# Save your file with awswrangler package
wr.s3.to_parquet(
df=df,
path=path_ext,
compression=comp_opts,
index=False
)
return
# Retrieve credentials from AWS Secrets Manager
# replace this value with your secret name
secret_name = args['google_secret_name']
region_name = args['google_secret_region'] # modify to your region
session = boto3.session.Session()
client = session.client(
service_name='secretsmanager',
region_name=args['google_secret_region']
)
get_secret_value_response = client.get_secret_value(
SecretId=args['google_secret_name']
)
secret = get_secret_value_response['SecretString']
secret = json.loads(secret)
# Create credentials object with google.oauth2.
# Use those scopes for to connect to Google Drive.
# More information:
# https://google-auth.readthedocs.io/en/master/reference/google.oauth2.service_account.html
credentials = service_account.Credentials.from_service_account_info(
secret,
scopes=['https://www.googleapis.com/auth/drive', 'https://www.googleapis.com/auth/spreadsheets'])
# Open connection with googleapiclient and create object connection.
# More information: https://developers.google.com/sheets/api/quickstart/python
service = build('sheets', 'v4', credentials=credentials)
sheet = service.spreadsheets()
# Read google object
# More information:
# https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get
result = sheet.values().get(
spreadsheetId=args['google_spreadsheet_id'],
range=args['google_spreadsheet_tab']).execute()
data = result.get('values')
# save data into a dataframe
data = pd.DataFrame(data[1:], columns=data[0])
bucket = args['bucket']
folder = args['folder']
filename = args['filename']
output_format = args['output_format']
compression = args['compression']
save_file(data, bucket, folder, filename, output_format, compression)
This code imports native Google libraries:
- google-api-python-client → Native Google client python package
- google-auth-oauthlib → Native Google Auth python package
- google-auth-httplib2 → This library provides an httplib2 transport for google-auth.
- fastparquet → parquet support for pandas
You can refer to their documentation to extend the proposed code sample.
- Setup the job parameters.
Follow the Glue interface, and provide the following information:
- IAM Role: You can provide the glue role that we have created before.
- Python version: Python 3.9
- Data Processing: 1/16 DPU
- Number of retries: 1
- Job Timeout: 2880 min
You can keep advance properties as per default.
The code uses a set of script variables to easily integrate with your pipeline. In Job parameters, add all the additional python modules imported and the environment variables. To do that include the following job parameter in key-value format:
- --additional-python-modules : google-api-python-client, google-auth-oauthlib, google-auth-httplib2, fastparquet
- --google_secret_name: your-google-secret-name
- --google_secret_region: your-google-secret-region
- --google_spreadsheet_id: your-google-spreadsheet-id (More information here)
- --google_spreadsheet_tab: your-google-spreadsheet-tab
- --bucket: your-data-bucket
- --folder: your-data-bucket-folder (default = data)
- --filename: your-data-filename (default = filename)
- --output_format: your-filename-format (supported file formats: parquet and csv tab separated)
- --compression: your-filename-compression (supported compression algorithms: snappy for parquet and zstd for csv)
In case you would prefer to use AWS CLI
to create your Glue Job
follow those steps:
- Save the code sample file in a
S3 Bucket
. (Reuse the Bucket you have created before or create new one) - Open your preferred
shell
terminal with configuredAWS CLI
access to your account. - Execute the following bash command:
aws glue create-job --profile account-session #Choose the profile that enables access to your account
--name google_spreadsheet_data #Choose a unique name for this job
--role gsheet-glue-role #Choose teh Role created in previous sections
--command
Name=pythonshell,
PythonVersion=3.9,
# Navigate to your python script in S3. Create a new bucket or reuse the existent one.
# Chose a your strategy for the TempDir location in --default-arguments
ScriptLocation=s3://bucket/script_folder/script.py
--default-arguments
'{
"--TempDir":"s3://${temp_bucket}/${temp_folder}",
"--job-language":"python",
"library-set":"analytics",
"--additional-python-modules":"google-api-python-client, google-auth-oauthlib, google-auth-httplib2, fastparquet"
"--google_secret_name":"${your-google-secret-name}"
"--google_secret_region":"${your-google-secret-region}"
"--google_spreadsheet_id":"${your-google-spreadsheet-id}"
"--google_spreadsheet_tab":"${your-google-spreadsheet-tab}"
"--bucket":"${your-data-bucket}"
"--folder":"${your-data-bucket-folder}"
"--filename":"${your-data-filename}"
"--output_format":"${your-filename-format}"
"--compression":"${your-filename-compression}"
}'
--timeout 30
--max-capacity 0.0625
--max-retries 1
Once you have finished the setup you can run the Glue Job
in both AWS Glue interface
or in AWS CLI
.
aws glue start-job-run --job-name google_spreadsheet_data
AWS Glue is a fully managed service and you will not incur in extra charges if you do not run your jobs. However, to avoid any additional charges linked to other services follow these steps:
- Empty and delete the S3 bucket.
- Delete Google Service Account credentials in AWS Secrets Manager .
In case you want to try the code locally or you would like to add new capabilities to this routine, we recommend to test it locally. Use your preferred code IDE with the same JSON
credentials file you have created previously and make any enhancements that applies to your use case. Once you are satisfied with your code follow the steps in this tutorial with your new routine.
Run the following terminal command to clone the repo:
git clone http://...
cd <folder>
Navigate to your python environment and run the following command to install the requirements.
pip install -r requirements.txt
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.