PandasGlue

Amazon Glue is an AWS simple, flexible, and cost-effective ETL service and Pandas is a Python library which provides high-performance, easy-to-use data structures and data analysis tools.

The goal of this package is help data engineers in the usage of cost efficient serverless compute services (Lambda, Glue, Athena) in order to provide an easy way to integrate Pandas with AWS Glue, allowing load the content of a DataFrame (Write function) directly in a table (parquet format) in the Glue Data Catalog and also execute Athena queries (Read function) returning the result directly in a Pandas DataFrame.

Use cases

This package is recommended for ETL purposes which loads and transforms small to medium size datasets without requiring to create Spark jobs, helping reduce infrastructure costs.

It could be used within Lambda functions, Glue scripts, EC2 instances or any other infrastucture resources.

Prerequisites

pip install pandas
pip install boto3
pip install pyarrow

Installing the package

pip install pandasglue

Usage

Read method:

read_glue()

To retrieve the result of an Athena Query in a Pandas DataFrame.

Quick example:

import pandas as pd
import pandasglue as pg

#Parameters
sql_query = "SELECT * FROM table_name LIMIT 20" 
db_name = "DB_NAME"
s3_output_bucket = "s3://bucket-url/"

df = pg.read_glue(sql_query,db_name,s3_output_bucket)

print(df)

Write method:

write_glue()

Convert a given Pandas Dataframe to a Glue Parquet table.

Quick example:

import pandas as pd
import pandasglue as pg

#Parameters
database = "DB_NAME"
table_name = "TB_NAME"
s3_path = "s3://bucket-url/"

#Sample DF
source_data = {'name': ['Sarah', 'Renata', 'Erika', 'Fernanda', 'Diana'], 
        'city': ['Seattle', 'Sao Paulo', 'Seattle', 'Santiago', 'Lima'],
         'test_score': [82, 52, 56, 234, 254]}
         
df = pd.DataFrame(source_data, columns = ['name', 'city', 'test_score'])


pg.write_glue(df, database, table_name, s3_path, partition_cols=['city'])

Parameters list:

Param 1: explanation
Param 2: explanation
Param 3: explanation
Param 4: explanation

Built With

Boto3 - (AWS) SDK for Python, which allows Python developers to write software that makes use of Amazon services like S3 and EC2.
PyArrow - Python package to interoperate Arrow with Python allowing to convert text files format to parquet files among other functions.

Examples on AWS services:

Text here

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Authors

*Contributor 1 - Initial work - Profile link
*Contributor 2 - Initial work - Profile link
*Contributor 3 - Initial work - Profile link

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Hat tip to anyone whose code was used
Inspiration
etc

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pandasglue		pandasglue
sample_data		sample_data
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

andresmao/pandasglue

Folders and files

Latest commit

History

Repository files navigation

PandasGlue

Use cases

Prerequisites

Installing the package

Usage

Built With

Examples on AWS services:

Contributing

Authors

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Languages