Skip to content
Metalib to upload Pandas data frames as CSV/Parquet files to AWS S3 + create a Hive external table to this S3 bucket
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
__init__.py
readme.md
test_uploader.py
upload_s3_hive.py

readme.md

At my work I do this stuff a lot:
  1. Read data to Pandas data frame
  2. Save the data into AWS S3 bucket in CSV or Parquet format
  3. Create an external Hive table, which should read from those files in S3 To help myself do this job, I’ve created a small meta-library, which contains basic methods which I'm using to implement this pipeline.
Issues this lib addressed:
  • A lot of handwork for type preparation for Pandas data frame
  • Automation for the process saving data from pandas to file, send the file, and then delete it locally
  • Automate script creation for ‘create’ statement for adding an external table in Hive, with types based on the basic types of the data frame
Usage:

A. The data should be in Pandas data frame

B. Init the lib as

upl = Upload_S3_HIVE(df, export_type='csv')

where

‘df’ – is a pandas data frame ‘export_type’ is a format of the saved file in s3. It could be ‘csv’ or ‘parquet’ (for saving in parquet file the arrow method is used)

C. To script a ‘create’ statement for creation of an external Hive table, use method

upl.script_hive_ext_table_create_statement(s3_bucket, dbtablename) where

‘s3_bucket’ is the full path to the bucket where the file will be saved ‘dbtablename’ – the name of the external table which will be using this s3 bucket as a data source

For ‘csv’ format and ‘parquet’ format create statements will be different

I.e. pandas dataframe:

pd.DataFrame([{"col1": 11, "col2": 'test'}, {"col1": 21, "col2": 'test2'}, {"col1": 31, "col2": 'test3'}, ])

With s3_bucket = "s3://export_test/test", and, dbtablename = ‘test_table’

Will be scripted as:

``CREATE EXTERNAL TABLE ext.test_table ('col1' BIGINT, 'col2' VARCHAR(512))

ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' STORED AS TEXTFILE LOCATION 's3://export_test/test'``

D. To copy a data frame into an AWS S3 bucket in a particular format, use this method:

upl.upload_to_s3(s3bucket)

where

‘s3_bucket’ is the full path to the bucket where the file will be saved The method will fix type for the columns of the data frame, export the data from pandas data frame to a local file (base on the base filename), copy this file to the mentioned S3 bucket, and delete the local file afterwards.

You can’t perform that action at this time.