Parsing Large CSV from Google Cloud Storage
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
GCSIterator.py
README.md
main.py
requirements.txt
utils.py

README.md

GCSIterator (Python CSV iterator for Google Cloud Storage)

GCSIterator is a chucksize base Python CSV iterator for Google Cloud Storage

Getting Started

# Get gcloud
$ curl https://sdk.cloud.google.com | bash

# Get App Engine component
$ gcloud components update app
$ gcloud components update gae-python

# Clone repo from github
$ git clone https://github.com/cage1016/GCSIterator

# Install pip packages
$ sudo pip install -r requirements.txt -t lib

Replace your bucket-name and object-name. You may also modify chunksize at line 24 in main.py file.

# main.py
...

def get_authenticated_service():
  credentials = GoogleCredentials.get_application_default()
  http = credentials.authorize(httplib2.Http())
  return discovery_build('storage', 'v1', http=http)

gcs_service = get_authenticated_service()
bucket_name = '<your-bucket-name>' # waldo-gcp-file
object_name = '<your-object-name>' # kaichu_1016_00000100.csv

request = gcs_service.objects().get_media(bucket=bucket_name, object=object_name.encode('utf8'))
iterator = GCSIterator(request, chunksize=512)

reader = csv.DictReader(iterator, skipinitialspace=True, delimiter=',')
for row in reader:
  print row

Execute main.py

# sample output
$ python main.py
read bytes=0-512/*
{'email': 'kaichu_1016+00000000@gmail.com', 'name': 'cage00000000'}
{'email': 'kaichu_1016+00000001@gmail.com', 'name': 'cage00000001'}
{'email': 'kaichu_1016+00000002@gmail.com', 'name': 'cage00000002'}
{'email': 'kaichu_1016+00000003@gmail.com', 'name': 'cage00000003'}
{'email': 'kaichu_1016+00000004@gmail.com', 'name': 'cage00000004'}
{'email': 'kaichu_1016+00000005@gmail.com', 'name': 'cage00000005'}
{'email': 'kaichu_1016+00000006@gmail.com', 'name': 'cage00000006'}
{'email': 'kaichu_1016+00000007@gmail.com', 'name': 'cage00000007'}
{'email': 'kaichu_1016+00000008@gmail.com', 'name': 'cage00000008'}
{'email': 'kaichu_1016+00000009@gmail.com', 'name': 'cage00000009'}
{'email': 'kaichu_1016+00000010@gmail.com', 'name': 'cage00000010'}
read bytes=513-1025/4411
{'email': 'kaichu_1016+00000011@gmail.com', 'name': 'cage00000011'}
{'email': 'kaichu_1016+00000012@gmail.com', 'name': 'cage00000012'}
{'email': 'kaichu_1016+00000013@gmail.com', 'name': 'cage00000013'}
{'email': 'kaichu_1016+00000014@gmail.com', 'name': 'cage00000014'}
{'email': 'kaichu_1016+00000015@gmail.com', 'name': 'cage00000015'}
{'email': 'kaichu_1016+00000016@gmail.com', 'name': 'cage00000016'}
{'email': 'kaichu_1016+00000017@gmail.com', 'name': 'cage00000017'}
{'email': 'kaichu_1016+00000018@gmail.com', 'name': 'cage00000018'}
{'email': 'kaichu_1016+00000019@gmail.com', 'name': 'cage00000019'}
{'email': 'kaichu_1016+00000020@gmail.com', 'name': 'cage00000020'}
{'email': 'kaichu_1016+00000021@gmail.com', 'name': 'cage00000021'}
{'email': 'kaichu_1016+00000022@gmail.com', 'name': 'cage00000022'}
read bytes=1026-1538/4411
{'email': 'kaichu_1016+00000023@gmail.com', 'name': 'cage00000023'}
{'email': 'kaichu_1016+00000024@gmail.com', 'name': 'cage00000024'}
...

Reference