Parsing Large CSV from Google Cloud Storage
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

GCSIterator (Python CSV iterator for Google Cloud Storage)

GCSIterator is a chucksize base Python CSV iterator for Google Cloud Storage

Getting Started

# Get gcloud
$ curl | bash

# Get App Engine component
$ gcloud components update app
$ gcloud components update gae-python

# Clone repo from github
$ git clone

# Install pip packages
$ sudo pip install -r requirements.txt -t lib

Replace your bucket-name and object-name. You may also modify chunksize at line 24 in file.


def get_authenticated_service():
  credentials = GoogleCredentials.get_application_default()
  http = credentials.authorize(httplib2.Http())
  return discovery_build('storage', 'v1', http=http)

gcs_service = get_authenticated_service()
bucket_name = '<your-bucket-name>' # waldo-gcp-file
object_name = '<your-object-name>' # kaichu_1016_00000100.csv

request = gcs_service.objects().get_media(bucket=bucket_name, object=object_name.encode('utf8'))
iterator = GCSIterator(request, chunksize=512)

reader = csv.DictReader(iterator, skipinitialspace=True, delimiter=',')
for row in reader:
  print row


# sample output
$ python
read bytes=0-512/*
{'email': '', 'name': 'cage00000000'}
{'email': '', 'name': 'cage00000001'}
{'email': '', 'name': 'cage00000002'}
{'email': '', 'name': 'cage00000003'}
{'email': '', 'name': 'cage00000004'}
{'email': '', 'name': 'cage00000005'}
{'email': '', 'name': 'cage00000006'}
{'email': '', 'name': 'cage00000007'}
{'email': '', 'name': 'cage00000008'}
{'email': '', 'name': 'cage00000009'}
{'email': '', 'name': 'cage00000010'}
read bytes=513-1025/4411
{'email': '', 'name': 'cage00000011'}
{'email': '', 'name': 'cage00000012'}
{'email': '', 'name': 'cage00000013'}
{'email': '', 'name': 'cage00000014'}
{'email': '', 'name': 'cage00000015'}
{'email': '', 'name': 'cage00000016'}
{'email': '', 'name': 'cage00000017'}
{'email': '', 'name': 'cage00000018'}
{'email': '', 'name': 'cage00000019'}
{'email': '', 'name': 'cage00000020'}
{'email': '', 'name': 'cage00000021'}
{'email': '', 'name': 'cage00000022'}
read bytes=1026-1538/4411
{'email': '', 'name': 'cage00000023'}
{'email': '', 'name': 'cage00000024'}