Parsing Large CSV from Google Cloud Storage
GCSIterator (Python CSV iterator for Google Cloud Storage)

GCSIterator is a chucksize base Python CSV iterator for Google Cloud Storage

Getting Started

# Get gcloud
$ curl | bash

# Get App Engine component
$ gcloud components update app
$ gcloud components update gae-python

# Clone repo from github
$ git clone

# Install pip packages
$ sudo pip install -r requirements.txt -t lib

Replace your bucket-name and object-name. You may also modify chunksize at line 24 in file.


def get_authenticated_service():
  credentials = GoogleCredentials.get_application_default()
  http = credentials.authorize(httplib2.Http())
  return discovery_build('storage', 'v1', http=http)

gcs_service = get_authenticated_service()
bucket_name = '<your-bucket-name>' # waldo-gcp-file
object_name = '<your-object-name>' # kaichu_1016_00000100.csv

request = gcs_service.objects().get_media(bucket=bucket_name, object=object_name.encode('utf8'))
iterator = GCSIterator(request, chunksize=512)

reader = csv.DictReader(iterator, skipinitialspace=True, delimiter=',')
for row in reader:
  print row


# sample output
$ python
read bytes=0-512/*
{'email': '', 'name': 'cage00000000'}
{'email': '', 'name': 'cage00000001'}
{'email': '', 'name': 'cage00000002'}
{'email': '', 'name': 'cage00000003'}
{'email': '', 'name': 'cage00000004'}
{'email': '', 'name': 'cage00000005'}
{'email': '', 'name': 'cage00000006'}
{'email': '', 'name': 'cage00000007'}
{'email': '', 'name': 'cage00000008'}
{'email': '', 'name': 'cage00000009'}
{'email': '', 'name': 'cage00000010'}
read bytes=513-1025/4411
{'email': '', 'name': 'cage00000011'}
{'email': '', 'name': 'cage00000012'}
{'email': '', 'name': 'cage00000013'}
{'email': '', 'name': 'cage00000014'}
{'email': '', 'name': 'cage00000015'}
{'email': '', 'name': 'cage00000016'}
{'email': '', 'name': 'cage00000017'}
{'email': '', 'name': 'cage00000018'}
{'email': '', 'name': 'cage00000019'}
{'email': '', 'name': 'cage00000020'}
{'email': '', 'name': 'cage00000021'}
{'email': '', 'name': 'cage00000022'}
read bytes=1026-1538/4411
{'email': '', 'name': 'cage00000023'}
{'email': '', 'name': 'cage00000024'}