Export

We describe here how to use the Parquet exporter with Scrapy.

Parquet is a file format common in Big Data platforms for analysing in a structured way.

Example

You can find an example on how to use it in a Scrapy project here

General Guidelines

Please look carefully at the options below.

You need at least the library fastparquet to enable the Parquet export. You may need additional libraries for special types of compression (see below).

Configuration

You need to configure in your Scrapy project in settings.py the following exporter:

FEED_EXPORTERS={'parquet': 'zuinnote.scrapy.contrib.bigexporters.ParquetItemExporter'} # register additional format

Then you need to configure FEEDS in settings.py to define output format and file name.

Example local file, e.g. data-quotes-2020-01-01T10-00-00.parquet:

FEEDS = {
'data-%(name)s-%(time)s.parquet': {
        'format': 'parquet',
        'encoding': 'utf8',
        'store_empty': False,
        'item_export_kwargs': {
           'compression': 'GZIP',
           'times': 'int64',
           'hasnulls': True,
           'convertallstrings': False,
           'writeindex': False,
           'objectencoding': 'infer',
           'rowgroupoffset': 50000000,
           'items_rowgroup': 10000
        },
    }
}

Example s3 file, e.g. s3://mybucket/data-quotes-2020-01-01T10-00-00.parquet:

FEEDS = {
's3://aws_key:aws_secret@mybucket/data-%(name)s-%(time)s.parquet': {
        'format': 'parquet',
        'encoding': 'utf8',
        'store_empty': False,
        'item_export_kwargs': {
           'compression': 'GZIP',
           'times': 'int64',
           'hasnulls': True,
           'convertallstrings': False,
           'writeindex': False,
           'objectencoding': 'infer',
           'rowgroupoffset': 50000000,
           'items_rowgroup': 10000
        },
    }
}

There are more storage backend, e.g. Google Cloud. See the documentation linked above.

Finally, you can define in the FEEDS settings various options in 'item_export_kwargs'

Options for Parquet export

Option	Default	Description
'compression'	'compression' : 'GZIP'	Compression to be used in Parquet: 'UNCOMPRESSED', 'GZIP', 'SNAPPY', 'LZO', 'BROTLI','LZ4','ZSTD'. Instead of a string, you can also specify a dict containing compression options (see here)
'times'	'times' : 'int64'	type for times 'int64' or 'int96', older spark versions are int96 only
'convertallstrings'	'convertallstrings' : False	convert all values to string. recommended for compatibility reasons, conversion to native types is suggested as part of the ingestion in the processing platform
'hasnulls'	'hasnulls' : True	can contain nulls
'writeindex'	'writeindex' : False	write index as extra column
'objectencoding'	'objectencoding' : 'infer'	As of version 0.4.0. Data type of columns. infer is a special type and means that fastparquet tries to detect it automatically. Can be str or dictionary in the format {col: type}, and type can be infer\|bytes\|utf8\|json\|bson\|bool\|int\|int32\|float\|decimal, where bytes is assumed if not specified (i.e., no conversion) (see here)
'rowgroupoffset'	'rowgroupoffset':50000000	offset row groups
'items_rowgroup'	'items_rowgroup' : 10000	how many items per rowgroup, should be several thousands, e.g. between 5,000 and 30,000. The more rows the higher the memory consumption and the better the compression on the final parquet file

Additional libraries

If you want to use special types of compression then additional libraries may be needed:

Compression Codecs and required libraries

Compression Codec	Description	Additional library
'UNCOMPRESSED'	No compression	built-in
'GZIP'	Gzip compression	built-in
'SNAPPY'	Snappy compression	python-snappy
'LZO'	LZO compression	python-lzo
'BROTLI'	BROTLI compression (note: scrapy requires brotlipy, but fastparquet brotli)	brotli
'ZSTD'	Zstandard compression	zstandard
'LZ4'	LZ4 compression	lz4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet.rst

parquet.rst

Export

Example

General Guidelines

Configuration

Additional libraries

Files

parquet.rst

Latest commit

History

parquet.rst

File metadata and controls

Export

Example

General Guidelines

Configuration

Additional libraries