Skip to content

Latest commit

 

History

History
184 lines (159 loc) · 6.86 KB

avro.rst

File metadata and controls

184 lines (159 loc) · 6.86 KB

Export

We describe here how to use the Avro exporter with Scrapy.

Avro is a file format common in Big Data platforms for exchanging data in a structured way.

Example

You can find an example on how to use it in a Scrapy project here

General Guidelines

You need at least the library fastavro to enable the Avro export. You may need additional libraries for special types of compression (see below).

Please look carefully at the options below.

You need in any case define an Avro schema. Field names in your Avro schema should match the field names that you have defined in your Scrapy scraper project. Carefully look at the types and that your scraper always provides the correct type. Especially in case the data cannot be found on the web page or is not in the expected format (e.g. numbers contain text on the website)

Configuration

You need to configure in your Scrapy project in settings.py the following exporter:

FEED_EXPORTERS={'avro': 'zuinnote.scrapy.contrib.bigexporters.AvroItemExporter'} # register additional format

Then you need to configure FEEDS in settings.py to define output format and file name.

Local file (e.g. "data-quotes-2020-01-01T10-00-00.avro") with a schema "Author, text, tags":

FEEDS = {
      'data-%(name)s-%(time)s.avro': {
          'format':'avro',
          'encoding':'utf8',
          'store_empty': False,
          'item_export_kwargs': {
               'compression': 'deflate',
               'compressionlevel': None,
               'metadata': None,
               'syncinterval': 16000,
               'recordcache': 10000,
               'syncmarker': None,
               'convertallstrings': False,
               'validator': None,
               'avroschema': {
                   'doc': 'Some quotes',
                   'name': 'quotes',
                   'type': 'record',
                   'fields': [
                       {'name': 'text', 'type': 'string'},
                       {'name': 'author', 'type': {
                           'type':'array',
                           'items':'string',
                           'default':[]
                           }
                       },
                       {'name': 'tags', 'type': {
                           'type':'array',
                           'items':'string',
                           'default':[]
                           }
                       },
                   ]
               }
            }
      }
    }

S3 file (e.g "s3://mybucket/data-quotes-2020-01-01T10-00-00.avro") with a schema "Author, text, tags":

 FEEDS = {
  's3://aws_key:aws_secret@mybucket/data-%(name)s-%(time)s.avro': {
      'format':'avro',
      'encoding':'utf8',
      'store_empty': False,
      'item_export_kwargs': {
           'compression': 'deflate',
           'compressionlevel': None,
           'metadata': None,
           'syncinterval': 16000,
           'recordcache': 10000,
           'syncmarker': None,
           'convertallstrings': False,
           'validator': None,
           'avroschema': {
               'doc': 'Some quotes',
               'name': 'quotes',
               'type': 'record',
               'fields': [
                   {'name': 'text', 'type': 'string'},
                   {'name': 'author', 'type': {
                       'type':'array',
                       'items':'string',
                       'default':[]
                       }
                   },
                   {'name': 'tags', 'type': {
                       'type':'array',
                       'items':'string',
                       'default':[]
                       }
                   },
               ]
           }
        }
  }
}

There are more storage backend, e.g. Google Cloud. See the documentation linked above.

Finally, you can define in the FEEDS settings various options in 'item_export_kwargs' (and you need to at least define the AvroSchema)

Options for Avro export
Option Default Description
'compression' 'compression' : 'deflate' Compression to be used in Avro: 'null', 'deflate', 'bzip2', 'snappy', 'zstandard', 'lz4', 'xz'
'compressionlevel' 'compressionlevel' = None Compression level to be used in Avro: can be an integer if supported by codec
'metadata' 'metadata' : None Avro metadata (dict)
'syncinterval' 'syncinterval' : 16000 sync interval, how many bytes written per block, should be several thousands, the higher the better is the compression, but seek time may increase
'recordcache' 'recordcache' : 10000 how many records should be written at once, the higher the better the compression, but the more memory is needed
'syncmarker' 'syncmarker' : None bytes, if None then a random byte string is used
'convertallstrings' 'convertallstrings' : False convert all values to string. recommended for compatibility reasons, conversion to native types is suggested as part of the ingestion in the processing platform
'avroschema' 'avroschema' : None Mandatory to specify schema. Please name your fields exactly like you name them in your items. Please make sure that the item has always values filled, otherwise you may see errors during scraping. See also fastavro write
'validator' 'validator' : None use fast avro validator when writing, can be None, True (fastavro.validation.validate is used) or a custom function

Additional libraries

If you want to use special types of compression then additional libraries may be needed:

Compression Codecs and required libraries
Compression Codec Description Additional library
'null' No compression built-in
'deflate' Gzip compression built-in
'bzip2' Bzip2 compression built-in
'snappy' Snappy compression python-snappy
'zstandard' Zstandard compression zstandard
'lz4' LZ4 compression lz4
'xz' XZ compression backports.lzma