<a href="https://colab.research.google.com/github/davidcavazos/beam/blob/colab-notebooks/examples/python/notebooks/io/custom-inputs-filebasedsource.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom inputs - FileBasedSource

Apache Beam makes it easy to read from text files using `beam.io.ReadFromText`, which works in most common cases. However, this assumes that every element is in a single row.

Let's see how we can use an Apache Beam custom source to read from both a compacted JSON file, where the entire content is in a single line, and a multiline JSON file, where every element spans multiple lines.

To learn more about Splittable DoFn, see [Powerful and modular IO connectors with Splittable DoFn in Apache Beam](https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html).

# Setup
We use the `ijson` module to read the JSON file as a generator, rather than reading the entire file. This way, we only store a single element in memory at a time.

In [0]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

# Install apache-beam.
run('pip install --quiet apache-beam ijson')

# Create the required directories.
run('mkdir -p data')

>> pip install --quiet apache-beam ijson

>> mkdir -p data



# Example: JSON Source

To start, let's transform one of the sample CSV file into a compacted JSON file and transform another CSV file into a multiline JSON. For simplicity, we'll use standard Python modules for these transformations.

In [0]:
import json

def create_test_data(start, stop):
  for i in range(start, stop):
    yield {
      'value': i,
      'value_squared': i*i,
    }

for obj in create_test_data(1, 5):
  print(obj)

{'value_squared': 1, 'value': 1}
{'value_squared': 4, 'value': 2}
{'value_squared': 9, 'value': 3}
{'value_squared': 16, 'value': 4}


In [0]:
# Create the compacted JSON file.
with open('data/compacted.json', 'w') as out_file:
  json.dump(
      [obj for obj in create_test_data(5, 10)],
      out_file,
      separators=(',', ':'),
  )

# Check the generated file.
!ls -lh data/compacted.json
!cat data/compacted.json

-rw-r--r-- 1 root root 156 Jan 30 01:40 data/compacted.json
[{"value_squared":25,"value":5},{"value_squared":36,"value":6},{"value_squared":49,"value":7},{"value_squared":64,"value":8},{"value_squared":81,"value":9}]

In [0]:
# Create the multiline JSON file.
with open('data/multiline.json', 'w') as out_file:
  json.dump(
      [obj for obj in create_test_data(10, 15)],
      out_file,
      separators=(',', ': '),
      indent=2,
  )

# Check the generated file.
!ls -lh data/multiline.json
!cat data/multiline.json

-rw-r--r-- 1 root root 257 Jan 30 01:40 data/multiline.json
[
  {
    "value_squared": 100,
    "value": 10
  },
  {
    "value_squared": 121,
    "value": 11
  },
  {
    "value_squared": 144,
    "value": 12
  },
  {
    "value_squared": 169,
    "value": 13
  },
  {
    "value_squared": 196,
    "value": 14
  }
]

We'll start by testing how to use the `ijson` library by itself to create a generator of JSON objects.

In [0]:
import ijson

# Sample the first 10 objects on each file.
print 'Compacted'
with open('data/compacted.json') as f:
  for obj in ijson.items(f, 'item'):
    print(obj)
print ''

print 'Multiline'
with open('data/multiline.json') as f:
  for obj in ijson.items(f, 'item'):
    print(obj)

Compacted
{u'value_squared': 25, u'value': 5}
{u'value_squared': 36, u'value': 6}
{u'value_squared': 49, u'value': 7}
{u'value_squared': 64, u'value': 8}
{u'value_squared': 81, u'value': 9}

Multiline
{u'value_squared': 100, u'value': 10}
{u'value_squared': 121, u'value': 11}
{u'value_squared': 144, u'value': 12}
{u'value_squared': 169, u'value': 13}
{u'value_squared': 196, u'value': 14}


We can now use the generator that we created in the custom source.

In [0]:
import apache_beam as beam
import logging
import ijson

# Create a generator of objects from a given JSON file.
def read_json_objects(file_object):
  for item in ijson.items(file_object, 'item'):
    yield item

class JsonSource(beam.io.filebasedsource.FileBasedSource):
  def __init__(self, file_pattern):
    super(JsonSource, self).__init__(file_pattern, splittable=False)

  def in_range(self, range_tracker, position):
    try:
      return range_tracker.set_current_position(position)
    except ValueError:
      return range_tracker.try_claim(position)

  def read_records(self, filename, range_tracker):
    with self.open_file(filename) as f:
      f.seek(range_tracker.start_position() or 0)
      while self.in_range(range_tracker, f.tell()):
        for obj in read_json_objects(f):
          yield obj

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Read JSON objects' >> beam.io.Read(JsonSource('data/*'))
      | 'Inspect elements' >> beam.Map(logging.warning)
  )

