<a href="https://colab.research.google.com/github/YoheiShinozaki/BeamKatasColab/blob/master/Beam_Katas_08_Core_Transforms_Combine_CombineFn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Beam Katas
## CombineFn

Combine is a Beam transform for combining collections of elements or values in your data. When you apply a Combine transform, you must provide the function that contains the logic for combining the elements or values. The combining function should be commutative and associative, as the function is not necessarily invoked exactly once on all values with a given key. Because the input data (including the value collection) may be distributed across multiple workers, the combining function might be called multiple times to perform partial combining on subsets of the value collection.

Complex combination operations might require you to create a subclass of CombineFn that has an accumulation type distinct from the input/output type. You should use CombineFn if the combine function requires a more sophisticated accumulator, must perform additional pre- or post-processing, might change the output type, or takes the key into account.

Kata: Implement the average of numbers using Combine.CombineFn.


 Hint 1 
Extend the CombineFn class that counts the average of the number.

 Hint 2 
Refer to the Beam Programming Guide "Advanced combinations using CombineFn" section for more information.

In [0]:
!pip install apache-beam -qqq

import apache_beam as beam
from apache_beam.runners.interactive import interactive_runner

## Python Collection

In [2]:
class AverageFn(beam.CombineFn):

    def create_accumulator(self):
        return 0.0, 0

    def add_input(self, accumulator, element):
        (sum, count) = accumulator
        return sum + element, count + 1

    def merge_accumulators(self, accumulators):
        sums, counts = zip(*accumulators)
        return sum(sums), sum(counts)

    def extract_output(self, accumulator):
        (sum, count) = accumulator
        return sum / count if count else float('NaN')

[10, 20, 50, 70, 90] | beam.CombineGlobally(AverageFn())

[48.0]

## Beam Pcollection

In [3]:
p = beam.Pipeline(interactive_runner.InteractiveRunner())

class AverageFn(beam.CombineFn):

    def create_accumulator(self):
        return 0.0, 0

    def add_input(self, accumulator, element):
        (sum, count) = accumulator
        return sum + element, count + 1

    def merge_accumulators(self, accumulators):
        sums, counts = zip(*accumulators)
        return sum(sums), sum(counts)

    def extract_output(self, accumulator):
        (sum, count) = accumulator
        return sum / count if count else float('NaN')

      

(p | beam.Create([10, 20, 50, 70, 90])
   | beam.CombineGlobally(AverageFn()))

p.run()

Running...

Using 0 cached PCollections
Executing 2 of 2 transforms.

Create produced {90, 70, 20, 50, 10}

CombineGlobally(AverageFn) produced {48.0}

<apache_beam.runners.interactive.interactive_runner.PipelineResult at 0x7f3939e713d0>