In [None]:
#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Beam DataFrames

<button>
  <a href="https://beam.apache.org/documentation/dsls/dataframes/overview/">
    <img src="https://beam.apache.org/images/favicon.ico" alt="Open the docs" height="16"/>
    Beam DataFrames overview
  </a>
</button>

Beam DataFrames provide a pandas-like [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
API to declare Beam pipelines.

> ℹ️ To learn more about Beam DataFrames, take a look at the
[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.

First, we need to install Apache Beam with the `interactive` extra for the Interactive runner.We also need to install a version of `pandas` supported by the DataFrame API, which we can get with the `dataframe` extra in Beam 2.34.0 and newer.

In [None]:
%pip install --quiet apache-beam[interactive,dataframe]

Lets create a small data file of
[Comma-Separated Values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values).
It simply includes the dates of the
[equinoxes](https://en.wikipedia.org/wiki/Equinox) and
[solstices](https://en.wikipedia.org/wiki/Solstice)
of the year 2021.

In [None]:
%%writefile solar_events.csv
timestamp,event
2021-03-20 09:37:00,March Equinox
2021-06-21 03:32:00,June Solstice
2021-09-22 19:21:00,September Equinox
2021-12-21 15:59:00,December Solstice

# Interactive Beam

Pandas has the
[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
function to easily read CSV files into DataFrames.
Beam has the
[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)
function that emulates `pandas.read_csv`, but returns a deferred Beam DataFrame.

If you’re using
[Interactive Beam](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.interactive.interactive_beam.html),
you can use `collect` to bring a Beam DataFrame into local memory as a Pandas DataFrame.

In [3]:
import apache_beam as beam
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner

pipeline = beam.Pipeline(InteractiveRunner())

# Create a deferred Beam DataFrame with the contents of our csv file.
beam_df = pipeline | 'Read CSV' >> beam.dataframe.io.read_csv('solar_events.csv')

# We can use `ib.collect` to view the contents of a Beam DataFrame.
ib.collect(beam_df)



Unnamed: 0,timestamp,event
solar_events.csv:0,2021-03-20 09:37:00,March Equinox
solar_events.csv:1,2021-06-21 03:32:00,June Solstice
solar_events.csv:2,2021-09-22 19:21:00,September Equinox
solar_events.csv:3,2021-12-21 15:59:00,December Solstice


Collecting a Beam DataFrame into a Pandas DataFrame is useful to perform
[operations not supported by Beam DataFrames](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas#classes-of-unsupported-operations).

For example, let's say we want to take only the first two events in chronological order.
Since a deferred Beam DataFrame does not have any ordering guarantees,
first we need to sort the values.
In Pandas, we could first
[`df.sort_values(by='timestamp')`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) and then
[`df.head(2)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) to achieve this.

However, these are
[order-sensitive operations](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas#order-sensitive-operations)
so using them in a Beam DataFrame raises a
[`WontImplementError`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
We can work around this by using `collect` to convert the Beam DataFrame into a Pandas DataFrame.

In [4]:
import apache_beam.runners.interactive.interactive_beam as ib

# Collect the Beam DataFrame into a Pandas DataFrame.
df = ib.collect(beam_df)

# We can now use any Pandas transforms with our data.
df.sort_values(by='timestamp').head(2)

Unnamed: 0,timestamp,event
solar_events.csv:0,2021-03-20 09:37:00,March Equinox
solar_events.csv:1,2021-06-21 03:32:00,June Solstice


> ℹ️ Note that `collect` is _only_ accessible if you’re using
[Interactive Beam](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.interactive.interactive_beam.html)

# Beam DataFrames to PCollections

If you have your data as a Beam DataFrame, you can convert it into a regular PCollection with
[`to_pcollection`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection).

Converting a Beam DataFrame in this way yields a PCollection with a [schema](https://beam.apache.org/documentation/programming-guide/#what-is-a-schema).
This allows us to easily access each property by attribute, for example `element.event` and `element.timestamp`.

Sometimes it's more convenient to convert the named tuples to Python dictionaries.
We can do that with the
[`_asdict`](https://docs.python.org/3/library/collections.html#collections.somenamedtuple._asdict)
method.

In [5]:
import apache_beam as beam
from apache_beam.dataframe import convert

with beam.Pipeline() as pipeline:
  beam_df = pipeline | 'Read CSV' >> beam.dataframe.io.read_csv('solar_events.csv')

  (
      # Convert the Beam DataFrame to a PCollection.
      convert.to_pcollection(beam_df)

      # We get named tuples, we can convert them to dictionaries like this.
      | 'To dictionaries' >> beam.Map(lambda x: dict(x._asdict()))

      # Print the elements in the PCollection.
      | 'Print' >> beam.Map(print)
  )



{'timestamp': '2021-03-20 09:37:00', 'event': 'March Equinox'}
{'timestamp': '2021-06-21 03:32:00', 'event': 'June Solstice'}
{'timestamp': '2021-09-22 19:21:00', 'event': 'September Equinox'}
{'timestamp': '2021-12-21 15:59:00', 'event': 'December Solstice'}


# Pandas DataFrames to PCollections

If you have your data as a Pandas DataFrame, you can convert it into a regular PCollection with
[`to_pcollection`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection).

Since Pandas DataFrames are not part of any Beam pipeline, we must provide the `pipeline` explicitly.

In [6]:
import pandas as pd
import apache_beam as beam
from apache_beam.dataframe import convert

with beam.Pipeline() as pipeline:
  df = pd.read_csv('solar_events.csv')

  (
      # Convert the Pandas DataFrame to a PCollection.
      convert.to_pcollection(df, pipeline=pipeline)

      # We get named tuples, we can convert them to dictionaries like this.
      | 'To dictionaries' >> beam.Map(lambda x: dict(x._asdict()))

      # Print the elements in the PCollection.
      | 'Print' >> beam.Map(print)
  )



{'timestamp': '2021-03-20 09:37:00', 'event': 'March Equinox'}
{'timestamp': '2021-06-21 03:32:00', 'event': 'June Solstice'}
{'timestamp': '2021-09-22 19:21:00', 'event': 'September Equinox'}
{'timestamp': '2021-12-21 15:59:00', 'event': 'December Solstice'}


If you have your data as a PCollection of Pandas DataFrames, you can convert them into a PCollection with
[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).

> ℹ️ If the number of elements in each DataFrame can be very different (that is, some DataFrames might contain thousands of elements while others contain only a handful of elements), it might be a good idea to
> [`Reshuffle`](https://beam.apache.org/documentation/transforms/python/other/reshuffle).
> This basically rebalances the elements in the PCollection, which helps make sure all the workers have a balanced number of elements.

In [7]:
import pandas as pd
import apache_beam as beam

with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Filename' >> beam.Create(['solar_events.csv'])

      # Each element is a Pandas DataFrame, so we can do any Pandas operation.
      | 'Read CSV' >> beam.Map(pd.read_csv)

      # We yield each element of all the DataFrames into a PCollection of dictionaries.
      | 'To dictionaries' >> beam.FlatMap(lambda df: df.to_dict('records'))

      # Reshuffle to make sure parallelization is balanced.
      | 'Reshuffle' >> beam.Reshuffle()

      # Print the elements in the PCollection.
      | 'Print' >> beam.Map(print)
  )



{'timestamp': '2021-03-20 09:37:00', 'event': 'March Equinox'}
{'timestamp': '2021-06-21 03:32:00', 'event': 'June Solstice'}
{'timestamp': '2021-09-22 19:21:00', 'event': 'September Equinox'}
{'timestamp': '2021-12-21 15:59:00', 'event': 'December Solstice'}


# PCollections to Beam DataFrames

If you have your data as a PCollection, you can convert it into a deferred Beam DataFrame with
[`to_dataframe`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe).

> ℹ️ To convert a PCollection to a Beam DataFrame, each element _must_ have a
[schema](https://beam.apache.org/documentation/programming-guide/#what-is-a-schema).

In [8]:
import csv
import apache_beam as beam
from apache_beam.dataframe import convert

with open('solar_events.csv') as f:
  solar_events = [dict(row) for row in csv.DictReader(f)]

with beam.Pipeline() as pipeline:
  pcoll = pipeline | 'Create data' >> beam.Create(solar_events)

  # Convert the PCollection into a Beam DataFrame
  beam_df = convert.to_dataframe(pcoll | 'To Rows' >> beam.Map(
      lambda x: beam.Row(
          timestamp=x['timestamp'],
          event=x['event'],
      )
  ))

  # Print the elements in the Beam DataFrame.
  (
      convert.to_pcollection(beam_df)
      | 'To dictionaries' >> beam.Map(lambda x: dict(x._asdict()))
      | 'Print' >> beam.Map(print)
  )



{'timestamp': '2021-03-20 09:37:00', 'event': 'March Equinox'}
{'timestamp': '2021-06-21 03:32:00', 'event': 'June Solstice'}
{'timestamp': '2021-09-22 19:21:00', 'event': 'September Equinox'}
{'timestamp': '2021-12-21 15:59:00', 'event': 'December Solstice'}


# PCollections to Pandas DataFrames

If you have your data as a PCollection, you can convert it into an in-memory Pandas DataFrame via a
[side input](https://beam.apache.org/documentation/programming-guide#side-inputs).

> ℹ️ It's recommended to **only** do this if you need to use a Pandas operation that is
> [not supported in Beam DataFrames](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/#classes-of-unsupported-operations).
> Converting a PCollection into a Pandas DataFrame consolidates elements from potentially multiple workers into a single worker, which could create a performance bottleneck.

> ⚠️ Pandas DataFrames are in-memory data structures, so make sure all the elements in the PCollection fit into memory.
> If they don't fit into memory, consider yielding multiple DataFrame elements via
> [`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).

In [9]:
import csv
import pandas as pd
import apache_beam as beam

with open('solar_events.csv') as f:
  solar_events = [dict(row) for row in csv.DictReader(f)]

with beam.Pipeline() as pipeline:
  pcoll = pipeline | 'Create data' >> beam.Create(solar_events)

  (
      pipeline

      # Create a single element containing the entire PCollection. 
      | 'Singleton' >> beam.Create([None])
      | 'As Pandas' >> beam.Map(
          lambda _, dict_iter: pd.DataFrame(dict_iter),
          dict_iter=beam.pvalue.AsIter(pcoll),
      )

      # Print the Pandas DataFrame.
      | 'Print' >> beam.Map(print)
  )



             timestamp              event
0  2021-03-20 09:37:00      March Equinox
1  2021-06-21 03:32:00      June Solstice
2  2021-09-22 19:21:00  September Equinox
3  2021-12-21 15:59:00  December Solstice


# What's next?

* [Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) -- an overview of the Beam DataFrames API.
* [Differences from pandas](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas) -- goes through some of the differences between Beam DataFrames and Pandas DataFrames, as well as some of the workarounds for unsupported operations.
* [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) -- a quickstart guide to Pandas DataFrames.
* [Pandas DataFrame API](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) -- the API reference for Pandas DataFrames