##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# Example 1: Word Count

This example demonstrates how to set up an Apache Beam pipeline that reads from a
[Google Cloud Storage](https://cloud.google.com/storage) file containing text from Shakespeare's work *King Lear*, 
tokenizes the text lines into individual words, and performs a frequency count on each of those words. 

An Apache Beam pipeline is a pipeline that reads input data, transforms that
data, and writes output data. It consists of `PTransform`s and `PCollection`s.
A `PCollection` represents a distributed data set that your Beam pipeline operates on.
A `PTransform` represents a data processing operation, or a step, in your pipeline.
It takes one or more `PCollection`s as input, performs a processing function
that you provide on the elements of that `PCollection`, and produces zero
or more output `PCollection` objects.

For details about Apache Beam pipelines, including `PTransform`s and
`PCollection`s, visit [Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/).

You'll be able to use this notebook to explore the data in each `PCollection`.

We first start with the necessary imports:

In [None]:
# Python's regular expression library
import re

# Beam and interactive Beam imports
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

The following defines a `PTransform` named `ReadWordsFromText`, that extracts words from a file.

In [None]:
class ReadWordsFromText(beam.PTransform):
    
    def __init__(self, file_pattern):
        self._file_pattern = file_pattern
    
    def expand(self, pcoll):
        return (pcoll.pipeline
                | beam.io.ReadFromText(self._file_pattern)
                | beam.FlatMap(lambda line: re.findall(r'[\w\']+', line.strip(), re.UNICODE)))

The following sets up an Apache Beam pipeline with the *Interactive Runner*.
The *Interactive Runner* is the runner suitable for running in notebooks.
A runner is an execution engine for Apache Beam pipelines.

In [None]:
p = beam.Pipeline(InteractiveRunner())

The following sets up a `PTransform` that extracts words from a Google Cloud Storage file that contains the text of Shakespeare's work *King Lear*.

`|` is an overloaded operator that applies a `PTransform` to a `PCollection` to produce a new `PCollection`.
Together with `|`, `>>` allows you to optionally name a `PTransform`.

Usage: `<PCollection> | <PTransform>` or `<PCollection> | <name> >> <PTransform>`


In [None]:
words = p | 'read' >> ReadWordsFromText('gs://apache-beam-samples/shakespeare/kinglear.txt')

The following sets up a `PTransform` to count the words. `counts` is a `PCollection` that will contain the count data.


In [None]:
counts = (words 
          | 'count' >> beam.combiners.Count.PerElement())

The following implicitly runs the pipeline and shows the elements in `PCollection` `count`.

In [None]:
ib.show(counts)

The following sets up `PTransform`s that will convert the words to lowercase and then count them.

In [None]:
lower_counts = (words
                | "lower" >> beam.Map(lambda word: word.lower())
                | "lower_count" >> beam.combiners.Count.PerElement())

The following will return the count using the same words as before but with lowercase.
Because all words are converted to lowercase before being counted, some words
will have a higher count than before. 
(e.g. `KING: 2, King: 4, king: 3` will become `king: 9`)
Note the parameter `visualize_data=True`. This optional parameter gives you a visualization of the data (see [FAQ #3.How do I read the visualization](../../faq.md#q3)). 

In [None]:
ib.show(lower_counts, visualize_data=True)

The following gives you a [Pandas Dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that represents the `PCollection` `lower_counts`.

In [None]:
ib.collect(lower_counts)

You can see the job graph for the pipeline by doing:

In [None]:
ib.show_graph(p)

This example is designed to run easily on a single machine. If you have many such files, add an output sink to your `PCollection` result by doing:
```
lower_counts | beam.io.WriteToText(<file>)
```
And if you have millions of such files with billions of words, you need a cluster of machines that have enough processing power and memory to finish processing them in a reasonable amount of time.
[Google Cloud Dataflow](https://cloud.google.com/dataflow) takes away the headache of managing such a cluster, parallelizes and reliably runs your Apache Beam pipelines, with intelligent auto-scaling so that you only pay for the resources needed for your pipelines.

Refer to [this walkthrough](Dataflow_Word_Count.ipynb) on how to run a Dataflow job using the example code in this notebook.

If you have any feedback on this notebook, drop us a line at beam-notebooks-feedback@google.com.