## **PCollection Introduction in Apache Beam:**

### **PCollection:**

#### A PCollection is a data set or data stream. The data that a pipeline processes is part of a PCollection.

It is an abstraction represents a potentially distributed, multi-element data set. It represents a distributed data set that our beam pipeline operates on.

o	**Immutability:** Pcollections are immutable in nature. Applying a transformations on a pcollection results in creation of new pcollection.

o	**Element type:** The elements in pcollection may be of any type, but all must be of same type.

o **Element Schema:** Element type in a **PCollection** has a structure that can introspected. Examples are JSON, Protocol Buffer, Avro, and database records. Schemas provide a way to express types as a set of named fields, allowing for more-expressive aggregations.

o	**Operation type:**  Pcollection does not support grained operations. We cannot apply transformations on specific elements in pcollection.

o	**Timestamps:** Each element in pcollection has an associated timestamp with it.

o **Creating a Pcollection:** create a PCollection by either reading data from an external source using Beam’s Source API (or) can create a PCollection of data stored in an in-memory collection class in your driver program.

o	**Unbounded pcollections:** An unbounded PCollection represents a data set of unlimited size. Example: Streaming data from pubsub. Source assigns the timestamps.

o	**Bounded pcollections:** A bounded PCollection represents a data set of a known fixed size. Example: Batch data. Every element is set to same timestamp.

o	**No Random access:** Can’t access data using index or some specific element. No size restriction.

o	**Ptransform:** Ptransform represent a data processing operation, or a step in our pipeline. Ex., Map, Groupby, FlatMap, ParDo, filter, flatten, combine etc.

•	**PCollection characteristics:**
o	A PCollection is owned by the specific Pipeline object for which it is created; multiple pipelines cannot share a PCollection.

### •	***Resources:***

o	https://beam.apache.org/documentation/programming-guide/#pcollections

o	https://beam.apache.org/releases/pydoc/2.36.0/apache_beam.io.textio.html?highlight=readfromtext#apache_beam.io.textio.ReadFromText


In [None]:
#installation of apache beam   https://github.com/vigneshSs-07/Cloud-AI-Analytics/tree/main/Apache%20Beam%20-Python
!pip3 install apache_beam

In [None]:
#importing library
import apache_beam as beam

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd drive/My\ Drive/Colab\ Notebooks/Cloud-AI-Analytics/Apache\ Beam\ -Python/data

In [None]:
!ls

In [None]:
!cat cloud_export_100.txt

In [None]:
#from external resources
p1 = beam.Pipeline()

cloud_cdr = (p1
           | "Read from Text" >> beam.io.ReadFromText("cloud_export_100.txt", skip_header_lines=1)
           | "split the record" >> beam.Map(lambda record: record.split(','))
           | 'Filter compute engine' >> beam.Filter(lambda record: record[2] == 'Compute Engine')
           | 'Write to text'>> beam.io.WriteToText('result/compute_engine_filter'))
#| beam.Map(print))

p1.run()

In [None]:
!ls ./result

In [None]:
!cat result/compute_engine_filter-00000-of-00001

In [None]:
#In memory
with beam.Pipeline() as pipeline:
  lines = (
      pipeline
      | beam.Create([
          'To be, or not to be: that is the question: ',
          "Whether 'tis nobler in the mind to suffer ",
          'The slings and arrows of outrageous fortune, ',
          'Or to take arms against a sea of troubles, ',
      ]))

In [None]:
lines