## **PCollection Introduction in Apache Beam:**

### **PCollection:**

#### A PCollection is a data set or data stream. The data that a pipeline processes is part of a PCollection.

It is an abstraction represents a potentially distributed, multi-element data set. It represents a distributed data set that our beam pipeline operates on.

o	**Immutability:** Pcollections are immutable in nature. Applying a transformations on a pcollection results in creation of new pcollection.

o	**Element type:** The elements in pcollection may be of any type, but all must be of same type.

o **Element Schema:** Element type in a **PCollection** has a structure that can introspected. Examples are JSON, Protocol Buffer, Avro, and database records. Schemas provide a way to express types as a set of named fields, allowing for more-expressive aggregations.

o	**Operation type:**  Pcollection does not support grained operations. We cannot apply transformations on specific elements in pcollection.

o	**Timestamps:** Each element in pcollection has an associated timestamp with it.

o **Creating a Pcollection:** create a PCollection by either reading data from an external source using Beam’s Source API (or) can create a PCollection of data stored in an in-memory collection class in your driver program.

o	**Unbounded pcollections:** An unbounded PCollection represents a data set of unlimited size. Example: Streaming data from pubsub. Source assigns the timestamps.

o	**Bounded pcollections:** A bounded PCollection represents a data set of a known fixed size. Example: Batch data. Every element is set to same timestamp.

o	**No Random access:** Can’t access data using index or some specific element. No size restriction.

o	**Ptransform:** Ptransform represent a data processing operation, or a step in our pipeline. Ex., Map, Groupby, FlatMap, ParDo, filter, flatten, combine etc.

•	**PCollection characteristics:**
o	A PCollection is owned by the specific Pipeline object for which it is created; multiple pipelines cannot share a PCollection.

### •	***Resources:***

o	https://beam.apache.org/documentation/programming-guide/#pcollections

o	https://beam.apache.org/releases/pydoc/2.36.0/apache_beam.io.textio.html?highlight=readfromtext#apache_beam.io.textio.ReadFromText


In [1]:
#installation of apache beam   https://github.com/vigneshSs-07/Cloud-AI-Analytics/tree/main/Apache%20Beam%20-Python
!pip3 install apache_beam

Collecting apache_beam
  Downloading apache_beam-2.60.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Collecting crcmod<2.0,>=1.7 (from apache_beam)
  Downloading crcmod-1.7.tar.gz (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting orjson<4,>=3.9.7 (from apache_beam)
  Downloading orjson-3.10.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.6/50.6 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dill<0.3.2,>=0.3.1.1 (from apache_beam)
  Downloading dill-0.3.1.1.tar.gz (151 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.0/152.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cloudpickle~=2.2.1 (from apache_beam)

In [2]:
#importing library
import apache_beam as beam

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%cd drive/My\ Drive/Colab\ Notebooks/Cloud-AI-Analytics/Apache\ Beam\ -Python/data

/content/drive/My Drive/Colab Notebooks/Cloud-AI-Analytics/Apache Beam -Python/data


In [6]:
!ls

dept_data.txt  regular_filter.txt-00000-of-00001  students_exclude.txt
grocery.txt    Students_age.txt			  students.txt


In [7]:
!cat grocery.txt

Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Yea,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
FDA15,9.3,Low Fat,0.016047301,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
DRC01,5.92,Regular,0.019278216,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
FDN15,17.5,Low Fat,0.016760075,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
FDX07,19.2,Regular,0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
NCD19,8.93,Low Fat,0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
FDP36,10.395,Regular,0,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
FDO10,13.65,Regular,0.012741089,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
FDP10,,Low Fat,0.127469857,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
FDH17,16.2,Regular,0.016687114,F

In [8]:
#from external resources
p1 = beam.Pipeline()

grocery = (p1
           | "Read from Text" >> beam.io.ReadFromText("grocery.txt", skip_header_lines=1)
           | "split the record" >> beam.Map(lambda record: record.split(','))
           | 'Filter regular' >> beam.Filter(lambda record: record[2] == 'Regular')
           | 'Write to text'>> beam.io.WriteToText('regular_filter.txt'))  #| beam.Map(print))

p1.run()



<apache_beam.runners.portability.fn_api_runner.fn_runner.RunnerResult at 0x7e85f5d56aa0>

In [9]:
!ls

dept_data.txt  regular_filter.txt-00000-of-00001  students_exclude.txt
grocery.txt    Students_age.txt			  students.txt


In [10]:
!cat regular_filter.txt-00000-of-00001

['DRC01', '5.92', 'Regular', '0.019278216', 'Soft Drinks', '48.2692', 'OUT018', '2009', 'Medium', 'Tier 3', 'Supermarket Type2', '443.4228']
['FDX07', '19.2', 'Regular', '0', 'Fruits and Vegetables', '182.095', 'OUT010', '1998', '', 'Tier 3', 'Grocery Store', '732.38']
['FDP36', '10.395', 'Regular', '0', 'Baking Goods', '51.4008', 'OUT018', '2009', 'Medium', 'Tier 3', 'Supermarket Type2', '556.6088']
['FDO10', '13.65', 'Regular', '0.012741089', 'Snack Foods', '57.6588', 'OUT013', '1987', 'High', 'Tier 3', 'Supermarket Type1', '343.5528']
['FDH17', '16.2', 'Regular', '0.016687114', 'Frozen Foods', '96.9726', 'OUT045', '2002', '', 'Tier 2', 'Supermarket Type1', '1076.5986']
['FDU28', '19.2', 'Regular', '0.09444959', 'Frozen Foods', '187.8214', 'OUT017', '2007', '', 'Tier 2', 'Supermarket Type1', '4710.535']
['FDA03', '18.5', 'Regular', '0.045463773', 'Dairy', '144.1102', 'OUT046', '1997', 'Small', 'Tier 1', 'Supermarket Type1', '2187.153']
['FDX32', '15.1', 'Regular', '0.1000135', 'Fruit

In [11]:
#In memory
with beam.Pipeline() as pipeline:
  lines = (
      pipeline
      | beam.Create([
          'To be, or not to be: that is the question: ',
          "Whether 'tis nobler in the mind to suffer ",
          'The slings and arrows of outrageous fortune, ',
          'Or to take arms against a sea of troubles, ',
      ]))

In [12]:
lines

<PCollection[[11]: Create/Map(decode).None] at 0x7e85f47709a0>