<a href="https://colab.research.google.com/github/binhvd/Data-Management-2/blob/main/PETL/1-Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**petl** is a general purpose Python package for extracting, transforming and loading tables of data.

![Architecture](https://petl.readthedocs.io/en/stable/_images/petl-architecture.png)


In [29]:
"""
Created on Wed Aug  7 11:29:12 2019
@author: ashish
"""

# petl is a framework with the help of which we can create ETL job
# Documentation Link: https://petl.readthedocs.io/en/stable/intro.html#

# Command to install the petl package : pip install petl
# Demo job to demonstrate an ETL workflow with the help of petl
!pip install petl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting petl
  Downloading petl-1.7.11.tar.gz (408 kB)
[K     |████████████████████████████████| 408 kB 3.9 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: petl
  Building wheel for petl (PEP 517) ... [?25l[?25hdone
  Created wheel for petl: filename=petl-1.7.11-py3-none-any.whl size=226448 sha256=17be3b25b5df67f9664199e382f05513db4dbd190d0c4ddc28ae143d7919303d
  Stored in directory: /root/.cache/pip/wheels/bc/0f/ae/4f496e580063d9929bd46b9f4d97e8884ece77dc80cd0ccb79
Successfully built petl
Installing collected packages: petl
Successfully installed petl-1.7.11


In [30]:
# Prepare an example
example_data = """foo,bar,baz
a,1,3.4
b,2,7.4
c,6,2.2
d,9,8.1
"""

with open('example.csv', 'w') as f:  # Write record in a CSV file
    f.write(example_data)

# Extract table from file

In [31]:
# Import petl library to create a etl pipeline
import petl as etl

# table1 is a table container
table1 = etl.fromcsv('example.csv') # Read the records from the CSV file into a variable with datatype of data table. 
# Check the datatype
type(table1)

petl.io.csv_py3.CSVView

In [32]:
# Convert the value of column in a table based on the argument given in this function. 
# convert(table, *args, **kwargs)  
table2 = etl.convert(table1, 'foo', 'upper') 

# look() by default only shows 5 records
etl.look(table2)

+-----+-----+-------+
| foo | bar | baz   |
| 'A' | '1' | '3.4' |
+-----+-----+-------+
| 'B' | '2' | '7.4' |
+-----+-----+-------+
| 'C' | '6' | '2.2' |
+-----+-----+-------+
| 'D' | '9' | '8.1' |
+-----+-----+-------+

In [36]:
# Convert datatype of the column into integer
table3 = etl.convert(table2, 'bar', int) 
etl.look(table3)

+-----+-----+-------+
| foo | bar | baz   |
| 'A' |   1 | '3.4' |
+-----+-----+-------+
| 'B' |   2 | '7.4' |
+-----+-----+-------+
| 'C' |   6 | '2.2' |
+-----+-----+-------+
| 'D' |   9 | '8.1' |
+-----+-----+-------+

In [37]:
# Convert datatype of the column into float
table4 = etl.convert(table3, 'baz', float) 
etl.look(table4) 

+-----+-----+-----+
| foo | bar | baz |
+=====+=====+=====+
| 'A' |   1 | 3.4 |
+-----+-----+-----+
| 'B' |   2 | 7.4 |
+-----+-----+-----+
| 'C' |   6 | 2.2 |
+-----+-----+-----+
| 'D' |   9 | 8.1 |
+-----+-----+-----+

In [38]:
# Create a new column based on existing columns
table5 = etl.addfield(table4, 'quux', lambda row: row.bar * row.baz) 
etl.look(table5) 

+-----+-----+-----+--------------------+
| foo | bar | baz | quux               |
| 'A' |   1 | 3.4 |                3.4 |
+-----+-----+-----+--------------------+
| 'B' |   2 | 7.4 |               14.8 |
+-----+-----+-----+--------------------+
| 'C' |   6 | 2.2 | 13.200000000000001 |
+-----+-----+-----+--------------------+
| 'D' |   9 | 8.1 |  72.89999999999999 |
+-----+-----+-----+--------------------+

Another way to write the complete above pipeline in petl

Object Oriented Programming Style

In [39]:
table = (
        etl
        .fromcsv('example.csv')
        .convert('foo', 'upper')
        .convert('bar', int)
        .convert('baz', float)
        .addfield('quux', lambda row: row.bar * row.baz)
        )

table.look()

+-----+-----+-----+--------------------+
| foo | bar | baz | quux               |
| 'A' |   1 | 3.4 |                3.4 |
+-----+-----+-----+--------------------+
| 'B' |   2 | 7.4 |               14.8 |
+-----+-----+-----+--------------------+
| 'C' |   6 | 2.2 | 13.200000000000001 |
+-----+-----+-----+--------------------+
| 'D' |   9 | 8.1 |  72.89999999999999 |
+-----+-----+-----+--------------------+

# Extract table from list

In [41]:
l = [['foo', 'bar'], ['a', 1], ['b', 2], ['c', 3]]
print(l) # l is a table iterator

[['foo', 'bar'], ['a', 1], ['b', 2], ['c', 3]]


In [42]:
# wrap() - to use object-oriented style with table container object
table = etl.wrap(l)
table.look()

+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' |   3 |
+-----+-----+

In [43]:
# Print function
print(table) 

+-----+-----+
| foo | bar |
+=====+=====+
| a   |   1 |
+-----+-----+
| b   |   2 |
+-----+-----+
| c   |   3 |
+-----+-----+



# Use of Numpy arrays with petl

In [47]:
import numpy as np

a = np.array([('apples', 1, 2.5),
              ('oranges', 3, 4.4),
               ('pears', 7, 0.1)],
              dtype='U8, i4, f4') # short for unicode 64 bits, integer 32 bits, float 32 bits

t1 = etl.fromarray(a)
t1.look()

+-----------+----+-----+
| f0        | f1 | f2  |
| 'apples'  | 1  | 2.5 |
+-----------+----+-----+
| 'oranges' | 3  | 4.4 |
+-----------+----+-----+
| 'pears'   | 7  | 0.1 |
+-----------+----+-----+

In [48]:
# Cut only filter out the specified columns
# Convert performs transformations
# addfield adds an additional columns
t2 = t1.cut('f0', 'f2').convert('f0', 'upper').addfield('f3', lambda row: row.f2 * 2)
print(t2)

+---------+-----+---------------------+
| f0      | f2  | f3                  |
| APPLES  | 2.5 |                 5.0 |
+---------+-----+---------------------+
| ORANGES | 4.4 |   8.800000190734863 |
+---------+-----+---------------------+
| PEARS   | 0.1 | 0.20000000298023224 |
+---------+-----+---------------------+

