## Row level transformations

Let us understand how to perform row level transformations using orders data set. Here are the details about orders.
* Data is in text file format
* Each line in the file contains one record.
* Each record contains 4 attributes which are separated by “,”
  * order_id
  * order_date
  * order_customer_id
  * order_status

In [None]:
%%sh

ls -ltr /data/retail_db/orders/part-00000

In [None]:
%%sh

tail /data/retail_db/orders/part-00000

In [1]:
path = 'D:\\BIGDATA-LEARN\\data-engineering-spark-main\\data\\retail_db\orders\part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)

In [2]:
type(orders_file)

_io.TextIOWrapper

In [3]:
orders_raw = orders_file.read()

In [4]:
type(orders_raw)

str

In [5]:
orders_raw.splitlines?

[1;31mSignature:[0m [0morders_raw[0m[1;33m.[0m[0msplitlines[0m[1;33m([0m[0mkeepends[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a list of the lines in the string, breaking at line boundaries.

Line breaks are not included in the resulting list unless keepends is given and
true.
[1;31mType:[0m      builtin_function_or_method


In [6]:
orders = orders_raw.splitlines()

In [7]:
type(orders)

list

In [8]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [9]:
type(orders[0])

str

In [10]:
len(orders)

68883

In [11]:
%%sh

wc -l /data/retail_db/orders/part-00000

Couldn't find program: 'sh'


### Task 1

Get all order ids and associated statuses. Each record in the output should be comma separated string.

In [12]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED' # -> '1,CLOSED'

In [13]:
# We invokde join on delimiter

In [14]:
str.join?

[1;31mSignature:[0m [0mstr[0m[1;33m.[0m[0mjoin[0m[1;33m([0m[0mself[0m[1;33m,[0m [0miterable[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Concatenate any number of strings.

The string whose method is called is inserted in between each given string.
The result is returned as a new string.

Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
[1;31mType:[0m      method_descriptor


In [15]:
':'.join(['1', '2', '3', '4'])

'1:2:3:4'

In [16]:
order.split(',')[0]

'1'

In [17]:
order.split(',')[3]

'CLOSED'

In [18]:
[order.split(',')[0], order.split(',')[3]]

['1', 'CLOSED']

In [19]:
','.join([order.split(',')[0], order.split(',')[3]])

'1,CLOSED'

In [20]:
l = [1]

In [21]:
l.append(2)

In [22]:
l

[1, 2]

In [23]:
order_statuses = []
for order in orders:
    order_statuses.append(','.join([order.split(',')[0], order.split(',')[3]]))

In [24]:
order_statuses[:10]

['1,CLOSED',
 '2,PENDING_PAYMENT',
 '3,COMPLETE',
 '4,CLOSED',
 '5,COMPLETE',
 '6,COMPLETE',
 '7,COMPLETE',
 '8,PROCESSING',
 '9,PENDING_PAYMENT',
 '10,PENDING_PAYMENT']

In [25]:
len(order_statuses)

68883

In [26]:
order_statuses = [','.join([order.split(',')[0], order.split(',')[3]]) for order in orders] # alternative solution

In [27]:
order_statuses[:10]

['1,CLOSED',
 '2,PENDING_PAYMENT',
 '3,COMPLETE',
 '4,CLOSED',
 '5,COMPLETE',
 '6,COMPLETE',
 '7,COMPLETE',
 '8,PROCESSING',
 '9,PENDING_PAYMENT',
 '10,PENDING_PAYMENT']

In [28]:
len(order_statuses)

68883

### Task 2

Get all order ids, the dates on which order is placed and order status. Each record in the output should be dict with following column names as keys.
* order_id
* order_date
* order_status

In [29]:
{'order_id': 1, 'order_date': '2020-12-22', 'order_status': 'COMPLETE'}

{'order_id': 1, 'order_date': '2020-12-22', 'order_status': 'COMPLETE'}

In [30]:
def get_order_details(order):
    """Extract order details such as id, date as well as status and return as dict"""
    order_values = order.split(',')
    return ({
        'order_id': int(order_values[0]),
        'order_date': order_values[1],
        'order_status': order_values[3]
    })

In [31]:
get_order_details('1,2013-07-25 00:00:00.0,11599,CLOSED')

{'order_id': 1,
 'order_date': '2013-07-25 00:00:00.0',
 'order_status': 'CLOSED'}

In [32]:
order_details = []
for order in orders:
    order_details.append(get_order_details(order))

In [33]:
order_details[:10]

[{'order_id': 1,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 2,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 3,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 4,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 5,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 6,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 7,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 8,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PROCESSING'},
 {'order_id': 9,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 10,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'}]

In [34]:
len(order_details)

68883

In [35]:
order_details = [get_order_details(order) for order in orders]

In [36]:
order_details[:10]

[{'order_id': 1,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 2,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 3,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 4,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 5,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 6,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 7,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 8,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PROCESSING'},
 {'order_id': 9,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 10,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'}]

In [37]:
len(order_details)

68883