## Read Delimited data using CSV

Let us understand how to read data from delimited files using Python I/O functions as well as `csv`. `csv` can be used to read iterable of delimited strings into iterable of tuples or dicts.
* We will go through the steps related to how to read the contents of the file into list of tuples. We will also see how we can apply transformations such as changing the data types of elements in each of the tuple.
* Also, we will go through the steps related to how to read the contents of the file into list of dicts.

In [1]:
!ls -lhtr /data/retail_db/orders

total 2.9M
-rw-r--r-- 1 root root 2.9M Jan 21  2021 part-00000


In [2]:
!head -5 /data/retail_db/orders/part-00000

1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE


In [3]:
import csv

In [4]:
orders_file = open('/data/retail_db/orders/part-00000')

In [5]:
orders_list = orders_file.read().splitlines()

In [6]:
type(orders_list)

list

In [7]:
orders_list[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [8]:
csv.reader?

[0;31mDocstring:[0m
csv_reader = reader(iterable [, dialect='excel']
                        [optional keyword args])
    for row in csv_reader:
        process(row)

The "iterable" argument can be any object that returns a line
of input for each iteration, such as a file object or a list.  The
optional "dialect" parameter is discussed below.  The function
also accepts optional keyword arguments which override settings
provided by the dialect.

The returned object is an iterator.  Each iteration returns a row
of the CSV file (which can span multiple input lines).
[0;31mType:[0m      builtin_function_or_method


In [9]:
orders = csv.reader(orders_list, delimiter=',') 
# Default for delimter is ','
# The above piece of code is same as csv.reader(orders_list) 

In [10]:
type(orders)

_csv.reader

In [13]:
orders = csv.reader(orders_list)
list(orders)[:10]

[['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED'],
 ['2', '2013-07-25 00:00:00.0', '256', 'PENDING_PAYMENT'],
 ['3', '2013-07-25 00:00:00.0', '12111', 'COMPLETE'],
 ['4', '2013-07-25 00:00:00.0', '8827', 'CLOSED'],
 ['5', '2013-07-25 00:00:00.0', '11318', 'COMPLETE'],
 ['6', '2013-07-25 00:00:00.0', '7130', 'COMPLETE'],
 ['7', '2013-07-25 00:00:00.0', '4530', 'COMPLETE'],
 ['8', '2013-07-25 00:00:00.0', '2911', 'PROCESSING'],
 ['9', '2013-07-25 00:00:00.0', '5657', 'PENDING_PAYMENT'],
 ['10', '2013-07-25 00:00:00.0', '5648', 'PENDING_PAYMENT']]

In [15]:
orders = csv.reader(orders_list)
order = list(orders)[0]

In [18]:
order

['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED']

In [19]:
order[0]

'1'

In [20]:
int(order[0]) if order[0].isdigit() else order[0]

1

In [21]:
list(map(lambda item: int(item) if item.isdigit() else item, order))

[1, '2013-07-25 00:00:00.0', 11599, 'CLOSED']

In [22]:
tuple(map(lambda item: int(item) if item.isdigit() else item, order))

(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED')

In [23]:
orders = csv.reader(orders_list)

list(
    tuple(
        map(
            lambda order: tuple(map(lambda item: int(item) if item.isdigit() else item, order)), 
            orders
        )
    )
)[:10]

[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'),
 (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'),
 (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'),
 (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'),
 (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'),
 (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'),
 (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'),
 (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'),
 (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'),
 (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]

* Here is the code to convert the data in a CSV file to list of tuples using `csv`.

In [24]:
orders_file = open('/data/retail_db/orders/part-00000')
orders_list = orders_file.read().splitlines()

orders = csv.reader(orders_list)

list(
    tuple(
        map(
            lambda order: tuple(map(lambda item: int(item) if item.isdigit() else item, order)), 
            orders
        )
    )
)[:10]

[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'),
 (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'),
 (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'),
 (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'),
 (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'),
 (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'),
 (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'),
 (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'),
 (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'),
 (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]

* Let us see how we can read iterable of CSV strings into list of dict type objects using `csv.DictReader`.

In [25]:
csv.DictReader?

[0;31mInit signature:[0m
[0mcsv[0m[0;34m.[0m[0mDictReader[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mf[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfieldnames[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrestkey[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrestval[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdialect[0m[0;34m=[0m[0;34m'excel'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0margs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwds[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      <no docstring>
[0;31mFile:[0m           /opt/anaconda3/envs/beakerx/lib/python3.6/csv.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     


In [27]:
users = [
    '1,Scott,Tiger',
    '2,Donald,Duck'
]

In [28]:
csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name'])

<csv.DictReader at 0x7fb1039788d0>

In [29]:
list(csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name']))

[OrderedDict([('user_id', '1'),
              ('user_first_name', 'Scott'),
              ('user_last_name', 'Tiger')]),
 OrderedDict([('user_id', '2'),
              ('user_first_name', 'Donald'),
              ('user_last_name', 'Duck')])]

In [32]:
users_dicts = list(csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name']))

In [33]:
users_dicts

[OrderedDict([('user_id', '1'),
              ('user_first_name', 'Scott'),
              ('user_last_name', 'Tiger')]),
 OrderedDict([('user_id', '2'),
              ('user_first_name', 'Donald'),
              ('user_last_name', 'Duck')])]

In [41]:
users_dicts[0]

OrderedDict([('user_id', '1'),
             ('user_first_name', 'Scott'),
             ('user_last_name', 'Tiger')])

In [42]:
users_dicts[0]['user_first_name']

'Scott'

In [56]:
orders_file = open('/data/retail_db/orders/part-00000')

In [57]:
orders_list = orders_file.read().splitlines()

In [58]:
orders_list[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [59]:
list(csv.DictReader(orders_list, fieldnames=['order_id', 'order_date', 'order_customer_id', 'order_status']))[:10]

[OrderedDict([('order_id', '1'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '11599'),
              ('order_status', 'CLOSED')]),
 OrderedDict([('order_id', '2'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '256'),
              ('order_status', 'PENDING_PAYMENT')]),
 OrderedDict([('order_id', '3'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '12111'),
              ('order_status', 'COMPLETE')]),
 OrderedDict([('order_id', '4'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '8827'),
              ('order_status', 'CLOSED')]),
 OrderedDict([('order_id', '5'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '11318'),
              ('order_status', 'COMPLETE')]),
 OrderedDict([('order_id', '6'),
              ('order_date', '2013-07-25 00:00:00.0'),
            

In [60]:
orders = list(csv.DictReader(orders_list, fieldnames=['order_id', 'order_date', 'order_customer_id', 'order_status']))

In [62]:
orders[:3]

[OrderedDict([('order_id', '1'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '11599'),
              ('order_status', 'CLOSED')]),
 OrderedDict([('order_id', '2'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '256'),
              ('order_status', 'PENDING_PAYMENT')]),
 OrderedDict([('order_id', '3'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '12111'),
              ('order_status', 'COMPLETE')])]

In [63]:
order = orders[0]

In [64]:
type(order)

collections.OrderedDict

In [65]:
order

OrderedDict([('order_id', '1'),
             ('order_date', '2013-07-25 00:00:00.0'),
             ('order_customer_id', '11599'),
             ('order_status', 'CLOSED')])

In [66]:
order['order_date']

'2013-07-25 00:00:00.0'