# Overview of Collections and Tuples – Python 3
Let us understand basics behind collections such as list, set and dict as well as tuples. While tuple is unnamed object with multiple attributes collections can be often group of tuples.

* Read data from files into collection
* Collections – list, set and dict – group of homogeneous elements
* Basic Operations on Collections
* Tuples – group of heterogeneous elements
* Develop data processing applications (using loops over collections)

## File I/O
To understand collections in detail it is better to read real world data rather than using hypothetical examples. Let us assume that we got data in files and we will see how we can create collections out of the data in files.
* Python have simple and yet rich APIs to perform file I/O
* We can create file object with open in different modes (by default read only mode)
* To read the contents from the file into memory, we have APIs on top of file object such as read
* read will create large string using contents of the files
* If the data have multiple records with new line character as delimiter, we can apply splitlines on the output of read
* splitlines will convert the string into list with new line character as delimiter

Here is the sample code which reads data from files into collection (list in this case)

In [6]:
ordersPath = "/data/retail_db/orders/part-00000"
ordersFile = open(ordersPath)
ordersData = ordersFile.read()
orders = ordersData.splitlines()
for i in orders[:10]:
    print(i)

1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT


## Collections
* list
    * Group of elements with index and length
    * Elements can be added/inserted at a particular position
    * We can access elements in list by using index in []
    * There can be duplicates in a list
    * APIs are available to add elements to the list, delete elements from the list and sort the list
    * We will see some basic list operations by using simple examples
        * Adding elements into list (append, insert)
        * Deleting elements from list (pop, clear)
        * Checking how many times an element is repeated in list (count)
        * Get the position of element (index)
        * Sorting elements in the list (sort for in place sort and sorted for sorting and creating new collection)
    
* set
    * Group of unique elements with no index or length
    * Elements can be added/inserted but not at a particular position
    * We can check whether the element exists using in operator
    * There can be no duplicates in a set
    * APIs are available to add elements to the set, delete elements from the set and perform set operations such as union, intersection etc
    * We need to convert set to list to sort the data or use sorted function. There is no API available in set to sort it.
    * We will see some basic set operations by using simple examples
        * Adding elements into set (add)
        * Deleting elements from set (pop/remove, clear)
        * Checking whether element is present in a set ([])
        * Set operations (union, intersection, difference etc)

* dict
    * Group of key value pairs
    * Keys are unique
    * Values need not be unique
    * We can access values using keys
    * APIs are available to add new key value pairs to a dict, update values based on keys in dict, extract keys as set from dict, extract values as list from dict, to check whether key exists in the dict etc
    * We will see some basic dict operations by using simple examples
        * Adding elements to dict
        * Removing elements from dict (clear, pop, popitem)
        * Get all keys (keys)
        * Get all key value pairs (items)
        * Get only values (values)
    
## Tuple
Now let us understand definition and characteristics of a tuple.
* Tuple is like object with unnamed attributes
* Values of attributes can be accessed only using positional notation
* It represents individual row in a table or spread sheet with multiple attributes
* We use () to represent tuples
* Tuples are immutable
* Very limited operations are available – e.g.: count, index

## Develop applications (using loops)
As we understand how to read data from files and also manipulate collections now we will see how we can process data which is read from files into collections using traditional loops and collection as well as tuple data structures.

Here is sample program to develop applications using loops.

* Task 1: Get all order statuses from orders data (loop through list, get order status and add it to the set)

In [7]:
ordersPath = "/data/retail_db/orders/part-00000"
ordersFile = open(ordersPath)
ordersData = ordersFile.read()
orders = ordersData.splitlines()
#set can be initialized by saying set([]) or set({})
orderStatuses = set([])
for order in orders:
  orderStatuses.add(order.split(",")[3])

for i in orderStatuses:
  print(i)

PAYMENT_REVIEW
CANCELED
ON_HOLD
COMPLETE
CLOSED
PENDING_PAYMENT
PENDING
SUSPECTED_FRAUD
PROCESSING


* Task 2: Create a function to get revenue for a given order_id (from order_items)

In [8]:
def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

def getOrderRevenue(orderItems, orderId):
  revenue = 0.0
  for orderItem in orderItems:
    if(int(orderItem.split(",")[1]) == orderId):
      revenue += float(orderItem.split(",")[4])
  return revenue

orderItemsPath = "/data/retail_db/order_items/part-00000"
orderItems = readData(orderItemsPath)
orderRevenue = getOrderRevenue(orderItems, 2)
print(orderRevenue)

579.98


* Task 3: Create a function to get revenue for each order_id (from order_items)


In [9]:
def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

def getRevenuePerOrder(orderItems):
  revenuePerOrder = {}
  for orderItem in orderItems:
    orderIdAndRevenue = (int(orderItem.split(",")[1]), float(orderItem.split(",")[4]))
    if(revenuePerOrder.get(orderIdAndRevenue[0])):
      revenuePerOrder[orderIdAndRevenue[0]] += orderIdAndRevenue[1]
    else:
      revenuePerOrder[orderIdAndRevenue[0]] = orderIdAndRevenue[1]
  return revenuePerOrder

orderItemsPath = "/data/retail_db/order_items/part-00000"
orderItems = readData(orderItemsPath)
revenuePerOrder = getRevenuePerOrder(orderItems)
revenuePerOrder[2]
len(revenuePerOrder)

57431

* Task 4: Create a function to get daily revenue using orders which are completed or closed and order items

In [9]:
def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

def getCompletedOrders(orders):
    ordersFiltered = []
    for order in orders:
        if(order.split(',')[3] in ('COMPLETE', 'CLOSED')):
            ordersFiltered.append(order)
    return ordersFiltered
                        

def getOrderIdAndDateDict(orders):
    orderIdAndDateDict = {}
    for order in orders:
        orderIdAndDateDict[int(order.split(",")[0])] = order.split(",")[1]
    return orderIdAndDateDict

def getDailyRevenue(orderIdAndDateDict, orderItems):
    dailyRevenue = {}
    for orderItem in orderItems:
        orderIdAndRevenue = (int(orderItem.split(",")[1]), float(orderItem.split(",")[4]))
        if(orderIdAndDateDict.get(orderIdAndRevenue[0])):
            if(dailyRevenue.get(orderIdAndDateDict[orderIdAndRevenue[0]])):
                dailyRevenue[orderIdAndDateDict[orderIdAndRevenue[0]]] += orderIdAndRevenue[1]
            else:
                dailyRevenue[orderIdAndDateDict[orderIdAndRevenue[0]]] = orderIdAndRevenue[1]
    return dailyRevenue

orderItemsPath = "/data/retail_db/order_items/part-00000"
ordersPath = "/data/retail_db/orders/part-00000"

orderItems = readData(orderItemsPath)
orders = readData(ordersPath)
ordersFiltered = getCompletedOrders(orders)
orderIdAndDateDict = getOrderIdAndDateDict(ordersFiltered)
dailyRevenue = getDailyRevenue(orderIdAndDateDict, orderItems)

for k in dailyRevenue:
    print((k, dailyRevenue[k]))


('2013-07-25 00:00:00.0', 31547.230000000014)
('2013-07-26 00:00:00.0', 54713.23000000002)
('2013-07-27 00:00:00.0', 48411.48000000003)
('2013-07-28 00:00:00.0', 35672.03000000004)
('2013-07-29 00:00:00.0', 54579.699999999946)
('2013-07-30 00:00:00.0', 49329.29000000002)
('2013-07-31 00:00:00.0', 59212.490000000056)
('2013-08-01 00:00:00.0', 49160.080000000045)
('2013-08-02 00:00:00.0', 50688.58000000002)
('2013-08-03 00:00:00.0', 43416.74000000001)
('2013-08-04 00:00:00.0', 35093.01000000003)
('2013-08-05 00:00:00.0', 34025.270000000026)
('2013-08-06 00:00:00.0', 57843.89000000003)
('2013-08-07 00:00:00.0', 45525.59000000006)
('2013-08-08 00:00:00.0', 33549.47000000002)
('2013-08-09 00:00:00.0', 29225.160000000018)
('2013-08-10 00:00:00.0', 46435.04000000003)
('2013-08-11 00:00:00.0', 31155.5)
('2013-08-12 00:00:00.0', 59014.74000000002)
('2013-08-13 00:00:00.0', 17956.879999999994)
('2013-08-14 00:00:00.0', 42043.45000000002)
('2013-08-15 00:00:00.0', 49566.68000000003)
('2013-08-16 

## Exercises
* Get number of records by status (using orders data)
* Get number of orders per month (using orders data)
* Get those order items where order item subtotal is not equal to order item quantity multiplied by order item product price
* Get all those order details from orders where there are no corresponding order items
* Get all those products whose daily revenue is more than $1000 – we need order_date, product_id and product_revenue