# Quick revision of Python 3


As part of this topic, let us quickly review the basic concepts of Python before jumping into Spark APIs. Python is a programming language and Spark APIs are compatible with Python (along with Scala, Java etc). It is imperative to master at least one of the programming languages to build applications using Spark.

Let us revise below concepts before jumping into pyspark (Spark with Python)

Let us revise below concepts before jumping into pyspark (Spark with Python).

* Basics of Programming (help, type, indentation etc)
* Overview of Functions
* Lambda Functions
* Basic file I/O
* Collections and Map Reduce APIs
* Overview of Pandas Data Frames
    
We can use the jupyter notebook in the lab to revise python concepts.

# Basics of Programming

Let us talk about some of the basics of programming using Python 3.

* We can launch python CLI or use Jupyter Notebook to develop Python Code.
* <b>type</b> can be used to get the data type of the Python Variable or Object.
* <b>help</b> can be used as a CLI or as a function on Class or Object or a Function.
* We need to indent properly to define the scope while using Python for Programming.
* As Python is dynamically typed programming language we cannot specify data types while creating variable or objects. The type will be inherited based on the value assigned to a variable.
* It has all basic constructs such as if, while, for, the ternary operator etc.
* <b>Python</b> supports all basic data types as well as collections such as list, set, map etc.
* As part of the demo, we will see the usage of a type, help, basic program using the ternary operator as well as looping through a list (get even numbers from a list of elements)

# Overview of Functions
We need to revise the following related to functions.

* Pre-Defined Functions
 * Performing File I/O
 * String Manipulation Functions (will see few examples)
 * Date Manipulation Functions
 * Manipulating Collections
 * and more
*User Defined Functions
 * At times we need to develop new functions which are not available as part of Core Python or 3rd party Python modules.
 * Here are a few things we should recollect with respect to user-defined functions.
  * Function Specification (Function Name, Arguments, and Return type)
  * We can have a fixed number of arguments, varying number of arguments as well as keyword arguments for Functions in Python.
  * Function Definition or Logic
  * Return Statement
* Functions can be passed as arguments to other functions.
* We also will go through lambda functions in a separate topic.

# Lambda Functions

Let us revise the details related to Lambda Functions

* At times we might have to develop simple functions, especially to pass as an argument for higher order functions.
* In that case, we can use lambda functions.
* Lambda Functions are extensively used as part of modern programming languages.

In [1]:
# Correct way of getting sumOfIntegers
def sumOfIntegers(lb, ub):
    l = lb - 1
    return ((ub * (ub + 1)) / 2) - ((l * (l + 1)) / 2)

print(sumOfIntegers(2, 5))

# To demonstrate lambda functions we will loop through the range
# Conventional approach, we need to write different functions for
# sum of range of numbers
# sum of squares in range of numbers
# and more
def sum(lb, ub):
    total = 0
    for i in range(lb, ub + 1):
        total += i
    return total
print ("sum of integers using conventional approach " + str(sum(3, 5)))

def sumOfSquares(lb, ub):
    total = 0
    for i in range(lb, ub + 1):
        total += (i * i)
    return total
print ("sum of squares using conventional approach " + str(sumOfSquares(3, 5)))

# With lambda functions, we can get more concise and readable code
def sum(f, lb, ub):
    total = 0
    for i in range(lb, ub + 1):
        total += f(i)
    return total
print ("sum of integers using lambda functions " + str(sum(lambda i: i, 3, 5)))
print ("sum of squares using lambda functions " + str(sum(lambda i: i * i, 3, 5)))

# We can also pass named function as argument
def cube(i): return i * i * i
print ("sum of cubes using lambda functions " + str(sum(lambda i: cube(i), 3, 5)))

14.0
sum of integers using conventional approach 12
sum of squares using conventional approach 50
sum of integers using lambda functions 12
sum of squares using lambda functions 50
sum of cubes using lambda functions 216


# Basic File I/O

Let us see how we can read the data using Python File I/O APIs. We will limit the scope to read the data from a file into a collection.

* <b>open</b> is the API which facilitates us to create File Object
* We can perform <b>read()</b> to read the data from a file into the memory. When we apply read on files of text format, data will be loaded into memory as a string.
* We can load data at once or in iterations of multiple batches or buffers.
* To convert into the collection we can either use <b>split</b> or <b>splitlines</b>

# Collections and Map Reduce APIs
Now let us recollect details about collections and basic map reduce APIs.

* Python support 3 types of Collections
 * list – **[1, 2, 1, 5, 3]**
 * set – **{1, 2, 5, 3}**
 * dict – **{ ‘order_id’: 1, ‘order_date’: ‘2013-07-25 00:00:00.0’, ‘order_customer_id’: 1000, ‘order_status’: ‘COMPLETE’ }**
 * a list is a heap of items while the set is a group of unique items
 * dict is similar to a hash map where keys are unique with corresponding value.
* We also have another data structure called Tuple. They are unnamed objects where values of attributes can be retrieved using positional notation
 * tuple – <b>(1, ‘2013-07-25 00:00:00.0’, 1000, ‘COMPLETE’)</b>
* Quite often we will create a list or set of tuples
* Let us see some simple examples
 * Creating a list using orders data from a file
 * Convert one element from the list into a tuple and perform tuple operations.
 * Extract order_dates from a list and get unique dates using set.
 * Extract order_id and order_date as dict.

In [2]:
orders = open('/data/retail_db/orders/part-00000'). \
read(). \
splitlines()

# for order in orders[:10]: print(order)
    
orderDatesList = []

for order in orders:
    orderDatesList.append(order.split(',')[1])
    
orderDates = set(orderDatesList)

# for order in list(orderDates)[:10]: print(order)

orderRecord = orders[0]
orderRecordElements = orders[0].split(',')
 
orderTuple = (int(orderRecordElements[0]), orderRecordElements[1], int(orderRecordElements[2]), orderRecordElements[3])
# print(orderTuple[1])

orderDict = {}

for order in orders:
    orderDict[int(order.split(',')[0])] = order.split(',')[1]
    
print(orderDict[1])
len(orderDict.keys())

2013-07-25 00:00:00.0


68883

# Map Reduce APIs

Let us get into the details related to Map Reduce APIs to manipulate collections.

* We can process data in collections using different approaches – conventional loops, map-reduce etc.
* Map Reduce APIs such as filter, map etc take care of initializing the aggregator, looping through elements as well as returning the aggregator for us. We just need to focus on business logic.
* If we have to sort the collection then we need to convert the collection to list
* If we have to eliminate duplicates then we need to convert the collection to set
* Let us see how we can create a collection from a file and then apply map reduce APIs to compute revenue for a given order_item_order_id.

In [3]:
ordersPath = "/data/retail_db/orders/part-00000"
ordersFile = open(ordersPath)
ordersData = ordersFile.read()
orders = ordersData.splitlines()
for i in orders[:10]:
    print(i)

1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT


In [4]:
ordersMap = map(lambda o: (o.split(",")[0], o.split(",")[3]), orders)
for i in list(ordersMap)[:10]:
    print(i)

('1', 'CLOSED')
('2', 'PENDING_PAYMENT')
('3', 'COMPLETE')
('4', 'CLOSED')
('5', 'COMPLETE')
('6', 'COMPLETE')
('7', 'COMPLETE')
('8', 'PROCESSING')
('9', 'PENDING_PAYMENT')
('10', 'PENDING_PAYMENT')


In [5]:
orderItemsPath = "/data/retail_db/order_items/part-00000"
orderItemsFile = open(orderItemsPath)
orderItemsData = orderItemsFile.read()
orderItems = orderItemsData.splitlines()
for i in orderItems[:10]:
    print(i)

1,1,957,1,299.98,299.98
2,2,1073,1,199.99,199.99
3,2,502,5,250.0,50.0
4,2,403,1,129.99,129.99
5,4,897,2,49.98,24.99
6,4,365,5,299.95,59.99
7,4,502,3,150.0,50.0
8,4,1014,4,199.92,49.98
9,5,957,1,299.98,299.98
10,5,365,5,299.95,59.99


In [12]:
orderItemsFiltered = filter(lambda oi: int(oi.split(",")[1]) == 2, orderItems)
orderItemsMap = map(lambda oi: float(oi.split(",")[4]), orderItemsFiltered)
#sum(orderItemsMap)
import functools as ft
ft.reduce(lambda x, y: x + y, orderItemsMap)

579.98

# Overview of Pandas Data Frames

While collections are typically the group of objects or tuples or simple strings, we need to parse them to further process the data. With Data Frames we can define the structure and we can reference values in each record using column names in Data Frames. Also, Data Frames provide rich and simple APIs to convert CSV Files into Data Frames and process them with developer-friendly API.

* Using read_csv with names we can create Data Frame out of comma-separated data with the field name
* You can fetch data from specific columns using names
* We can filter data using query
* We can perform by key aggregations using group by and then aggregate functions
* We can also join data using align

Here are some of the examples of usage of Pandas data frames.

In [None]:
import pandas as pd
orderItemsPath = "/data/retail_db/order_items/part-00000"
orderItems = pd.read_csv(orderItemsPath, names=["order_item_id", "order_item_order_id", "order_item_product_id", "order_item_quantity", "order_item_subtotal", "order_item_product_price"])
orderItems[['order_item_id', 'order_item_subtotal']]
orderItems.query('order_item_order_id == 2')
orderItems.query('order_item_order_id == 2')['order_item_subtotal'].sum()
orderItems.groupby(['order_item_order_id'])['order_item_subtotal'].sum()