### Section 9: Apache Spark 2.x - Data Frames and Pre-Defined Functions

#### Data Frames - Overview

In [1]:
#read a file into to dataframe
sc

In [2]:
spark

In [3]:
ordersDF = spark.read.csv("/user/pi/retail_db/orders")

In [4]:
type(ordersDF)

pyspark.sql.dataframe.DataFrame

In [5]:
ordersDF.first()

Row(_c0='1', _c1='2013-07-25 00:00:00.0', _c2='11599', _c3='CLOSED')

In [6]:
ordersDF.select('_c0','_c1').show()

+---+--------------------+
|_c0|                 _c1|
+---+--------------------+
|  1|2013-07-25 00:00:...|
|  2|2013-07-25 00:00:...|
|  3|2013-07-25 00:00:...|
|  4|2013-07-25 00:00:...|
|  5|2013-07-25 00:00:...|
|  6|2013-07-25 00:00:...|
|  7|2013-07-25 00:00:...|
|  8|2013-07-25 00:00:...|
|  9|2013-07-25 00:00:...|
| 10|2013-07-25 00:00:...|
| 11|2013-07-25 00:00:...|
| 12|2013-07-25 00:00:...|
| 13|2013-07-25 00:00:...|
| 14|2013-07-25 00:00:...|
| 15|2013-07-25 00:00:...|
| 16|2013-07-25 00:00:...|
| 17|2013-07-25 00:00:...|
| 18|2013-07-25 00:00:...|
| 19|2013-07-25 00:00:...|
| 20|2013-07-25 00:00:...|
+---+--------------------+
only showing top 20 rows



<p>Data Frames is nothing but RDD with structure.</p>
<ul>
<li>Data Frame can be created on any data set which have structure associated with it.</li>
<li>Attributes/columns in a data frame can be referred using names.</li>
<li>One can create data frame using data from files, hive tables, relational tables over JDBC.</li>
<li>Common functions on Data Frames
<ul>
<li>printSchema – to print the column names and data types of data frame</li>
<li>show – to preview data (default 20 records)</li>
<li>describe – to understand characteristics of data</li>
<li>count – to get number of records</li>
<li>collect – to convert data frame into Array</li>
</ul>
</li>
<li>Once data frame is created, we can process data using 2 approaches.
<ul>
<li>Native Data Frame APIs</li>
<li>Register as temp table and run queries using spark.sql</li>
</ul>
</li>
<li>To work with Data Frames as well as Spark SQL, we need to create object of type SparkSession</li>
</ul>

***

from pyspark.sql import SparkSession

spark = SparkSession. \
    builder. \
    master('local'). \
    appName('Create Dataframe over JDBC'). \
    getOrCreate()
    
***

In [7]:
#to see the Dataframe look like - structure
ordersDF.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)



In [8]:
#to preview data
ordersDF.show()

+---+--------------------+-----+---------------+
|_c0|                 _c1|  _c2|            _c3|
+---+--------------------+-----+---------------+
|  1|2013-07-25 00:00:...|11599|         CLOSED|
|  2|2013-07-25 00:00:...|  256|PENDING_PAYMENT|
|  3|2013-07-25 00:00:...|12111|       COMPLETE|
|  4|2013-07-25 00:00:...| 8827|         CLOSED|
|  5|2013-07-25 00:00:...|11318|       COMPLETE|
|  6|2013-07-25 00:00:...| 7130|       COMPLETE|
|  7|2013-07-25 00:00:...| 4530|       COMPLETE|
|  8|2013-07-25 00:00:...| 2911|     PROCESSING|
|  9|2013-07-25 00:00:...| 5657|PENDING_PAYMENT|
| 10|2013-07-25 00:00:...| 5648|PENDING_PAYMENT|
| 11|2013-07-25 00:00:...|  918| PAYMENT_REVIEW|
| 12|2013-07-25 00:00:...| 1837|         CLOSED|
| 13|2013-07-25 00:00:...| 9149|PENDING_PAYMENT|
| 14|2013-07-25 00:00:...| 9842|     PROCESSING|
| 15|2013-07-25 00:00:...| 2568|       COMPLETE|
| 16|2013-07-25 00:00:...| 7276|PENDING_PAYMENT|
| 17|2013-07-25 00:00:...| 2667|       COMPLETE|
| 18|2013-07-25 00:0

In [10]:
#to preview data - without data truncated
ordersDF.show(20,False)

+---+---------------------+-----+---------------+
|_c0|_c1                  |_c2  |_c3            |
+---+---------------------+-----+---------------+
|1  |2013-07-25 00:00:00.0|11599|CLOSED         |
|2  |2013-07-25 00:00:00.0|256  |PENDING_PAYMENT|
|3  |2013-07-25 00:00:00.0|12111|COMPLETE       |
|4  |2013-07-25 00:00:00.0|8827 |CLOSED         |
|5  |2013-07-25 00:00:00.0|11318|COMPLETE       |
|6  |2013-07-25 00:00:00.0|7130 |COMPLETE       |
|7  |2013-07-25 00:00:00.0|4530 |COMPLETE       |
|8  |2013-07-25 00:00:00.0|2911 |PROCESSING     |
|9  |2013-07-25 00:00:00.0|5657 |PENDING_PAYMENT|
|10 |2013-07-25 00:00:00.0|5648 |PENDING_PAYMENT|
|11 |2013-07-25 00:00:00.0|918  |PAYMENT_REVIEW |
|12 |2013-07-25 00:00:00.0|1837 |CLOSED         |
|13 |2013-07-25 00:00:00.0|9149 |PENDING_PAYMENT|
|14 |2013-07-25 00:00:00.0|9842 |PROCESSING     |
|15 |2013-07-25 00:00:00.0|2568 |COMPLETE       |
|16 |2013-07-25 00:00:00.0|7276 |PENDING_PAYMENT|
|17 |2013-07-25 00:00:00.0|2667 |COMPLETE       |


In [11]:
#to check teh characteristics of the data
ordersDF.describe().show()

+-------+------------------+--------------------+-----------------+---------------+
|summary|               _c0|                 _c1|              _c2|            _c3|
+-------+------------------+--------------------+-----------------+---------------+
|  count|             68883|               68883|            68883|          68883|
|   mean|           34442.0|                null|6216.571098819738|           null|
| stddev|19884.953633337947|                null|3586.205241263963|           null|
|    min|                 1|2013-07-25 00:00:...|                1|       CANCELED|
|    max|              9999|2014-07-24 00:00:...|             9999|SUSPECTED_FRAUD|
+-------+------------------+--------------------+-----------------+---------------+



In [12]:
#get no of records in DF
ordersDF.count()

68883

In [None]:
#get no of records in DF
ordersDF.count()

In [13]:
#convert dataframe to python collection
ordersLst=ordersDF.collect()
type(ordersLst)

list

In [17]:
#read a json file
ordersDF = spark.read.json("/user/pi/retail_db_json/orders")

In [19]:
ordersDF.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)



In [21]:
ordersDF.select("_c0","_c1").show()

+---+--------------------+
|_c0|                 _c1|
+---+--------------------+
|  1|2013-07-25 00:00:...|
|  2|2013-07-25 00:00:...|
|  3|2013-07-25 00:00:...|
|  4|2013-07-25 00:00:...|
|  5|2013-07-25 00:00:...|
|  6|2013-07-25 00:00:...|
|  7|2013-07-25 00:00:...|
|  8|2013-07-25 00:00:...|
|  9|2013-07-25 00:00:...|
| 10|2013-07-25 00:00:...|
| 11|2013-07-25 00:00:...|
| 12|2013-07-25 00:00:...|
| 13|2013-07-25 00:00:...|
| 14|2013-07-25 00:00:...|
| 15|2013-07-25 00:00:...|
| 16|2013-07-25 00:00:...|
| 17|2013-07-25 00:00:...|
| 18|2013-07-25 00:00:...|
| 19|2013-07-25 00:00:...|
| 20|2013-07-25 00:00:...|
+---+--------------------+
only showing top 20 rows



#### Process data in form of SQL

In [22]:
ordersDF.createTempView("orders")

In [24]:
spark.sql("select * from orders").show()

+---+--------------------+-----+---------------+
|_c0|                 _c1|  _c2|            _c3|
+---+--------------------+-----+---------------+
|  1|2013-07-25 00:00:...|11599|         CLOSED|
|  2|2013-07-25 00:00:...|  256|PENDING_PAYMENT|
|  3|2013-07-25 00:00:...|12111|       COMPLETE|
|  4|2013-07-25 00:00:...| 8827|         CLOSED|
|  5|2013-07-25 00:00:...|11318|       COMPLETE|
|  6|2013-07-25 00:00:...| 7130|       COMPLETE|
|  7|2013-07-25 00:00:...| 4530|       COMPLETE|
|  8|2013-07-25 00:00:...| 2911|     PROCESSING|
|  9|2013-07-25 00:00:...| 5657|PENDING_PAYMENT|
| 10|2013-07-25 00:00:...| 5648|PENDING_PAYMENT|
| 11|2013-07-25 00:00:...|  918| PAYMENT_REVIEW|
| 12|2013-07-25 00:00:...| 1837|         CLOSED|
| 13|2013-07-25 00:00:...| 9149|PENDING_PAYMENT|
| 14|2013-07-25 00:00:...| 9842|     PROCESSING|
| 15|2013-07-25 00:00:...| 2568|       COMPLETE|
| 16|2013-07-25 00:00:...| 7276|PENDING_PAYMENT|
| 17|2013-07-25 00:00:...| 2667|       COMPLETE|
| 18|2013-07-25 00:0

### Create Data Frames from Text Files

<p>Let us see how we can read text data from files into data frame. spark.read also have APIs for other types of file formats, but we will get into those details later.</p>
<ul>
<li>We can use spark.read.csv or spark.read.text to read text data.</li>
<li>spark.read.csv can be used for comma separated data. Default field names will be in the form of _c0, _c1 etc</li>
<li>spark.read.text can be used to read fixed length data where there is no delimiter. Default field name is value.</li>
<li>We can define attribute names using toDF function</li>
<li>In either of the case data will be represented as strings</li>
<li>We can covert data types by using cast function –  <code>df.select(df.field.cast(IntegerType()))</code>
</li>
<li>We will see all other functions soon, but let us perform the task of reading the data into data frame and represent it in their original format.</li>
</ul>

In [26]:
spark

In [29]:
spark.stop()

In [31]:
from pyspark.sql import SparkSession

In [32]:
spark = SparkSession.\
        builder.\
        appName("Reading Data from Text Files").\
        master("spark://192.168.1.109:7077").\
        getOrCreate()

In [33]:
spark

In [34]:
#read data from text file using spark.read.csv
orderDF = spark.read.csv("/user/pi/retail_db/orders")

In [35]:
#preview first record
orderDF.first()

Row(_c0='1', _c1='2013-07-25 00:00:00.0', _c2='11599', _c3='CLOSED')

In [36]:
#preview 20 records
orderDF.show(truncate=False)

+---+---------------------+-----+---------------+
|_c0|_c1                  |_c2  |_c3            |
+---+---------------------+-----+---------------+
|1  |2013-07-25 00:00:00.0|11599|CLOSED         |
|2  |2013-07-25 00:00:00.0|256  |PENDING_PAYMENT|
|3  |2013-07-25 00:00:00.0|12111|COMPLETE       |
|4  |2013-07-25 00:00:00.0|8827 |CLOSED         |
|5  |2013-07-25 00:00:00.0|11318|COMPLETE       |
|6  |2013-07-25 00:00:00.0|7130 |COMPLETE       |
|7  |2013-07-25 00:00:00.0|4530 |COMPLETE       |
|8  |2013-07-25 00:00:00.0|2911 |PROCESSING     |
|9  |2013-07-25 00:00:00.0|5657 |PENDING_PAYMENT|
|10 |2013-07-25 00:00:00.0|5648 |PENDING_PAYMENT|
|11 |2013-07-25 00:00:00.0|918  |PAYMENT_REVIEW |
|12 |2013-07-25 00:00:00.0|1837 |CLOSED         |
|13 |2013-07-25 00:00:00.0|9149 |PENDING_PAYMENT|
|14 |2013-07-25 00:00:00.0|9842 |PROCESSING     |
|15 |2013-07-25 00:00:00.0|2568 |COMPLETE       |
|16 |2013-07-25 00:00:00.0|7276 |PENDING_PAYMENT|
|17 |2013-07-25 00:00:00.0|2667 |COMPLETE       |


In [37]:
#preview Schema
orderDF.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)



In [46]:
#to set column name without type cast
orderDF = spark.read.csv("/user/pi/retail_db/orders"). \
        toDF('order_id','order_date','customer_id','order_status')

In [47]:
orderDF.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)



In [43]:
#to set column name with type cast
orderDF = spark.read.csv("/user/pi/retail_db/orders",sep=',', \
                         schema='order_id int,order_Date string,customer_id int,order_status string')

In [44]:
orderDF.printSchema()                         

root
 |-- order_id: integer (nullable = true)
 |-- order_Date: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [49]:
orderDF.show(10,truncate=False)

+--------+---------------------+-----------+---------------+
|order_id|order_date           |customer_id|order_status   |
+--------+---------------------+-----------+---------------+
|1       |2013-07-25 00:00:00.0|11599      |CLOSED         |
|2       |2013-07-25 00:00:00.0|256        |PENDING_PAYMENT|
|3       |2013-07-25 00:00:00.0|12111      |COMPLETE       |
|4       |2013-07-25 00:00:00.0|8827       |CLOSED         |
|5       |2013-07-25 00:00:00.0|11318      |COMPLETE       |
|6       |2013-07-25 00:00:00.0|7130       |COMPLETE       |
|7       |2013-07-25 00:00:00.0|4530       |COMPLETE       |
|8       |2013-07-25 00:00:00.0|2911       |PROCESSING     |
|9       |2013-07-25 00:00:00.0|5657       |PENDING_PAYMENT|
|10      |2013-07-25 00:00:00.0|5648       |PENDING_PAYMENT|
+--------+---------------------+-----------+---------------+
only showing top 10 rows



In [51]:
#read tesxt file using spark.read.format
orderDF = spark. \
            read. \
            format('csv'). \
            option('sep',','). \
            schema('order_id int,order_Date string,customer_id int,order_status string'). \
            load("/user/pi/retail_db/orders")

In [52]:
orderDF.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_Date: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [53]:
orderDF.show(10)

+--------+--------------------+-----------+---------------+
|order_id|          order_Date|customer_id|   order_status|
+--------+--------------------+-----------+---------------+
|       1|2013-07-25 00:00:...|      11599|         CLOSED|
|       2|2013-07-25 00:00:...|        256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|      12111|       COMPLETE|
|       4|2013-07-25 00:00:...|       8827|         CLOSED|
|       5|2013-07-25 00:00:...|      11318|       COMPLETE|
|       6|2013-07-25 00:00:...|       7130|       COMPLETE|
|       7|2013-07-25 00:00:...|       4530|       COMPLETE|
|       8|2013-07-25 00:00:...|       2911|     PROCESSING|
|       9|2013-07-25 00:00:...|       5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|       5648|PENDING_PAYMENT|
+--------+--------------------+-----------+---------------+
only showing top 10 rows



In [54]:
#Different methods of type cast
orderDF = spark.read.csv("/user/pi/retail_db/orders").toDF('order_id','order_date','customer_id','order_status')
orderDF.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)



In [61]:
from pyspark.sql.types import IntegerType
orderDF_1=orderDF.select(orderDF.order_id.cast("int"),orderDF.order_date,orderDF.customer_id.cast(IntegerType()),orderDF.order_status)

In [62]:
orderDF.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)



In [63]:
orderDF_1.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [56]:
#using with column - when we have to typecast specific columns only
orderDF = spark.read.csv("/user/pi/retail_db/orders").toDF('order_id','order_date','customer_id','order_status')
orderDF.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)



In [68]:
orderDF_2=orderDF.withColumn('order_id',orderDF.order_id.cast(IntegerType())). \
       withColumn('customer_id',orderDF.customer_id.cast(IntegerType()))

In [69]:
orderDF_2.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



#### REad fixed length data - spark.reaqd.text

In [70]:
orderDF = spark.read.text("/user/pi/retail_db/orders")

In [71]:
orderDF.show(truncate=False)

+---------------------------------------------+
|value                                        |
+---------------------------------------------+
|1,2013-07-25 00:00:00.0,11599,CLOSED         |
|2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT  |
|3,2013-07-25 00:00:00.0,12111,COMPLETE       |
|4,2013-07-25 00:00:00.0,8827,CLOSED          |
|5,2013-07-25 00:00:00.0,11318,COMPLETE       |
|6,2013-07-25 00:00:00.0,7130,COMPLETE        |
|7,2013-07-25 00:00:00.0,4530,COMPLETE        |
|8,2013-07-25 00:00:00.0,2911,PROCESSING      |
|9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT |
|10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT|
|11,2013-07-25 00:00:00.0,918,PAYMENT_REVIEW  |
|12,2013-07-25 00:00:00.0,1837,CLOSED         |
|13,2013-07-25 00:00:00.0,9149,PENDING_PAYMENT|
|14,2013-07-25 00:00:00.0,9842,PROCESSING     |
|15,2013-07-25 00:00:00.0,2568,COMPLETE       |
|16,2013-07-25 00:00:00.0,7276,PENDING_PAYMENT|
|17,2013-07-25 00:00:00.0,2667,COMPLETE       |
|18,2013-07-25 00:00:00.0,1205,CLOSED   

In [73]:
orderDF.printSchema()

root
 |-- value: string (nullable = true)



### Create Data Frames from Hive Tables

In [3]:
sc

In [4]:
spark

<p>If Hive and Spark are integrated, we can create data frames from data in Hive tables or run Spark SQL queries against it.</p>
<ul>
<li>We can use spark.read.table to read data from Hive tables into Data Frame</li>
<li>We can prefix database name to table name while reading Hive tables into Data Frame</li>
<li>We can also run Hive queries directly using spark.sql</li>
<li>Both spark.read.table and spark.sql returns Data Frame</li>
</ul>

In [8]:
spark.sql("show databases").show()

+-------------------+
|       databaseName|
+-------------------+
|ameen_daily_revenue|
|            default|
|      retail_db_txt|
+-------------------+



In [9]:
spark.sql("use retail_db_txt")
spark.sql("show tables").show()

+-------------+---------------+-----------+
|     database|      tableName|isTemporary|
+-------------+---------------+-----------+
|retail_db_txt|      customers|      false|
|retail_db_txt|daily_revenue_2|      false|
|retail_db_txt|    order_items|      false|
|retail_db_txt|         orders|      false|
+-------------+---------------+-----------+



In [10]:
#read data from Hive table
orderDF=spark.read.table("retail_db_txt.orders")
orderDF.show()

+--------+--------------------+-----------+---------------+
|order_id|          order_date|customer_id|         status|
+--------+--------------------+-----------+---------------+
|       1|2013-07-25 00:00:...|      11599|         CLOSED|
|       2|2013-07-25 00:00:...|        256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|      12111|       COMPLETE|
|       4|2013-07-25 00:00:...|       8827|         CLOSED|
|       5|2013-07-25 00:00:...|      11318|       COMPLETE|
|       6|2013-07-25 00:00:...|       7130|       COMPLETE|
|       7|2013-07-25 00:00:...|       4530|       COMPLETE|
|       8|2013-07-25 00:00:...|       2911|     PROCESSING|
|       9|2013-07-25 00:00:...|       5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|       5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|        918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|       1837|         CLOSED|
|      13|2013-07-25 00:00:...|       9149|PENDING_PAYMENT|
|      14|2013-07-25 00:00:...|       98

In [11]:
orderDF.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- status: string (nullable = true)



#using spark sql
orderDF = spark.sql("select * from retail_db_txt.orders")
orderDF.printSchema()

### Create Data Frames using JDBC

<p>Spark also facilitate us to read data from relational databases over JDBC.</p>
<ul>
<li>We need to make sure jdbc jar file is registered using  <code>--packages</code>  or  <code>--jars</code>  and  <code>--driver-class-path</code>  while launching pyspark</li>
<li>In Pycharm, we need to copy relevant jdbc jar file to SPARK_HOME/jars</li>
<li>We can either use spark.read.format(‘jdbc’) with options or spark.read.jdbc with jdbc url, table name and other properties as dict to read data from remote relational databases.</li>
<li>We can pass a table name or query to read data using JDBC into Data Frame</li>
<li>While reading data, we can define number of partitions (using numPartitions), criteria to divide data into partitions (partitionColumn, lowerBound, upperBound)</li>
<li>Partitioning can be done only on numeric fields</li>
<li>If lowerBound and upperBound is specified, it will generate strides depending up on number of partitions and then process entire data. Here is the example
<ul>
<li>We are trying to read order_items data with 4 as numPartitions</li>
<li>partitionColumn – order_item_order_id</li>
<li>lowerBound – 10000</li>
<li>upperBound – 20000</li>
<li>order_item_order_id is in the range of 1 and 68883</li>
<li>But as we define lowerBound as 10000 and upperBound as 20000, here will be strides – 1 to 12499, 12500 to 14999, 15000 to 17499, 17500 to maximum of order_item_order_id</li>
<li>You can check the data in the output path mentioned</li>
</ul>
</li>
</ul>

In [None]:
#create a DB user in mysql with privileges
$sudo mysql -u root -p
:root

CREATE USER 'mysql'@'localhost' IDENTIFIED BY 'mysql';
GRANT ALL PRIVILEGES ON *.* TO 'mysql'@'localhost' WITH GRANT OPTION;
CREATE USER 'mysql'@'%' IDENTIFIED BY 'mysql';
GRANT ALL PRIVILEGES ON *.* TO 'mysql'@'%' WITH GRANT OPTION;
FLUSH PRIVILEGES;

***

pyspark --master yarn --conf spark.ui.port=12121 --jars /home/pi/jars/mysql-connector-java.jar \\
    --driver-class-path /home/pi/jars/mysql-connector-java.jar

***

In [1]:
#to connect to jdbc
help(spark.read.jdbc)

Help on method jdbc in module pyspark.sql.readwriter:

jdbc(url, table, column=None, lowerBound=None, upperBound=None, numPartitions=None, predicates=None, properties=None) method of pyspark.sql.readwriter.DataFrameReader instance
    Construct a :class:`DataFrame` representing the database table named ``table``
    accessible via JDBC URL ``url`` and connection ``properties``.
    
    Partitions of the table will be retrieved in parallel if either ``column`` or
    ``predicates`` is specified. ``lowerBound`, ``upperBound`` and ``numPartitions``
    is needed when ``column`` is specified.
    
    If both ``column`` and ``predicates`` are specified, ``column`` will be used.
    
    .. note:: Don't create too many partitions in parallel on a large cluster;
        otherwise Spark might crash your external database systems.
    
    :param url: a JDBC URL of the form ``jdbc:subprotocol:subname``
    :param table: the name of the table
    :param column: the name of a column of numeric,

In [2]:
#user spark.read.format
order_items = spark.read.\
                format('jdbc'). \
                option('url','jdbc:mysql://raspberrypi1:3306'). \
                option('dbtable','retail_db.order_items'). \
                option('user','mysql'). \
                option('password','mysql'). \
                load()

In [3]:
order_items.printSchema()

root
 |-- order_item_id: integer (nullable = true)
 |-- order_item_order_id: integer (nullable = true)
 |-- order_item_product_id: integer (nullable = true)
 |-- order_item_quantity: integer (nullable = true)
 |-- order_item_subtotal: double (nullable = true)
 |-- order_item_product_price: double (nullable = true)



In [4]:
order_items.show()

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

In [5]:
#spark.read.jdbc
order_items_jdbc = spark.read.jdbc('jdbc:mysql://raspberrypi1:3306','retail_db.order_items', \
                                  properties={'user':'mysql','password':'mysql'})

In [6]:
order_items_jdbc.show()

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

*To define no:of threads for processing*
    - numPartitions
    - partitionColumn -- partition column - partition can be done only on numeric fields
    - lowerBound
    - upperBound

<ul>
<li>We are trying to read order_items data with 4 as numPartitions</li>
<li>partitionColumn – order_item_order_id</li>
<li>lowerBound – 10000</li>
<li>upperBound – 20000</li>
<li>order_item_order_id is in the range of 1 and 68883</li>
<li>But as we define lowerBound as 10000 and upperBound as 20000, here will be strides – 1 to 12499, 12500 to 14999, 15000 to 17499, 17500 to maximum of order_item_order_id</li>
<li>You can check the data in the output path mentioned</li>
</ul>

In [7]:
#user spark.read.format
order_items = spark.read.\
                format('jdbc'). \
                option('url','jdbc:mysql://raspberrypi1:3306'). \
                option('dbtable','retail_db.order_items'). \
                option('user','mysql'). \
                option('password','mysql'). \
                option('partitionColumn','order_item_order_id'). \
                option('lowerBound',10000). \
                option('upperBound',20000). \
                option('numPartitions',4). \
                load()

In [8]:
order_items.printSchema()

root
 |-- order_item_id: integer (nullable = true)
 |-- order_item_order_id: integer (nullable = true)
 |-- order_item_product_id: integer (nullable = true)
 |-- order_item_quantity: integer (nullable = true)
 |-- order_item_subtotal: double (nullable = true)
 |-- order_item_product_price: double (nullable = true)



In [9]:
order_items.show()

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

In [14]:
#write to csv file
order_items.write.csv("/user/pi/retail_db_csv/order_items")

In [15]:
! hdfs dfs -ls /user/pi/retail_db_csv/order_items/*

2020-05-31 19:48:05,098 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   2 pi supergroup          0 2020-05-31 19:47 /user/pi/retail_db_csv/order_items/_SUCCESS
-rw-r--r--   2 pi supergroup     936524 2020-05-31 19:47 /user/pi/retail_db_csv/order_items/part-00000-904cca22-2ea2-4103-903e-f7cdcdbdc0b4-c000.csv
-rw-r--r--   2 pi supergroup     196316 2020-05-31 19:47 /user/pi/retail_db_csv/order_items/part-00001-904cca22-2ea2-4103-903e-f7cdcdbdc0b4-c000.csv
-rw-r--r--   2 pi supergroup     192627 2020-05-31 19:47 /user/pi/retail_db_csv/order_items/part-00002-904cca22-2ea2-4103-903e-f7cdcdbdc0b4-c000.csv
-rw-r--r--   2 pi supergroup    4083413 2020-05-31 19:47 /user/pi/retail_db_csv/order_items/part-00003-904cca22-2ea2-4103-903e-f7cdcdbdc0b4-c000.csv


*Even though we have mentioned lower bound as 10000, and numPartition as 4 , and upperBound as 20000 , first it will take the diffrence between upper and lower and divide by 4.So each junk will have 2500 , but for the first partition it will consider from 0 to 10000+2500 and for the last from 17500 till the max order id.This is the reason why the last file have higher size*

### Data Frame Operations - Overview

<p>Let us see overview about Data Frame Operations. It is one of the 2 ways we can process Data Frames.</p>
<ul>
<li>Selection or Projection – select</li>
<li>Filtering data – filter or where</li>
<li>Joins – join (supports outer join as well)</li>
<li>Aggregations – groupBy and agg with support of functions such as sum, avg, min, max etc</li>
<li>Sorting – sort or orderBy</li>
<li>Analytics Functions – aggregations, ranking and windowing functions</li>
</ul>

### Spark SQL - Overview

<p>We can also use Spark SQL to process data in data frames.</p>
<ul>
<li>We can get list of tables by using  <code>spark.sql('show tables')</code>
</li>
<li>We can register data frame as temporary view  <code>df.createTempView("view_name")</code>
</li>
<li>Output of show tables show the temporary tables as well</li>
<li>Once temp view is created, we can use SQL style syntax and run queries against the tables/views</li>
<li>Most of the hive queries will work out of the box</li>
</ul>

### Overview of Functions to manipulate data in Data Frame fields or columns

<p>Let us quickly look into some of the functions available in Data Frames.</p>
<ul>
<li>Main package for functions pyspark.sql.functions</li>
<li>We can import by saying  <code>from pyspark.sql import functions as sf</code>
</li>
<li>You will see many functions which are similar to the functions in traditional databases.</li>
<li>These can be categorized into
<ul>
<li>String manipulation</li>
<li>Date manipulation</li>
<li>Type casting</li>
<li>Expressions such as case when</li>
</ul>
</li>
<li>We will see some of the functions in action
<ul>
<li>substring</li>
<li>lower, upper</li>
<li>trim</li>
<li>date_format</li>
<li>trunc</li>
<li>Type Casting</li>
<li>case when</li>
</ul>
</li>
</ul>

In [10]:
from pyspark.sql import functions as fn

In [7]:
spark.sql('show functions').show(1000)

+--------------------+
|            function|
+--------------------+
|                   !|
|                   %|
|                   &|
|                   *|
|                   +|
|                   -|
|                   /|
|                   <|
|                  <=|
|                 <=>|
|                   =|
|                  ==|
|                   >|
|                  >=|
|                   ^|
|                 abs|
|                acos|
|          add_months|
|           aggregate|
|                 and|
|approx_count_dist...|
|   approx_percentile|
|               array|
|      array_contains|
|      array_distinct|
|        array_except|
|     array_intersect|
|          array_join|
|           array_max|
|           array_min|
|      array_position|
|        array_remove|
|        array_repeat|
|          array_sort|
|         array_union|
|      arrays_overlap|
|          arrays_zip|
|               ascii|
|                asin|
|         assert_true|
|          

#### Available functions for row level manupulation
|1|2|3|4|5|6|
|-----|-----|-----|-----|-----|-----|
|abs|element_at|posexplode_outer|ceil|least|sqrt|          
|acos|encode|pow| coalesce|length|stddev|
|add_months|exp|quarter|col|levenshtein|stddev_pop|
|approxCountDistinct|explode|radians|  collect_list|lit|stddev_samp|
|approx_count_distinct|explode_outer|rand|collect_set|locate|struct|
|array|expm1|randn|column|log|substring|
|array_contains|expr|rank|concat|log10|substring_index|
|array_distinct|factorial|regexp_extract|concat_ws|log1p|sum|
|array_except|first|regexp_replace|conv|log2|sumDistinct|
|array_intersect|flatten|repeat|corr|lower|tan|
|array_join|floor|reverse|cos|lpad|tanh|
|array_max|format_number|rint|cosh|ltrim|toDegrees|
|array_min|format_string|round|count|map|toRadians|
|array_position|from_json|row_number|countDistinct|map_concat|to_date|
|array_remove|from_unixtime|rpad|covar_pop|map_from_arrays|to_json|
|array_repeat|from_utc_timestamp|rtrim|covar_samp|map_from_entries|to_timestamp|
|array_sort|get_json_object|schema_of_json|crc32|map_keys|to_utc_timestamp|
|array_union|greatest|second|cume_dist|map_values|translate|
|arrays_overlap|grouping|sequence|currentRow|max|trim|
|arrays_zip|grouping_id|sha1|current_date|md5|trunc|
|asc|hash|sha2|current_timestamp|mean|typedLit|
|asc_nulls_first|hex|shiftLeft|date_add|min|udf|
|asc_nulls_last|hour|shiftRight|date_format|minute|unbase64|
|ascii|hypot|shiftRightUnsigned|date_sub|monotonicallyIncreasingId|unboundedFollowing|
|asin|initcap|shuffle|date_trunc|monotonically_increasing_id|unboundedPreceding|
|atan|input_file_name|signum|datediff|month|unhex|
|atan2|instr|sin|dayofmonth|months_between|unix_timestamp|
|avg|isnan|sinh|dayofweek|nanvl|upper|
|base64|isnull|size|dayofyear|negate|var_pop|
|bin|json_tuple|skewness|decode|next_day|var_samp|
|bitwiseNOT|kurtosis|slice|degrees|not|variance|
|broadcast|lag|sort_array|dense_rank|ntile|weekofyear|
|bround|last|soundex|desc|percent_rank|when|
|callUDF|last_day|spark_partition_id|desc_nulls_first|pmod|window|
|cbrt|lead|split|desc_nulls_last|posexplode|year|