# Dataframe Deep Dive (Part 2)

In [1]:
spark

In [2]:
sc

## Pre-requisit

In [3]:
# Create a database
spark.sql('SHOW databases').show()

23/06/06 17:00:39 INFO HiveConf: Found configuration file file:/etc/spark/conf.dist/hive-site.xml
23/06/06 17:00:39 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
23/06/06 17:00:40 INFO AWSGlueClientFactory: Using region from ec2 metadata : us-east-2


+--------------------+
|           namespace|
+--------------------+
|db_youtube_analytics|
| db_youtube_cleansed|
|      db_youtube_raw|
|             default|
|        dev_feedback|
|         my_db_spark|
+--------------------+



In [4]:
spark.sql('USE my_db_spark') 

DataFrame[]

In [5]:
spark.sql("SELECT current_database()").show()    # Check the present database (which is selected)

[Stage 0:>                                                          (0 + 1) / 1]

+------------------+
|current_database()|
+------------------+
|       my_db_spark|
+------------------+



                                                                                

In [6]:
spark.sql('SHOW tables').show()  # Show all the tables within the database 

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



In [7]:
# Create a table 
spark.sql('CREATE TABLE my_db_spark.orders \
               (order_id integer, \
                order_date string, \
                customer_id integer, \
                order_status string)')

spark.sql('SELECT * FROM orders').show()

23/06/06 17:00:44 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
23/06/06 17:00:44 INFO SQLStdHiveAccessController: Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=01e4a922-159f-4e4a-b91e-f1160a2ecd0b, clientType=HIVECLI]
23/06/06 17:00:44 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
23/06/06 17:00:44 INFO AWSCatalogMetastoreClient: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook
23/06/06 17:00:44 INFO AWSGlueClientFactory: Using region from ec2 metadata : us-east-2
23/06/06 17:00:45 INFO AWSGlue

+--------+----------+-----------+------------+
|order_id|order_date|customer_id|order_status|
+--------+----------+-----------+------------+
+--------+----------+-----------+------------+



In [8]:
# Insert data into the table 

# Step 1: Load some DF 
data_set = 's3://fcc-spark-example/dataset/2023/orders.csv'
df = spark.read.csv('s3://fcc-spark-example/dataset/2023/orders.csv', header=True, inferSchema=True)

# Step 2: Create a TempView 
df.createOrReplaceTempView('my_db_spark.temp_table')

# Step 3: Now Insert the data to the table from the above TempView
spark.sql("INSERT INTO orders \
            SELECT * \
            FROM temp_table")

# Step 4: Delete the TempView
spark.sql("DROP table temp_table")

23/06/06 17:00:56 INFO log: Updating table stats fast for orders                
23/06/06 17:00:56 INFO log: Updated size of table orders to 2862089


DataFrame[]

In [9]:
spark.sql('SHOW tables').show()  # Show all the tables within the database 

+-----------+---------+-----------+
|  namespace|tableName|isTemporary|
+-----------+---------+-----------+
|my_db_spark|   orders|      false|
+-----------+---------+-----------+



So, now we have some tables which we can use to create a Dataframe

## Creating Dataframe 

We can create a Dataframe using differ ways:
    
    - spark.read()
    - spark.sql() 
    - spark.table()
    - spark.createDataFrame() -> mostly for testing
    - spark.range()           -> mostly for testing
    - from an RDD using

### 1. Using `spark.read()`

Reading from a file/folder

In [10]:
data_set = 's3://fcc-spark-example/dataset/2023/orders.csv'

df = (spark.read
           .format('csv')
           .option('header', 'true')
           .option('inferSchema', 'true')
           .load(data_set)
     )

                                                                                

In [11]:
df.show(5)

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
+--------+-------------------+-----------------+---------------+
only showing top 5 rows



### 2. Using `spark.sql()`

From a Table

In [12]:
spark.sql('SHOW databases').show()

+--------------------+
|           namespace|
+--------------------+
|db_youtube_analytics|
| db_youtube_cleansed|
|      db_youtube_raw|
|             default|
|        dev_feedback|
|         my_db_spark|
+--------------------+



In [13]:
spark.sql('USE my_db_spark') 

DataFrame[]

In [14]:
spark.sql('SHOW tables').show()

+-----------+---------+-----------+
|  namespace|tableName|isTemporary|
+-----------+---------+-----------+
|my_db_spark|   orders|      false|
+-----------+---------+-----------+



In [15]:
# Lets create a DF from this table 

df = spark.sql('SELECT * \
              FROM orders')

In [16]:
df.show(5)

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|   order_status|
+--------+-------------------+-----------+---------------+
|       1|2013-07-25 00:00:00|      11599|         CLOSED|
|       2|2013-07-25 00:00:00|        256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|      12111|       COMPLETE|
|       4|2013-07-25 00:00:00|       8827|         CLOSED|
|       5|2013-07-25 00:00:00|      11318|       COMPLETE|
+--------+-------------------+-----------+---------------+
only showing top 5 rows



### 3. Using `spark.table()`

From a Table

In [17]:
df = spark.table('orders')

In [18]:
df.show(5)

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|   order_status|
+--------+-------------------+-----------+---------------+
|       1|2013-07-25 00:00:00|      11599|         CLOSED|
|       2|2013-07-25 00:00:00|        256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|      12111|       COMPLETE|
|       4|2013-07-25 00:00:00|       8827|         CLOSED|
|       5|2013-07-25 00:00:00|      11318|       COMPLETE|
+--------+-------------------+-----------+---------------+
only showing top 5 rows



### 4. Using `spark.range()`

Mostly for testing purpose 

In [19]:
df = spark.range(10)
df.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+



In [20]:
df = spark.range(1, 10, 2)
df.show()

+---+
| id|
+---+
|  1|
|  3|
|  5|
|  7|
|  9|
+---+



### 5. Using `spark.createDataFrame()`

Again mostly for testing/development purpose 

In [21]:
my_data = [ (1, "2013-07-25 00:00:00",2561,"PENDING_PAYMENT"),
            (2, "2013-07-25 00:00:00",1211,"COMPLETE"),
            (3, "2013-07-25 00:00:00",8827,"CLOSED"),
            (4, "2013-07-25 00:00:00",1131,"COMPLETE"),
            (5, "2013-07-25 00:00:00",1000,"COMPLETE") ]

df = spark.createDataFrame(my_data)

In [22]:
df.show()

+---+-------------------+----+---------------+
| _1|                 _2|  _3|             _4|
+---+-------------------+----+---------------+
|  1|2013-07-25 00:00:00|2561|PENDING_PAYMENT|
|  2|2013-07-25 00:00:00|1211|       COMPLETE|
|  3|2013-07-25 00:00:00|8827|         CLOSED|
|  4|2013-07-25 00:00:00|1131|       COMPLETE|
|  5|2013-07-25 00:00:00|1000|       COMPLETE|
+---+-------------------+----+---------------+



In [23]:
df.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)
 |-- _4: string (nullable = true)



Lets fix the column name and enforce the data types

#### 1. Fixing `column names`

In [24]:
my_data = [ (1, "2013-07-25 00:00:00",2561,"PENDING_PAYMENT"),
            (2, "2013-07-25 00:00:00",1211,"COMPLETE"),
            (3, "2013-07-25 00:00:00",8827,"CLOSED"),
            (4, "2013-07-25 00:00:00",1131,"COMPLETE"),
            (5, "2013-07-25 00:00:00",1000,"COMPLETE") ]

df = spark.createDataFrame(my_data).toDF('order_id', 'order_date', 'customer_id', 'status')

In [25]:
df.show()

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|         status|
+--------+-------------------+-----------+---------------+
|       1|2013-07-25 00:00:00|       2561|PENDING_PAYMENT|
|       2|2013-07-25 00:00:00|       1211|       COMPLETE|
|       3|2013-07-25 00:00:00|       8827|         CLOSED|
|       4|2013-07-25 00:00:00|       1131|       COMPLETE|
|       5|2013-07-25 00:00:00|       1000|       COMPLETE|
+--------+-------------------+-----------+---------------+



In [26]:
df = df.toDF('A', 'B', 'C', 'D')
df.show()

+---+-------------------+----+---------------+
|  A|                  B|   C|              D|
+---+-------------------+----+---------------+
|  1|2013-07-25 00:00:00|2561|PENDING_PAYMENT|
|  2|2013-07-25 00:00:00|1211|       COMPLETE|
|  3|2013-07-25 00:00:00|8827|         CLOSED|
|  4|2013-07-25 00:00:00|1131|       COMPLETE|
|  5|2013-07-25 00:00:00|1000|       COMPLETE|
+---+-------------------+----+---------------+



In [27]:
schema = ['order_id', 'order_date', 'customer_id', 'status']
df = spark.createDataFrame(my_data, schema)
df.show()

+--------+-------------------+-----------+---------------+
|order_id|         order_date|customer_id|         status|
+--------+-------------------+-----------+---------------+
|       1|2013-07-25 00:00:00|       2561|PENDING_PAYMENT|
|       2|2013-07-25 00:00:00|       1211|       COMPLETE|
|       3|2013-07-25 00:00:00|       8827|         CLOSED|
|       4|2013-07-25 00:00:00|       1131|       COMPLETE|
|       5|2013-07-25 00:00:00|       1000|       COMPLETE|
+--------+-------------------+-----------+---------------+



In [28]:
df.printSchema()

root
 |-- order_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- status: string (nullable = true)



#### 2. Fixing `schema`

In [29]:
from pyspark.sql.types import *

orders_schema = (StructType()
                 .add(StructField('order_id', LongType()))
                 .add(StructField('order_date', StringType()))
                 .add(StructField('order_customer_id', IntegerType()))
                 .add(StructField('order_status', StringType()))
                )

df = spark.createDataFrame(my_data, orders_schema)

In [30]:
df.show()

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|             2561|PENDING_PAYMENT|
|       2|2013-07-25 00:00:00|             1211|       COMPLETE|
|       3|2013-07-25 00:00:00|             8827|         CLOSED|
|       4|2013-07-25 00:00:00|             1131|       COMPLETE|
|       5|2013-07-25 00:00:00|             1000|       COMPLETE|
+--------+-------------------+-----------------+---------------+



In [31]:
df.printSchema()

root
 |-- order_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [32]:
import pyspark.sql.functions as F

df_new = df.withColumn('order_date', F.to_timestamp(F.col('order_date')))
df_new.show(5)

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|             2561|PENDING_PAYMENT|
|       2|2013-07-25 00:00:00|             1211|       COMPLETE|
|       3|2013-07-25 00:00:00|             8827|         CLOSED|
|       4|2013-07-25 00:00:00|             1131|       COMPLETE|
|       5|2013-07-25 00:00:00|             1000|       COMPLETE|
+--------+-------------------+-----------------+---------------+



In [33]:
df_new.printSchema()

root
 |-- order_id: long (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



### Clean-up (Drop the `table`)

In [34]:
spark.sql('DROP table orders')

23/06/06 17:01:07 INFO GlueMetastoreClientDelegate: Initiating drop table partitions


DataFrame[]

In [35]:
spark.sql('SHOW tables').show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



### Creating a Dataframe from an RDD

In [36]:
data_set = 's3://fcc-spark-example/dataset/2023/orders/orders_2.csv'

rdd = sc.textFile(data_set)

In [37]:
rdd.take(5)

                                                                                

['1,07-25-2013,11599,CLOSED',
 '2,07-25-2013,256,PENDING_PAYMENT',
 '3,07-25-2013,12111,COMPLETE',
 '4,07-25-2013,8827,CLOSED',
 '5,07-25-2013,11318,COMPLETE']

In [40]:
rdd2 = rdd.map(lambda x: (int(x.split(',')[0]),
                   x.split(',')[1],
                   int(x.split(',')[2]),
                   x.split(',')[3]))

In [41]:
rdd2.take(5)

                                                                                

[(1, '07-25-2013', 11599, 'CLOSED'),
 (2, '07-25-2013', 256, 'PENDING_PAYMENT'),
 (3, '07-25-2013', 12111, 'COMPLETE'),
 (4, '07-25-2013', 8827, 'CLOSED'),
 (5, '07-25-2013', 11318, 'COMPLETE')]

In [44]:
orders_schema = (StructType()
                 .add(StructField('order_id', LongType()))
                 .add(StructField('order_date', StringType()))
                 .add(StructField('order_customer_id', IntegerType()))
                 .add(StructField('order_status', StringType()))
                )

# One way
df = rdd2.toDF(orders_schema)

In [46]:
df.show(5)

[Stage 38:>                                                         (0 + 1) / 1]

+--------+----------+-----------------+---------------+
|order_id|order_date|order_customer_id|   order_status|
+--------+----------+-----------------+---------------+
|       1|07-25-2013|            11599|         CLOSED|
|       2|07-25-2013|              256|PENDING_PAYMENT|
|       3|07-25-2013|            12111|       COMPLETE|
|       4|07-25-2013|             8827|         CLOSED|
|       5|07-25-2013|            11318|       COMPLETE|
+--------+----------+-----------------+---------------+
only showing top 5 rows



                                                                                

In [47]:
df.printSchema()

root
 |-- order_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [48]:
# Another way
df = spark.createDataFrame(rdd2, orders_schema)
df.show(5)

[Stage 39:>                                                         (0 + 1) / 1]

+--------+----------+-----------------+---------------+
|order_id|order_date|order_customer_id|   order_status|
+--------+----------+-----------------+---------------+
|       1|07-25-2013|            11599|         CLOSED|
|       2|07-25-2013|              256|PENDING_PAYMENT|
|       3|07-25-2013|            12111|       COMPLETE|
|       4|07-25-2013|             8827|         CLOSED|
|       5|07-25-2013|            11318|       COMPLETE|
+--------+----------+-----------------+---------------+
only showing top 5 rows



                                                                                