d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# DataFrame & Column
1. Construct columns
1. Subset columns
1. Add or replace columns
1. Subset rows
1. Sort rows

##### Methods
- DataFrame (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">Scala</a>): `select`, `selectExpr`, `drop`, `withColumn`, `withColumnRenamed`, `filter`, `distinct`, `limit`, `sort`
- Column (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=column#pyspark.sql.Column" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html" target="_blank">Scala</a>): `alias`, `isin`, `cast`, `isNotNull`, `desc`, operators

In [0]:
%run ./Includes/Classroom-Setup

Let's use the BedBricks events dataset.

In [0]:
eventsDF = spark.read.parquet(eventsPath)
display(eventsDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Construct columns

A **column** is a logical construction that will be computed based on the data in a DataFrame using an expression

Construct a new column based on the input columns existing in a DataFrame

In [0]:
from pyspark.sql.functions import col

col("device")
eventsDF.device
eventsDF["device"]

Use column objects to form complex expressions

In [0]:
col("ecommerce.purchase_revenue_in_usd") + col("ecommerce.total_item_quantity")
col("event_timestamp").desc()
(col("ecommerce.purchase_revenue_in_usd") * 100).cast("int")

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Subset columns
Use DataFrame transformations to subset columns

#### **`select()`**
Selects a set of columns or column based expressions

In [0]:
devicesDF = eventsDF.select("user_id", "device")
display(devicesDF)

user_id,device
UA000000107379500,macOS
UA000000107359357,Windows
UA000000107375547,macOS
UA000000107370581,iOS
UA000000107377108,Windows
UA000000107377161,Windows
UA000000107370851,iOS
UA000000107360961,macOS
UA000000107376205,Android
UA000000107359805,Windows


In [0]:
from pyspark.sql.functions import col

locationsDF = eventsDF.select("user_id", 
  col("geo.city").alias("city"),
  col("geo.state").alias("state"))

display(locationsDF)

user_id,city,state
UA000000107379500,Montrose,MI
UA000000107359357,Northampton,MA
UA000000107375547,Salinas,CA
UA000000107370581,Everett,MA
UA000000107377108,Cottage Grove,MN
UA000000107377161,Medina,MN
UA000000107370851,Mount Pleasant,UT
UA000000107360961,Piedmont,AL
UA000000107376205,Rancho Santa Margarita,CA
UA000000107359805,Elyria,OH


#### **`selectExpr()`**
Selects a set of SQL expressions

In [0]:
appleDF = eventsDF.selectExpr("user_id", "device in ('macOS', 'iOS') as apple_user")
display(appleDF)

user_id,apple_user
UA000000107379500,True
UA000000107359357,False
UA000000107375547,True
UA000000107370581,True
UA000000107377108,False
UA000000107377161,False
UA000000107370851,True
UA000000107360961,True
UA000000107376205,False
UA000000107359805,False


#### `drop()`
Returns a new DataFrame after dropping the given column, specified as a string or column object

Use strings to specify multiple columns

In [0]:
anonymousDF = eventsDF.drop("user_id", "geo", "device")
display(anonymousDF)

ecommerce,event_name,event_previous_timestamp,event_timestamp,items,traffic_source,user_first_touch_timestamp
"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,List(),google,1593878899217692
"List(null, null, null)",press,1593876662175340.0,1593877011756535,List(),google,1593876662175340
"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030
"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,List(),facebook,1593877903116176
"List(null, null, null)",mattresses,,1593878628143633,List(),google,1593878628143633
"List(null, null, null)",main,,1593878634344194,List(),youtube,1593878634344194
"List(null, null, null)",main,,1593877936171803,List(),direct,1593877936171803
"List(null, null, null)",main,,1593876843215329,List(),instagram,1593876843215329
"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,List(),instagram,1593878529774474
"List(null, null, null)",main,,1593876713246514,List(),facebook,1593876713246514


In [0]:
noSalesDF = eventsDF.drop(col("ecommerce"))
display(noSalesDF)

device,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
macOS,warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Add or replace columns
Use DataFrame transformations to add or replace columns

#### `withColumn`
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

In [0]:
mobileDF = eventsDF.withColumn("mobile", col("device").isin("iOS", "Android"))
display(mobileDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,mobile
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500,False
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357,False
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547,False
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581,True
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108,False
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161,False
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851,True
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961,False
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205,True
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805,False


In [0]:
purchaseQuantityDF = eventsDF.withColumn("purchase_quantity", col("ecommerce.total_item_quantity").cast("int"))
purchaseQuantityDF.printSchema()

#### `withColumnRenamed()`
Returns a new DataFrame with a column renamed.

In [0]:
locationDF = eventsDF.withColumnRenamed("geo", "location")
display(locationDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,location,items,traffic_source,user_first_touch_timestamp,user_id
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Subset Rows
Use DataFrame transformations to subset rows

#### `filter()`
Filters rows using the given SQL expression or column based condition.

In [0]:
purchasesDF = eventsDF.filter("ecommerce.total_item_quantity > 0")
display(purchasesDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
Linux,"List(1195.0, 1, 1)",finalize,1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263
iOS,"List(1045.0, 1, 1)",finalize,1593878485345763,1593878487460247,"List(Detroit, MI)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",facebook,1593877230282722,UA000000107364432
Android,"List(595.0, 1, 1)",finalize,1593877930076602,1593878966392505,"List(East Chicago, IN)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876889575474,UA000000107361347
iOS,"List(2290.0, 2, 2)",finalize,1593877650094042,1593877652106953,"List(Warwick, RI)","List(List(null, M_PREM_F, Premium Full Mattress, 1695.0, 1695.0, 1), List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876687337581,UA000000107359573
macOS,"List(945.0, 1, 1)",finalize,1593879151529456,1593879197837168,"List(Boonville, MO)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",facebook,1593878603312910,UA000000107376872
Windows,"List(595.0, 1, 1)",finalize,1593877908876473,1593878020119079,"List(Hampton, VA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593877033894464,UA000000107362622
Android,"List(945.0, 1, 1)",finalize,1593878355764861,1593878641498265,"List(White Bear Lake, MN)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",direct,1593877080764516,UA000000107363039
Chrome OS,"List(1095.0, 1, 1)",finalize,1593879073813036,1593879191730221,"List(San Antonio, TX)","List(List(null, M_PREM_T, Premium Twin Mattress, 1095.0, 1095.0, 1))",instagram,1593877153633764,UA000000107363715
macOS,"List(1045.0, 1, 1)",finalize,1593877425584678,1593877429621158,"List(Searcy, AR)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",direct,1593876851338276,UA000000107361027
iOS,"List(1045.0, 1, 1)",finalize,1593878984623390,1593879046209960,"List(Southport, IN)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",instagram,1593876574686487,UA000000107358614


In [0]:
revenueDF = eventsDF.filter(col("ecommerce.purchase_revenue_in_usd").isNotNull())
display(revenueDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
Linux,"List(1195.0, 1, 1)",finalize,1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263
iOS,"List(1045.0, 1, 1)",finalize,1593878485345763,1593878487460247,"List(Detroit, MI)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",facebook,1593877230282722,UA000000107364432
Android,"List(595.0, 1, 1)",finalize,1593877930076602,1593878966392505,"List(East Chicago, IN)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876889575474,UA000000107361347
iOS,"List(2290.0, 2, 2)",finalize,1593877650094042,1593877652106953,"List(Warwick, RI)","List(List(null, M_PREM_F, Premium Full Mattress, 1695.0, 1695.0, 1), List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876687337581,UA000000107359573
macOS,"List(945.0, 1, 1)",finalize,1593879151529456,1593879197837168,"List(Boonville, MO)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",facebook,1593878603312910,UA000000107376872
Windows,"List(595.0, 1, 1)",finalize,1593877908876473,1593878020119079,"List(Hampton, VA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593877033894464,UA000000107362622
Android,"List(945.0, 1, 1)",finalize,1593878355764861,1593878641498265,"List(White Bear Lake, MN)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",direct,1593877080764516,UA000000107363039
Chrome OS,"List(1095.0, 1, 1)",finalize,1593879073813036,1593879191730221,"List(San Antonio, TX)","List(List(null, M_PREM_T, Premium Twin Mattress, 1095.0, 1095.0, 1))",instagram,1593877153633764,UA000000107363715
macOS,"List(1045.0, 1, 1)",finalize,1593877425584678,1593877429621158,"List(Searcy, AR)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",direct,1593876851338276,UA000000107361027
iOS,"List(1045.0, 1, 1)",finalize,1593878984623390,1593879046209960,"List(Southport, IN)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",instagram,1593876574686487,UA000000107358614


In [0]:
androidDF = eventsDF.filter((col("traffic_source") != "direct") & (col("device") == "Android"))
display(androidDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Android,"List(null, null, null)",down,1593879057792999.0,1593879125815755,"List(Jackson, MO)",List(),facebook,1593879057792999,UA000000107380961
Android,"List(null, null, null)",cart,1593878887634182.0,1593878899159806,"List(Fayetteville, AR)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",google,1593877484784180,UA000000107366779
Android,"List(null, null, null)",warranty,1593877962951723.0,1593878620744974,"List(North Canton, OH)",List(),instagram,1593877962951723,UA000000107371070
Android,"List(null, null, null)",checkout,1593879068779767.0,1593879230950221,"List(Philadelphia, PA)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",facebook,1593878282510044,UA000000107374020
Android,"List(null, null, null)",mattresses,,1593876443386829,"List(Richmond, KY)",List(),google,1593876443386829,UA000000107357480
Android,"List(null, null, null)",mattresses,1593878139892679.0,1593879240812006,"List(Portland, OR)",List(),facebook,1593878139892679,UA000000107372672
Android,"List(null, null, null)",warranty,1593876568777492.0,1593878762928254,"List(San Diego, CA)",List(),email,1593876568777492,UA000000107358569
Android,"List(null, null, null)",original,1593878473561670.0,1593878535553523,"List(Chicago, IL)",List(),google,1593877936278608,UA000000107370853
Android,"List(null, null, null)",mattresses,1593877715559380.0,1593877756506929,"List(Benicia, CA)",List(),google,1593877707976765,UA000000107368830


#### `dropDuplicates()`
Returns a new DataFrame with duplicate rows removed, optionally considering only a subset of columns.

##### Alias: `distinct`

In [0]:
eventsDF.distinct()

In [0]:
distinctUsersDF = eventsDF.dropDuplicates(["user_id"])
display(distinctUsersDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
iOS,"List(null, null, null)",checkout,1592547736518007,1592548321455992,"List(San Bruno, CA)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592196947865522,UA000000102357807
Android,"List(null, null, null)",add_item,1592661874471858,1592662003391884,"List(Covington, LA)","List(List(NEWBED10, M_PREM_Q, Premium Queen Mattress, 1615.5, 1795.0, 1))",email,1592197275580686,UA000000102357841
Android,"List(null, null, null)",add_item,1592573713168269,1592574347642610,"List(Mobile, AL)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592198812458125,UA000000102358054
macOS,"List(null, null, null)",shipping_info,1592545562314108,1592545941007576,"List(Largo, FL)","List(List(NEWBED10, P_FOAM_S, Standard Foam Pillow, 53.1, 59.0, 1))",email,1592199427202331,UA000000102358165
Android,"List(null, null, null)",cc_info,1592540809346064,1592540992403614,"List(Mandan, ND)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1))",email,1592201255976506,UA000000102358562
Android,"List(1525.5, 1, 1)",finalize,1592611182030868,1592611189419947,"List(Phoenix, AZ)","List(List(NEWBED10, M_PREM_F, Premium Full Mattress, 1525.5, 1695.0, 1))",email,1592201848205824,UA000000102358714
iOS,"List(null, null, null)",register,1592553755082252,1592558423806848,"List(Mounds View, MN)","List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1))",email,1592205037961396,UA000000102359895
iOS,"List(null, null, null)",add_item,1592583618219827,1592584060201694,"List(Gibraltar, MI)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592205125802184,UA000000102359929
macOS,"List(null, null, null)",checkout,1592545840938731,1592546186611855,"List(Escondido, CA)","List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1))",email,1592205287357945,UA000000102360011
Windows,"List(null, null, null)",mattresses,1592206579403592,1592563056918897,"List(New York, NY)",List(),email,1592205427673498,UA000000102360074


#### `limit()`
Returns a new DataFrame by taking the first n rows.

In [0]:
limitDF = eventsDF.limit(100)
display(limitDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Sort rows
Use DataFrame transformations to sort rows

#### `sort()`
Returns a new DataFrame sorted by the given columns or expressions.

##### Alias: `orderBy`

In [0]:
increaseTimestampsDF = eventsDF.sort("event_timestamp")
display(increaseTimestampsDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
iOS,"List(null, null, null)",add_item,1592538844640966.0,1592539200194694,"List(New York, NY)","List(List(NEWBED10, M_STAN_K, Standard King Mattress, 1075.5, 1195.0, 1))",email,1592417776356879,UA000000102987319
iOS,"List(null, null, null)",main,,1592539202466157,"List(Fort Worth, TX)",List(),google,1592539202466157,UA000000103314642
iOS,"List(null, null, null)",email_coupon,1592538695373138.0,1592539202702440,"List(Eau Claire, WI)",List(),google,1592538326799214,UA000000103314437
iOS,"List(850.5, 1, 1)",finalize,1592539096721313.0,1592539205571717,"List(Denver, CO)","List(List(NEWBED10, M_STAN_F, Standard Full Mattress, 850.5, 945.0, 1))",email,1592312151735336,UA000000102640893
Chrome OS,"List(null, null, null)",email_coupon,1592539060624768.0,1592539211071433,"List(South Bend, IN)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",email,1592537824068348,UA000000103314282
Android,"List(null, null, null)",cart,1592539157333154.0,1592539212858607,"List(Miami Beach, FL)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592294681843997,UA000000102604475
Windows,"List(null, null, null)",main,,1592539216257977,"List(Farmington, MN)",List(),google,1592539216257977,UA000000103314643
macOS,"List(null, null, null)",mattresses,,1592539216262230,"List(Waterbury, CT)",List(),direct,1592539216262230,UA000000103314644
Windows,"List(null, null, null)",checkout,1592538308929545.0,1592539217303275,"List(New Orleans, LA)","List(List(NEWBED10, M_STAN_F, Standard Full Mattress, 850.5, 945.0, 1))",email,1592317381269186,UA000000102665598
Android,"List(null, null, null)",mattresses,1592456296935239.0,1592539217839800,"List(Los Angeles, CA)",List(),email,1592456243365765,UA000000103072869


In [0]:
decreaseTimestampsDF = eventsDF.sort(col("event_timestamp").desc())
display(decreaseTimestampsDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
iOS,"List(null, null, null)",press,1593879156340319.0,1593879299923863,"List(Cleveland, OH)",List(),email,1593879156340319,UA000000107381879
Windows,"List(null, null, null)",original,1593877169961586.0,1593879299762861,"List(Columbus, GA)",List(),facebook,1593877169961586,UA000000107363851
macOS,"List(null, null, null)",main,,1593879299756928,"List(Waco, TX)",List(),direct,1593879299756928,UA000000107383227
Windows,"List(null, null, null)",delivery,1593878867523509.0,1593879299750326,"List(Fort Worth, TX)",List(),google,1593878184214633,UA000000107373069
Windows,"List(null, null, null)",main,,1593879299746987,"List(Kansas City, MO)",List(),google,1593879299746987,UA000000107383226
macOS,"List(null, null, null)",pillows,,1593879299724595,"List(Chicago, IL)",List(),youtube,1593879299724595,UA000000107383225
macOS,"List(null, null, null)",main,,1593879299695205,"List(Lynn, MA)",List(),google,1593879299695205,UA000000107383224
Windows,"List(null, null, null)",email_coupon,1593879278078062.0,1593879299560513,"List(Long Beach, MS)",List(),google,1593879278078062,UA000000107383024
iOS,"List(null, null, null)",email_coupon,1593879171423489.0,1593879299402051,"List(New York, NY)",List(),facebook,1593875279959590,UA000000107347409
Chrome OS,"List(null, null, null)",mattresses,,1593879299380376,"List(College Station, TX)",List(),google,1593879299380376,UA000000107383223


In [0]:
increaseSessionsDF = eventsDF.orderBy(["user_first_touch_timestamp", "event_timestamp"])
display(increaseSessionsDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
iOS,"List(null, null, null)",mattresses,1592197539430780,1592547470518302,"List(San Bruno, CA)",List(),email,1592196947865522,UA000000102357807
iOS,"List(null, null, null)",add_item,1592547470518302,1592547472563625,"List(San Bruno, CA)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592196947865522,UA000000102357807
iOS,"List(null, null, null)",cart,1592547472563625,1592547736518007,"List(San Bruno, CA)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592196947865522,UA000000102357807
iOS,"List(null, null, null)",checkout,1592547736518007,1592548321455992,"List(San Bruno, CA)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592196947865522,UA000000102357807
iOS,"List(null, null, null)",register,1592548321455992,1592548833097155,"List(San Bruno, CA)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592196947865522,UA000000102357807
iOS,"List(null, null, null)",shipping_info,1592548833097155,1592548958091573,"List(San Bruno, CA)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592196947865522,UA000000102357807
iOS,"List(null, null, null)",cc_info,1592548958091573,1592549109730675,"List(San Bruno, CA)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592196947865522,UA000000102357807
iOS,"List(535.5, 1, 1)",finalize,1592549109730675,1592549474562691,"List(San Bruno, CA)","List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1592196947865522,UA000000102357807
Android,"List(null, null, null)",mattresses,1592200038926862,1592661874471858,"List(Covington, LA)",List(),email,1592197275580686,UA000000102357841
Android,"List(null, null, null)",add_item,1592661874471858,1592662003391884,"List(Covington, LA)","List(List(NEWBED10, M_PREM_Q, Premium Queen Mattress, 1615.5, 1795.0, 1))",email,1592197275580686,UA000000102357841


In [0]:
decreaseSessionsDF = eventsDF.sort(col("user_first_touch_timestamp").desc(), col("event_timestamp"))
display(decreaseSessionsDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
Chrome OS,"List(null, null, null)",mattresses,1593892853177619.0,1593866462911113,"List(Greenville, SC)",List(),email,1593892583883212,UA000000107499832
Chrome OS,"List(null, null, null)",add_item,1593866462911113.0,1593867027725324,"List(Greenville, SC)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1))",email,1593892583883212,UA000000107499832
Chrome OS,"List(null, null, null)",add_item,1593867027725324.0,1593867139101782,"List(Greenville, SC)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1), List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1593892583883212,UA000000107499832
Chrome OS,"List(null, null, null)",cart,1593867139101782.0,1593867711704705,"List(Greenville, SC)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1), List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1593892583883212,UA000000107499832
Chrome OS,"List(null, null, null)",checkout,1593867711704705.0,1593868024181767,"List(Greenville, SC)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1), List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1593892583883212,UA000000107499832
Chrome OS,"List(null, null, null)",guest,1593868024181767.0,1593868036756815,"List(Greenville, SC)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1), List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1593892583883212,UA000000107499832
Chrome OS,"List(null, null, null)",shipping_info,1593868036756815.0,1593868103134431,"List(Greenville, SC)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1), List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1593892583883212,UA000000107499832
Chrome OS,"List(null, null, null)",cc_info,1593868103134431.0,1593868131013119,"List(Greenville, SC)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1), List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1593892583883212,UA000000107499832
Chrome OS,"List(1521.0, 2, 2)",finalize,1593868131013119.0,1593868183366932,"List(Greenville, SC)","List(List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1), List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",email,1593892583883212,UA000000107499832
Windows,"List(null, null, null)",mattresses,1593884557091316.0,1593851143005145,"List(Birmingham, AL)",List(),email,1593883964023919,UA000000107426715


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Purchase Revenues Lab

Prepare dataset of events with purchase revenue.
1. Extract purchase revenue for each event
2. Filter events where revenue is not null
3. Check what types of events have revenue
4. Drop unneeded column

##### Methods
- DataFrame: `select`, `drop`, `withColumn`, `filter`, `dropDuplicates`
- Column: `isNotNull`

In [0]:
eventsDF = spark.read.parquet(eventsPath)
display(eventsDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### 1. Extract purchase revenue for each event
Add new column **`revenue`** by extracting **`ecommerce.purchase_revenue_in_usd`**

In [0]:
# TODO
revenueDF = eventsDF.withColumn("revenue", col("ecommerce.purchase_revenue_in_usd"))
display(revenueDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500,
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357,
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547,
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581,
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108,
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161,
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851,
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961,
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205,
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805,


-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
expected1 = [5830.0, 5485.0, 5289.0, 5219.1, 5180.0, 5175.0, 5125.0, 5030.0, 4985.0, 4985.0]
result1 = [row.revenue for row in revenueDF.sort(col("revenue").desc_nulls_last()).limit(10).collect()]

assert(expected1 == result1)

### 2. Filter events where revenue is not null
Filter for records where **`revenue`** is not **`null`**

In [0]:
# TODO
purchasesDF = revenueDF.filter(col("revenue").isNotNull())
display(purchasesDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Linux,"List(1195.0, 1, 1)",finalize,1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263,1195.0
iOS,"List(1045.0, 1, 1)",finalize,1593878485345763,1593878487460247,"List(Detroit, MI)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",facebook,1593877230282722,UA000000107364432,1045.0
Android,"List(595.0, 1, 1)",finalize,1593877930076602,1593878966392505,"List(East Chicago, IN)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876889575474,UA000000107361347,595.0
iOS,"List(2290.0, 2, 2)",finalize,1593877650094042,1593877652106953,"List(Warwick, RI)","List(List(null, M_PREM_F, Premium Full Mattress, 1695.0, 1695.0, 1), List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876687337581,UA000000107359573,2290.0
macOS,"List(945.0, 1, 1)",finalize,1593879151529456,1593879197837168,"List(Boonville, MO)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",facebook,1593878603312910,UA000000107376872,945.0
Windows,"List(595.0, 1, 1)",finalize,1593877908876473,1593878020119079,"List(Hampton, VA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593877033894464,UA000000107362622,595.0
Android,"List(945.0, 1, 1)",finalize,1593878355764861,1593878641498265,"List(White Bear Lake, MN)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",direct,1593877080764516,UA000000107363039,945.0
Chrome OS,"List(1095.0, 1, 1)",finalize,1593879073813036,1593879191730221,"List(San Antonio, TX)","List(List(null, M_PREM_T, Premium Twin Mattress, 1095.0, 1095.0, 1))",instagram,1593877153633764,UA000000107363715,1095.0
macOS,"List(1045.0, 1, 1)",finalize,1593877425584678,1593877429621158,"List(Searcy, AR)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",direct,1593876851338276,UA000000107361027,1045.0
iOS,"List(1045.0, 1, 1)",finalize,1593878984623390,1593879046209960,"List(Southport, IN)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",instagram,1593876574686487,UA000000107358614,1045.0


### 3. Check what types of events have revenue
Find unique **`event_name`** values in **`purchasesDF`** in one of two ways:
- Select "event_name" and get distinct records
- Drop duplicate records based on the "event_name" only

Hint: There's only one event associated with revenues

In [0]:
# TODO
distinctDF = purchasesDF.dropDuplicates(["event_name"])
display(distinctDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Linux,"List(1195.0, 1, 1)",finalize,1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263,1195.0


### 4. Drop unneeded column
Since there's only one event type, drop **`event_name`** from **`purchasesDF`**.

In [0]:
# TODO
finalDF = purchasesDF.drop("event_name")
display(finalDF)

device,ecommerce,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Linux,"List(1195.0, 1, 1)",1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263,1195.0


### 5. Chain all the steps above excluding step 3

In [0]:
# TODO
finalDF = (eventsDF
            .withColumn("revenue", col("ecommerce.purchase_revenue_in_usd"))
            .filter(col("revenue").isNotNull())
            .drop("event_name")
          )

display(finalDF)

device,ecommerce,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Linux,"List(1195.0, 1, 1)",1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263,1195.0
iOS,"List(1045.0, 1, 1)",1593878485345763,1593878487460247,"List(Detroit, MI)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",facebook,1593877230282722,UA000000107364432,1045.0
Android,"List(595.0, 1, 1)",1593877930076602,1593878966392505,"List(East Chicago, IN)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876889575474,UA000000107361347,595.0
iOS,"List(2290.0, 2, 2)",1593877650094042,1593877652106953,"List(Warwick, RI)","List(List(null, M_PREM_F, Premium Full Mattress, 1695.0, 1695.0, 1), List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876687337581,UA000000107359573,2290.0
macOS,"List(945.0, 1, 1)",1593879151529456,1593879197837168,"List(Boonville, MO)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",facebook,1593878603312910,UA000000107376872,945.0
Windows,"List(595.0, 1, 1)",1593877908876473,1593878020119079,"List(Hampton, VA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593877033894464,UA000000107362622,595.0
Android,"List(945.0, 1, 1)",1593878355764861,1593878641498265,"List(White Bear Lake, MN)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",direct,1593877080764516,UA000000107363039,945.0
Chrome OS,"List(1095.0, 1, 1)",1593879073813036,1593879191730221,"List(San Antonio, TX)","List(List(null, M_PREM_T, Premium Twin Mattress, 1095.0, 1095.0, 1))",instagram,1593877153633764,UA000000107363715,1095.0
macOS,"List(1045.0, 1, 1)",1593877425584678,1593877429621158,"List(Searcy, AR)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",direct,1593876851338276,UA000000107361027,1045.0
iOS,"List(1045.0, 1, 1)",1593878984623390,1593879046209960,"List(Southport, IN)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",instagram,1593876574686487,UA000000107358614,1045.0


-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert(finalDF.count() == 180678)

In [0]:
expected_columns = {'device', 'ecommerce', 'event_previous_timestamp', 'event_timestamp', 
                    'geo', 'items', 'revenue', 'traffic_source', 
                    'user_first_touch_timestamp', 'user_id'}
assert(set(finalDF.columns) == expected_columns)

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup
