# Common Data Engineering Tasks in SQL, Pands and PySpark

## Data Engineering Tasks:

- Select columns
- Create a new column
- Filter rows
- Join / Merge data
- Union / Append data
- Aggregate Data
- Rank
- Filter based on another dataset
- Temporary storage

## Data Model Overview

We will be using a simple data model with 2 tables:

<p align="center">
    <img src="DataModel.png"> 
</p>

### Select columns

The following code shows how to select a column on SQL Pandas and PySpark

#### Select columns in SQL
``` sql
SELECT CustomerID
      ,FirstName
FROM SalesLT.Customer;
```

#### Select columns in Pandas

In [3]:
import pandas as pd

df_customer_pandas = pd.read_parquet('./datasets/customer.parquet')

df_customer_pandas[["CustomerID","FirstName"]]

Unnamed: 0,CustomerID,FirstName
0,1,Orlando
1,2,Keith
2,3,Donna
3,4,Janet
4,5,Lucy
...,...,...
842,30113,Raja
843,30115,Dora
844,30116,Wanda
845,30117,Robert


#### Select columns in PySpark

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark Demo").getOrCreate()

df_customer_ps = spark.read.parquet("./datasets/customer.parquet")

df_customer_ps.select("CustomerID","FirstName").show()

+----------+-----------+
|CustomerID|  FirstName|
+----------+-----------+
|         1|    Orlando|
|         2|      Keith|
|         3|      Donna|
|         4|      Janet|
|         5|       Lucy|
|         6|   Rosmarie|
|         7|    Dominic|
|        10|   Kathleen|
|        11|  Katherine|
|        12|     Johnny|
|        16|Christopher|
|        18|      David|
|        19|       John|
|        20|       Jean|
|        21|    Jinghao|
|        22|      Linda|
|        23|      Kerim|
|        24|      Kevin|
|        25|     Donald|
|        28|     Jackie|
+----------+-----------+
only showing top 20 rows



### Add a new column

The following code shows how to add a column on SQL Pandas and PySpark

#### Add a column in SQL
``` sql
SELECT CustomerID
      ,FirstName
      ,FirstName + ' ' + LastName as FullName
FROM SalesLT.Customer;
```

#### Add a column in Pandas

In [6]:
df_customer_pandas["FullName"] = df_customer_pandas["FirstName"] + " " + df_customer_pandas["LastName"]

df_customer_pandas[["CustomerID","FirstName","FullName"]]


Unnamed: 0,CustomerID,FirstName,FullName
0,1,Orlando,Orlando Gee
1,2,Keith,Keith Harris
2,3,Donna,Donna Carreras
3,4,Janet,Janet Gates
4,5,Lucy,Lucy Harrington
...,...,...,...
842,30113,Raja,Raja Venugopal
843,30115,Dora,Dora Verdad
844,30116,Wanda,Wanda Vernon
845,30117,Robert,Robert Vessa


In [7]:
# Another option is to use apply:

df_customer_pandas['FullName'] = df_customer_pandas.apply(lambda row: row["FirstName" ] + " " + row["LastName"], axis=1)

df_customer_pandas[["CustomerID","FirstName","FullName"]]

Unnamed: 0,CustomerID,FirstName,FullName
0,1,Orlando,Orlando Gee
1,2,Keith,Keith Harris
2,3,Donna,Donna Carreras
3,4,Janet,Janet Gates
4,5,Lucy,Lucy Harrington
...,...,...,...
842,30113,Raja,Raja Venugopal
843,30115,Dora,Dora Verdad
844,30116,Wanda,Wanda Vernon
845,30117,Robert,Robert Vessa


#### Add a column in PySpark

In [8]:
df_customer_ps.withColumn("FullName", df_customer_ps.FirstName + " " + df_customer_ps.LastName) \
  .select("CustomerID", "FirstName", "FullName") \
  .show()

+----------+-----------+--------+
|CustomerID|  FirstName|FullName|
+----------+-----------+--------+
|         1|    Orlando|    null|
|         2|      Keith|    null|
|         3|      Donna|    null|
|         4|      Janet|    null|
|         5|       Lucy|    null|
|         6|   Rosmarie|    null|
|         7|    Dominic|    null|
|        10|   Kathleen|    null|
|        11|  Katherine|    null|
|        12|     Johnny|    null|
|        16|Christopher|    null|
|        18|      David|    null|
|        19|       John|    null|
|        20|       Jean|    null|
|        21|    Jinghao|    null|
|        22|      Linda|    null|
|        23|      Kerim|    null|
|        24|      Kevin|    null|
|        25|     Donald|    null|
|        28|     Jackie|    null|
+----------+-----------+--------+
only showing top 20 rows



### Filter Rows

#### Filter Rows in SQL
``` sql
SELECT CustomerID
      ,FirstName
      ,FirstName + ' ' + LastName as FullName
FROM SalesLT.Customer
where FirstName = 'Johnny';
```

Contains

``` sql
SELECT CustomerID
      ,FirstName
      ,FirstName + ' ' + LastName as FullName
FROM SalesLT.Customer
where FirstName like '%Joh%';
```

#### Filter rows in Pandas

In [11]:
#FirstName equals

df_customer_pandas[df_customer_pandas["FirstName"] == "Johnny"][["CustomerID","FirstName","FullName"]]

Unnamed: 0,CustomerID,FirstName,FullName
9,12,Johnny,Johnny Caprio
527,29627,Johnny,Johnny Caprio


In [12]:
#FirstName contains

df_customer_pandas[df_customer_pandas["FirstName"].str.contains("Joh")][["CustomerID","FirstName","FullName"]]

Unnamed: 0,CustomerID,FirstName,FullName
9,12,Johnny,Johnny Caprio
12,19,John,John Beaver
72,114,John,John Colon
174,276,Michael John,Michael John Troyer
196,309,John,John Arthur
214,335,John,John Berger
247,385,John,John Kelly
287,451,John,John Emory
301,471,John,John Ford
305,475,John,John Evans


#### Filter rows in PySpark

In [14]:
#FirstName equals

df_customer_ps.where(df_customer_ps["FirstName"] == "Johnny") \
  .withColumn("FullName", df_customer_ps.FirstName + " " + df_customer_ps.LastName) \
  .select("CustomerID", "FirstName", "FullName") \
  .show()

+----------+---------+--------+
|CustomerID|FirstName|FullName|
+----------+---------+--------+
|        12|   Johnny|    null|
|     29627|   Johnny|    null|
+----------+---------+--------+



In [16]:
#FirstName contains

df_customer_ps.where(df_customer_ps["FirstName"].like("%Joh%")) \
  .withColumn("FullName", df_customer_ps.FirstName + " " + df_customer_ps.LastName) \
  .select("CustomerID", "FirstName", "FullName") \
  .show()

+----------+------------+--------+
|CustomerID|   FirstName|FullName|
+----------+------------+--------+
|        12|      Johnny|    null|
|        19|        John|    null|
|       114|        John|    null|
|       276|Michael John|    null|
|       309|        John|    null|
|       335|        John|    null|
|       385|        John|    null|
|       451|        John|    null|
|       471|        John|    null|
|       475|        John|    null|
|       538|        John|    null|
|       673|        John|    null|
|     29523|        John|    null|
|     29545|        John|    null|
|     29558|        John|    null|
|     29587|        John|    null|
|     29627|      Johnny|    null|
|     29673|        John|    null|
|     29737|        John|    null|
|     29744|        John|    null|
+----------+------------+--------+
only showing top 20 rows



### Join / Merge Data

#### Join / Merge Data on SQL

#### Filter Rows in SQL
``` sql
SELECT a.CustomerID
      ,a.FirstName
      ,b.SalesOrderID
      ,b.TotalDue
FROM SalesLT.Customer a
      join SalesLT.SalesOrderHeader b on (a.CustomerID = b.CustomerID);
```


#### Join / Merge Data on Pandas

In [22]:
df_sales_pandas = pd.read_parquet('./datasets/sales_order_header.parquet')

df_customer_sales_pandas = df_customer_pandas.merge(right=df_sales_pandas, left_on="CustomerID", right_on="CustomerID")

df_customer_sales_pandas[["CustomerID", "FirstName", "SalesOrderID", "TotalDue"]]

Unnamed: 0,CustomerID,FirstName,SalesOrderID,TotalDue
0,29485,Catherine,71782,43962.7901
1,29531,Cory,71935,7330.8972
2,29546,Christopher,71938,98138.2131
3,29568,Donald,71899,2669.3183
4,29584,Walter,71895,272.6468
5,29612,Richard,71885,608.1766
6,29638,Rosmarie,71915,2361.6403
7,29644,Brigid,71867,1170.5376
8,29653,Pei,71858,15275.1977
9,29660,Anthony,71796,63686.2708


#### Join / Merge Data on PySpark

In [34]:
df_sales_ps = spark.read.parquet("./datasets/sales_order_header.parquet").withColumnRenamed("CustomerID","CustomerIDSales")

df_customer_sales_ps = df_customer_ps.join(df_sales_ps, df_customer_ps["CustomerID"] == df_sales_ps["CustomerIDSales"] )

df_customer_sales_ps.select("CustomerID", "FirstName", "SalesOrderID", "TotalDue").show()

+----------+-----------+------------+----------+
|CustomerID|  FirstName|SalesOrderID|  TotalDue|
+----------+-----------+------------+----------+
|     29485|  Catherine|       71782|43962.7901|
|     29531|       Cory|       71935| 7330.8972|
|     29546|Christopher|       71938|98138.2131|
|     29568|     Donald|       71899| 2669.3183|
|     29584|     Walter|       71895|  272.6468|
|     29612|    Richard|       71885|  608.1766|
|     29638|   Rosmarie|       71915| 2361.6403|
|     29644|     Brigid|       71867| 1170.5376|
|     29653|        Pei|       71858|15275.1977|
|     29660|    Anthony|       71796|63686.2708|
|     29736|      Terry|       71784|119960.824|
|     29741|     Janeth|       71946|   43.0437|
|     29781|        Guy|       71923|  117.7276|
|     29796|        Jon|       71797|86222.8072|
|     29847|      David|       71774|   972.785|
|     29877|      Joyce|       71897|14017.9083|
|     29922|     Pamala|       71832|39531.6085|
|     29929|    Jeff

### Union / Append Data

#### Union Append Data in SQL

#### Union Rows in SQL
``` sql
SELECT a.CustomerID
      ,a.FirstName
FROM SalesLT.Customer a
UNION ALL
SELECT a.CustomerID
      ,a.FirstName
FROM SalesLT.Customer a;
```


#### Union / Append Data in Pandas

In [41]:
pd.concat([df_customer_pandas[["CustomerID","FirstName"]],df_customer_pandas[["CustomerID","FirstName"]]])

Unnamed: 0,CustomerID,FirstName
0,1,Orlando
1,2,Keith
2,3,Donna
3,4,Janet
4,5,Lucy
...,...,...
842,30113,Raja
843,30115,Dora
844,30116,Wanda
845,30117,Robert


#### Union / Append Data in PySpark

In [42]:
df_customer_ps.select("CustomerID","FirstName") \
  .unionAll(df_customer_ps.select("CustomerID","FirstName")) \
  .show()

+----------+-----------+
|CustomerID|  FirstName|
+----------+-----------+
|         1|    Orlando|
|         2|      Keith|
|         3|      Donna|
|         4|      Janet|
|         5|       Lucy|
|         6|   Rosmarie|
|         7|    Dominic|
|        10|   Kathleen|
|        11|  Katherine|
|        12|     Johnny|
|        16|Christopher|
|        18|      David|
|        19|       John|
|        20|       Jean|
|        21|    Jinghao|
|        22|      Linda|
|        23|      Kerim|
|        24|      Kevin|
|        25|     Donald|
|        28|     Jackie|
+----------+-----------+
only showing top 20 rows

