# Advanced PySpark on DataBricks - Part 2
* Notebook by Adam Lang
* Date: 12/24/2024


# Overview
This is a continuation of a detailed notebook on how to leverage Advanced Data Wrangling and Data Trasnformations using PySpark on DataBricks.

# Reading in CSV Dataset
* Basic uploading and reading of a CSV file in databricks.

In [0]:
## import libraries first
from pyspark.sql.types import * ## import all types
from pyspark.sql.functions import * ## import all

In [0]:
## how to find filestore link
dbutils.fs.ls('/FileStore/tables/')

In [0]:
## read in CSV dataset into df
df = spark.read.format("csv").option("inferSchema",
                                True).option('header', True).load('/FileStore/tables/BigMart_Sales.csv')

In [0]:
## display
df.display()

# EXPLODE function
* This is useful for "exploding" the array or list we created above such as each value of each index into their own column. This is similar to the Python Pandas function.


Note: This way of doing it below only explodes into a new ROW.

In [0]:
## create a new df to do this
df_exp = df.withColumn('Outlet_Type', split('Outlet_Type',' '))
df_exp.display()

Now we want to EXPLODE the Outlet_Type column with an array of values each into separate rows.

In [0]:
## explode 
df_exp.withColumn('Outlet_Type', explode('Outlet_Type')).display()

Summary

* Now we can see each entity from Outlet_Type has been exploded into a new row.

## EXPLODE arrays into new columns
* This is where we have to use the `getItem() function

In [0]:
## first lets create the df_exp again since we transformed it above
## create a new df to do this
df_exp = df.withColumn('Outlet_Type', split('Outlet_Type',' '))
df_exp.display()

Now let's explode each index into new column.

In [0]:
# Create new columns for each element of the array
## first we give the new col name
## then we selct the col and getItem() index
df_exp = df_exp.withColumn("Outlet_Name", col("Outlet_Type").getItem(0))
df_exp = df_exp.withColumn("Outlet_Type_Class", col("Outlet_Type").getItem(1))
df_exp.display()

Summary

* Great! We exploded each item of the array index into a new separate column. Now if we wanted to we could drop the original column but we will forego that step right now.

# ARRAY_CONTAINS
* Now that we have started to demonstrate array functions above with EXPLODE we will demo additional array functions with ARRAY_CONTAINS.


* Question to answer: Is `Type1` present in the column array `Outlet_Type`

In [0]:
## lets review df_exp
df_exp.display()

In [0]:
## create a new flag column
## new col name --> column to index, value to index
df_exp.withColumn('Type1_flag', array_contains('Outlet_Type','Type1')).display()

# GROUP_BY
* Similar to group by in pandas.

## Scenario 1

In [0]:
## df display
df.display()

In [0]:
## find sum of MRP for item_type
df.groupBy('Item_Type').agg(sum('Item_MRP').alias('Sum_Item_MRP')).display()

## Scenario 2
* Finding the average.

In [0]:
## average groupby
df.groupBy('Item_Type').agg(avg('Item_MRP').alias('Average_Item_MRP')).display()

## Scenario 3 - Group By on 3 columns
* Sum of Item_Type
* Also group by on Outlet_Size

In [0]:
df.groupBy('Item_Type','Outlet_Size').agg(sum('Item_MRP').alias('Total_MRP')).display()

## Scenario 4
* This is an advanced transformation using a double aggregation with double groupby

In [0]:
df.groupBy('Item_Type', 'Outlet_Size').agg(sum('Item_MRP').alias('Sum_Item_MRP'),avg('Item_MRP').alias('Avg_Item_MRP')).display()

# Collect_List
* This is a great function to take multiple instances of the same variable in one column and group them into the same row with all their values from another column.
* Let's use an example with some dummy data below.

In [0]:
## create dummy df
data = [('user1', 'book1'),
        ('user1','book2'),
        ('user2', 'book2'),
        ('user2', 'book4'),
        ('user3', 'book1')]

schema = 'user string, book string'


## create df
df_book = spark.createDataFrame(data, schema)

## display
df_book.display()

In [0]:
df_book.groupBy('user').agg(collect_list('book')).display()

Summary

* We can see the power of using `collect_list` to aggregate and link variables based on 2 columns.

# PIVOT
* This is similar to pivot in excel or python pandas, used to create a pivot table.

In [0]:
## pivot function in pyspark
df.groupBy('Item_Type').pivot('Outlet_Size').agg(avg('Item_MRP')).display()

In [0]:
df.display()

# WHEN-OTHERWISE
* Logical statement for if-then operations similar to CASE-WHEN in SQL.

## Scenario 1

In [0]:
## when otherwise
df = df.withColumn('veg_flag', when(col('Item_Type')=='Meat', 'Non Veg').otherwise('Veg'))

In [0]:
df.display()

## Scenario 2
* Using 2 if-then logic conditions

In [0]:
## new flag column
df.withColumn('veg_exp_flag',when(((col('veg_flag')=='Veg') & (col('Item_MRP')<100)),'Veg_Inexpensive')\
                            .when((col('veg_flag')=='Veg') & (col('Item_MRP')>100),'Veg_Expensive')\
                            .otherwise('Non_Veg')).display() 