In [2]:
'''
You are given a dataset that contains daily closing stock prices for different companies. Each row represents the stock symbol, date, and closing price for that day.

Your task is to calculate the daily price change for each stock by comparing the current dayâ€™s price with the previous trading day for the same stock.

If there is no previous trading day (first date for that stock), the change should be null.

Input Schema & Example
Column Name	Data Type
stock_symbol	String
trade_date	String
closing_price	Double
Example Input Table
stock_symbol	trade_date	closing_price
AAPL	2023-10-26	150.0
AAPL	2023-10-27	152.5
AAPL	2023-10-30	151.0
GOOG	2023-10-26	2800.0
GOOG	2023-10-27	2810.0
Output Schema
Column Name	Data Type
stock_symbol	String
trade_date	String
closing_price	Double
daily_price_change	Double
Example Output Table
stock_symbol	trade_date	closing_price	daily_price_change
AAPL	2023-10-26	150.0	null
AAPL	2023-10-27	152.5	2.5
AAPL	2023-10-30	151.0	-1.5
GOOG	2023-10-26	2800.0	null
GOOG	2023-10-27	2810.0	10.0
ðŸ’¡ Explanation
The price change for 2023-10-27 (AAPL) is 152.5 - 150.0 = 2.5.
The price change for 2023-10-30 (AAPL) is 151.0 - 152.5 = -1.5.
The first record for each stock symbol has no previous day to compare to, so daily_price_change is null.
Starter Code
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    # AAPL
    ("AAPL", "2023-10-23", 148.0),
    ("AAPL", "2023-10-24", 149.5),
    ("AAPL", "2023-10-25", 150.0),
    ("AAPL", "2023-10-26", 150.0),
    ("AAPL", "2023-10-27", 152.5),
    ("AAPL", "2023-10-30", 151.0),

    # GOOG
    ("GOOG", "2023-10-24", 2790.0),
    ("GOOG", "2023-10-25", 2795.0),
    ("GOOG", "2023-10-26", 2800.0),
    ("GOOG", "2023-10-27", 2810.0),
    ("GOOG", "2023-10-30", 2798.5),

    # MSFT
    ("MSFT", "2023-10-23", 330.0),
    ("MSFT", "2023-10-24", 332.0),
    ("MSFT", "2023-10-25", 335.0),
    ("MSFT", "2023-10-26", 334.5),
    ("MSFT", "2023-10-27", 336.0),
]

columns = ["stock_symbol", "trade_date", "closing_price"]

df = spark.createDataFrame(data, columns)

# Your logic goes here to create df_result

display(df_result)
'''

'''
is it necessary to convert trade_date to to_date?

No, itâ€™s not strictly necessary to convert trade_date to a proper DateType if you only need to calculate the daily price change and order by the date in the format yyyy-MM-dd.

Hereâ€™s why:
String Ordering Works:
The date strings in yyyy-MM-dd format naturally sort correctly in chronological order when used in ORDER BY.
So LAG() or any window function will still work as intended.

When to_date is Useful:

If you want to perform date arithmetic, e.g., calculating the number of days between two dates.
If your date is not in yyyy-MM-dd format.

For date functions like month(), year(), dayofweek(), etc.
'''

# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder.appName('Spark Playground').getOrCreate()

data = [
    # AAPL
    ("AAPL", "2023-10-23", 148.0),
    ("AAPL", "2023-10-24", 149.5),
    ("AAPL", "2023-10-25", 150.0),
    ("AAPL", "2023-10-26", 150.0),
    ("AAPL", "2023-10-27", 152.5),
    ("AAPL", "2023-10-30", 151.0),

    # GOOG
    ("GOOG", "2023-10-24", 2790.0),
    ("GOOG", "2023-10-25", 2795.0),
    ("GOOG", "2023-10-26", 2800.0),
    ("GOOG", "2023-10-27", 2810.0),
    ("GOOG", "2023-10-30", 2798.5),

    # MSFT
    ("MSFT", "2023-10-23", 330.0),
    ("MSFT", "2023-10-24", 332.0),
    ("MSFT", "2023-10-25", 335.0),
    ("MSFT", "2023-10-26", 334.5),
    ("MSFT", "2023-10-27", 336.0),
]

columns = ["stock_symbol", "trade_date", "closing_price"]

df = spark.createDataFrame(data, columns)

# Convert trade_date to date type
df = df.withColumn("trade_date", F.to_date("trade_date", "yyyy-MM-dd"))

# Define window partitioned by stock_symbol and ordered by trade_date
window_spec = Window.partitionBy("stock_symbol").orderBy("trade_date")

# Calculate previous day's closing price
df_result = (
  df.withColumn("daily_price_change", F.col("closing_price") - F.lag("closing_price").over(window_spec))
)

'''
Bonus Challenge: Can you solve this using Spark SQL and temporary views?

# Create a temporary view
df.createOrReplaceTempView("stocks")

# Use Spark SQL with LAG() window function
df_result = spark.sql("""
  SELECT stock_symbol,
         trade_date,
         closing_price,
         closing_price - LAG(closing_price) OVER (
           PARTITION BY stock_symbol
           ORDER BY trade_date
         ) AS daily_price_change
  FROM stocks
""")
'''

# Display result
df_result.show()

+------------+----------+-------------+------------------+
|stock_symbol|trade_date|closing_price|daily_price_change|
+------------+----------+-------------+------------------+
|        AAPL|2023-10-23|        148.0|              NULL|
|        AAPL|2023-10-24|        149.5|               1.5|
|        AAPL|2023-10-25|        150.0|               0.5|
|        AAPL|2023-10-26|        150.0|               0.0|
|        AAPL|2023-10-27|        152.5|               2.5|
|        AAPL|2023-10-30|        151.0|              -1.5|
|        GOOG|2023-10-24|       2790.0|              NULL|
|        GOOG|2023-10-25|       2795.0|               5.0|
|        GOOG|2023-10-26|       2800.0|               5.0|
|        GOOG|2023-10-27|       2810.0|              10.0|
|        GOOG|2023-10-30|       2798.5|             -11.5|
|        MSFT|2023-10-23|        330.0|              NULL|
|        MSFT|2023-10-24|        332.0|               2.0|
|        MSFT|2023-10-25|        335.0|               3.