# Part II: Data Prep

In this notebook we will look at market data using the yfinance python package.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark import Row

# install yfinance package to get data from yahoo finance
#!pip install yfinance
import yfinance as yf
import pandas as pd

In [0]:
# build spark session
spark = SparkSession.builder.appName("Analyze Market Data").getOrCreate()

Let's get some data!

Using yfinance we will get some data for a few different assets. 

*Note: we are defining an asset as an investment asset (ie a stock, bond, or etf)*

In [0]:
# define assets in portfolio
assetList = ["ORCL", "BLK", "UNH"]

# get asset overview data
asset_data = [yf.Ticker(asset).info for asset in assetList]

# get price data for assets
price_data = pd.DataFrame()
for asset in assetList:
    data = yf.download(asset, start="2017-01-01", end="2023-04-23")
    data["Symbol"] = asset
    price_data = pd.concat([price_data, data])

price_data.reset_index(inplace=True)

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


In [0]:
# create spark dataframes
asset_df = spark.createDataFrame(asset_data)
price_df = spark.createDataFrame(price_data)

In [0]:
# let's look at the first 5 columns in our asset overview dataframe (this data has alot of columns)
asset_df.columns

# we only care about a few of these so let's drop the ones we don't care about
asset_info_df = asset_df.select(
    "symbol",
    "beta",
    "totalRevenue",
    "totalDebt",
    "overallRisk",
    "auditRisk",
    "exDividendDate",
    "dividendYield",
)

In [0]:
asset_info_df.show()
print(asset_info_df.schema)

+------+--------+------------+-----------+-----------+---------+--------------+-------------+
|symbol|    beta|totalRevenue|  totalDebt|overallRisk|auditRisk|exDividendDate|dividendYield|
+------+--------+------------+-----------+-----------+---------+--------------+-------------+
|  ORCL|0.995847| 47957999616|91810996224|         10|        8|    1681084800|       0.0168|
|   BLK|1.289273| 17417000960| 8488999936|          3|        4|    1678060800|       0.0297|
|   UNH| 0.67993|335943991296|70587998208|          3|       10|    1678406400|       0.0136|
+------+--------+------------+-----------+-----------+---------+--------------+-------------+

StructType([StructField('symbol', StringType(), True), StructField('beta', DoubleType(), True), StructField('totalRevenue', LongType(), True), StructField('totalDebt', LongType(), True), StructField('overallRisk', LongType(), True), StructField('auditRisk', LongType(), True), StructField('exDividendDate', LongType(), True), StructField('di

In [0]:
price_df.show(5)

+-------------------+------------------+------------------+------------------+------------------+------------------+--------+------+
|               Date|              Open|              High|               Low|             Close|         Adj Close|  Volume|Symbol|
+-------------------+------------------+------------------+------------------+------------------+------------------+--------+------+
|2017-01-03 00:00:00| 38.45000076293945|38.689998626708984| 38.29999923706055| 38.54999923706055|34.786827087402344|11051300|  ORCL|
|2017-01-04 00:00:00| 38.54999923706055| 38.91999816894531| 38.54999923706055|  38.7400016784668| 34.95830154418945| 9545500|  ORCL|
|2017-01-05 00:00:00| 38.66999816894531| 38.95000076293945| 38.40999984741211| 38.63999938964844| 34.86805725097656|12064700|  ORCL|
|2017-01-06 00:00:00|             38.75|             38.75|38.380001068115234| 38.45000076293945|34.696598052978516|14829700|  ORCL|
|2017-01-09 00:00:00|38.529998779296875| 39.45000076293945|38.4700012

We can also filter our data. Let's say we only want the price data for the current month.

In [0]:
price_df.filter((price_df.Date >= "2023-04-01") & (price_df.Date < "2023-05-01")).show(
    5
)

+-------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-------+------+
|               Date|             Open|             High|              Low|            Close|        Adj Close| Volume|Symbol|
+-------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-------+------+
|2023-04-03 00:00:00|92.37999725341797|             94.0|92.08999633789062|93.91999816894531| 93.5283432006836|8410900|  ORCL|
|2023-04-04 00:00:00| 93.8499984741211| 94.0199966430664|92.93000030517578|             94.0| 93.6080093383789|6651500|  ORCL|
|2023-04-05 00:00:00|93.62000274658203|95.11000061035156| 93.4800033569336|94.88999938964844|94.49429321289062|7478800|  ORCL|
|2023-04-06 00:00:00|94.33000183105469|96.08000183105469|93.98999786376953|95.91999816894531| 95.5199966430664|9146200|  ORCL|
|2023-04-10 00:00:00|94.68000030517578|95.11000061035156|93.55000305175781|93.76000213623047|93.76000213623047|

In [0]:
# now let's create a temp view
price_df.createOrReplaceTempView("prices")

# now we can execute our SQL query
prices = spark.sql("SELECT * FROM prices")
prices.columns

Out[10]: ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Symbol']

Now let's create a view of our asset details table

In [0]:
asset_info_df.createOrReplaceTempView("details")

# now we can execute our SQL query
details = spark.sql("SELECT * FROM details")
details.show()

+------+--------+------------+-----------+-----------+---------+--------------+-------------+
|symbol|    beta|totalRevenue|  totalDebt|overallRisk|auditRisk|exDividendDate|dividendYield|
+------+--------+------------+-----------+-----------+---------+--------------+-------------+
|  ORCL|0.995847| 47957999616|91810996224|         10|        8|    1681084800|       0.0168|
|   BLK|1.289273| 17417000960| 8488999936|          3|        4|    1678060800|       0.0297|
|   UNH| 0.67993|335943991296|70587998208|          3|       10|    1678406400|       0.0136|
+------+--------+------------+-----------+-----------+---------+--------------+-------------+



In [0]:
print(
    f"Total rows in each view: prices = {prices.count()}, details = { details.count()}"
)

Total rows in each view: prices = 4758, details = 3


## Write Data to Delta Tables

Because notebooks tend to become unreadable as you do more and more work, we want to write our prepared data to a central location. This way we can access it from any other notebook or script without having to run all the code in this notebook first. 

This makes our projects more maneagable as we don't have one monolithic notebook that has to be run everytime. You should think about how the actions the notebook performs are tied to different roles. This notebook for example may never be used by a data analyst or scientist, but a data engineer would find the steps here very useful.

First we need to update a column name in our prices dataframe. Spark doesn't like spaces in column names. You can set a property that allows this, but for simplicity we will just rename it. Renaming columns is also a very common task in data preparation so its good practice.

In [0]:
#rename Adjusted Closing price column to remove the space
price_df = price_df.withColumnRenamed('Adj Close', 'adjClose')

In [0]:
#write our spark dataframes to delta tables
asset_info_df.write.saveAsTable("details")
price_df.write.saveAsTable("prices")

Great! Now we are ready to start executing queries on the data from any notebook in our spark environment.

### Appendix

#### What are Delta Tables

Delta Tables are Databricks default method for storing data in your Databricks enviroment. There is more information on them in the documentation which you can find using the link below. 

ref: [docs.databricks/delta](https://docs.databricks.com/delta/index.html)