# Apple Stock

### Introduction:

We are going to use Apple's stock price.


### Step 1. Import the necessary libraries

In [8]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as func

In [2]:
spark = SparkSession.builder.appName("exercise9-timeseries").getOrCreate()

23/02/02 18:51:02 WARN Utils: Your hostname, Ana-Matebook resolves to a loopback address: 127.0.1.1; using 192.168.1.137 instead (on interface wlp2s0)
23/02/02 18:51:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/02/02 18:51:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/09_Time_Series/Apple_Stock/appl_1980_2014.csv)

### Step 3. Assign it to a variable apple

In [3]:
apple_df = spark.read.option("header", True).option("inferSchema", True).csv("appl_1980_2014.csv")

In [4]:
apple_df.show()

+-------------------+-----+-----+-----+-----+---------+---------+
|               Date| Open| High|  Low|Close|   Volume|Adj Close|
+-------------------+-----+-----+-----+-----+---------+---------+
|2014-07-08 00:00:00|96.27| 96.8|93.92|95.35| 65130000|    95.35|
|2014-07-07 00:00:00|94.14|95.99| 94.1|95.97| 56305400|    95.97|
|2014-07-03 00:00:00|93.67| 94.1| 93.2|94.03| 22891800|    94.03|
|2014-07-02 00:00:00|93.87|94.06|93.09|93.48| 28420900|    93.48|
|2014-07-01 00:00:00|93.52|94.07|93.13|93.52| 38170200|    93.52|
|2014-06-30 00:00:00| 92.1|93.73|92.09|92.93| 49482300|    92.93|
|2014-06-27 00:00:00|90.82| 92.0|90.77|91.98| 64006800|    91.98|
|2014-06-26 00:00:00|90.37|91.05| 89.8| 90.9| 32595800|     90.9|
|2014-06-25 00:00:00|90.21| 90.7|89.65|90.36| 36852200|    90.36|
|2014-06-24 00:00:00|90.75|91.74|90.19|90.28| 38988300|    90.28|
|2014-06-23 00:00:00|91.32|91.62| 90.6|90.83| 43618200|    90.83|
|2014-06-20 00:00:00|91.85|92.55| 90.9|90.91|100813200|    90.91|
|2014-06-1

### Step 4.  Check out the type of the columns

In [6]:
apple_df.printSchema()

root
 |-- Date: timestamp (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)



### Step 5. Transform the Date column as a datetime type

In [10]:
# It is already a date column but converting it here again:
apple_df = apple_df.withColumn("as_date", func.to_date(func.col("Date"),"MM-dd-yyyy"))
apple_df.printSchema()

root
 |-- Date: timestamp (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)
 |-- as_date: date (nullable = true)



### Step 6.  Set the date as the index

### Step 7.  Is there any duplicate dates?

In [14]:
apple_df.select("as_date").distinct().count()

8465

In [15]:
apple_df.select("as_date").count()

8465

In [16]:
from pyspark.sql import DataFrame

def has_duplicates(df: DataFrame, column_name: str) -> bool:
    unique_values = df.select(column_name).distinct().count()
    all_values = df.select(column_name).count()
    return all_values > unique_values

In [17]:
has_duplicates(apple_df, "as_date")

False

### Step 8.  Ops...it seems the index is from the most recent date. Make the first entry the oldest date.

In [19]:
apple_df.show(n=5)

+-------------------+-----+-----+-----+-----+--------+---------+----------+
|               Date| Open| High|  Low|Close|  Volume|Adj Close|   as_date|
+-------------------+-----+-----+-----+-----+--------+---------+----------+
|2014-07-08 00:00:00|96.27| 96.8|93.92|95.35|65130000|    95.35|2014-07-08|
|2014-07-07 00:00:00|94.14|95.99| 94.1|95.97|56305400|    95.97|2014-07-07|
|2014-07-03 00:00:00|93.67| 94.1| 93.2|94.03|22891800|    94.03|2014-07-03|
|2014-07-02 00:00:00|93.87|94.06|93.09|93.48|28420900|    93.48|2014-07-02|
|2014-07-01 00:00:00|93.52|94.07|93.13|93.52|38170200|    93.52|2014-07-01|
+-------------------+-----+-----+-----+-----+--------+---------+----------+
only showing top 5 rows



In [20]:
apple_df = apple_df.sort("as_date", ascending=True)

In [22]:
apple_df.show(n=5)

+-------------------+-----+-----+-----+-----+---------+---------+----------+
|               Date| Open| High|  Low|Close|   Volume|Adj Close|   as_date|
+-------------------+-----+-----+-----+-----+---------+---------+----------+
|1980-12-12 00:00:00|28.75|28.87|28.75|28.75|117258400|     0.45|1980-12-12|
|1980-12-15 00:00:00|27.38|27.38|27.25|27.25| 43971200|     0.42|1980-12-15|
|1980-12-16 00:00:00|25.37|25.37|25.25|25.25| 26432000|     0.39|1980-12-16|
|1980-12-17 00:00:00|25.87| 26.0|25.87|25.87| 21610400|      0.4|1980-12-17|
|1980-12-18 00:00:00|26.63|26.75|26.63|26.63| 18362400|     0.41|1980-12-18|
+-------------------+-----+-----+-----+-----+---------+---------+----------+
only showing top 5 rows



### Step 9. Get the last business day of each month

In [27]:
apple_df = apple_df.withColumn("day", func.dayofmonth("as_date"))
apple_df = apple_df.withColumn("month", func.month("as_date"))
apple_df = apple_df.withColumn("year", func.year("as_date"))

In [28]:
apple_df.show(n=5)

+-------------------+-----+-----+-----+-----+---------+---------+----------+---+-----+----+
|               Date| Open| High|  Low|Close|   Volume|Adj Close|   as_date|day|month|year|
+-------------------+-----+-----+-----+-----+---------+---------+----------+---+-----+----+
|1980-12-12 00:00:00|28.75|28.87|28.75|28.75|117258400|     0.45|1980-12-12| 12|   12|1980|
|1980-12-15 00:00:00|27.38|27.38|27.25|27.25| 43971200|     0.42|1980-12-15| 15|   12|1980|
|1980-12-16 00:00:00|25.37|25.37|25.25|25.25| 26432000|     0.39|1980-12-16| 16|   12|1980|
|1980-12-17 00:00:00|25.87| 26.0|25.87|25.87| 21610400|      0.4|1980-12-17| 17|   12|1980|
|1980-12-18 00:00:00|26.63|26.75|26.63|26.63| 18362400|     0.41|1980-12-18| 18|   12|1980|
+-------------------+-----+-----+-----+-----+---------+---------+----------+---+-----+----+
only showing top 5 rows



In [40]:
max_days = apple_df.groupBy("month").agg(func.max("day").alias("day"))
max_apple_df = apple_df.join(max_days, on=["month", "day"])
max_apple_df.show()

+-----+---+-------------------+------+------+------+------+---------+---------+----------+----+
|month|day|               Date|  Open|  High|   Low| Close|   Volume|Adj Close|   as_date|year|
+-----+---+-------------------+------+------+------+------+---------+---------+----------+----+
|    6| 30|2014-06-30 00:00:00|  92.1| 93.73| 92.09| 92.93| 49482300|    92.93|2014-06-30|2014|
|    4| 30|2014-04-30 00:00:00|592.64|599.43| 589.8|590.09|114160200|    83.83|2014-04-30|2014|
|    3| 31|2014-03-31 00:00:00|539.23|540.81|535.93|536.74| 42167300|    76.25|2014-03-31|2014|
|    1| 31|2014-01-31 00:00:00|495.18|501.53|493.55| 500.6|116199300|    70.69|2014-01-31|2014|
|   12| 31|2013-12-31 00:00:00|554.17|561.28| 554.0|561.02| 55771100|    79.23|2013-12-31|2013|
|   10| 31|2013-10-31 00:00:00| 525.0|527.49|521.27| 522.7| 68924100|    73.39|2013-10-31|2013|
|    9| 30|2013-09-30 00:00:00|477.25|481.66|474.41|476.75| 65039100|    66.94|2013-09-30|2013|
|    7| 31|2013-07-31 00:00:00|454.99|45

### Step 10.  What is the difference in days between the first day and the oldest

In [47]:
maxmin_apple_df = apple_df.agg(func.max("as_date").alias("max_date"), func.min("as_date").alias("min_date"))
maxmin_apple_df.show()

+----------+----------+
|  max_date|  min_date|
+----------+----------+
|2014-07-08|1980-12-12|
+----------+----------+



In [49]:
diff_apple_df = maxmin_apple_df.withColumn("diff", maxmin_apple_df.max_date - maxmin_apple_df.min_date)
diff_apple_df.show()

+----------+----------+--------------------+
|  max_date|  min_date|                diff|
+----------+----------+--------------------+
|2014-07-08|1980-12-12|INTERVAL '12261' DAY|
+----------+----------+--------------------+



In [50]:
diff_apple_df.printSchema()

root
 |-- max_date: date (nullable = true)
 |-- min_date: date (nullable = true)
 |-- diff: interval day (nullable = true)



In [53]:
# Another way
diff_apple_df_2 = maxmin_apple_df.select(func.datediff(maxmin_apple_df.max_date, maxmin_apple_df.min_date).alias('diff'))
diff_apple_df_2.show()

+-----+
| diff|
+-----+
|12261|
+-----+



In [54]:
diff_apple_df_2.printSchema()

root
 |-- diff: integer (nullable = true)



### Step 11.  How many months in the data we have?

In [62]:
apple_df.show()

+-------------------+-----+-----+-----+-----+---------+---------+----------+---+-----+----+
|               Date| Open| High|  Low|Close|   Volume|Adj Close|   as_date|day|month|year|
+-------------------+-----+-----+-----+-----+---------+---------+----------+---+-----+----+
|1980-12-12 00:00:00|28.75|28.87|28.75|28.75|117258400|     0.45|1980-12-12| 12|   12|1980|
|1980-12-15 00:00:00|27.38|27.38|27.25|27.25| 43971200|     0.42|1980-12-15| 15|   12|1980|
|1980-12-16 00:00:00|25.37|25.37|25.25|25.25| 26432000|     0.39|1980-12-16| 16|   12|1980|
|1980-12-17 00:00:00|25.87| 26.0|25.87|25.87| 21610400|      0.4|1980-12-17| 17|   12|1980|
|1980-12-18 00:00:00|26.63|26.75|26.63|26.63| 18362400|     0.41|1980-12-18| 18|   12|1980|
|1980-12-19 00:00:00|28.25|28.38|28.25|28.25| 12157600|     0.44|1980-12-19| 19|   12|1980|
|1980-12-22 00:00:00|29.63|29.75|29.63|29.63|  9340800|     0.46|1980-12-22| 22|   12|1980|
|1980-12-23 00:00:00|30.88| 31.0|30.88|30.88| 11737600|     0.48|1980-12-23| 23|

In [63]:
apple_df.agg(func.count_distinct(apple_df.month, apple_df.day).alias('count')).show()

+-----+
|count|
+-----+
|  363|
+-----+



### Step 12. Plot the 'Adj Close' value. Set the size of the figure to 13.5 x 9 inches

### BONUS: Create your own question and answer it.