# Spark DataFrame - Operations
Now that you know the basics, let's get into operations. 

Objective: This exercise is similar to the Basics exercise, but uses DataFrame methods instead of SQL. We'll also be going through some more complex operations with a more realisitic dataset. 

In [None]:
# Must be included at the beginning of each new notebook. Remember to change the app name.
import findspark
findspark.init('/home/ubuntu/spark-3.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('operations').getOrCreate()

In [None]:
# Schemas can only be inferred for CSV files. 
df = spark.read.csv('Datasets/apple_stock_data.csv', inferSchema=True, header=True)
df.printSchema()

In [None]:
# Let's get a better look at the data.
# We know that we can show a DataFrame, but that's resulted in a mess! 
df.show()

In [None]:
# Instead, let's just grab the first row. Much neater! 
df.head(1)

## DataFrame Methods

In [None]:
# Even though we know SQL is available, let's try out some of the DataFrame methods.
# For this example, let's have a look at the opeening and closing value where close is less than 500.
df.filter("Close < 500").select('Open','Close').show()

In [None]:
# We can also use Python within the DataFrame filter method!
df.filter(df['Close'] < 500).select('Open','Close').show()

In [None]:
# And we can use multiple operations! 
# Here we're looking for significant increases in stock.
df.filter( (df['Close'] > 500) & (df['Open'] < 495) ).select('Open','Close').show()

## Using Collect
You may have noticed that showing a DataFrame can be quite messy and useless. Instead, let's try using the collect method to visualise the data. It's not necessarily better, just a different method of achieving similar results.

In [None]:
# Let's pick a row of data with a low of $197.16 and collect it.  
employeeResult = df.filter(df['Low'] == 197.16).collect()

In [None]:
# When we collect it, you may notice an interesting format. 
employeeResult

In [None]:
# We can select the first row of data to shed the outer brackets.
employeeRow = employeeResult[0]

employeeRow

In [None]:
# And then visualise it simply as a dictionary. 
employeeRow.asDict()

In [None]:
# Why convert it into a dictionary? Because dictionaries have a lot of methods available.
# For example, we can simply call volume from the dictionary. 
employeeRow.asDict()['Volume']

# Aggregation and Dates
Let's shift gears a bit and focus on something different. Instead of simply eploring the data, let's try to find the average stock closing price per year. To do this, we'll first have to manipulate the Date column. Let's begin! 

In [None]:
# Let's import the relevant functions.
from pyspark.sql.functions import dayofmonth,month,hour,year,format_number

In [None]:
# And create a new column using the year function to manipulate date. 
df_with_year = df.withColumn("Year",year(df["Date"]))

df_with_year.head(1)

In [None]:
# Now let's sumamrise the data by year, find the mean of each year and select the two columns we'd like to see.
df_summary = df_with_year.groupBy("Year").mean().select(['Year','avg(Close)'])
df_summary.show()

While the data may be accurate, it's not necessarily appropriate in a professional context. Instead, let's make a few adjustments to make it more appealing.

In [None]:
# To make it more visually appealing, let's format the mean to two decimal places.
df_formatted = df_summary.select(['Year', format_number("avg(Close)",2)])
df_formatted.show()

In [None]:
# Let's change the name of the column to something that makes sense.
df_renamed = df_formatted.withColumnRenamed("format_number(avg(Close), 2)","Average Closing Price")
df_renamed.show()

In [None]:
# And finally order it by year.
df_renamed.orderBy('Year').show()

Great job! At this stage, it's a good idea to continue exploring the basics of DataFrames. Try different methods or reading the documentation.

When you feel comfortable, move on to the DataFrame Data Cleaning Exercise. 

If you would like a simpler aggregation example, try the DataFrame Aggregation Exercise. 