### Apache Spark

Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

However, with greater computing power comes greater complexity.

Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:

 - Is my data too big to work with on a single machine?
 - Can my calculations be easily parallelized?
 
 --------------

#### The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.

When you're just getting started with Spark it's simpler to just run a cluster locally.

Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

An object holding all these attributes can be created with the SparkConf() constructor. Take a look at the documentation for all the details! https://spark.apache.org/docs/2.1.0/api/python/pyspark.html

**Instatiate spark context**

from pyspark.context import SparkContext<br>
sc = SparkContext('local', 'test')

**Verify SparkContext**

print(sc)
> SparkContext master=local[*] appName=pyspark-shell

**Print Spark version**

print(sc.version)
> 3.2.0

--------------

#### Using DataFrames

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so generally you'll opt for the Spark DataFrame abstraction which is built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). 

DF's benefits
 - Not only are they easier to understand, DataFrames
 - are also more optimized for complicated operations than RDDs.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others.

When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!

To start working with Spark DataFrames, you first have to create a SparkSession object from your SparkContext. You can think of the SparkContext as your connection to the cluster and the SparkSession as your interface with that connection.

Creating multiple SparkSessions and SparkContexts can cause issues, so **it's best practice to use the SparkSession.builder.getOrCreate()** method. This returns an existing SparkSession if there's already one in the environment, or creates a new one if necessar

**Import SparkSession from pyspark.sql**

from pyspark.sql import SparkSession

**Create/print my_spark**

my_spark = SparkSession.builder.getOrCreate()

print(my_spark)
> <pyspark.sql.session.SparkSession object at 0x7f8f7de4dac0>

--------------
SparkSession has an attribute called catalog which lists all the data inside the cluster. This attribute has a few methods for extracting different pieces of information.

One of the most useful is the .listTables() method, which returns the names of all the tables in your cluster as a list

**Print the tables in the catalog**
print(spark.catalog.listTables())
> [Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

--------------

#### SQL on DFs
One of the advantages of the DataFrame interface is that you can run SQL queries on the tables in your Spark cluster.

Running a query on this table is as easy as using the .sql() method on your SparkSession. This method takes a string containing the query and returns a DataFrame with the results.

**Run SQL on a Spark DataFrame**

query = "FROM flights SELECT * LIMIT 10"<br>
results = spark.sql(query)<br>
results.show()


| year | month | day | dep_time|
|:-----|:------|:----|:--------|
| 2014 | 12|  8|     658|
|2014|    1| 22|    1040| 
|2014|    3|  9|    1443|

--------------

#### Pandafy a Spark DataFrame

Suppose you've run a query on your huge dataset and aggregated it down to something a little more manageable.

Sometimes it makes sense to then take that table and work with it locally using a tool like pandas. Spark DataFrames make that easy with the .toPandas() method. Calling this method on a Spark DataFrame returns the corresponding pandas DataFrame.

**Run SQL on a Spark DataFrame and convert to pandas**

query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"<br>
flight_counts = spark.sql(query)<br>
pd_counts = flight_counts.toPandas()<br>
print(pd_counts.head())


|  | origin | dest | N |
|:-----|:------|:----|:--------|
|0   |  SEA|  RNO|    8
|1   |  SEA|  DTW|   98
|2   |  SEA|  CLE|    2
|3   |  SEA|  LAX|  450
|4   |  PDX|  SEA|  144

--------------

#### From Pandas to Spark DataFrame
However, maybe you want to go the other direction, and put a pandas DataFrame into a Spark cluster! The SparkSession class has a method for this as well.

The .createDataFrame() method takes a pandas DataFrame and returns a Spark DataFrame.

The output of this method is stored locally, not in the SparkSession catalog. This means that you can use all the Spark DataFrame methods on it, but you can't access the data in other contexts.

For example, a SQL query (using the .sql() method) that references your DataFrame will throw an error. To access the data in this way, you have to save it as a temporary table.

You can do this using the .createTempView() Spark DataFrame method, which takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific SparkSession used to create the Spark DataFrame.

There is also the method .createOrReplaceTempView(). This safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. You'll use this method to avoid running into problems with duplicate tables.

Here's all the different ways your Spark data structures interact with each other:

<img src="assets/spark/spark_figure.png" style="width: 600px;"/>

**Example**

 - ***The code to create a pandas DataFrame of random number***<br>

pd_temp = pd.DataFrame(np.random.random(10))

 - ***Create a Spark DataFrame called spark_temp by calling the Spark method .createDataFrame() with pd_temp as the argument***<br>

spark_temp = spark.createDataFrame(pd_temp)

- ***Examine the list of tables in your Spark cluster and verify that the new DataFrame is not present. Remember you can use spark.catalog.listTables() to do so***<br>

print(spark.catalog.listTables())
> [ ]

 - ***Register the spark_temp DataFrame you just created as a temporary table using the .createOrReplaceTempView() method. The temporary table should be named "temp". Remember that the table name is set including it as the only argument to your method***<br>

spark_temp.createOrReplaceTempView('temp')

 - ***Examine the tables in the catalog again***<br>

print(spark.catalog.listTables())
>[Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

--------------

#### Read from file

SparkSession has a .read attribute which has several methods for reading different data sources into Spark DataFrames. Using these you can create a DataFrame from a .csv file just like with regular pandas DataFrames


 - ***Use the .read.csv() method to create a Spark DataFrame called airports***<br>
 
file_path = "/usr/local/share/datasets/airports.csv"


 - ***Read in the airports data and show. The first argument is file_path. Pass the argument header=True so that Spark knows to take the column names from the first line of the file.***<br>
 
airports = spark.read.csv(file_path, header=True)<br>
airports.show()

> table

--------------
#### Adding columns

The .withColumn() method, which takes two arguments. First, a string with the name of your new column, and second the new column itself.

The new column must be an object of class Column. Creating one of these is as easy as extracting a column from your DataFrame using df.colName.

Updating a Spark DataFrame is somewhat different than working in pandas because the Spark DataFrame is immutable. This means that it can't be changed, and so columns can't be updated in place.

Thus, all these methods return a new DataFrame. To overwrite the original DataFrame you must reassign the returned DataFrame using the method like so:

df = df.withColumn("newCol", df.oldCol + 1)
The above code creates a DataFrame with the same columns as df plus a new column, newCol, where every entry is equal to the corresponding entry from oldCol, plus one.

To overwrite an existing column, just pass the name of the column as the first argument.


 - ***Create the DataFrame flights. Use the spark.table() method with the argument "flights" to create a DataFrame containing the values of the flights table in the .catalog. Save it as flights. (spark = pyspark.sql.session.SparkSession)***<br>

flights = spark.table('flights')<br>
print(spark.catalog.listTables())

- ***Show the head of flights using flights.show(). Check the output: the column air_time contains the duration of the flight in minutes.***<br>

flights.show()

 - ***Add duration_hrs. Update flights to include a new column called duration_hrs, that contains the duration of each flight in hours (you'll need to divide air_time by the number of minutes in an hour).***<br>

flights = flights.withColumn('duration_hrs',flights.air_time/60)

--------------

#### Filtering Data

As you might suspect, this is the Spark counterpart of SQL's WHERE clause. The .filter() method takes either an expression that would follow the WHERE clause of a SQL expression as a string, or a Spark Column of boolean (True/False) values.

For example, the following two expressions will produce the same output:

> flights.filter("air_time > 120").show()<br>
> flights.filter(flights.air_time > 120).show()

Notice that in the first case, we pass a string to .filter(). In SQL, we would write this filtering task as SELECT * FROM flights WHERE air_time > 120. Spark's .filter() can accept any expression that could go in the WHEREclause of a SQL query (in this case, "air_time > 120"), as long as it is passed as a string. Notice that in this case, we do not reference the name of the table in the string -- as we wouldn't in the SQL request.

In the second case, we actually pass a column of boolean values to .filter(). Remember that flights.air_time > 120 returns a column of boolean values that has True in place of those records in flights.air_time that are over 120, and False otherwise.

**Use the .filter() method to find all the flights that flew over 1000 miles two ways:**

 - ***First, pass a SQL string to .filter() that checks whether the distance is greater than 1000. Save this as long_flights1.***
 
long_flights1 = flights.filter("distance > 1000")<br>
long_flights1.show()

 - ***Then pass a column of boolean values to .filter() that checks the same thing. Save this as long_flights2.***
 
long_flights2 = flights.filter(flights.distance > 1000)<br>
long_flights2.show()

--------------

#### Selecting Data

The Spark variant of SQL's SELECT is the .select() method. This method takes multiple arguments - one for each column you want to select. These arguments can either be the column name as a string (one for each column) or a column object (using the df.colName syntax). When you pass a column object, you can perform operations like addition or subtraction on the column to change the data contained in it, much like inside .withColumn().

The difference between .select() and .withColumn() methods is that .select() returns only the columns you specify, while .withColumn() returns all the columns of the DataFrame in addition to the one you defined. It's often a good idea to drop columns you don't need at the beginning of an operation so that you're not dragging around extra data as you're wrangling. In this case, you would use .select() and not .withColumn().

 - ***Select the columns "tailnum", "origin", and "dest" from flights by passing the column names as strings. Save this as selected1.***

selected1 = flights.select( "tailnum", "origin", "dest")

 - ***Select the columns "origin", "dest", and "carrier" using the df.colName syntax and then filter the result using both of the filters already defined for you (filterA and filterB) to only keep flights from SEA to PDX. Save this as selected2.***

temp = flights.select(flights.origin, flights.dest, flights.carrier)<br>
filterA = flights.origin == "SEA"<br>
filterB = flights.dest == "PDX"<br>
selected2 = temp.filter(filterA).filter(filterB)

--------------

#### Selecting Data II

Similar to SQL, you can also use the .select() method to perform column-wise operations. When you're selecting a column using the df.colName notation, you can perform any column operation and the .select() method will return the transformed column. For example,

>flights.select(flights.air_time/60)

returns a column of flight durations in hours instead of minutes. You can also use the .alias() method to rename a column you're selecting. So if you wanted to .select() the column duration_hrs (which isn't in your DataFrame) you could do

>flights.select((flights.air_time/60).alias("duration_hrs"))

The equivalent Spark DataFrame method .selectExpr() takes SQL expressions as a string:

>flights.selectExpr("air_time/60 as duration_hrs")

with the SQL as keyword being equivalent to the .alias() method. To select multiple columns, you can pass multiple strings.


**Create a table of the average speed of each flight both ways.**

 - ***Calculate average speed by dividing the distance by the air_time (converted to hours). Use the .alias() method name this column "avg_speed". Save the output as the variable avg_speed.***

avg_speed = (flights.distance/(flights.air_time/60)).alias("avg_speed")


 - ***Select the columns "origin", "dest", "tailnum", and avg_speed (without quotes!). Save this as speed1. Create the same table using .selectExpr() and a string containing a SQL expression. Save this as speed2.***

speed1 = flights.select("origin", "dest", "tailnum", avg_speed)<br>
speed2 = flights.selectExpr("origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed")

--------------

#### Aggregating Data

PySpark has a whole class devoted to grouped data frames: pyspark.sql.GroupedData.

All of the common aggregation methods, like .min(), .max(), and .count() are GroupedData methods. These are created by calling the .groupBy() DataFrame method. For example, to find the minimum value of a column, col, in a DataFrame, df, you could do

>df.groupBy().min("col").show()

This creates a GroupedData object (so you can use the .min() method), then finds the minimum value in col, and returns it as a DataFrame.

 - ***Find the length of the shortest (in terms of distance) flight that left PDX by first .filter()ing and using the .min() method. Perform the filtering by referencing the column directly, not passing a SQL string..***

flights.filter(flights.origin == 'PDX').groupBy().min('distance').show()

 - ***Find the length of the longest (in terms of time) flight that left SEA by filter()ing and using the .max() method. Perform the filtering by referencing the column directly, not passing a SQL string.***

flights.filter(flights.origin == 'SEA').groupBy().max('air_time').show()


 - ***Use the .avg() method to get the average air time of Delta Airlines flights (where the carrier column has the value "DL") that left SEA. The place of departure is stored in the column origin. show() the result.***

flights.filter(flights.carrier == "DL").filter(flights.origin == "SEA").groupBy().avg("air_time").show()
 
 - ***Use the .sum() method to get the total number of hours all planes in this dataset spent in the air by creating a column called duration_hrs from the column air_time. show() the result.***

flights.withColumn("duration_hrs", flights.air_time/60).groupBy().sum("duration_hrs").show()


--------------

You've learned how to create a grouped DataFrame by calling the .groupBy() method on a DataFrame with no arguments.

Now you'll see that when you pass the name of one or more columns in your DataFrame to the .groupBy() method, the aggregation methods behave like when you use a GROUP BY statement in a SQL query.

 - ***Create a DataFrame called by_plane that is grouped by the column tailnum.***

by_plane = flights.groupBy("tailnum")

 - ***Use the .count() method with no arguments to count the number of flights each plane made.***

by_plane.count().show()

 - ***Create a DataFrame called by_origin that is grouped by the column origin.***
 
by_origin = flights.groupBy("origin")

 - ***Find the .avg() of the air_time column to find average duration of flights from PDX and SEA.***
 
by_origin.avg("air_time").show()


--------------

In addition to the GroupedData methods you've already seen, there is also the .agg() method. This method lets you pass an aggregate column expression that uses any of the aggregate functions from the pyspark.sql.functions submodule.

This submodule contains many useful functions for computing things like standard deviations. All the aggregation functions in this submodule take the name of a column in a GroupedData table.

 -  ***Import the submodule pyspark.sql.functions as F.***

import pyspark.sql.functions as F

 - ***Create a GroupedData table called by_month_dest that's grouped by both the month and dest columns. Refer to the two columns by passing both strings as separate arguments.***

by_month_dest = flights.groupBy('month','dest')

 - ***Use the .avg() method on the by_month_dest DataFrame to get the average dep_delay in each month for each destination.***

by_month_dest.avg('dep_delay').show()

 - ***Find the standard deviation of dep_delay by using the .agg() method with the function F.stddev().***

by_month_dest.agg(F.stddev('dep_delay')).show()

--------------
#### Joining

Another very common data operation is the join. Joins are a whole topic unto themselves.

A join will combine two different tables along a column that they share. This column is called the key. Examples of keys here include the tailnum and carrier columns from the flights table.

For example, suppose that you want to know more information about the plane that flew a flight than just the tail number. This information isn't in the flights table because the same plane flies many different flights over the course of two years, so including this information in every row would result in a lot of duplication. To avoid this, you'd have a second table that has only one row for each plane and whose columns list all the information about the plane, including its tail number. You could call this table planes.

When you join the flights table to this table of airplane information, you're adding all the columns from the planes table to the flights table. To fill these columns with information, you'll look at the tail number from the flights table and find the matching one in the planes table, and then use that row to fill out all the new columns.

In PySpark, joins are performed using the DataFrame method .join(). This method takes three arguments. The first is the second DataFrame that you want to join with the first one. The second argument, on, is the name of the key column(s) as a string. The names of the key column(s) must be the same in each table. The third argument, how, specifies the kind of join to perform. In this course we'll always use the value how="leftouter".



 - ***Examine the airports DataFrame by calling .show(). Note which key column will let you join airports to the flights table.***

print(airports.show())

 - ***Rename the faa column in airports to dest by re-assigning the result of airports.withColumnRenamed("faa", "dest") to airports.***

airports = airports.withColumnRenamed("faa","dest")

 - ***Join the flights with the airports DataFrame on the dest column by calling the .join() method on flights. Save the result as flights_with_airports.***
     - The first argument should be the other DataFrame, airports.
     - The argument on should be the key column.
     - The argument how should be "leftouter".

flights_with_airports = flights.join(airports, on='dest', how='leftouter')


 - ***Call .show() on flights_with_airports to examine the data again. Note the new information that has been added.***
 
print(flights_with_airports.show())