# Getting to know PySpark
  
In this course, you'll learn how to use Spark from Python! Spark is a tool for doing parallel computation with large datasets and it integrates well with Python. PySpark is the Python package that makes the magic happen. You'll use this package to work with data about flights from Portland and Seattle. You'll learn to wrangle this data and build a whole machine learning pipeline to predict whether or not flights will be delayed. Get ready to put some Spark in your Python code and dive into the world of high-performance machine learning!
  
In this chapter, you'll learn how Spark manages data and how can you read and write tables from Python.
  
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   
      /_/
```

## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
[Apache Spark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>pyspark.SparkContext()</td>
    <td>Creates a SparkContext, the entry point to using Spark functionality.</td>
  </tr>
  <tr>
    <td>2</td>
    <td>pyspark.SparkContext().version</td>
    <td>Returns the version of the SparkContext.</td>
  </tr>
  <tr>
    <td>3</td>
    <td>pyspark.SparkContext().stop()</td>
    <td>Stops the SparkContext, terminating the connection to the Spark cluster.</td>
  </tr>
  <tr>
    <td>4</td>
    <td>pyspark.sql.SparkSession</td>
    <td>Creates a SparkSession, the entry point to using Spark SQL functionality.</td>
  </tr>
  <tr>
    <td>5</td>
    <td>pyspark.sql.SparkSession.builder.getOrCreate()</td>
    <td>Returns an existing SparkSession or creates a new one.</td>
  </tr>
  <tr>
    <td>6</td>
    <td>pyspark.sql.SparkSession.builder.appName</td>
    <td>Sets the name of the application for the SparkSession.</td>
  </tr>
  <tr>
    <td>7</td>
    <td>pyspark.sql.SparkSession.builder.getOrCreate</td>
    <td>Returns an existing SparkSession or creates a new one.</td>
  </tr>
  <tr>
    <td>8</td>
    <td>pyspark.sql.SparkSession.read</td>
    <td>Creates a DataFrameReader for reading data into a DataFrame.</td>
  </tr>
  <tr>
    <td>9</td>
    <td>pyspark.sql.SparkSession.read.format</td>
    <td>Sets the input format for reading data into a DataFrame.</td>
  </tr>
  <tr>
    <td>10</td>
    <td>pyspark.sql.SparkSession.read.format.option('inferSchema', 'True')</td>
    <td>Sets the option to infer the schema of the DataFrame from the data.</td>
  </tr>
  <tr>
    <td>11</td>
    <td>pyspark.sql.SparkSession.read.format.option('header', 'True')</td>
    <td>Sets the option to treat the first row as the header in the DataFrame.</td>
  </tr>
  <tr>
    <td>12</td>
    <td>pyspark.sql.SparkSession.read.format.load()</td>
    <td>Loads data into a DataFrame using the specified format.</td>
  </tr>
  <tr>
    <td>13</td>
    <td>pyspark.sql.SparkSession.createOrReplaceTempView</td>
    <td>Creates or replaces a temporary view of a DataFrame.</td>
  </tr>
  <tr>
    <td>14</td>
    <td>pyspark.sql.SparkSession.catalog.listTables()</td>
    <td>Returns a list of tables in the catalog.</td>
  </tr>
  <tr>
    <td>15</td>
    <td>pyspark.sql.SparkSession.sql()</td>
    <td>Executes a SQL query and returns the result as a DataFrame.</td>
  </tr>
  <tr>
    <td>16</td>
    <td>pyspark.sql.SparkSession.sql().toPandas()</td>
    <td>Converts a DataFrame to a Pandas DataFrame.</td>
  </tr>
  <tr>
    <td>17</td>
    <td>pyspark.sql.SparkSession.sql().toPandas().head()</td>
    <td>Returns the first n rows of a Pandas DataFrame.</td>
  </tr>
  <tr>
    <td>18</td>
    <td>pyspark.sql.SparkSession.createDataFrame()</td>
    <td>Creates a DataFrame from a Pandas DataFrame or an RDD.</td>
  </tr>
  <tr>
    <td>19</td>
    <td>pyspark.sql.SparkSession.read.csv()</td>
    <td>Reads a CSV file and returns a DataFrame.</td>
  </tr>
</table>

  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: pyspark  
Version: 3.4.1  
Summary: Apache Spark Python API  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [1]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import pyspark                      # Apache Spark:             Cluster Computing

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

### What is Spark, anyway?
  
Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.
  
As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.
  
However, with greater computing power comes greater complexity.
  
Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:
  
- Is my data too big to work with on a single machine?
- Can my calculations be easily parallelized?

### Using Spark in Python
  
The first step in using Spark is connecting to a cluster.
  
In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.
  
When you're just getting started with Spark it's simpler to just run a cluster locally. Thus, for this course, instead of connecting to another computer, all computations will be run on DataCamp's servers in a simulated cluster.
  
Creating the connection is as simple as creating an instance of the `SparkContext` class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.
  
An object holding all these attributes can be created with the `SparkConf()` constructor. Take a look at the [documentation](http://spark.apache.org/docs/3.0.0/api/python/pyspark.html) for all the details!
  

### Examining The SparkContext
  
In this exercise you'll get familiar with the `SparkContext`.
  
You'll probably notice that code takes longer to run than you might expect. This is because Spark is some serious software. It takes more time to start up than you might be used to. You may also find that running simpler computations might take longer than expected. That's because all the optimizations that Spark has under its hood are designed for complicated operations with big data sets. That means that for simple or small problems Spark may actually perform worse than some other solutions!
  
Get to know the `SparkContext`.
  
- Call `print()` on `pyspark.SparkContext` to verify there's a `SparkContext` in your environment.
- `print()` `pyspark.SparkContext.version` to see what version of Spark is running on your cluster.

In [2]:
sc = pyspark.SparkContext()

# Verify SparkContext
print(sc)

# Print Spark version
print(sc.version)

sc.stop()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/24 12:33:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


<SparkContext master=local[*] appName=pyspark-shell>
3.4.1


Awesome! You're up and running with Spark.

### Using DataFrames
  
Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so in this course you'll be using the Spark DataFrame abstraction built on top of RDDs.
  
The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.
  
When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!
  
To start working with Spark DataFrames, you first have to create a `SparkSession` object from your `SparkContext`. You can think of the `SparkContext` as your connection to the cluster and the `SparkSession` as your interface with that connection.
  
Remember, for the rest of this course you'll have a `SparkSession` called spark available in your workspace!
  
---
  
Which of the following is an advantage of Spark DataFrames over RDDs?
  
Possible Answers
  
- [x] Operations using DataFrames are automatically optimized.
- [ ] They are smaller.
- [ ] They can perform more kinds of operations.
- [ ] They can hold more kinds of data.
  
Exactly! This is another way DataFrames are like SQL tables.

### Creating a SparkSession
  
We've already created a `SparkSession` for you called `spark`, but what if you're not sure there already is one? Creating multiple SparkSessions and `SparkContexts` can cause issues, so it's best practice to use the `SparkSession.builder.getOrCreate()` method. This returns an existing `SparkSession` if there's already one in the environment, or creates a new one if necessary!
  
1. Import `SparkSession` from `pyspark.sql`.
2. Make a new `SparkSession` called `my_spark` using `SparkSession.builder.getOrCreate()`.
3. Print `my_spark` to the console to verify it's a `SparkSession`.

In [3]:
from pyspark.sql import SparkSession

# Create my_spark
my_spark = SparkSession.builder.getOrCreate()

# Print my_spark
print(my_spark)

<pyspark.sql.session.SparkSession object at 0x123ceda10>


Great work! You did that like a PySpark Pro!

### Viewing tables
  
Once you've created a `SparkSession`, you can start poking around to see what data is in your cluster!
  
Your `SparkSession` has an attribute called catalog which lists all the data inside the cluster. This attribute has a few methods for extracting different pieces of information.
  
One of the most useful is the `.listTables()` method, which returns the names of all the tables in your cluster as a list.
  
1. See what tables are in your cluster by calling `spark.catalog.listTables()` and printing the result!


In [4]:
spark = (
    SparkSession.builder.appName('flights').getOrCreate()
)

# Path to dataset
csv_file = '../_datasets/flights_small.csv'

# Read and create a temporary view
# Infer schema (note that for larger files you may want to specify the schema)
flights = (
    spark.read.format('csv').option('inferSchema', 'True').option('header', 'True').load(csv_file)
)

# Temporary view
flights.createOrReplaceTempView('flights')

23/08/24 12:33:52 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
                                                                                

In [5]:
# Print the tables in the catalog
print(spark.catalog.listTables())

[Table(name='flights', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]


Fantastic! What kind of data do you think is in that table?

### Are you query-ious?
  
One of the advantages of the DataFrame interface is that you can run SQL queries on the tables in your Spark cluster. If you don't have any experience with SQL, don't worry, we'll provide you with queries! 
  
As you saw in the last exercise, one of the tables in your cluster is the `flights` table. This table contains a row for every flight that left Portland International Airport (PDX) or Seattle-Tacoma International Airport (SEA) in 2014 and 2015.
  
Running a query on this table is as easy as using the `.sql()` method on your `SparkSession`. This method takes a string containing the query and returns a DataFrame with the results!
  
If you look closely, you'll notice that the table `flights` is only mentioned in the query, not as an argument to any of the methods. This is because there isn't a local object in your environment that holds that data, so it wouldn't make sense to pass the table as an argument.
  
Remember, we've already created a `SparkSession` called `spark` in your workspace. (It's no longer called `my_spark` because we created it for you!)
  
1. Use the `.sql()` method to get the first 10 rows of the `flights` table and save the result to `flights10`. The variable `query` contains the appropriate SQL query.
2. Use the DataFrame method `.show()` to print `flights10`.

In [6]:
query = 'FROM flights SELECT * LIMIT 10'

# Get the first 10 rows of flights
flights10 = spark.sql(query)

# Show the results
flights10.show()

                                                                                

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

Awesome work! You've got queries down!

### Pandafy a Spark DataFrame
  
Suppose you've run a query on your huge dataset and aggregated it down to something a little more manageable.
  
Sometimes it makes sense to then take that table and work with it locally using a tool like `pandas`. Spark DataFrames make that easy with the `.toPandas()` method. Calling this method on a Spark DataFrame returns the corresponding pandas DataFrame. It's as simple as that!
  
This time the query counts the number of flights to each airport from SEA and PDX.
  
Remember, there's already a `SparkSession` called spark in your workspace!
  
1. Run the query using the `.sql()` method. Save the result in `flight_counts`.
2. Use the `.toPandas()` method on `flight_counts` to create a `pandas` DataFrame called `pd_counts`.
3. Print the `.head()` of `pd_counts` to the console.

In [7]:
query = '''
SELECT origin, dest, COUNT(*) AS N 
FROM flights 
GROUP BY origin, dest'''

# Run the query
flight_counts = spark.sql(query)

# Convert the results to a pandas Dataframe
pd_counts = flight_counts.toPandas()

# Print the head of pd_counts
pd_counts.head()

                                                                                

Unnamed: 0,origin,dest,N
0,SEA,RNO,8
1,SEA,DTW,98
2,SEA,CLE,2
3,SEA,LAX,450
4,PDX,SEA,144


Great job! You did it!

### Put some Spark in your data
  
In the last exercise, you saw how to move data from Spark to `pandas`. However, maybe you want to go the other direction, and put a `pandas` DataFrame into a Spark cluster! The `SparkSession` class has a method for this as well.
  
The `.createDataFrame()` method takes a `pandas` DataFrame and returns a Spark DataFrame.
  
The output of this method is stored locally, not in the `SparkSession` catalog. This means that you can use all the Spark DataFrame methods on it, but you can't access the data in other contexts.
  
For example, a SQL query (using the `.sql()` method) that references your DataFrame will throw an error. To access the data in this way, you have to save it as a temporary table.
  
You can do this using the `.createTempView()` Spark DataFrame method, which takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific `SparkSession` used to create the Spark DataFrame.
  
There is also the method `.createOrReplaceTempView()`. This safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. You'll use this method to avoid running into problems with duplicate tables.
  
Check out the diagram to see all the different ways your Spark data structures interact with each other.
  
<center><img src='../_images/spark_figure.png' alt='img'></center>
  
There's already a `SparkSession` called `spark` in your workspace, `numpy` has been imported as `np`, and `pandas` as `pd`.
  
1. The code to create a `pandas` DataFrame of random numbers has already been provided and saved under `pd_temp`.
2. Create a Spark DataFrame called `spark_temp` by calling the Spark method `.createDataFrame()` with `pd_temp` as the argument.
3. Examine the list of tables in your Spark cluster and verify that the new DataFrame is not present. Remember you can use `spark.catalog.listTables()` to do so.
4. Register the `spark_temp` DataFrame you just created as a temporary table using the `.createOrReplaceTempView()` method. The temporary table should be named "temp". Remember that the table name is set including it as the only argument to your method!
5. Examine the list of tables again.

In [8]:
# Create pd_temp
pd_temp = pd.DataFrame(np.random.random(10))

# Create spark_temp from pd_temp
spark_temp = spark.createDataFrame(pd_temp)

# Examine the tables in the catalog
print(spark.catalog.listTables())

# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView('temp')

# Examine the tables in the catalog again
print(spark.catalog.listTables())

[Table(name='flights', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]
[Table(name='flights', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True), Table(name='temp', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]


<span style='color:#7393B3'>NOTE:</span>  Exercise output below, important to note that exercise uses version 3.2.0:
  
```python
<script.py> output:
    []
    [Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
```

Awesome! Now you can get your data in and out of Spark.

### Dropping the middle man
  
Now you know how to put data into Spark via pandas, but you're probably wondering why deal with pandas at all? Wouldn't it be easier to just read a text file straight into Spark? Of course it would!
  
Luckily, your `SparkSession` has a `.read` attribute which has several methods for reading different data sources into Spark DataFrames. Using these you can create a DataFrame from a .csv file just like with regular pandas DataFrames!
  
The variable `file_path` is a string with the path to the file `airports.csv`. This file contains information about different airports all over the world.
  
A `SparkSession` named `spark` is available in your workspace.
  
1. Use the `.read.csv()` method to create a Spark DataFrame called `airports`
2. The first argument is `file_path`
3. Pass the argument `header=True` so that Spark knows to take the column names from the first line of the file.
4. Print out this DataFrame by calling `.show()`.

In [9]:
file_path = '../_datasets/airports.csv'

# Read in the airports data
airports = spark.read.csv(file_path)

# Show the data
airports.show()

+---+--------------------+----------------+-----------------+----+---+---+
|_c0|                 _c1|             _c2|              _c3| _c4|_c5|_c6|
+---+--------------------+----------------+-----------------+----+---+---+
|faa|                name|             lat|              lon| alt| tz|dst|
|04G|   Lansdowne Airport|      41.1304722|      -80.6195833|1044| -5|  A|
|06A|Moton Field Munic...|      32.4605722|      -85.6800278| 264| -5|  A|
|06C| Schaumburg Regional|      41.9893408|      -88.1012428| 801| -6|  A|
|06N|     Randall Airport|       41.431912|      -74.3915611| 523| -5|  A|
|09J|Jekyll Island Air...|      31.0744722|      -81.4277778|  11| -4|  A|
|0A9|Elizabethton Muni...|      36.3712222|      -82.1734167|1593| -4|  A|
|0G6|Williams County A...|      41.4673056|      -84.5067778| 730| -5|  A|
|0G7|Finger Lakes Regi...|      42.8835647|      -76.7812318| 492| -5|  A|
|0P2|Shoestring Aviati...|      39.7948244|      -76.6471914|1000| -5|  U|
|0S9|Jefferson County ...

Awesome job! You've got the basics of Spark under your belt!

In [10]:
pyspark.SparkContext.stop(spark)