## Basics of pyspark from DataCamp

This notebook follows 

- https://campus.datacamp.com/courses/introduction-to-pyspark/getting

In [8]:
import pyspark

## Spark context

- **You can connect to a Spark cluster using `SparkContext` class**


- Creating the connection to a cluster is as simple as creating an instance of the SparkContext class. 


- The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.




In [7]:
help(pyspark.SparkContext)

Help on class SparkContext in module pyspark.context:

class SparkContext(builtins.object)
 |  Main entry point for Spark functionality. A SparkContext represents the
 |  connection to a Spark cluster, and can be used to create L{RDD} and
 |  broadcast variables on that cluster.
 |  
 |  Methods defined here:
 |  
 |  __enter__(self)
 |      Enable 'with SparkContext(...) as sc: app(sc)' syntax.
 |  
 |  __exit__(self, type, value, trace)
 |      Enable 'with SparkContext(...) as sc: app' syntax.
 |      
 |      Specifically stop the context on exit of the with block.
 |  
 |  __getnewargs__(self)
 |  
 |  __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>)
 |      Create a new SparkContext. At least the master and app name should be set,
 |      either through the named parameters here or through C{conf}.
 |      


In [2]:
sc = pyspark.SparkContext()

In [9]:
sc

## Spark Configuration

## Creating a SparkSession

- Creating multiple `SparkSessions` and `SparkContexts` can cause issues, so it's best practice to use the `SparkSession.builder.getOrCreate()` method. 


- This returns an existing `SparkSession` if there's already one in the environment, or creates a new one if necessary.


In [17]:
from pyspark.sql import SparkSession

In [18]:
spark = SparkSession.builder.getOrCreate()

In [19]:
spark

## Viewing tables

- Once you've created a **`SparkSession`**, you can start poking around to see what data is in your cluster!


- Your **`SparkSession`** has an attribute called catalog which lists all the data inside the cluster. This attribute has a few methods for extracting different pieces of information.


- One of the most useful is the **`.listTables()`** method, which returns the names of all the tables in your cluster as a list.




In [20]:
print(spark.catalog.listTables())

[]


## Queries 

- One of the advantages of the DataFrame interface is that you can run SQL queries on the tables in your Spark cluster.


- As you saw in the last exercise, one of the tables in your cluster is the `flights` table. This table contains a row for every flight that left Portland International Airport (PDX) or Seattle-Tacoma International Airport (SEA) in 2014 and 2015.

- Running a query on this table is as easy as using the **`.sql()`** method on your `SparkSession` that we called `spark`. This method takes a string containing the query and returns a DataFrame with the results!

- If you look closely, you'll notice that the table `flights` is only mentioned in the query, not as an argument to any of the methods. This is because there isn't a local object in your environment that holds that data, so it wouldn't make sense to pass the table as an argument.

- Remember, we've already created a SparkSession called spark in your workspace.

```python
#Don't change this query
query = "FROM flights SELECT * LIMIT 10"

#Get the first 10 rows of flights
flights10 = spark.sql(query)

#Show the results
flights10.show()
```


In [23]:
# Don't change this query
query = "FROM flights SELECT * LIMIT 10"

# Get the first 10 rows of flights
#flights10 = spark.sql(query)

# Show the results
#flights10.show()

## `.toPandas()` a Spark Dataframe

- Suppose you've run a query on your huge dataset and aggregated it down to something a little more manageable.


- Sometimes it makes sense to then take that table and work with it locally using a tool like `pandas`. Spark DataFrames make that easy with the **`.toPandas()`** method. Calling this method on a Spark DataFrame returns the corresponding `pandas` DataFrame. It's as simple as that!


## Pandas dataframe to SparkCluster

- The `SparkSession class` has **`.createDataFrame()`** that takes a pandas DataFrame and returns a Spark DataFrame.


- The output of this method is stored locally, not in the `SparkSession` catalog. This means that you can use all the Spark DataFrame methods on it, but you can't access the data in other contexts.


- For example, a SQL query (using the `.sql()` method) that references your DataFrame will throw an error. To access the data in this way, you have to save it as a temporary table.


- You can do this using the **`.createTempView()`** Spark DataFrame method, which takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific `SparkSession` used to create the Spark DataFrame.


- There is also the method **`.createOrReplaceTempView()`**. This safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. You'll use this method to avoid running into problems with duplicate tables.


![SparkDiagram](./images/spark_figure.png)

In [26]:
import pandas as pd

In [None]:
pyspark.Da

In [27]:
# Create pd_temp
pd_temp = pd.DataFrame(np.random.random(10))

# Create spark_temp from pd_temp
spark_temp = spark.createDataFrame(pd_temp)

# Examine the tables in the catalog
print(spark.catalog.listTables())

# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView("temp")

# Examine the tables in the catalog again
print(spark.catalog.listTables())

NameError: name 'np' is not defined