<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Getting-to-know-PySpark" data-toc-modified-id="Getting-to-know-PySpark-1">Getting to know PySpark</a></span><ul class="toc-item"><li><span><a href="#Creating-a-SparkSession" data-toc-modified-id="Creating-a-SparkSession-1.1">Creating a SparkSession</a></span></li><li><span><a href="#Viewing-Tables" data-toc-modified-id="Viewing-Tables-1.2">Viewing Tables</a></span></li><li><span><a href="#Put-some-Spark-in-your-data" data-toc-modified-id="Put-some-Spark-in-your-data-1.3">Put some Spark in your data</a></span></li><li><span><a href="#Read-csv-into-a-SparkDataFrame" data-toc-modified-id="Read-csv-into-a-SparkDataFrame-1.4">Read csv into a SparkDataFrame</a></span></li></ul></li><li><span><a href="#Manipulating-Data" data-toc-modified-id="Manipulating-Data-2">Manipulating Data</a></span></li></ul></div>

### Getting to know PySpark
#### Creating a SparkSession
**`SparkSession.builder.getOrCreate()`** method returns an existing `SparkSession` if there's already one in the environment, or creates a new one if necesssary.

In [9]:
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np

# Create my_spark
my_spark = SparkSession.builder.getOrCreate()

# Print my_spark
print(my_spark)

<pyspark.sql.session.SparkSession object at 0x110a66550>


#### Viewing Tables
**`.listTables()`** method returns the names of all the tables in your cluster as a list.

**`.toPandas()`** method creates a pandas DataFrame.

In [5]:
# Print the tables in the catalog
print(my_spark.catalog.listTables())

[]


#### Put some Spark in your data
**`.createDataFrame()`** method takes a pandas DataFrame and returns a Spark DataFrame.

The output of this method is stored locally, not in the `SparkSession` catalog. This means that you can use all the Spark DataFrame methods on it, but you can't access the data in other contexts.

**`.createTempView()`** method takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific SparkSession used to create the Spark DataFrame.

**`.createOrReplaceTempView()`** method safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. You'll use this method to avoid running into problems with duplicate tables.

In [12]:
# Create pd_temp
pd_temp = pd.DataFrame(np.random.random(10))

# Create spark_temp from pd_temp
spark_temp = my_spark.createDataFrame(pd_temp)

# Examine the tables in the catalog
print(my_spark.catalog.listTables())

# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView("temp")

# Examine the tables in the catalog again
print(my_spark.catalog.listTables())

[]
[Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


#### Read csv into a SparkDataFrame
**`.read.csv()`** creates a SparkDataFrame.

### Manipulating Data