# Introduction to pyspark

Source: DataCamp

Spark is a platform for big data and cluster computing. Spark lets distribute computations over clusters with multiple nodes (computing instances). Each node works different subsets of the data and carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. 

Big data, parallelization also introduces greater complexity. In order to decide if `pyspark` is for you, you can answer the following questions:

- Is my data too big to work with on a single machine?
- Can my calculations be easily parallelized?

# pyspark: Using Spark from Python

## Step 1: Connect to a Cluster

The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called **the master** that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called **slaves**. The master sends the slaves data and calculations to run, and they send their results back to the master.

We are going to run our cluster locally, therefore we will not connect to another computer, instead we will run them locally on your own computer.

### SparkContext sc

Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

An object holding all these attributes can be created with the SparkConf() constructor. Take a look at the documentation for all the details!

When you start pyspark you will have a SparkContext called sc already available in your workspace.

In [1]:
sc

In [2]:
print(sc)

<SparkContext master=local[*] appName=PySparkShell>


In [3]:
# print version of Spark running
print(sc.version)

2.3.0


# Using DataFrames
Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so in this course you'll be using the Spark DataFrame abstraction built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!

To start working with Spark DataFrames, you first have to create a SparkSession object from your SparkContext. You can think of the **SparkContext as your connection to the cluster** and the **SparkSession as your interface with that connection**.

Remember, for the rest of this course you'll have a SparkSession called spark available in your workspace!

# Creating a SparkSession
We've already created a SparkSession for you called spark, but what if you're not sure there already is one? Creating multiple SparkSessions and SparkContexts can cause issues, so it's best practice to use the SparkSession.builder.getOrCreate() method. This returns an existing SparkSession if there's already one in the environment, or creates a new one if necessary!

*Spark session automatically created as* `spark`

In [4]:
spark

In [5]:
# can also look up by running
spark2 = SparkSession.builder.getOrCreate()

In [6]:
# It's the same Session
spark == spark2

True

- Import SparkSession from pyspark.sql.
- Make a new SparkSession called my_spark using SparkSession.builder.getOrCreate().
- Print my_spark to the console to verify it's a SparkSession

In [7]:
from pyspark.sql import SparkSession

my_spark = SparkSession.builder.getOrCreate()

print(my_spark)

<pyspark.sql.session.SparkSession object at 0x7f2ae2249128>


# Viewing tables
Once you've created a SparkSession, you can start poking around to see what data is in your cluster!

Your SparkSession has an **attribute called catalog** which lists all the data inside the cluster. This attribute has a few methods for extracting different pieces of information.

One of the most useful is the .listTables() method, which returns the names of all the tables in your cluster as a list.

- See what tables are in your cluster by calling spark.catalog.listTables() and printing the result!

In [8]:
my_spark.catalog.listTables()

[]

In [1]:
# Simple tests with RDD
res = sc.parallelize(range(1000000))
res = res.map(lambda x: x + 273.15)

In [4]:
res.take(2)

[273.15, 274.15]