# PySpark SQL

### The pyspark.sql module

Important classes of Spark SQL and DataFrames:

* `pyspark.sql.SparkSession` Main entry point for DataFrame and SQL functionality.

* `pyspark.sql.DataFrame` A distributed collection of data grouped into named columns.

* `pyspark.sql.Column` A column expression in a DataFrame.

* `pyspark.sql.Row` A row of data in a DataFrame.

* `pyspark.sql.GroupedData` Aggregation methods, returned by DataFrame.groupBy().

* `pyspark.sql.DataFrameNaFunctions` Methods for handling missing data (null values).

* `pyspark.sql.DataFrameStatFunctions` Methods for statistics functionality.

* `pyspark.sql.functions` List of built-in functions available for DataFrame.

* `pyspark.sql.types` List of data types available.

* `pyspark.sql.Window` For working with window functions.

http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html

## 1. SparkSession

The traditional way to interact with Spark is the SparkContext. In the notebooks we get that from the pyspark driver.

From 2.0 we can use SparkSession to replace SparkConf, SparkContext and SQLContext

In [1]:
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
# Create a Session
session = SparkSession.builder.getOrCreate()
# How is my session?
print(session)

<pyspark.sql.session.SparkSession object at 0x7fcda9edfdd8>


In [2]:
session

#### Passing other options to spark session:
    
    

In [3]:
session = SparkSession.builder.config('someoption.key','somevalue').getOrCreate()

We can check option values in the resulting session like this:

In [4]:
session.sparkContext.getConf().getAll()

[('someoption.key', 'somevalue'),
 ('spark.rdd.compress', 'True'),
 ('spark.app.id', 'local-1558105286031'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.driver.port', '35805'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.host', '10.0.2.15'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.app.name', 'pyspark-shell')]

## 2. Creating DataFrames

SparkSession.createDataFrame: from an RDD, a list or a pandas.DataFrame.

In [5]:
# Create the rows for the Dataframe
import random
random.seed(42)
ids= range(5)
positions=[random.choice(['mechanic','sales','manager']) for id_ in ids]
print(positions)
# Join ids with positions through zip command
rows = zip(ids,positions)
print(rows)

['manager', 'mechanic', 'mechanic', 'manager', 'sales']
<zip object at 0x7fcda9c72448>


In [6]:
# Create the Dataframe
df = session.createDataFrame(rows)

In [None]:
# Show the content. It's similar to .head() in Pandas 
df.show()

In [18]:
# Collect is another way to see the Dataframe
df.collect()

[Row(_1=0, _2='manager'),
 Row(_1=1, _2='mechanic'),
 Row(_1=2, _2='mechanic'),
 Row(_1=3, _2='manager'),
 Row(_1=4, _2='sales'),
 Row(_1=5, _2='mechanic'),
 Row(_1=6, _2='mechanic'),
 Row(_1=7, _2='mechanic'),
 Row(_1=8, _2='manager'),
 Row(_1=9, _2='mechanic'),
 Row(_1=10, _2='manager'),
 Row(_1=11, _2='manager')]