Spark Libraries
====

A tour of the Spark SQL library, the `spark-csv` package and Spark DataFrmaes.

You need to define these environment variables before starting the notebook.

```bash
export SPARK_HOME=~/spark
export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PACKAGES="com.databricks:spark-csv_2.11:1.4.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
```

In Unix/Mac, this can be done in `.bashrc` or `.bash_profile`.

In [1]:
from pyspark import SparkContext, SparkConf

conf = (SparkConf()
        .setAppName('SparkSQL')
        .setMaster('local[*]'))

In [2]:
sc = SparkContext(conf=conf)

In [3]:
from pyspark.sql import SQLContext
sqlc = SQLContext(sc)

## Working with CSV files

In [27]:
df = (sqlc.read.format('com.databricks.spark.csv')
      .options(header='true', inferschema='true')
      .load('data/cars.csv'))

### Using the datframe

In [28]:
df.printSchema()

root
 |-- year: integer (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)



In [29]:
df.show()

+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla|    S|          No comment|     |
|1997| Ford| E350|Go get one now th...|     |
|2015|Chevy| Volt|                null| null|
+----+-----+-----+--------------------+-----+



In [30]:
df.select('model').show()

+-----+
|model|
+-----+
|    S|
| E350|
| Volt|
+-----+



### To run SQL queries, we need to register the dataframe as a table

In [31]:
df.registerTempTable('cars')

In [32]:
q = sqlc.sql('select year, make from cars where year > 2000')
q.show()

+----+-----+
|year| make|
+----+-----+
|2012|Tesla|
|2015|Chevy|
+----+-----+



### Spark daaframes can be converted to Pandas ones

Typically, we would only convert small dataframes such as the results of SQL queries. If we could load the original dataset in memory as a `pandaa` dataframe, why would we be using Spark?

In [33]:
q_df = q.toPandas()
q_df

Unnamed: 0,year,make
0,2012,Tesla
1,2015,Chevy
