#Introduction to Spark SQL
See more at https://spark.apache.org/docs/latest/sql-programming-guide.html. Below is an adaption of this text to IPyNB and this course.

##Overview
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

### DataFrames (see next notebook on this)
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

The DataFrame API is available in Scala, Java, and Python.

*Initiate Spark Context - ONLY first time for each notebook. If you get problems with below, see [Help](/notebooks/spark_course/1-Course-Information-and-Links/If-you-get-problems-initiating-spark-context.ipynb)*

In [1]:
import os
from pyspark import SparkContext
sc = SparkContext(appName="search", master=os.environ['MASTER'])

In [2]:
from pyspark.sql import SQLContext, Row
sqlCtx = SQLContext(sc)

This is a toy problem, just to show the functionality.
Persons we're "analysing" are:

Michael, 29

Andy, 30

Justin, 19

Can we find the teenager?

In [3]:
# Load a text file and convert each line to a dictionary
peopleFile = "/uuData/people.txt"
lines = sc.textFile(peopleFile)
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

In [4]:
# Infer the schema, and register the SchemaRDD as a table.
# In future versions of PySpark we would like to add support
# for registering RDDs with other datatypes as tables
peopleTable = sqlCtx.inferSchema(people)
peopleTable.registerAsTable("people")

In [5]:
peopleTable.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



Exercise: complete below to find the teenager(s)!

Link to [Solution pages](/notebooks/spark_course/7-Solutions-to-exercises/)

In [5]:
# SQL can be run over SchemaRDDs that have been registered as a table
teenagers = sqlCtx.sql("SELECT ...")

In [None]:
teenNames = teenagers.map(lambda ...)
teenNames....