### Spark DataFrame Basics
Spark DataFrames are the workhouse and main way of working with Spark and Python post Spark 2.0. DataFrames act as powerful versions of tables, with rows and columns, easily handling large datasets. 

We use findspark api to locate the Spark for pyspark.

In [2]:
import findspark

In [3]:
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')

First we create the Spark Session

In [4]:
from pyspark.sql import SparkSession 

Now, lets the Spark Session, lets create variable spark with app name 'Basics' as this tutorial is about learning basics. 

In [5]:
spark = SparkSession.builder.appName('Basics').getOrCreate()

Read the Dataset

In [6]:
df = spark.read.json('people.json')

Lets show the loaded Data. Here we go.
Spark usually replace the missing data with null.

In [9]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



Let see schema or type of variables dataset.

In [11]:
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [12]:
df.columns

['age', 'name']

Now to see the statistical summary of our DataFrame df use describe()

In [16]:
df.describe()

DataFrame[summary: string, age: string, name: string]

To see details and default output

In [17]:
df.describe().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



Some data types make it easier to infer schema (like tabular formats such as csv which we will show later).

However you often have to set the schema yourself if you aren't dealing with a .read method that doesn't have inferSchema() built-in.

Spark has all the tools you need for this, it just requires a very specific structure:

In [18]:
from pyspark.sql.types import (StructField,StringType,
                               IntegerType,StructType)

Next we need to create the list of Structure fields

* :param name: string, name of the field.
* :param dataType: :class:`DataType` of the field.
* :param nullable: boolean, whether the field can be null (None) or not.

In [36]:
data_schema = [StructField('age', IntegerType(),True),
               StructField('name',StringType(),True)]

In [37]:
final_struct = StructType(fields=data_schema)

In [38]:
df = spark.read.json('people.json',schema=final_struct)

In [39]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



#### Grab Data
lets see how to grab data now

In [41]:
type(df['age'])

pyspark.sql.column.Column

In [42]:
df.select('age')

DataFrame[age: int]

In [44]:
df.select('age').show()

+----+
| age|
+----+
|null|
|  30|
|  19|
+----+



In [47]:
type(df.select('age')) ## returns the DataFrame

pyspark.sql.dataframe.DataFrame

In [48]:
# Returns list of Row objects first 2
df.head(2)

[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

In [50]:
df.head(2)[1] ## print 2nd row with index

Row(age=30, name='Andy')

In [52]:
type(df.head(2)[1])

pyspark.sql.types.Row

In [54]:
df.select(['age','name']).show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



### Creating new columns

In [59]:
df.withColumn('doubled_age',df['age']*2).show() ##This is happening inplace, we are not saving it in any new df or variable.

+----+-------+-----------+
| age|   name|doubled_age|
+----+-------+-----------+
|null|Michael|       null|
|  30|   Andy|         60|
|  19| Justin|         38|
+----+-------+-----------+



In [60]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [57]:
#Renaming column in place
df.withColumnRenamed('age','my_new_age').show()

+----------+-------+
|my_new_age|   name|
+----------+-------+
|      null|Michael|
|        30|   Andy|
|        19| Justin|
+----------+-------+



In [61]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



### Using SQL

To use SQL queries directly with the dataframe, you will need to register it to a temporary sql table view:
For this people.json, lets create a table with name 'people'

In [64]:
df.createOrReplaceTempView('people')

In [65]:
results = spark.sql("SELECT * FROM people")

In [66]:
results.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [67]:
new_results = spark.sql("SELECT * FROM people WHERE age=30")

In [68]:
new_results.show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

