# Five minute `DataFrame` demo

### 1. Initialize the `SparkSession` which allows us to use the RDD and DataFrame APIs

First, we need to get a Spark application. We use a sparkmagic client to connect remotely to the Spark cluster.

In [1]:
spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
76,application_1583239045420_3621,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<pyspark.sql.session.SparkSession object at 0x7fa286e33450>

Now let's get some information about the current Livy endpoint.

In [2]:
%%info

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
76,application_1583239045420_3621,pyspark,idle,Link,Link,✔


### 2. Read in some data and turn it into an RDD of tuples

First of all we need to initialize an `SQLContext`.
The `SQLContext` allows us to connect the engine with different data sources. It is used to initiate the functionalities of Spark SQL.

In [3]:
from pyspark.sql import Row
from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We have a small text file for our demo, therefore it can be copied in a `%%local` variable and sent to the remote spark driver using the `%%send_to_spark` command.

In [4]:
%%local
people_str = []
with open('../data/people.txt', 'r') as f :
    people_str = f.read()


In [5]:
%%send_to_spark -i people_str -t people_str -t str

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'people_str' as 'people_str' to Spark kernel

Since we are reading from a file, we need to format the lines.

In [6]:
people_str = people_str.split('\n')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
print(people_str)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['Michael, 29', 'Andy, 30', 'Justin, 19']

Now we create an RDD

In [8]:
people_rdd = sc.parallelize(people_str)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
people_rdd.first()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

'Michael, 29'

We format the `RDD` into columns

In [10]:
rdd_formatted = people_rdd.map(lambda line: line.split(','))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Convert the tuples and create a `DataFrame` using the sqlContext.

In [11]:
ppl = rdd_formatted.map(lambda x: Row(name=x[0], age=int(x[1])))

df = sqlContext.createDataFrame(ppl)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

When the `DataFrame` is constructed, the data type for each column is inferred:

In [12]:
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

In [13]:
df.first()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Row(age=29, name=u'Michael')

There are some convenient methods for pretty-printing the columns:

In [14]:
df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------+
|age|   name|
+---+-------+
| 29|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+

Let's compare `RDD` methods and `DataFrame` -- we want to get all the people older than 20: 

In [15]:
# using the usual RDD methods
rdd_formatted.filter(lambda x: int(x[1])>20).collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[['Michael', ' 29'], ['Andy', ' 30']]

In [16]:
# using the DataFrame
df.filter(df.age > 20).take(20)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(age=29, name=u'Michael'), Row(age=30, name=u'Andy')]

No need to write `map`s if you can express the operation with the built-in functions. You refer to columns via the `DataFrame` object:

In [17]:
# this is a column that you can use in arithmetic expressions
df.age

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Column<age>

In [18]:
df.select(df.age, (df.age*2).alias('times two')).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+---------+
|age|times two|
+---+---------+
| 29|       58|
| 30|       60|
| 19|       38|
+---+---------+

In [19]:
# equivalent RDD method
rdd_formatted.map(lambda x: int(x[1])*2).collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[58, 60, 38]