Basica notebook usage. We can write markdown or code in the cells. Use "Shift" + "Enter" to execute the code and advance to the next cell. Use "Ctrl" + "Enter" to exectute the code and remain in the cell.    
Here we are going to write a simple python code and run it.

In [2]:
# Lets run some python code.
a = 10
b = -20
if a < b:
  print 'b is greater'
else:
  print 'a is greater'

In order to use Spark and its DataFrame API we will need to use SQLContext.
When running Spark, we start a new application by creating SparkContext and then we create a SQLContext from SparkContext.

In [4]:
# check what version of Spark are we running
sc.version

In [5]:
# display the type of the Spark sqlContext
type(sqlContext)

Before we create a DataFrame, lets use a Python package called fake-factory to create fake person records.    
For our project, we are going to create 10,000 records wih columns - last_name, first_name, occupation, company, age.

In [7]:
# Lets create a python collection of 10,000 people
from faker import Factory
fake = Factory.create()
fake.seed(1234)

In [8]:
from pyspark.sql import Row
def fake_data():
  name = fake.name().split()
  return(name[1], name[0], fake.job(), fake.company(), abs(2016 - fake.date_time().year) + 1)


# xrange is evaluated lazily and acts like a generator while range creates a list in memory.
def repeat_times(times, func, *args, **kwargs):
  for _ in xrange(times):
    yield func(*args, **kwargs)

In [9]:
data = list(repeat_times(10000, fake_data))
data[0]

In [10]:
len(data)

In [11]:
# create DataFrame
dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'occupation', 'company', 'age'))

In [12]:
# register the dataframe as a named table called person.
sqlContext.registerDataFrameAsTable(dataDF, 'person')

In [13]:
dataDF.printSchema()

In [14]:
dataDF.show()

In [15]:
dataDF.rdd.getNumPartitions()

In [16]:
subDF = dataDF.select('last_name', 'first_name')
subDF.show()

In [17]:
# collec to view results. Not a good idea if the dataset is large.
results = subDF.collect()
print results

In [18]:
dataDF.show(n=30, truncate=False)

In [19]:
# a better way to visualize data.
display(dataDF)

In [20]:
print dataDF.count()

In [21]:
# lets apply another transformation, filter.
filteredDF = dataDF.filter(dataDF.age < 13)
filteredDF.show(truncate=False)
filteredDF.count()

In [22]:
# python lambda functions and udf
from pyspark.sql.types import BooleanType
less_13 = udf(lambda s: s < 13, BooleanType())
lambdaDF = dataDF.filter(less_13(dataDF.age))
lambdaDF.show()
lambdaDF.count()

In [23]:
# lets collect the even values less than 10.
even_less13 = udf(lambda s: s%2 == 0, BooleanType())
evenDF = lambdaDF.filter(even_less13(lambdaDF.age))
evenDF.show()
evenDF.count()

In [24]:
display(evenDF.take(5))

In [25]:
display(dataDF.orderBy(dataDF.age.desc()).take(5))

In [26]:
display(dataDF.orderBy('age').take(5))

In [27]:
# we can use groupby (similar like pandas)
dataDF.groupBy('occupation').count().show(truncate=False)

In [28]:
dataDF.groupBy().avg('age').show(truncate=False)

In [29]:
dataDF.groupBy().max('age').show(truncate=False)

We can use cache() operation to keep the DataFrame in memory.   
cache() is lazy so before running the operation we need to make sure we call an action operation.

In [31]:
dataDF.cache()
print dataDF.is_cached

In [32]:
dataDF.unpersist()
print dataDF.is_cached

In [33]:
# readability and code style
from pyspark.sql.functions import *
(dataDF
 .filter(dataDF.age > 25)
 .select(concat(dataDF.first_name, lit(' '), dataDF.last_name).alias('name'), dataDF.occupation, dataDF.age)
 .show(truncate=False)
)