# Spark DataFrame - Basics

Let's start off with the fundamentals of Spark DataFrame. 

Objective: In this exercise, you'll find out how to start a spark session, read in data, explore the data and manipuluate the data (using DataFrame syntax as well as SQL syntax). Let's get started! 

In [None]:
# Must be included at the beginning of each new notebook. Remember to change the app name.
import findspark
findspark.init('/home/ubuntu/spark-3.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('basics').getOrCreate()

In [None]:
# Let's read in the data. Note that it's in the format of JSON.
df = spark.read.json('Datasets/people.json')

## Data Exploration

In [None]:
# The show method allows you visualise DataFrames. We can see that there are two columns. 
df.show()

# You could also try this. 
df.columns

In [None]:
# We can use the describe method get some general statistics on our data too. Remember to show the DataFrame!
# But what about data type?
df.describe().show()

In [None]:
# For type, we can use print schema. 
# But wait! What if you want to change the format of the data? Maybe change age to an integer instead of long?
df.printSchema()

## Data Manipulation

In [None]:
# Let's import in the relevant types.
from pyspark.sql.types import (StructField,StringType,IntegerType,StructType)

In [None]:
# Then create a variable with the correct structure.
data_schema = [StructField('age',IntegerType(),True),
              StructField('name',StringType(),True)]

final_struct = StructType(fields=data_schema)

In [None]:
# And now we can read in the data using that schema. If we print the schema, we can see that age is now an integer. 
df = spark.read.json('Datasets/people.json', schema=final_struct)

df.printSchema()

In [None]:
# We can also select various columns from a DataFrame. 
df.select('age').show()

# We could split up these steps, first assigning the output to a variable, then showing that variable. As you see, the output is the same.
ageColumn = df.select('age')

ageColumn.show()

In [None]:
# We can also add columns, manipulating the DataFrame.

df.withColumn('double_age',df['age']*2).show()

# But note that this doesn't alter the original DataFrame. You need to assign the output to a new variable in order to do so.
df.show()

In [None]:
# We can rename columns too! 
df.withColumnRenamed('age', 'my_new_age').show()

## Introducing SQL
We can query a DataFrame as if it were a table! Let's see a few examples of that below:

In [None]:
# First, we have to register the DataFrame as a SQL temporary view.
df.createOrReplaceTempView('people')

# After that, we can use the SQL programming language for queries. 
results = spark.sql("SELECT * FROM people")

In [None]:
# Here's another example:
results = spark.sql("SELECT age FROM people WHERE age >= 19")
results.show()

Now that we're done with this tutorial, let's move on to Spark DataFrame Operations!