<a href="https://colab.research.google.com/github/XinchengLi0306/aws--instance/blob/main/Spark%20DataFrame%20-%20Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark DataFrame - Basics

Let's start off with the fundamentals of Spark DataFrame.

Objective: In this exercise, you'll find out how to start a spark session, read in data, explore the data and manipuluate the data (using DataFrame syntax as well as SQL syntax). Let's get started!

In [1]:
!apt-get install openjdk-11-jdk-headless -qq
!pip install -q pyspark findspark

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
# No need to download Spark—pip installation includes Spark JARs

import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ColabSpark").getOrCreate()
spark

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
# Let's read in the data. Note that it's in the format of JSON.
#change the path
df = spark.read.json('/content/drive/MyDrive/aws/people.json')
print(type(df))

<class 'pyspark.sql.dataframe.DataFrame'>


## Data Exploration

In [17]:
# The show method allows you visualise DataFrames. We can see that there are two columns.
df.show()

# You could also try this.
df.columns

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



['age', 'name']

In [18]:
# We can use the describe method get some general statistics on our data too. Remember to show the DataFrame!
# But what about data type?
df.describe().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   NULL|
| stddev|7.7781745930520225|   NULL|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



In [None]:
# For type, we can use print schema.
# But wait! What if you want to change the format of the data? Maybe change age to an integer instead of long?
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



## Data Manipulation

In [None]:
# Let's import in the relevant types.
from pyspark.sql.types import (StructField,StringType,IntegerType,StructType)

In [None]:
# Then create a variable with the correct structure.
data_schema = [StructField('age',IntegerType(),True),
              StructField('name',StringType(),True)]

final_struct = StructType(fields=data_schema)

In [None]:
# And now we can read in the data using that schema. If we print the schema, we can see that age is now an integer.
df = spark.read.json('/content/drive/MyDrive/Infosys722/Datasets/people.json', schema=final_struct)

df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



In [None]:
# We can also select various columns from a DataFrame.
df.select('age').show()

# We could split up these steps, first assigning the output to a variable, then showing that variable. As you see, the output is the same.
ageColumn = df.select('age')

ageColumn.show()

+----+
| age|
+----+
|NULL|
|  30|
|  19|
+----+

+----+
| age|
+----+
|NULL|
|  30|
|  19|
+----+



In [None]:
# We can also add columns, manipulating the DataFrame.

df.withColumn('double_age',df['age']*2).show()

# But note that this doesn't alter the original DataFrame. You need to assign the output to a new variable in order to do so.
df.show()

+----+-------+----------+
| age|   name|double_age|
+----+-------+----------+
|NULL|Michael|      NULL|
|  30|   Andy|        60|
|  19| Justin|        38|
+----+-------+----------+

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [None]:
# We can rename columns too!
df.withColumnRenamed('age', 'my_new_age').show()

+----------+-------+
|my_new_age|   name|
+----------+-------+
|      NULL|Michael|
|        30|   Andy|
|        19| Justin|
+----------+-------+



## Introducing SQL
We can query a DataFrame as if it were a table! Let's see a few examples of that below:

In [None]:
# First, we have to register the DataFrame as a SQL temporary view.
df.createOrReplaceTempView('people')

# After that, we can use the SQL programming language for queries.
results = spark.sql("SELECT * FROM people")

In [None]:
# Here's another example:
results = spark.sql("SELECT age FROM people WHERE age >= 19")
results.show()

+---+
|age|
+---+
| 30|
| 19|
+---+



Now that we're done with this tutorial, let's move on to Spark DataFrame Operations!