# Spark basics

If you have not installed spark, install it in a virtual environment with `pip install pyspark`

## Spark architecture

1. In the spark or pyspark root folder, look at the structure, do you recognize some modules?

2. Now go in the bin folder, do you recognize the different APIs for each language?

3. Now execute `spark-shell` and afterwards `pyspark`. Where is Spark running and where is the UI running.

4. Go to the Spark UI and browse through it.

## Basic Spark query commands

In [2]:
# Import the SparkSession module
from pyspark.sql import SparkSession

In [3]:
# Create the SparkSession, read the iris data set and show dataframe
spark = SparkSession.builder.appName("IrisQuery").getOrCreate()
df = spark.read.csv("iris.csv")
df.show()

23/10/11 20:40:22 WARN Utils: Your hostname, stephan-Precision-5520 resolves to a loopback address: 127.0.1.1; using 192.168.1.130 instead (on interface wlp2s0)
23/10/11 20:40:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/11 20:40:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


+---+---+---+---+-----------+
|_c0|_c1|_c2|_c3|        _c4|
+---+---+---+---+-----------+
|5.1|3.5|1.4|0.2|Iris-setosa|
|4.9|3.0|1.4|0.2|Iris-setosa|
|4.7|3.2|1.3|0.2|Iris-setosa|
|4.6|3.1|1.5|0.2|Iris-setosa|
|5.0|3.6|1.4|0.2|Iris-setosa|
|5.4|3.9|1.7|0.4|Iris-setosa|
|4.6|3.4|1.4|0.3|Iris-setosa|
|5.0|3.4|1.5|0.2|Iris-setosa|
|4.4|2.9|1.4|0.2|Iris-setosa|
|4.9|3.1|1.5|0.1|Iris-setosa|
|5.4|3.7|1.5|0.2|Iris-setosa|
|4.8|3.4|1.6|0.2|Iris-setosa|
|4.8|3.0|1.4|0.1|Iris-setosa|
|4.3|3.0|1.1|0.1|Iris-setosa|
|5.8|4.0|1.2|0.2|Iris-setosa|
|5.7|4.4|1.5|0.4|Iris-setosa|
|5.4|3.9|1.3|0.4|Iris-setosa|
|5.1|3.5|1.4|0.3|Iris-setosa|
|5.7|3.8|1.7|0.3|Iris-setosa|
|5.1|3.8|1.5|0.3|Iris-setosa|
+---+---+---+---+-----------+
only showing top 20 rows



In [3]:
# Read the dataframe with schema recognition and print the schema
df = spark.read.option("InferSchema", "true").csv("iris.csv")
df.printSchema()

root
 |-- _c0: double (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: string (nullable = true)



In [4]:
# Rename columns as follows sepal_length, sepal_width, petal_length, petal_width and iris_type
df = df.withColumnRenamed("_c0", "sepal_length").withColumnRenamed("_c1", "sepal_width").withColumnRenamed("_c2", "petal_length").withColumnRenamed("_c3", "petal_width").withColumnRenamed("_c4", "iris_type")
df.show()

+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|  iris_type|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|
|         4.6|        3.4|         1.4|        0.3|Iris-setosa|
|         5.0|        3.4|         1.5|        0.2|Iris-setosa|
|         4.4|        2.9|         1.4|        0.2|Iris-setosa|
|         4.9|        3.1|         1.5|        0.1|Iris-setosa|
|         5.4|        3.7|         1.5|        0.2|Iris-setosa|
|         4.8|        3.4|         1.6|        0.2|Iris-setosa|
|         4.8|        3.0|         1.4| 

In [5]:
# Read the dataframe and apply the schema on it
df = spark.read.schema("sepal_length double, sepal_width double, petal_length double, petal_width double, iris_type string").csv("iris.csv")

df.printSchema()

root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- iris_type: string (nullable = true)



In [7]:
# Get the count and maximum measures by iris type
df.groupBy("iris_type").count().show()
df.groupBy("iris_type").max().show()

+---------------+-----+
|      iris_type|count|
+---------------+-----+
| Iris-virginica|   50|
|    Iris-setosa|   50|
|Iris-versicolor|   50|
+---------------+-----+

+---------------+-----------------+----------------+-----------------+----------------+
|      iris_type|max(sepal_length)|max(sepal_width)|max(petal_length)|max(petal_width)|
+---------------+-----------------+----------------+-----------------+----------------+
| Iris-virginica|              7.9|             3.8|              6.9|             2.5|
|    Iris-setosa|              5.8|             4.4|              1.9|             0.6|
|Iris-versicolor|              7.0|             3.4|              5.1|             1.8|
+---------------+-----------------+----------------+-----------------+----------------+



In [9]:
# Use SQL synthax to query the data, show the table
df.createOrReplaceTempView("iris_table")

spark.sql("SELECT * FROM iris_table").show()

+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|  iris_type|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|
|         4.6|        3.4|         1.4|        0.3|Iris-setosa|
|         5.0|        3.4|         1.5|        0.2|Iris-setosa|
|         4.4|        2.9|         1.4|        0.2|Iris-setosa|
|         4.9|        3.1|         1.5|        0.1|Iris-setosa|
|         5.4|        3.7|         1.5|        0.2|Iris-setosa|
|         4.8|        3.4|         1.6|        0.2|Iris-setosa|
|         4.8|        3.0|         1.4| 

In [10]:
# Show the average sepal length for Setosa with sql synthax

query="SELECT max(sepal_length) FROM iris_table WHERE iris_type=='Iris-setosa'"

spark.sql(query).show()

+-----------------+
|max(sepal_length)|
+-----------------+
|              5.8|
+-----------------+



Now go the Spark UI and analyze the jobs dags ...etc.

In [11]:
# Stop the spark session

spark.stop()