In [1]:
import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext("local", "pyspark-shell")

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# DataFrame details

## Intro to data cleaning with Apache Spark

Preparing raw data for use in data processing pipelines. Spark is scalable so lets you scale your data processing capacity.

The primary limit to Spark's abilities is the **level of RAM** in the Spark cluster.

A schema defines and validates the number and types of columns for a given DataFrame. Defining a schema also improves read performance.

### Defining a schema

In [2]:
from pyspark.sql.types import *

people_schema = StructType([
    StructField("name", StringType(), False),
    StructField("age", IntegerType(), False),
    StructField("city", StringType(), False)
])

## Immutability and lazy processing

Spark DataFrames are defined once and are not modifiable after initialization.

Lazy processing makes Spark to perform the most efficient set of operations to get the desired result. Spark not performing any transformations until an action is requested.

### Using lazy processing

In [11]:
import pyspark.sql.functions as F
aa_dfw_df = spark.read.csv("AA_DFW_2017_Departures_Short.csv", header=True)
aa_dfw_df = aa_dfw_df.withColumn("airport", F.lower(aa_dfw_df["Destination Airport"]))
aa_dfw_df = aa_dfw_df.drop(aa_dfw_df["Destination Airport"])
aa_dfw_df.show()

+-----------------+-------------+-----------------------------+-------+
|Date (MM/DD/YYYY)|Flight Number|Actual elapsed time (Minutes)|airport|
+-----------------+-------------+-----------------------------+-------+
|       01/01/2017|         0005|                          537|    hnl|
|       01/01/2017|         0007|                          498|    ogg|
|       01/01/2017|         0037|                          241|    sfo|
|       01/01/2017|         0043|                          134|    dtw|
|       01/01/2017|         0051|                           88|    stl|
|       01/01/2017|         0060|                          149|    mia|
|       01/01/2017|         0071|                          203|    lax|
|       01/01/2017|         0074|                           76|    mem|
|       01/01/2017|         0081|                          123|    den|
|       01/01/2017|         0089|                          161|    slc|
|       01/01/2017|         0096|                           84| 

## Understanding Parquet

Some common issues with CSV files are; the schema is not defined, there are no data types included, nor columns names (beyond a header row). In addition to there, Spark has some specific problems processing CSV data. CSV files are quire slow to import and parse. 

Parquet is a compressed columnar data format developed for use in any Hadoop based system. The Parquet format is structured with data accessible in chunks, allowing efficient read and write operations without processing the entire file. It provides significant performance improvement and they automatically include schema information and handle data encoding.

Parquet files are binary file format and can only be used with the proper tools. 

To write parquet file use: df.write.parquet("filename.parquet")

### Saving a DataFrame in Parquet format

In [14]:
df1 = spark.read.csv("AA_DFW_2017_Departures_Short.csv")
df2 = spark.read.csv("AA_DFW_2016_Departures_Short.csv")

print("df1 Count: %d"%df1.count())
print("df2 Count: %d"%df2.count())

df3 = df1.union(df2)
df3 = df3.toPandas()
df3.to_parquet("AA_DFW_ALL.parquet", engine='pyarrow', compression='gzip', index=False)
# df3.write.parquet("AA_DFW_ALL.parquet", mode="overwrite")
print(spark.read.parquet('AA_DFW_ALL.parquet').count())

df1 Count: 139359
df2 Count: 140605
279964


### SQL and Parquet

In [45]:
flights_df = spark.read.parquet('AA_DFW_ALL.parquet')
flights_df.createOrReplaceTempView("flights")

avg_duration = spark.sql("SELECT avg(_c3) FROM flights").collect()[0]
print('The average flight time is: %d' % avg_duration)

The average flight time is: 151
