### What is data cleaning?

 - Preparing raw data for use in data processing pipelines
 - Tasks include reformatting or replacing text, performing calculations, removing garbage or incomplete data etc
 - For billions of pieces of data, performance will be an issue, hence the use of Spark which can handle big data / The primary limit to Spark's abilities is the level of RAM in the Spark cluster
 - Spark schemas: define the various data types used, can filter garbage data during import, imporves read performance

In [None]:
# Import the pyspark.sql.types library
from pyspark.sql.types import *

# Define a new schema using the StructType method
people_schema = StructType([
  # Define a StructField for each field
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('city', StringType(), False)
])

# Reading files and enforcing the schema
 people_df = spark.read.format('csv').load(name='rawdata.csv', schema=people_schema)

In [None]:
# Import the pyspark.sql.types library
from pyspark.sql.types import *

# Define a new schema using the StructType method
people_schema = StructType([
  # Define a StructField for each field
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('city', StringType(), False)
])

#### Immutability and lazy processing

 - Normally in Python variables are mutable
 - This while adding to flexibility presents a problem when there are multiple concurrent components trying to modify the same data
 - Spark is designed to use immutable variables (variables that are defined once and are not modifieable after initialization)
 - Variables are re-created if reassigned
 - This allows Spark to share data efficiently without worrying about concurrent data objects
 - When you make changes to a Spark DataFrame the original object is destroyed and a new one takes its name/place
 - That doesn't mean that the original data (e.g. the file that was read to create the first DataFrame) is changed


Lazy processing: the idea that very little actually happens until an action is performed
 - Funcionality is broken down to transformations and actions
 - Transformations are like instructions of what we want to accomplish
 - Actions are like "triggers" that begin the process based on the instructions provided
 - Lazy processing operations will usually return in about the same amount of time regardless of the actual quantity of data. Remember that this is due to Spark not performing any transformations until an action is requested.
 - Note the amount of time required for the transformations to complete when defined vs when the data is actually queried. These differences may be short, but they will be noticeable. When working with a full Spark cluster with larger quantities of data the difference will be more apparent.

In [None]:
# Load the CSV file
aa_dfw_df = spark.read.format('csv').options(Header=True).load('AA_DFW_2018.csv.gz')

# Add the airport column using the F.lower() method
aa_dfw_df = aa_dfw_df.withColumn('airport', F.lower(aa_dfw_df['Destination Airport']))

# Drop the Destination Airport column
aa_dfw_df = aa_dfw_df.drop(aa_dfw_df['Destination Airport'])

# Show the DataFrame
aa_dfw_df.show()

### CSVs vs Parquet file formats

CSVs
 - CSVs are slow to import and parse / the files cannot be shared among Spark workers and if no schema is defined all data must be read before a schema can be inferred
 - Files cannot be filtered via "predicate pushdown" (the idea of ordering tasks to do the least amount of work / filtering data prior to processing is one of the primary optimizations of predicate pushdown drastically reducing the amount of information that must be processed in large data sets / you cannot filter the CSV data via predicate pushdown)
 - Spark processes are often multi-step and may utilize an intermediate file representation / these representations allow data to be used later without regenerating data from source / using CSV would require a significant amount of extra work defining schemas, encoding formats etc

Parquet files
 - Parquet is a compressed columnar data format developed for use in any Hadoop based system (Apache Spark, Hadoop, Impala)
 - The format is structured with data accessible in chunks allowing efficient read-write operations without processing the entire file
 - It supports the predicate pushdown functionality, providing significant performance improvement
 - The Parquet format is a columnar data store, allowing Spark to use predicate pushdown. This means Spark will only process the data necessary to complete the operations you define versus reading the entire dataset. This gives Spark more flexibility in accessing the data and often drastically improves performance on large datasets.
 - Automatically includes schema information and handle data encoding
 - Parquet files are a binary file format and can only be used with the proper tools / in contrast to CSV files which can be edited with any text editor
 - Parquet files are perfect as a backing data store for SQL queries in Spark. While it is possible to run the same queries directly via Spark's Python functions, sometimes it's easier to run SQL queries alongside the Python options.

In [None]:
# Two methods to read/write (used interchangeably)

# Reading parquet files
df = spark.read.format('parquet').load('filename.parquet')
df = spark.read.parquet('filename.parquet')

# Writing parquet files
df.write.format('parquet').save('filename.parquet')
df.write.parquet('filename.parquet')

In [None]:
# To run SQL queries use createOrReplaceTempView
# after reading the parquet file

flight_df = spark.read.parquet('flights.parquet')

flight_df.createOrReplaceTempView('flights')

short_flights_df = spark.sql('SELECT * FROM flights WHERE flightduration < 100')
  


### Manipulating DataFrames

DataFrames
 - Made up of rows and columns and generally analogous to a database table
 - Are immutable as any change to the structure or content creates a new DataFrame
 - Are modified through the use of transformations

#### Examples

In [None]:
# Return rows where name starts with "M"
voter_df.filter(voter_df.name.like('M%'))
# Return name and position only
voters = voter_df.select('name', 'position')

In [None]:
# Filter
voter_df.filter(voter_df.date > '1/1/2019') # or voter_df.where(...)
voter_df.filter(voter_df['name'].isNotNull()) # remove nulls
voter_df.filter(voter_df.date.year > 1800) # remove old entries

# Where
voter_df.where(voter_df['_c0'].contains('VOTE')) # split data from combined sources
voter_df.where(~ voter_df._c1.isNull()) # negate with ~
 
# Select
voter_df.select(voter_df.name)

# withColumn
voter_df.withColumn('year', voter_df.date.year)

# drop
voter_df.drop('unused_column')

In [None]:
# Column string transformations
# Contsined in pyspark.sql.functions
import pyspark.sql.functions as F

# Applied per column transformation
voter_df.withColumn('upper', F.upper('name'))

# Create intermediary columns
voter_df.withColumn('splits', F.split('name', ' '))

# Cast to other types
voter_df.withColumn('year', voter_df['_c4'].cast(IntegerType()))
    

#### ArrayType column functions
 - .size(column) returns length of arrayType() column
 - .getItem(index) retrieves a specific item at index of list column

#### More examples

In [None]:
# Show the distinct VOTER_NAME entries
voter_df.select('VOTER_NAME').distinct().show(40, truncate=False)

# Filter voter_df where the VOTER_NAME is 1-20 characters in length
voter_df = voter_df.filter('length(VOTER_NAME) > 0 and length(VOTER_NAME) < 20')

# Filter out voter_df where the VOTER_NAME contains an underscore
voter_df = voter_df.filter(~ F.col('VOTER_NAME').contains('_'))

# Show the distinct VOTER_NAME entries again
voter_df.select('VOTER_NAME').distinct().show(40, truncate=False)

# Add a new column called splits separated on whitespace
voter_df = voter_df.withColumn('splits', F.split(voter_df.VOTER_NAME, '\s+'))

# Create a new column called first_name based on the first item in splits
voter_df = voter_df.withColumn('first_name', voter_df.splits.getItem(0))

# Get the last entry of the splits list and create a column called last_name
voter_df = voter_df.withColumn('last_name', voter_df.splits.getItem(F.size('splits') - 1))

# Drop the splits column
voter_df = voter_df.drop('splits')

# Show the voter_df DataFrame
voter_df.show()