### Big data

 - Complex data sets for traditional computing resources
 - Data sets that a distributed clustered of computers might be needed to analyse
 - 3v's:
     - Volume, the size of the data
     - Variety, the various sources and formats
     - Veolocity, the frequency (speed) the data are coming in or get updated
 
 - Clustered computing: collection of resources of multiple machine
 - Parallel computing: simultaneous computation
 - Distributed computing: collection of nodes (networked computers) that run in parallel
 - Batch processing: Breaking the job into smaller pieces and running them on individual machines
 - Real-time processing: immediate processing of data

--------------
**Apache Spark**
 - General purpose and lighting fast cluster computing system that is open sourced and can handle both batch and real-time data processing (by comparison Apache Hadoop/MapReduce only handles batch processing)
 - Spark distributes data and computation across multiple computers in order to execute complex (multi-stage) applications such as machine learning.
 - Spark runs most computations in-memory for better performance
 - Written in Scala but supports multiple languages
 - At the center of Spark's ecosystem is Spark Core (RDD API) which contains the basic functionality of Spark
 - On top of Spark Core there are libraries:
     - Spark SQL: for processing structured and semi-structured data
     - MLlib: includes common machine learning algorithms
     - GraphX: collection of algorithms and tools for manipulating graphs and performing parallel graph computations
     - Spark Streaming: a scalable, high-throughput processing library for real-time data

--------------
**Spark modes**

 - Local mode: runs on a single machine (like a laptop) / convenient for testing, debugging and demonstration purposes
 - Cluster mode: runs on a cluster of computers / used for production

**Spark shells**

 - Spark comes with interactive shells that enable ad-hoc analysis
 - Spark shell is an interactive environment through which one can access Spark's functionality quickly and conveniently
 - Spark's shells allow interacting with data that is on disk or in memory across many machines. Spark takes care of automatically distributing this processing
 - Three different Spark shells: Spark-shell for Scala, SparkR for R and PySpark-shell for Python
--------------
**Spark in Python**

 - PySpark is a library created for running Spark in Python
 - PySpark shell is the Python-based command line tool to develop Spark's apps in Python
 - To use the Spark shell you need an entry point. An entry point is where control is transferred from the operating system to the provided program. This is the SparkContext. You can access the SparkContext in the PySpark shell as a variable named sc.
 - SparkContext attributes
     - Use sc.version to retrieve the SparkContext version
     - Use sc.pythonVer to retrieve the Python version of SparkContext
     - Use sc.master to retrieve the URL of the cluster or "local" string if it runs in local mode
 - You an load raw data to PySpark using SparkContext's:
     - sc.parallelize method
     > rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
     - sc.textFile() method
     > rdd = sc.textFile("test.txt")
 - A SparkContext represents the entry point to Spark functionality. It's like a key to your car. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. PySpark automatically creates a SparkContext for you in the PySpark shell (so you don't have to create it by yourself) and is exposed via a variable sc.

--------------
**RDD**
 
 - An immutable distributed collection of objects
 - Stands for resilient and distributed datasets
     - Resilient: ability to withstand failures and recompute missing or damaged paritions
     - Distributed: spanning the jobs across multiple nodes in the cluster for efficient computation
     - Datasets: collection of partitioned data (e.g. arrays, tables, tuples or other objects)
 - Backbone data type in PySpark
 - When Spark starts processing data, it divides it into partitions and distributes it across cluster nodes, with each node containing a slice of data
 - 3 different methods for creating RDDs
     - via sc.parallelize([list])
     - from external datasets using sc.textFile('file.type')
     - from existing RDDs
 - A partition in Spark is the division of a dataset into parts with each part being stored in multiple locations across the cluster. Spark (by default) partitions the data at the time of an RDD's creation based on several factors such as available resources, external datasets etc. This can be controlled by passing a minPartitions arguments that defines the minimum number of partitions to be created by an RDD.
 - To check the number of RDD partitions use 
 >rdd_name.getNumPartitions()

--------------
**RDD operations in PySpark**
 
 - Two types of Spark operations are supported:
     - Transformations
     - Actions
 - Transformations are operations on RDDs that return a new RDD
 - Actions perform some computations on the RDD
 - The most important feature of Spark that helps RDDs in fault tolerance and optimising resource use is the lazy evaluation
 - Transformations follow lazy evaluation
 - Lazy evaluation:
     - Spark creates a graph from all operations you perform on an RDD and execution of the graph starts only when an action is performed on RDD
     
<img src="assets/spark/lazy_eval.png" style="width: 600px;"/>

 - Basic RDD transformations
 > map(), filter(), flatMap(), union()
 - Basic RDD actions
 > collect(), take(N), first(), count()
 
**Examples**
#### Create map() transformation to cube numbers
cubedRDD = numbRDD.map(lambda x: x**3)

#### Collect the results
numbers_all = cubedRDD.collect()

#### Print the numbers from numbers_all
for numb in numbers_all:
	print(numb)

#### Filter the fileRDD to select lines with Spark keyword
fileRDD_filter = fileRDD.filter(lambda line: 'Spark' in line)

#### How many lines are there in fileRDD?
print("The total number of lines with the keyword Spark is", fileRDD_filter.count())

#### Print the first four lines of fileRDD
for line in fileRDD_filter.take(4): 
  print(line)

--------------
**Pair RDDs**
 
 - A special data structure to work with datasets that are key/value pairs
 - Each row is a key and maps to one or more values
 - The key refers to the identifier whereas value refers to the data
 - Two ways to create pair RDDs:
     - From a list of key/value tuple
     > my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]<br>
     > pairRDD_tuple = sc.parallelize(my_tuple)
     - From a regular RDD
     > my_list = ['Sam 23', 'Mary 34', 'Peter 25']<br>
     > regularRDD = sc.parallelize(my_list)<br>
     > pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))
 - Operations available on RDDs are still available on pair RDDs but there are some special operations too
 - Since pair RDDs contain tuples for transformations to work we have to pass functions that operate on key/value pairs
     - reduceByKey(func): combine values with the same key by running parallel operations for each key in the dataset
     > regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34), ("Neymar", 22), ("Messi", 24)])<br>
     > pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y)<br>
     > pairRDD_reducebykey.collect()<br>
     > [('Neymar', 22), ('Ronaldo', 34), ('Messi', 47)]
     - groupByKey(): group values with the same key
     > airports = [("US", "JFK"),("UK", "LHR"),("FR", "CDG"),("US", "SFO")]<br>
     > regularRDD = sc.parallelize(airports)<br>
     > pairRDD_group = regularRDD.groupByKey().collect()<br>
     > for cont, air in pairRDD_group: print(cont, list(air))<br>
     > FR ['CDG']<br>
     > US ['JFK', 'SFO']<br>
     > UK ['LHR']
 
     - sortByKey(): return an RDD sorted by the key
     > pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1], x[0]))<br>
     > pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()<br>
     > [(47, 'Messi'), (34, 'Ronaldo'), (22, 'Neymar')]
 
     - join(): join two pair RDDs based on their key
     > RDD1 = sc.parallelize([("Messi", 34),("Ronaldo", 32),("Neymar", 24)])<br>
     > RDD2 = sc.parallelize([("Ronaldo", 80),("Neymar", 120),("Messi", 100)])<br>
     > RDD1.join(RDD2).collect()<br>
     > [('Neymar', (24, 120)), ('Ronaldo', (32, 80)), ('Messi', (34, 100))]
 
 
 --------------
**Advanced Actions on RDDs**

 - reduce(func): action for aggregating a regular RDD's elements
     - the function should be commutative (i.e. changing the order of the operands does not change the results) and associative
 - saveAsTextFile("tempFile"): action for saving RDDs into a text file inside a directory with each partition as a separate file
     - coalesce(), a method to use for saving an RDD as a single text file
     - RDD.coalesce(1).saveAsTextFile("tempFile")
 - Action operations on pair RDDs
     - countByKey(): counts the number of elements for each key
     > rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])<br>
     > for kee, val in rdd.countByKey().items(): print(kee, val)<br>
     > ('a', 2) ('b', 1)
   
     - collectAsMap(): returns the key/value pairs in the RDD as a dict

**Examples**

**Count the unique keys**<br>
total = Rdd.countByKey()

**What is the type of total?**<br>
print("The type of total is", type(total))

**Iterate over the total and print the output**<br>
for k, v in total.items(): <br>
  print("key", k, "has", v, "counts")

**Create a baseRDD from the file path**<br>
baseRDD = sc.textFile(file_path)

**Split the lines of baseRDD into words**<br>
splitRDD = baseRDD.flatMap(lambda x: x.split())

**Count the total number of words**<br>
print("Total number of words in splitRDD:", splitRDD.count())

**Convert the words in lower case and remove stop words from the stop_words curated list**<br>
splitRDD_no_stop = splitRDD.filter(lambda x: x.lower() not in stop_words)

**Create a tuple of the word and 1**<br>
splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (w, 1))

**Count of the number of occurences of each word**<br>
resultRDD = splitRDD_no_stop_words.reduceByKey(lambda x, y: x + y)


**Display the first 10 words and their frequencies from the input RDD**<br>
for word in resultRDD.take(10):<br>
	print(word)

**Swap the keys and values from the input RDD**<br>
resultRDD_swap = resultRDD.map(lambda x: (x[1], x[0]))

**Sort the keys in descending order**<br>
resultRDD_swap_sort = resultRDD_swap.sortByKey(ascending=False)

**Show the top 10 most frequent words and their frequencies from the sorted RDD**<br>
for word in resultRDD_swap_sort.take(10):<br>
	print("{},{}". format(word[1], word[0]))

--------------
**PySpark DataFrames**
 
 - PySpark SQL is Spark's high level API for working with structured data
 - Provides a programming abstraction called DataFrames
 - DFs are immutable distributed collections of data with named columns
 - Are designed for processing both structured (e.g. relational databases) and semi-structured (e.g. JSON) data
 - Support both SQL statements and direct expressions
 - The SparkSession gives a single entry poiny to interact with Spark DataFrames (similarly to what the SparkContext does for RDDs)
 - It can create DFs, register DFs and execute SQL queries
 - Available through the PySpark SQL
 - Two methods of creating DataFrames in PySpark
     - From existing RDDs using SparkSession's createDataFrame() method
     - From various data sources using SparkSession's read() method 
 - DF Schema provides info about the column name, data type in the column, empty values etc. If the schema is a list of column names the data type of each column will be inferred from the RDD's data. Use printSchema() to retrieve the DF's schema

**Create an RDD from the list**<br>
rdd = sc.parallelize(sample_list)

**Create a PySpark DataFrame**<br>
names_df = spark.createDataFrame(rdd, schema=['Name', 'Age'])

**Check the type of names_df**<br>
print("The type of names_df is", type(names_df))

**Create an DataFrame from file_path**<br>
people_df = spark.read.csv(file_path, header=True, inferSchema=True)

**Check the type of people_df**<br>
print("The type of people_df is", type(people_df))


--------------
**Common DF transformations**
 - select(), filter(), groupby(), orderby(), dropDuplicates(), withColumnRenamed()

**Common DF actions**
 - head(), show(), count(), columns, describe()

--------------
**SQL queries executions**

 - The SparkSession's sql() method can be used to execute SQL statements (returns a DF) in Spark
 - SQL queries cannot be run directly against a DataFrame. To do so use the df.createOrReplaceTempView("table1") function to create a temp table which can be used to run SQL queries against
 > df.createOrReplaceTempView("table1")<br>
 > df2 = spark.sql("SELECT field1, field2 FROM table1")<br>
 > df2.collect()

**Create a temporary table "people"**<br>
people_df.createOrReplaceTempView("people")

**Construct a query to select the names of the people from the temporary table "people"**<br>
query = '''SELECT name FROM people'''

**Assign the result of Spark's query to people_df_names**<br>
people_df_names = spark.sql(query)

**Print the top 10 names of the people**<br>
people_df_names.show(10)

**Filter the people table to select female sex**<br>
people_female_df = spark.sql('SELECT * FROM people WHERE sex=="female"')

**Filter the people table DataFrame to select male sex**<br>
people_male_df = spark.sql('SELECT * FROM people WHERE sex=="male"')

**Count the number of rows in both DataFrames**<br>
print("There are {} rows in the people_female_df and {} rows in the people_male_df DataFrames".format(people_female_df.count(), people_male_df.count()))

--------------
**Visualisations**

Graphical representations or visualization of data is imperative for understanding as well as interpreting the data. Convert the names_df to Pandas DataFrame and plot the contents as horizontal bar plot with names of the people on the x-axis and their age on the y-axis.

**Check the column names of names_df**<br>
print("The column names of names_df are", names_df.columns)

**Convert to Pandas DataFrame**<br>
df_pandas = names_df.toPandas()

**Create a horizontal bar plot**<br>
df_pandas.plot(kind='barh', x='Name', y='Age', colormap='winter_r')
plt.show()

--------------
**Load the Dataframe**<br>
fifa_df = spark.read.csv(file_path, header=True, inferSchema=True)

**Check the schema of columns**<br>
fifa_df.printSchema()

**Show the first 10 observations**<br>
fifa_df.show(10)

**Print the total number of rows**<br>
print("There are {} rows in the fifa_df DataFrame".format(fifa_df.count()))

**Create a temporary view of fifa_df**<br>
fifa_df.createOrReplaceTempView('fifa_df_table')

**Construct the "query"**<br>
query = '''SELECT Age FROM fifa_df_table WHERE Nationality == "Germany"'''

**Apply the SQL "query"**<br>
fifa_df_germany_age = spark.sql(query)

**Generate basic statistics**<br>
fifa_df_germany_age.describe().show()

**Convert fifa_df to fifa_df_germany_age_pandas DataFrame**<br>
fifa_df_germany_age_pandas = fifa_df_germany_age.toPandas()

**Plot the 'Age' density of Germany Players**<br>
fifa_df_germany_age_pandas.plot(kind='density')<br>
plt.show()

## Building data pipelines with Spark

 - Use Spark for data processing at scale
 - Interactive analytics
 - Machine learning
 - Do not use when you have only little data

#### Spark dataframe

In [None]:
# Start the Spark analytics engine
from pyspark.sql import SparkSession
from pprint import pprint

spark = SparkSession.builder.getOrCreate()

In [None]:
# Define the schema
schema = StructType([
  StructField("brand", StringType(), nullable=False),
  StructField("model", StringType(), nullable=False),
  StructField("absorption_rate", ByteType(), nullable=True),
  StructField("comfort", ByteType(), nullable=True)
])

better_df = (spark
             .read
             .options(header="true")
             # Pass the predefined schema to the Reader
             .schema(schema)
             .csv("/home/repl/workspace/mnt/data_lake/landing/ratings.csv"))

pprint(better_df.dtypes)

# Output:
# [('brand', 'string'),
#  ('model', 'string'),
#  ('absorption_rate', 'tinyint'),
#  ('comfort', 'tinyint')]

#### Droping invalid rows

In [None]:
prices = (spark
      .read
      .options(header="true", mode="DROPMALFORMED")
      .csv('landing/prices.csv'))

#### Cleaning data

You can select the column to be transformed by using the .withColumn() method, conditionally replace those values using the pyspark.sql.functions.when function when values meet a given condition or leave them unaltered when they don’t with the .otherwise() method

In [None]:
# Replace nulls with arbitrary value on column subset
ratings = ratings.fillna(4, subset=["comfort"])

from pyspark.sql.functions import col, when

# Add/relabel the column
categorized_ratings = ratings.withColumn(
    "comfort",
    # Express the condition in terms of column operations
    when(col("comfort") > 3, "sufficient").otherwise("insufficient"))

categorized_ratings.show()

### Running a PySpark program locally

> python my_pyspark_data_pipeline.py  # script starts at least a SparkSession
 
Conditions:
 - local installation of Spark
 - access to referenced resources
 - classpath is properly configured / the class path tells the Java Virtual Machine - which is what the Spark runs on - where to look for classes that are imported
 
 
In daily operations you'll be using the spark-submit script

 - It comes with any Spark installation
 - The script sets up a launch environment for use with cluster manager and deploy mode
 - The deploy mode tells Spark where to run the driver of the Spark application: either on a dedicated master node or on one of the cluster worker nodes
 - After the setup of the launch environment spark-submit also invokes the " main " class or method

 > spark-submit \        -> On your path, if Spark is installed <br>
 > --master "local[*]" \ -> URL of the cluster manager <br>
 > --py-files PY_FILES \ -> Comma-separated list of zip, egg or py <br>
 > MAIN_PYTHON_FILE \    -> Path to the module to be run <br>
 > app_arguments         -> Optional arguments parsed by main scrip


<img src="assets/pipelines/spark_submit.png" style="width: 600px;"/>


Run a PySpark program locally by first zipping your code: This packaging step becomes more important when your code consists of many modules. Packaging in the zip format is done by navigating to the root folder of your pipeline using the cd command and running the following command:

 > zip --recurse-paths zip_file.zip pipeline_folder

<img src="assets/pipelines/spark_pipeline.png" style="height: 400px;"/>

#### Submitting your Spark job

 - With the dependencies of a job ready to be distributed across a cluster’s nodes, you can submit a job to a cluster easily. To run a PySpark application locally, you need to call:

 > spark-submit --py-files PY_FILES MAIN_PYTHON_FILE
 
with PY_FILES being either a zipped archive, a Python egg or separate Python files that will be placed on the PYTHONPATH environment variable of your cluster's nodes. The MAIN_PYTHON_FILE should be the entry point of your application.

Example: The path of the zipped archive is spark_pipelines/pydiaper/pydiaper.zip whereas the path to your application entry point is spark_pipelines/pydiaper/pydiaper/cleaning/clean_ratings.py

#### Creating in-memory DataFrames

Creating small datasets for unit tests is an important skill. It improves readability and understanding, because any developer looking at your code, can immediately see the inputs to some function and how they relate to the output. Additionally, you can illustrate how the function behaves with normal data and with exceptional data, like missing or incorrect fields.


In [None]:
from datetime import date
from pyspark.sql import Row

Record = Row("country", "utm_campaign", "airtime_in_minutes", "start_date", "end_date")

# Create a tuple of records
data = (
  Record("USA", "DiapersFirst", 28, date(2017, 1, 20), date(2017, 1, 27)),
  Record("Germany", "WindelKind", 31, date(2017, 1, 25), None),
  Record("India", "CloseToCloth", 32, date(2017, 1, 25), date(2017, 2, 2))
)

# Create a DataFrame from these records
frame = spark.createDataFrame(data)
frame.show()

In [None]:
def test_calculate_unit_price_in_euro():
    record = dict(price=10, quantity=5, exchange_rate_to_euro=2.)
    df = spark.createDataFrame([Row(**record)])
    result = calculate_unit_price_in_euro(df)
    expected_record = Row(**record, unit_price_in_euro=4.)
    expected = spark.createDataFrame([expected_record])
    assertDataFrameEqual(result, expected)