So far we've used toy examples to introduce the RDD API along with a few of its Transformations and Actions. Now let's look at a more real-life example: let's wrangle a fairly big "semi-structured" file and turn it into something a Data Scientist would be ready to work with. In fact, let's ask a few Data Science-y questions of this data and use Spark itself to answer them while we are at it!

In order to continue with this lesson, first download the required data files and put it in your data folder inside your working directory.

This example file is a standard Apache web server log. It's the logs from a month's worth of requests to NASA's website, in the distant year of 1995, combined into one fairly big file to be more specific.

This log contains the following information:

* The IP Address or the DNS name performing a request
* A time stamp of the form: "dd/Mon/YYYY:hh:mm:ss Timezone"
* The request type (HTTP verb), the resource being requested and the Protocol used
* The code returned by the server (200 OK, 400 Not Found etc...)
* The Size of the resource being requested

We will use the textFile method to read in this file. This, like the parallelize method, turns the data inside this file into an RDD. There are two important things you need to know about this method: 

* In a real-life Spark Cluster, the location of the file (the argument you will pass to textFile) must be visible/accessible to all nodes of the Cluster. In practice, a lot of the time this location will be a path on a Hadoop Distributed File System (HDFS), but this can be any Network File System, or a location mounted on all nodes, or Amazon S3... as long as it's visible and accessible on all nodes! 

* This method turns each line of the input file into an element in a Partition. So ,no matter what the format of the file is, when it gets turned into an RDD, each line (as delimited by a newline a.k.a. "\n") becomes an element.


In [None]:
#let's import pyspark, initialize a Spark connection


#Let's read NASA logs as a textfile into a variable and find out the type of variable we defined here


The first step in any data problem is to look at the data to get a sense of what we are dealing with. A good practice is to find out how many elements we have to get a sense of what we are dealing with. The RDD API has the count method for that: 

In [None]:
# use count() method to see how many elements (lines) are in the NASA logs



The RDD API has the take Action, that brings a number of elements (remember, an element here is a line of the original file) back to the Driver so we can see them. The important thing here is to be careful not to bring too many elements back to Driver and blow up its memory capacity!

In [None]:
# Use take() action to bring a number of elements from Cluster back to the Driver!


Now that we can see what the data looks like, a reasonable first step seems to be to split the data on the " " (space) character:

In [None]:
# Split the data on space characters


Next, for the sake of this example, let's say we are not interested in lines where there is data missing. In other words, we are only interested in lines that have all 10 elements. We will use the filter method to filter any lines that don't have all 10 elements out of our RDD:

In [None]:
# Filter the data and count the lines that have 10 exactly elements




This line of code in PySpark performs a series of operations to count the number of lines in the 
nasa_logs RDD that have exactly 10 words (or elements) separated by spaces.

You might be asking yourself whether using the take method all the time to check if we are doing things right is the best practice... and the answer is no. Everytime you call it, you are computing a new RDD and thus having the Spark Cluster do work for you. In real-life you will rarely have a Cluster all for yourself, so you should expect your computations to get queued and competing for resources with other users. in this scenario, minimizing the amount of times you move things back and forth between the Driver and the Executors is a good idea.

So in practice, one approach would be to use the RDD API method sample to extract a sample of your data to examine in the driver and figure out what you need to do before farming out computations to the cluster. The take method also works here, but getting a random sample (using sample() method) instead of the first N elements of your RDD is almost always a better plan.


In [None]:
# Make sure you know how much data 0.01% of your dataset is! 
#It might look like a small fraction, but in the Big Data world 
#even that might be too much for your local computer!





Web server logs like this are called 'semi-structured' for a reason: we can be pretty sure that every line will be formatted the same way. This means every element in each of our Partitions looks pretty much the same after our first step. We can be confident that the same unwanted characters ended up inside the elements of all partitions of our RDD. So our next step takes care of removing them:

In [None]:
# Data cleaning: remove the unwanted characters
# The dictionary maps three unwanted characters ([, ], and ") to empty strings ('')



In summary, this code cleaned the data by removing square brackets and double quotes from each element in lines that have exactly 10 elements.

* The translate() method returns a string where some specified characters are replaced with the character described in a dictionary, or in a mapping table.
* The str.maketrans(replacement_dict) part creates a translation table that maps the keys (characters) to their corresponding values (empty strings)
* The element.translate(...) call uses this translation table to remove the characters specified in the replacement_dict keys from the string element.

Ok, so now our RDD has the following elements: 

IP/NAME_OF_ORIGIN 
DATE/TIME, TIMEZONE
 REQUEST_METHOD
 RESOURCE_REQUESTED
 PROTOCOL
 STATUS_CODE, SIZE_OF_RESOURCE

That looks pretty much like a CSV (or a Dataframe) a Data Scientist could work with! We aim to take advantage of our now-structured dataset and see if we can do a bit of Data Science using the RDD API directly. Let's find out where most requests to the NASA web server came from on our dataset. To do this, let's do a little bit of Map-Reduce.


In [None]:
# Take each line of our structured log and return a Key-Value Pair




In [None]:
# identify and return the five most frequent encounters 
# like the count program, we create a tuple containing the encounters and a count of 1 
# representing its initial occurrence, then compute the total count.



# The second map() transforms the DataFrame to have the count as the first element 
# for sorting purposes and the word as the second element.


Exercise 3.1 - Word count in NASA log

If we take the element containing NASA's website resource names and we replace the "/"s and "."s by " "s, we sort of get words. Write a word count program to find the top 5 most frequent words.


In [None]:
# The first step essentially should splits the resource name based on
# / and . characters and treats them as word separators.
# The result should be a new DataFrame containing these "cleaned-up" website resource names.



In [None]:
# This step effectively should create a new DataFrame
# where each row is a single word extracted from the website resource names.
# Use flatMap() method to flattening the list of words into a single level.
# Then use a map() transformation to create your count tuple
# Finally use a reduceByKey action to calculate the total count for each word




Reading a CSV file with Core Spark API (RDD API):

The RDD API is very powerful, but on its own it has some serious limitations. Ironically, one of its biggest limitations is its usefulness on structured data... like CSV files.

We had caught a glimpse of that on the NASA website example, but now let's look at a real-life CSV to illustrate this and introduce the Pandas on Spark API - a powerful API for which the RDD API can work as a beautiful complement.

The file below contains data about all pieces owned/maintained by the Metropolitan Museum of Art in New York City. As we've seen before, the RDD API only allows us to load it as a plain text file:


Spark Pandas API:

The Spark Core RDD API is a powerful tool for operating on very large Data. However, the RDD API and its Functional Programming flavor are not for everyone. Most people dealing with heavy-duty data analytics problems are used to far more structured data types. Whether they're R users or Python users, data people love data that is in a tabular format - a Table in database or a DataFrame in R or Pandas. In most data analysis situations, it is important to be able to mimic some functionality and design choices from the Pandas package as one of the most powerful python packages to analyze data. To cater to this particular user base, Spark maintainers have introduced a new API in Spark v3: Pandas on Spark. As the name suggests, the idea behind this API is to reproduce the user experience from the Pandas package with as many of its methods and operators as possible, but on a very large scale distributed DataFrames. Note that as this API is actively being developed, you might encounter some errors with some functions. Usually, downgrading PySpark or Pandas version can fix those issues.

To get started, first let's import the module pyspark.pandas:

In [None]:
#import PySpark Pandas API
#ignore the warning!




A handy way of using Pandas on Spark is by converting an actual Pandas DataFrame into a Pandas on Spark DataFrame. In this scenario, you would have a regular Pandas DataFrame, created without any calls to Spark that you wish to perform work on in a parallelized or even distributed fashion.


In [None]:
#let's create a Pandas dataframe from a CSV file and then convert it to Pandas on Spark DataFrame



In [None]:
#Let's look at the data:


In [None]:
# Now, let's create a Spark DataFrame from a Pandas DataFrame!



Now we have a parallelized or distributed DataFrame that looks and behaves just like a regular Pandas DataFrame.

In [None]:
# Let's look at this DataFrame using some familiar methods: first head()



Next, we will go through a few examples of how to use Pandas on Spark DataFrame. Accessing columns and rows, as well as slicing a DataFrame works just like in Pandas.

In [None]:



# Access columns by name with two different syntaxes:





In [None]:
# Use the .iloc() method to access a row by index


In [None]:
# Use conditionals to find subsets of a DataFrame that match a condition


In [None]:
# summary statistics



In [None]:
# location measures




In [None]:
# dispersion measures:


Notes: 

* Parallelizing a DataFrame does not necessarily mean any arbitrary operation will run faster. In general, you can expect Pandas on Spark to outperform Pandas as the size of a DataFrame grows, even if you are running PySpark on a single node. That being said, you should always reason about scalability before choosing to parallelize work over multiple cores, or multiple nodes. See this article for more about scalability: https://docs.alliancecan.ca/wiki/Scalability

* Pandas on Spark is not a 100% perfect clone of Pandas - some Pandas functionalities have not yet been implemented, some probably never will be, and Pandas on Spark have a few features that do not exist on Pandas. See the complete API reference for more details: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/index.html

Exercise 4.1 

Use pandas on Spark API:

* Create a parallel query that finds all rows with a weight value greater than 50 and hindfoot_length larger than 52, and then calculate the summary statistics of these rows.

* Hint: You can use where() method to introduce two different conditions in your search and dropna() method to remove rows with missing values in weight or hindfoot_length


In [None]:
#Exercise 4.1 Solution

