# PySpark Tutorial - Dataframes
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

Up to now, we've see the RDD interface to PySpark. The RDD is a building block for more capable data structures such as the **dataframe** and **database**. These data structures are part of the [PySpark SQL library](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html) which, as the name implies, is influenced by standard SQL practices and queries.

The PySpark library has the **dataframe API**, but it does not support the **database API** -- that's only accessible via the Scala and Java libraries and through SQL queries.

The **database** is effectively an SQL relation -- i.e. rows and columns with a specific schema. The **dataframe** takes a little futher and constructs a labeled dataframe similar to the [Python Pandas](https://pandas.pydata.org/) interface or the [R dplyr](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) interface for R.

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import numpy as np
import pandas as pd
import operator

We're going to use an airline information database as the example. You can download extended versions of the database [at this Dept. of Transportation website](https://www.transtats.bts.gov/DL_SelectFields.asp), but the data we're using is distributed with the course notes.

As with the RDD interface, we need a "context" to a remote machine. The [Spark SQL tutorial](https://spark.apache.org/docs/latest/sql-getting-started.html) has some information on this, but for complete information you need to look at the [Spark API documentation.](https://spark.apache.org/docs/latest/api/python/)

In this example, we're creating a local session (i.e. CPU's on JupyterHub).

In [None]:
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .master("local[*]")\
    .getOrCreate()

There are many ways to load data, including HDFS, a format called [Parquet](http://parquet.apache.org/), CSV files and so on. We'll use a compressed CSV file of the airline data.

In [None]:
flights = spark.read.load('airline-ontime-reporting.csv.gz',
            format="csv", sep=",", header=True,
            compression="gzip",
            inferSchema="true")

The dataframe has a **schema** or type for each entry. All entries must have the same type or we'll see operations fail. In this example, we have asked that the schema be inferred -- this usually works, but if it doesn't we may need to take some extra steps (see below).

In [None]:
flights.printSchema()

In [None]:
flights.columns

In [None]:
print("There are", len(flights.columns), "columns and ", flights.count(), "rows")

In [None]:
flights.dtypes

The schema is inferred, but it can also be defined explicitly.

Note that one column is labeled `_c23`, which is showing up as "null". Perhaps this is bad data import?

Lets look at some of the values.

In [None]:
flights.show(5, truncate=False)

Let's pull out the values in one column -- the `select` method can be used to produce a new dataframe with just that column as an entry.

In [None]:
flights.select('_c23').show(5)

And we can slice out multiple columns, similar to Pandas. Again, this produces a new dataframe.

In [None]:
flights.select(['year', '_c23']).show(5)

Alternatively, we can produce a
[Column object which has its own methods](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=column#pyspark.sql.Column). These are typically used in **column expressions** that produce indicies that can be used when selecting or filtering data.

For example, let's find all the rows where the mystery `_c23` column is not null.

In [None]:
flights.filter( flights._c23.isNotNull()).show(5)

Hmm.... This liooks like all the values are null. We could confirm this by selecting the column and looking at the distinct elements.

In [None]:
flights.select('_c23').distinct().show()

This this column is null, lets just drop it.

In [None]:
newFlights = flights.drop('_c23')

In [None]:
newFlights.show(5)

We often work with multiple columns of data in a dataframe. Some methods just use column names (corr, cov, crosstab, describe) and others can use column references, such as `newAir.ORIGIN`.

There are also a number of methods that work on columns or column expressions -- we've been using `select` already.

* `cube(*cols)`: column names (string) or column expressions or **both**.
* `drop(*cols)`: ***a list of column names OR a single column expression.***
* `groupBy(*cols)`: column name (string) or column expression or **both**.
* `rollup(*cols)`: column name (string) or column expression or **both**.
* `select(*cols)`: column name (string) or column expression or **both**.
* `sort(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `sortWithinPartitions(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `orderBy(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `sampleBy(col, fractions, sed=None)`: a column name.
* `toDF(*cols)`: **a list of column names (string).**
* `withColumn(colName, col)`: `colName` refers to column name; `col` refers to a column expression.
* `withColumnRenamed(existing, new)`: takes column names as arguments.
* `filter(condition)`: ***condition** refers to a column expression that returns `types.BooleanType` of values. 

In [None]:
newFlights.groupBy(newFlights.ORIGIN).count().collect()

In [None]:
newFlights.filter(newFlights.ORIGIN == 'DEN' ).show(5)

## Doing Joins

Again, everything boils down to a join in "big data". We can do joins between two dataframes much as in Pandas. Let's load a second dataframe that contains airline identifiers.

In [None]:
airlines = spark.read.load('unique-carriers.csv.gz',
            format="csv", sep=",", header=True,
            compression="gzip",
            inferSchema="true")

In [None]:
airlines.show(5)

Our flights data also has carrier information in the `OP_UNIQUE_CARRIER` column. Let's list out the distinct values by selecting that column, determining the distinct values and then showing it.

In [None]:
flights.select('OP_UNIQUE_CARRIER').distinct().show()

Now, let's join the airlines `Code`  with the flights `OP_UNIQUE_CARRIER`. This will result in data like the `flights` data but with two additional columns, `Code` (the join key) and `Description` (the full airline name).

In [None]:
flights.join(airlines, airlines.Code == flights.OP_UNIQUE_CARRIER).show(5)

From here, you could *e.g.* pull out all over the Denver to Chicago flights and list them by the airline name, *etc, etc*.

## Escape back into the world of RDD's

A dataframe is composed of `Row` objects and a dataframe (and database) is just a collection of those rows. You can pull out the row objects as RDD's and then operate on those, much as we did before.

In [None]:
flights.rdd.filter(lambda x: x['DEST'] == 'DEN').take(5)

Spark will attempt to interpret the types of the data but it's not always successful. By default, it will use the first 100 rows to determine the types. This may fail as indicated below:

In [None]:
onlyDen = spark.createDataFrame(flights.rdd.filter(lambda x: x['DEST'] == 'DEN'))

The solution is to sample the data randomly -- here we're going to sample 50% of the data to determine the types:

In [None]:
onlyDen = spark.createDataFrame(flights.rdd.filter(lambda x: x['DEST'] == 'DEN'), 
                                samplingRatio=0.5)

And again, the resulting data is a `Row` type:

In [None]:
onlyDen.take(3)

## Using SQL

It's clear that the Dataframe methods provide operations similar to those of SQL but in a more procedural or imperative form.

PySpark also has an SQL wrapper that lets us convert a `DataFrame` into an SQL relational table.

In [None]:
from pyspark.sql import SQLContext

sqlContext = SQLContext( spark.sparkContext )

In [None]:
sqlContext.registerDataFrameAsTable(onlyDen, "onlyDen")

In [None]:
sqlContext.registerDataFrameAsTable(flights, "flights")

From there, we can do SQL queries and a query planner will construct the series of operations needed.

In [None]:
sqlContext.sql("SELECT COUNT(*) from onlyDEN").show(5)

In [None]:
sqlContext.sql("SELECT COUNT(*) from flights WHERE DEST='DEN'").show(5)