<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Reading Data - JDBC Connections

**Technical Accomplishments:**
- Read Data from Relational Database

## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.2.10,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

## Reading from JDBC

Working with a JDBC data source is significantly different than any of the other data sources.
* Configuration settings can be a lot more complex.
* Often required to "register" the JDBC driver for the target database.
* We have to juggle the number of DB connections.
* We have to instruct Spark how to partition the data.

**NOTE:** The database is read-only
* For security reasons. 
* The notebook does not demonstrate writing to a JDBC database.

* For examples of writing via JDBC, see 
  * <a href="https://docs.databricks.com/spark/latest/data-sources/sql-databases.html" target="_blank">Connecting to SQL Databases using JDBC</a>
  * <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases" target="_blank">JDBC To Other Databases</a>

In [None]:
tableName = "training.people"

jdbcURL = "jdbc:postgresql://54.195.117.194/training"

connProperties = {
    "driver": "org.postgresql.Driver",
    "user": "qcro",
    "password": "qc-readonly"
}

spark.conf.set("spark.sql.shuffle.partitions", "8")

In [None]:
exampleOneDF = spark.read.jdbc(
    url=jdbcURL,
    table=tableName,
    properties=connProperties)
exampleOneDF.printSchema()

In [None]:
exampleOneDF

**Question:** Compared to CSV and even Parquet, what is missing here?

**Question:** Based on the answer to the previous question, what are the ramifications of the missing...?

**Question:** Before you run the next cell, what's your best guess as to the number of partitions?

In [None]:
print("Partitions: " + str(exampleOneDF.rdd.getNumPartitions()) )

## That's not Parallelized

Let's try this again, and this time we are going to increase the number of connections to the database.

**Note:** *If any one of these properties is specified, they must all be specified:*
* `partitionColumn` - the name of a column of an integral type that will be used for partitioning.
* `lowerBound` - the minimum value of columnName used to decide partition stride.
* `upperBound` - the maximum value of columnName used to decide partition stride
* `numPartitions` - the number of partitions/connections

To quote the <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases" target="_blank">Spark SQL, DataFrames and Datasets Guide</a>:
> These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. `partitionColumn` must be a numeric column from the table in question. Notice that `lowerBound` and `upperBound` are just used to decide the partition stride, not for filtering the rows in a table. So all rows in the table will be partitioned and returned. This option applies only to reading.

In [None]:
exampleTwoDF = spark.read.jdbc(
  url=jdbcURL,                  # the JDBC URL
  table=tableName,              # the name of the table
  column="salary",     # the name of a column of an integral type that will be used for partitioning.
  lowerBound=1,                 # the minimum value of columnName used to decide partition stride.
  upperBound=200000,            # the maximum value of columnName used to decide partition stride
  numPartitions=8,              # the number of partitions/connections
  properties=connProperties)    # the connection properties

In [None]:
exampleTwoDF.count()

Let's start with checking how many partitions we have (it should be 8)

In [None]:
print("Partitions: " + str(exampleTwoDF.rdd.getNumPartitions()) )

That might be a problem... notice how many records are in the last partition?

**Question:** What are the performance ramifications of leaving our partitions like this?

##  That's Not [Well] Distributed

And this is one of the little gotchas when working with JDBC - to properly specify the stride, we need to know the minimum and maximum value of the IDs.

In [None]:
from pyspark.sql.functions import *

minimumID = (exampleTwoDF
  .select(min("salary"))   # Compute the minimum ID
  .first()["min(salary)"]  # Extract as an integer
)
maximumID = (exampleTwoDF
  .select(max("salary"))   # Compute the maximum ID
  .first()["max(salary)"]  # Extract as an integer
)
print("Minimum ID: " + str(minimumID))
print("Maximum ID: " + str(maximumID))
print("-"*80)

Now, let's try this one more time... this time with the proper stride:

In [None]:
exampleThree = spark.read.jdbc(
  url=jdbcURL, # the JDBC URL
  table=tableName,                                # the name of the table
  column="salary",                       # the name of a column of an integral type that will be used for partitioning.
  lowerBound=minimumID,                           # the minimum value of columnName used to decide partition stride.
  upperBound=maximumID,                           # the maximum value of columnName used to decide partition stride
  numPartitions=8,                                # the number of partitions/connections
  properties=connProperties)                      # the connection properties


And of course we can view that data here:

In [None]:
exampleThree.count()

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) Quantia Consulting, srl. All rights reserved.