In [4]:
%run "./Includes/Classroom-Setup"

## Java Database Connectivity

Java Database Connectivity (JDBC) is an application programming interface (API) that defines database connections in Java environments.  Spark is written in Scala, which runs on the Java Virtual Machine (JVM).  This makes JDBC the preferred method for connecting to data whenever possible. Hadoop, Hive, and MySQL all run on Java and easily interface with Spark clusters.

Databases are advanced technologies that benefit from decades of research and development. To leverage the inherent efficiencies of database engines, Spark uses an optimization called predicate pushdown.  **Predicate pushdown uses the database itself to handle certain parts of a query (the predicates).**  In mathematics and functional programming, a predicate is anything that returns a Boolean.  In SQL terms, this often refers to the `WHERE` clause.  Since the database is filtering data before it arrives on the Spark cluster, there's less data transfer across the network and fewer records for Spark to process.  Spark's Catalyst Optimizer includes predicate pushdown communicated through the JDBC API, making JDBC an ideal data source for Spark workloads.

In the road map for ETL, this is the **Extract and Validate** step.

### Recalling the Design Pattern

Recall the design pattern for connecting to data from the previous lesson:  
<br>
1. Define the connection point.
2. Define connection parameters such as access credentials.
3. Add necessary options. 

After adhering to this, read data using `spark.read.options(<option key>, <option value>).<connection_type>(<endpoint>)`.  The JDBC connection uses this same formula with added complexity over what was covered in the lesson.

Each notebook has a default language that appears in upper corner of the screen next to the notebook name, and you can easily switch between languages in a notebook. To change languages, start your cell with `%python`, `%scala`, `%sql`, or `%r`.

In [10]:
%scala
// run this regardless of language type
Class.forName("org.postgresql.Driver")

Define your database connection criteria. In this case, you need the hostname, port, and database name. 

Access the database `training` via port `5432` of a Postgres server sitting at the endpoint `server1.databricks.training`.

Combine the connection criteria into a URL.

In [12]:
jdbcHostname = "server1.databricks.training"
jdbcPort = 5432
jdbcDatabase = "training"

jdbcUrl = f"jdbc:postgresql://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}"

Create a connection properties object with the username and password for the database.

In [14]:
connectionProps = {
  "user": "readonly",
  "password": "readonly"
}

Read from the database by passing the URL, table name, and connection properties into `spark.read.jdbc()`.

In [16]:
tableName = "training.people_1m"

peopleDF = spark.read.jdbc(url=jdbcUrl, table=tableName, properties=connectionProps)

display(peopleDF)

id,firstName,middleName,lastName,gender,birthDate,ssn,salary
430610,Serita,Tiesha,Fitzhenry,F,1972-11-10T00:00:00.000+0000,968-52-6254,42593
430611,Youlanda,Cathryn,Pebworth,F,1966-09-21T00:00:00.000+0000,998-42-7003,97924
430612,Corrine,Shiela,Toderbrugge,F,1979-12-31T00:00:00.000+0000,941-54-1955,91701
430613,Ardelia,Claudie,Sprowles,F,1971-09-04T00:00:00.000+0000,905-42-3995,94550
430614,Ofelia,Paulette,Van Hesteren,F,1986-02-03T00:00:00.000+0000,975-38-5761,35691
430615,Virgina,Troy,Bartosiak,F,1988-01-15T00:00:00.000+0000,974-76-1639,77640
430616,Celesta,Mitsuko,Crackel,F,1988-10-16T00:00:00.000+0000,956-75-3484,85105
430617,Calandra,Lila,Ickovicz,F,1956-09-02T00:00:00.000+0000,916-67-9285,83124
430618,Janette,Ardath,Coxon,F,1980-03-21T00:00:00.000+0000,925-32-3624,92907
430619,Donna,Jasmine,Sisley,F,1981-08-01T00:00:00.000+0000,916-67-4629,87639


## Exercise 1: Parallelizing JDBC Connections

The command above was executed as a serial read through a single connection to the database. This works well for small data sets; at scale, parallel reads are necessary for optimal performance.

See the [Managing Parallelism](https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#managing-parallelism) section of the Databricks documentation.

### Step 1: Find the Range of Values in the Data

Parallel JDBC reads entail assigning a range of values for a given partition to read from. The first step of this divide-and-conquer approach is to find bounds of the data.

Calculate the range of values in the `id` column of `peopleDF`. Save the minimum to `dfMin` and the maximum to `dfMax`.  **This should be the number itself rather than a DataFrame that contains the number.**  Use `.first()` to get a Scala or Python object.

In [19]:
from pyspark.sql.functions import min, max

dfMin = peopleDF.select(min("id")).first()[0]
dfMax = peopleDF.select(max("id")).first()[0]

In [20]:
# TEST - Run this cell to test your solution

dbTest("ET1-P-04-01-01", 1, dfMin)
dbTest("ET1-P-04-01-02", 1000000, dfMax)

print("Tests passed!")

### Step 2: Define the Connection Parameters.

Use 8 partitions.

Assign the results to `peopleDFParallel`.

In [22]:
peopleDFParallel = spark.read.jdbc(
  url=jdbcUrl,                    # the JDBC URL
  table="training.people_1m",     # the name of the table
  column="id",                    # the name of a column of an integral type that will be used for partitioning.
  lowerBound=dfMin,               # the minimum value of columnName used to decide partition stride.
  upperBound=dfMax,               # the maximum value of columnName used to decide partition stride
  numPartitions=8,               # the number of partitions/connections
  properties=connectionProps      # the connection properties
)

display(peopleDFParallel)

id,firstName,middleName,lastName,gender,birthDate,ssn,salary
1,Lydia,Ula,Rubinowicz,F,1997-02-02T00:00:00.000+0000,927-54-8759,70110
2,Diamond,Carletta,Melesk,F,1984-10-21T00:00:00.000+0000,939-18-5247,74024
3,Yen,Julienne,Recher,F,1988-11-24T00:00:00.000+0000,929-26-8667,83619
4,Mallie,Albertina,Icom,F,1997-03-17T00:00:00.000+0000,921-87-2459,84369
5,Neda,Adele,Sansam,F,1997-05-25T00:00:00.000+0000,948-60-9586,63300
6,Brittaney,Marisela,Ingerfield,F,1966-06-20T00:00:00.000+0000,921-43-9011,84172
7,Annetta,Jenny,Ghiroldi,F,1987-01-22T00:00:00.000+0000,997-84-2238,79905
8,Jinny,Ethel,Tunno,F,1970-04-23T00:00:00.000+0000,909-25-2848,113081
9,Sherise,Lorita,McArte,F,1956-02-27T00:00:00.000+0000,914-17-3474,107826
10,Hilaria,Samira,Dana,F,1993-08-08T00:00:00.000+0000,940-79-2466,72104


In [23]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-04-02-01", 8, peopleDFParallel.rdd.getNumPartitions())

print("Tests passed!")

### Step 3: Compare the Serial and Parallel Reads

Compare the two reads with the `%timeit` function.

Display the number of partitions in each DataFrame by running the following:

In [26]:
print("Partitions:", peopleDF.rdd.getNumPartitions())
print("Partitions:", peopleDFParallel.rdd.getNumPartitions())

Invoke `%timeit` followed by calling a `.describe()`, which computes summary statistics, on both `peopleDF` and `peopleDFParallel`.

In [28]:
%timeit peopleDF.describe()
%timeit peopleDFParallel.describe()

What is the difference between serial and parallel reads?  Note that your results vary drastically depending on the cluster and number of partitions you use

## Review

**Question:** What is JDBC?  
**Answer:** JDBC stands for Java Database Connectivity, and is a Java API for connecting to databases such as MySQL, Hive, and other data stores.

**Question:** How does Spark read from a JDBC connection by default?  
**Answer:** With a serial read.  With additional specifications, Spark conducts a faster, parallel read.  Parallel reads take full advantage of Spark's distributed architecture.

**Question:** What is the general design pattern for connecting to your data?  
**Answer:** The general design patter is as follows:
0. Define the connection point
0. Define connection parameters such as access credentials
0. Add necessary options such as for headers or parallelization

In [32]:
%run "./Includes/Classroom-Cleanup"