# Connecting to JDBC

Apache Spark&trade; and Azure Databricks&reg; allow you to connect to a number of data stores using JDBC.

-sandbox
## Java Database Connectivity

Java Database Connectivity (JDBC) is an application programming interface (API) that defines database connections in Java environments.  Spark is written in Scala, which runs on the Java Virtual Machine (JVM).  This makes JDBC the preferred method for connecting to data whenever possible. Hadoop, Hive, and MySQL all run on Java and easily interface with Spark clusters.

In the roadmap for ETL, this is the **Extract and Validate** step:

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-Process-1.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

### Recalling the Design Pattern

Recall the design pattern for connecting to data from the previous lesson:  
<br>
1. Define the connection point.
2. Define connection parameters such as access credentials.
3. Add necessary options. 

After adhering to this, read data using `spark.read.options(<option key>, <option value>).<connection_type>(<endpoint>)`.  The JDBC connection uses this same formula with added complexity over what was covered in the lesson.

Run the cell below to set up your environment.

In [5]:
%run "./Includes/Classroom-Setup"

-sandbox
Run the cell below to confirm you are using the right driver.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Each notebook has a default language that appears in upper corner of the screen next to the notebook name, and you can easily switch between languages in a notebook. To change languages, start your cell with `%python`, `%scala`, `%sql`, or `%r`.

In [7]:
%scala
// run this regardless of language type
Class.forName("org.postgresql.Driver")

Define your database connection criteria. In this case, you need the hostname, port, and database name. 

Access the database `training` via port `5432` of a Postgres server sitting at the endpoint `server1.databricks.training`.

Combine the connection criteria into a URL.

In [9]:
jdbcHostname = "server1.databricks.training"
jdbcPort = 5432
jdbcDatabase = "training"

jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)

Create a connection properties object with the username and password for the database.

In [11]:
connectionProps = {
  "user": "readonly",
  "password": "readonly"
}

Read from the database by passing the URL, table name, and connection properties into `spark.read.jdbc()`.

In [13]:
accountDF = spark.read.jdbc(url=jdbcUrl, table="Account", properties=connectionProps)
display(accountDF)

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
711302488427778049,ILL_04A,I touched MYNAME 06/09/16,266,367,Doing Cube's job for free. «Once 4Minute is always 4Minute» Ilhoon's Lady 🐝🐝🐝 NO사랑🌸,94489285357,3153
4325772802,YoussefJooo8,"أسوان, مصر",48,17,كاائن اليوسف ✋ تعريفه: هو كائن لطيف اوووزعۃ youssef ahmed Age:19 from:red sea live:in loux 2 univiristy AAst القبطاان ☺ f.b: jooo Uwk #Zamalek Smile ツ,94489285358,3153
4805856135,enda__34,,59,106,,94489285359,3153
859844340025262081,_beci,,41,51,Life goes on 🎈,94489285360,3153
327888219,freakjawndotcom,"Philadelphia, PA",2690,3756,"FreakJawn Inc is about empowering female sex workers, educating cis men, and educating Blue/White collar women who are interested in exploring the sex culture",94489285361,3153
3436712944,Glynnchen,Deutschland,293,4746,"Untugenden werden toleriert, solange es im Verborgenen verbleibt! Selbstreflektion hat noch keinem geschadet! Versucht es mal selbst!",94489285362,3153
315250717,DGAFDanielle,,2236,11900,21. 011718👶🏻💙 Snapchat: DGAFDaniellee K.LuxeExtensions💋 DiscountCode:LuxeDollDani for 10% off your purchase🗣,94489285363,3153
879857101513695233,bbalmalnq8,,1068,1293,من لا يهتم لامرك لا تهتم لامره الحب جميل والكرامه أجمل,94489285364,3153
230866480,njwalker_,,334,570,It's pretty simple: I luv love,94489285365,3153
929946336992079872,Sanda05865047,king Williams town,287,19,,94489285366,3153


## Exercise 1: Parallelizing JDBC Connections

The command above was executed as a serial read through a single connection to the database. This works well for small data sets; at scale, parallel reads are neccesary for optimal performance.

See the [Managing Parallelism](https://docs.azuredatabricks.net/spark/latest/data-sources/sql-databases.html#manage-parallelism) section of the Databricks documentation.

-sandbox
### Step 1: Find the Range of Values in the Data

Parallel JDBC reads entail assigning a range of values for a given partition to read from. The first step of this divide-and-conquer approach is to find bounds of the data.

Calculate the range of values in the `insertID` column of `accountDF`. Save the minimum to `dfMin` and the maximum to `dfMax`.  **This should be the number itself rather than a DataFrame that contains the number.**  Use `.first()` to get a Scala or Python object.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** See the `min()` and `max()` functions in Python `pyspark.sql.functions` or Scala `org.apache.spark.sql.functions`.

In [16]:
# TODO
from pyspark.sql.functions import max, min

dfMin = accountDF.select(min("insertID")).first()[0]
dfMax = accountDF.select(max("insertID")).first()[0]

In [17]:
dfMin

In [18]:
# TEST - Run this cell to test your solution

dbTest("ET1-P-04-01-01", 0, dfMin)
dbTest("ET1-P-04-01-02", 214748365087, dfMax)

print("Tests passed!")

-sandbox
### Step 2: Define the Connection Parameters.

<a href="https://docs.azuredatabricks.net/spark/latest/data-sources/sql-databases.html#manage-parallelism" target="_blank">Referencing the documentation,</a> define the connection parameters for this read.  Use 12 partitions.

Save the results to `accountDFParallel`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Setting the column for your parallel read introduces unexpected behavior due to a bug in Spark. To make sure Spark uses the capitalization of your column, use `'"insertID"'` for your column. <a href="https://github.com/apache/spark/pull/20370#issuecomment-359958843" target="_blank"> Monitor the issue here.</a>

In [20]:
# TODO
accountDFParallel = spark.read.jdbc(
  url = jdbcUrl,
  table = "Account",
  column = '"insertID"',
  lowerBound = dfMin,
  upperBound = dfMax,
  numPartitions = 12,
  properties = connectionProps
)

display(accountDFParallel)

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
899279937558818817,pui_kookie,ประเทศไทย,38,0,💕บุคคลผู้เเต่งฟิคบังทัน💞💞💞,17179873995,3153
947960237226467328,VictorM53727258,,121,5,,17179873996,3153
718773313468768258,ntarasia06,"Bhubaneshwar, India",19,45,,17179873997,3153
4334300787,dehavenick5,,561,620,fly with me ✈️Maddie ❤️,17179873998,3153
772105712268902404,cabellyouth,Cabello stan,1253,2180,to her. • Camila follow's,17179873999,3153
920579990743166976,I_lover_ss,샌즈 방 침대 구석!,67,63,언더테일 / 드림 / 그림 / 마음요정 / 말랑한 사람 / 미성년자 / 언더테일 스포 ⭕ / 헤더는 료시님께서 그려주신 너무 너무 예쁜 샌즈아이~~~~!!💝🌹✨/ #샌즈아이 / ❄✨내사랑 샌즈✨❄ ☞@sans_for_i 세상에서 당신만큼 어여쁜 이는 보지못했어요.,17179874000,3153
446815193,MayLoves5H,México,299,1480,I love Fifth Harmony with all my heart💞 #PSATourMonterrey 11/10/2017 ❤❤❤❤ Dinah follows me 14/11/2017,17179874001,3153
2998459529,giimarote_,"São Paulo, Brasil",180,1323,Vivo pelo campeão dos campeões - NÃO AO FUTEBOL MODERNO ATÉ APÓS A MORTE!,17179874002,3153
741465963351158784,barrildemil,"Salvador, Brasil",989,1787,"Um paradoxo do pretérito imperfeito, complexo com a teoria da relatividade. Fluente em errar na vida. Flamenguista.",17179874003,3153
947939851537670144,TeBancoMuchoFV,Argentina,509,123,"A la gente buena les pasa cosas buenas, @flor_vigna ❤ #PaulaAmoedoAlBailando2018",17179874004,3153


In [21]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-04-02-01", 12, accountDFParallel.rdd.getNumPartitions())

print("Tests passed!")

### Step 3: Compare the Serial and Parallel Reads

Compare the two reads with the `%timeit` function.

Display the number of partitions in each DataFrame by running the following:

In [24]:
print(accountDF.rdd.getNumPartitions())
print(accountDFParallel.rdd.getNumPartitions())

Invoke `%timeit` followed by calling a `.describe()`, which computes summary statistics, on both `accountDF` and `accountDFParallel`.

In [26]:
%timeit accountDF.describe()

In [27]:
%timeit accountDFParallel.describe()

What is the difference between serial and parallel reads?  Note that your results vary drastically depending on the cluster and number of partitions you use

## Review

**Question:** What is JDBC?  
**Answer:** JDBC stands for Java Database Connectivity, and is a Java API for connecting to databases such as MySQL, Hive, and other data stores.

**Question:** How does Spark read from a JDBC connection by default?  
**Answer:** With a serial read.  With additional specifications, Spark conducts a faster, parallel read.  Parallel reads take full advantage of Spark's distributed architecture.

**Question:** What is the general design pattern for connecting to your data?  
**Answer:** The general design patter is as follows:
0. Define the connection point
0. Define connection parameters such as access credentials
0. Add necessary options such as for headers or paralleization

## Next Steps

Start the next lesson, [Applying Schemas to JSON Data]($./05-Applying-Schemas-to-JSON-Data ).

## Additional Topics & Resources

**Q:** My tool can't connect via JDBC.  Can I connect via <a href="https://en.wikipedia.org/wiki/Open_Database_Connectivity" target="_blank">ODBC instead</a>?  
**A:** Yes.  The best practice is generally to use JDBC connetions wherever possible since Spark runs on the JVM.  In cases where JDBC is either not supported or is less performant, use the Simba ODBC driver instead.  See <a href="https://docs.azuredatabricks.net/user-guide/bi/jdbc-odbc-bi.html#step-1-download-and-install-a-jdbc-odbc-driver" target="_blank">the Databricks documentation on connecting BI tools</a> for more details.