d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# 3.3 Accessing Data

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:<br>

* Read data from a BLOB store
* Read data in serial from JDBC
* Read data in parallel from JDBC

In [0]:
%run ../Includes/Classroom-Setup

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) DBFS Mounts and S3

Amazon Simple Storage Service (S3) is the backbone of Databricks workflows.  S3 offers data storage that easily scales to the demands of most data applications and, by colocating data with Spark clusters, Databricks quickly reads from and writes to S3 in a distributed manner.

The Databricks File System, or DBFS, is a layer over S3 that allows you to [mount S3 buckets](https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3), making them available to other users in your workspace and persisting the data after a cluster is shut down.

In Azure Databricks, DBFS is backed by the Azure Blob Store. More documentation can be found [here](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage).

In the first lesson, you uploaded data using the Databricks user interface.  This can be done by clicking Data on the left-hand side of the screen.  Here we'll be mounting data.

Define your AWS credentials.  Below are defined read-only keys, the name of an AWS bucket, and the mount name we will be referring to in DBFS.

For getting AWS keys, take a look at <a href="https://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html" target="_blank"> take a look at the AWS documentation

In [0]:
%python
ACCESS_KEY = "AKIAJBRYNXGHORDHZB4A"
# Encode the Secret Key to remove any "/" characters
SECRET_KEY = "a0BzE1bSegfydr3%2FGE3LSPM6uIV5A4hOUfpH8aFF".replace("/", "%2F")
AWS_BUCKET_NAME = "davis-dsv1071/data/"
MOUNT_NAME = "/mnt/davis-tmp"

-sandbox

Now mount the bucket [using the template provided in the docs.](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html#mounting-an-s3-bucket)

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The code below includes error handling logic to handle the case where the mount is already mounted.

In [0]:
%python
try:
  dbutils.fs.mount("s3a://{}:{}@{}".format(ACCESS_KEY, SECRET_KEY, AWS_BUCKET_NAME), MOUNT_NAME)
except:
  print("""{} already mounted. Unmount using `dbutils.fs.unmount("{}")` to unmount first""".format(MOUNT_NAME, MOUNT_NAME))

Next, explore the mount using `%fs ls` and the name of the mount.

In [0]:
%fs ls /mnt/davis-tmp

path,name,size
dbfs:/mnt/davis-tmp/dataframes/,dataframes/,0
dbfs:/mnt/davis-tmp/fire-calls/,fire-calls/,0
dbfs:/mnt/davis-tmp/fire-incidents/,fire-incidents/,0


In practice, always secure your AWS credentials.  Do this by either maintaining a single notebook with restricted permissions that holds AWS keys, or delete the cells or notebooks that expose the keys. After a cell used to mount a bucket is run, you can access the data in this mount point in any notebook or any cluster in Databricks, and share the mount between colleagues.

In [0]:
%fs mounts

mountPoint,source,encryptionType
/mnt/davis,s3a://davis-dsv1071/data,
/databricks-datasets,databricks-datasets,sse-s3
/databricks/mlflow-tracking,databricks/mlflow-tracking,sse-s3
/databricks-results,databricks-results,sse-s3
/mnt/davis-tmp,s3a://davis-dsv1071/data/,
/databricks/mlflow-registry,databricks/mlflow-registry,sse-s3
/,DatabricksRoot,sse-s3


You now have access to an unlimited store of data.  This is a read-only S3 bucket.  If you create and mount your own, you can write to it as well.

Now unmount the data.

In [0]:
%python
dbutils.fs.unmount("/mnt/davis-tmp")

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Serial JDBC Reads

Java Database Connectivity (JDBC) is an application programming interface (API) that defines database connections in Java environments.  Spark is written in Scala, which runs on the Java Virtual Machine (JVM).  This makes JDBC the preferred method for connecting to data whenever possible. Hadoop, Hive, and MySQL all run on Java and easily interface with Spark clusters.

Databases are advanced technologies that benefit from decades of research and development. To leverage the inherent efficiencies of database engines, Spark uses an optimization called predicate push down.  **Predicate push down uses the database itself to handle certain parts of a query (the predicates).**  In mathematics and functional programming, a predicate is anything that returns a Boolean.  In SQL terms, this often refers to the `WHERE` clause.  Since the database is filtering data before it arrives on the Spark cluster, there's less data transfer across the network and fewer records for Spark to process.  Spark's Catalyst Optimizer includes predicate push down communicated through the JDBC API, making JDBC an ideal data source for Spark workloads.

Run the cell below to confirm you are using the right driver.

In [0]:
%scala
Class.forName("org.postgresql.Driver")

First we need to create the JDBC String.

In [0]:
%sql
DROP TABLE IF EXISTS twitterJDBC;

CREATE TABLE IF NOT EXISTS twitterJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "training.Account"
)

We now have a twitter table.

In [0]:
%sql
SELECT * FROM twitterJDBC LIMIT 10

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
829933491664015360,zawakota0723,福岡市内,87,79,和白丘→水産 高３本田翼のファン 本田翼一推し歴4年優しく接してやってください｡ 気軽にフォローどうぞフォロバしますhttp://Instagram.com/fukuzawakota,343597397447,3153
3465776834,bcksojin86,,52,10,,343597397448,3153
450774034,REGULIA,,811,79,20↑日本のマンガ文化・ゲームが大好きな中国人です－！簡単な日本語なら喋れる😃「ヘキカ_エン」の名義で本国のアニメ制作進行を務めています😝好きな物と仕事関連なことをつぶやく垢です❤よろしくお願いします😳,343597397449,3153
937766057472942080,forYou_Talal79,,0,0,صدقه جاريه لك بالدنياء والأخره💙💙,343597397450,3153
933378720,hope_sherman,,385,345,,343597397451,3153
900846248030347265,mspyourselfrule,no where you need to know,274,104,[]Hello I'm on US server My user name is Yourselfrules[] (I also follow msp people back)[]Team Rose[],343597397452,3153
913934030390796288,lhenzkie27,"Tarlac City, Central Luzon",143,30,julianella fangirl...,343597397453,3153
804602734590595073,daddyjuice21,bompton,1209,615,you won’t ever come before brayson.,343597397454,3153
3240173659,YeemaxC,THAILAND,192,199,Support : NICHKHUN : 2PM : FC BAYERN : TWICE ❤,343597397455,3153
959211991649931264,drgddrf,,1,1,,343597397456,3153


Add a subquery to `dbtable`, which pushes the predicate to JDBC to process before transferring the data to your Spark cluster.

In [0]:
%sql
DROP TABLE IF EXISTS twitterPhilippinesJDBC;

CREATE TABLE twitterPhilippinesJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "(SELECT * FROM training.Account WHERE location = 'Philippines') as subq"
)

In [0]:
%sql
SELECT * FROM twitterPhilippinesJDBC LIMIT 10

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
2414716389,miggyolilaa,Philippines,130,219,We are just souls lost between reality and dream 💫,343597397582,3153
829929175565819905,lesboooooy,Philippines,302,176,VIII•V•XV 💑 I•XXX•XVIII 💍 // @McllGrcHp 🔥,369367202135,3153
444578604,KarrenTheGreat,Philippines,208,307,Wonder woman of my own version 💁,369367202827,3153
55088681,nikkiplanets,Philippines,438,158,,369367202962,3153
955625563430969344,shokoy_,Philippines,232,48,*second account*,343597398952,3153
1479342410,jhaninacajefe,Philippines,236,279,,369367204446,3153
144070866,bucaladenver,Philippines,76,92,Go suck my middle finger and I'll tell u wut!,352187335485,3153
40234887,tweetjoefred,Philippines,247,254,NVWS🖕🏼,352187335998,3153
376865035,lavpurple28,Philippines,275,948,NO ONE IS PERFECT!I'm a SOLID #KATHNIEL fan!💙 Don't follow me if you're a 🐸s or yung tga ibang LTs😝Di ka welcome d2👅#SongSongCouple❤️ #ParkBoGum😍 #YooSeungHo😍,369367204761,3153
1467205963,budithaeunicarn,Philippines,107,158,,369367204831,3153


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Parallel JDBC Reads

In [0]:
%sql
SELECT min(userID) as minID, max(userID) as maxID from twitterJDBC

minID,maxID
509,959519629566672896


In [0]:
%sql
DROP TABLE IF EXISTS twitterParallelJDBC;

CREATE TABLE IF NOT EXISTS twitterParallelJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "training.Account",
  partitionColumn '"userID"',
  lowerBound 2591,
  upperBound 951253910555168768,
  numPartitions 25
)

In [0]:
%sql
SELECT * from twitterParallelJDBC

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
3465776834,bcksojin86,,52,10,,343597397448,3153
450774034,REGULIA,,811,79,20↑日本のマンガ文化・ゲームが大好きな中国人です－！簡単な日本語なら喋れる😃「ヘキカ_エン」の名義で本国のアニメ制作進行を務めています😝好きな物と仕事関連なことをつぶやく垢です❤よろしくお願いします😳,343597397449,3153
933378720,hope_sherman,,385,345,,343597397451,3153
3240173659,YeemaxC,THAILAND,192,199,Support : NICHKHUN : 2PM : FC BAYERN : TWICE ❤,343597397455,3153
3123776030,Basrah__,,398,633,الحشد الشعبي المقدس - الثورة الإسلامية في ايران - حزب الله - أنصار الله✌️,343597397458,3153
3219625352,DreamlandSheep,Great Chasm,6,6,🔥,343597397459,3153
3555009314,luchrin,대한민국,127,91,잡덕입니다/알계 및 비공개는 정기적으로 블락합니다/ 안될테니사고를 쳐봤는데 다들 사고치는데 동의해주셨다....냥/멍멍이 사진 많이 올라와요/기본잡담이라 반말/대화는 실친제외 모두 존댓말 프리파라계정:@luchrin_pri 수공예계정: @craft_luchrin,343597397460,3153
1408137253,udealu,"Washington, DC",7536,7859,UDealu http://Udealu.com searches the deal websites and displays them! We search slickdeals and all the top deal sites!!!,343597397461,3153
1555321812,bowy46,,367,127,-HW46✌🏻HUSO👉🏻 PC21Ψ BUU63 👉🏻👉🏻#ออฟกัน #ป่าปี๊ไม่ได้หล่อขนาดนั้น #กันอรรถพันธ์น่ารักกว่าที่คิด ❤️,343597397463,3153
3403370834,etsu2057,日本 / Japan,7676,7163,LINEのようなアプリ！1.完全無料 2.8ティアまで報酬が入る 3.広告収入の70%をユーザーに分配。無料通話やメールが出来るアプリを広めていくビジネスです！PC・スマホ、タブレットで登録可！無料登録は、下記URLからJoin Free ↓↓ #autofollow #refollow #相互フォロー #wowapp,343597397464,3153


In [0]:
%python
%timeit sql("SELECT * from twitterJDBC").describe()

In [0]:
%python
%timeit sql("SELECT * from twitterParallelJDBC").describe()

For additional options [see the Spark docs.](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>