# Tutorial - How to access dashDB data with SparkR
Welcome to Cognitive Class Labs. This notebook is designed to teach you how to access data on dashDB using SparkR.

This notebook shows how to access a dashDB Data Warehouse (or a DB2 database) using R by following the steps below:
1. Loading the `RJDBC` Python library
1. Identify and enter the database connection credentials
1. Load the data from dashDB into a SparkR dataframe
1. Query the data

## What is dashDB ?

**dashDB** is a fully managed cloud data warehouse, purpose-built for analytics. It offers massively parallel processing (MPP) scale, and compatibility with a wide range of business intelligence (BI) tools.  


__Notice:__ Get your own dashDB free of charge: 

<h3 align = "center">
[Launch a dashDB service through Bluemix](https://console.ng.bluemix.net/?direct=classic/&amp;cm_mc_uid=&amp;cm_mc_sid_50200000=1453781614#/store/cloudOEPaneId=store&amp;serviceOfferingGuid=7c87c148-e1a4-4cb8-81f8-c5e74be7684b&CampID=DSWB)
</h3>

<a class="ibm-tooltip" href="https://console.ng.bluemix.net/?direct=classic/&amp;cm_mc_uid=&amp;cm_mc_sid_50200000=1453781614#/store/cloudOEPaneId=store&amp;serviceOfferingGuid=7c87c148-e1a4-4cb8-81f8-c5e74be7684b&CampID=DSWB" target="_blank" title="" id="ibm-tooltip-0">
<img alt="IBM Bluemix.Get started now" height="193" width="153" src="https://ibm.box.com/shared/static/42yt39czuksqdi278xpy96txtlw3lfmb.png" >
</a> 

## Load the `RJDBC` library. 
RJDBC is an implementation of R's DBI interface using JDBC as a back-end. This allows R to connect to any DBMS that has a JDBC driver.



In [None]:
library(RJDBC)

## Identify the database connection credentials

Connecting to dashDB or DB2 database requires the following information:
* Database name 
* Host DNS name or IP address 
* Host port
* User ID
* User Password

All of this information must be captured in a connection string in a subsequent step.

__Notice:__ To obtain credentials follow this [user guide](http://support.datascientistworkbench.com/knowledgebase/articles/826020-getting-credentials-to-access-a-dashdb-data-wareho)

In [None]:
dsn_username = "<your username>"  # e.g.  dash104434
dsn_password = "<your password>"   # e.g. xxxx
dsn_hostname = "<your hostname>"  # e.g.  awh-yp-small03.services.dal.bluemix.net
dsn_port = "<your port>"   # e.g.  "50000"
dsn_database = "<default database>"   # e.g BLUDB

## Create the database connection with SparkR

The following code snippet loads in the data from dashDB directly as a SparkR dataframe.

### Why SparkR?
Even if your data is stored on dashDB in a relational database, you can now leverage SparkR to access, manipulate and analyze that data. SparkR allows you to work with extremely large datasets. Using SparkR, you can manipulate your data via SQL queries, or via SparkR's native commands.

In [None]:
sqlContext <- sparkRSQL.init(sc)

In [None]:
myurl <- paste0("jdbc:db2://",dsn_hostname,":", dsn_port,
                "/", dsn_database,
                ":user=", dsn_username,
                ";password=", dsn_password,
                ";")
print(myurl)

In [None]:
df <- read.df(sqlContext, source="jdbc", 
             url=myurl,
             dbtable="GOSALESDW.EMP_EXPENSE_FACT")

class(df) #Confirm that df is a Spark dataframe

### Check the Schema of the Spark Dataframe

In [None]:
printSchema(df)

### Query the Data
You can now use either SQL, or native Spark dataframe functions to query the data:

#### SQL

In [None]:
registerTempTable(df, "tempdf")

results <- sql(sqlContext, "SELECT * FROM tempdf Limit 10")

# results is now a DataFrame
head(results)

#### Spark Dataframe functions

In [None]:
SparkR::head(df)

## Reference
- [Spark course available free on Big Data University](http://bigdatauniversity.com/courses/spark-fundamentals/?utm_source=Data%20Scientist%20Workbench&utm_medium=Notebook&utm_campaign=Tutorial%20-%20Access%20dashDB%20data%20with%20SparkR)
- [SparkR Programming Guide](https://spark.apache.org/docs/latest/sparkr.html)

### Need a refresher on Apache Spark? 
Free 3 hr course for beginners on Big Data University:  

<a href=http://bigdatauniversity.com/courses/spark-fundamentals/?utm_source=sparkfundamentalsI&utm_medium=dswb&utm_campaign=bdu><img src=https://ibm.box.com/shared/static/r3pj5oo2ivnzqar0poj2eexiqrnvq6vy.png></a>


<h3>Authors:</h3>
<article class="teacher">
<div class="teacher-image" style="    float: left;
    width: 115px;
    height: 115px;
    margin-right: 10px;
    margin-bottom: 10px;
    border: 1px solid #CCC;
    padding: 3px;
    border-radius: 3px;
    text-align: center;"><img class="alignnone wp-image-2258 " src="https://ibm.box.com/shared/static/tyd41rlrnmfrrk78jx521eb73fljwvv0.jpg" alt="Saeed Aghabozorgi" width="178" height="178" /></div>
<h4>Saeed Aghabozorgi</h4>
<p><a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>
</article>
<article class="teacher">
<div class="teacher-image" style="    float: left;
    width: 115px;
    height: 115px;
    margin-right: 10px;
    margin-bottom: 10px;
    border: 1px solid #CCC;
    padding: 3px;
    border-radius: 3px;
    text-align: center;"><img class="alignnone size-medium wp-image-2177" src="https://ibm.box.com/shared/static/2ygdi03ahcr97df2ofrr6cf8knq4kodd.jpg" alt="Polong Lin" width="300" height="300" /></div>
<h4>Polong Lin</h4>
<p>
<a href="https://ca.linkedin.com/in/polonglin">Polong Lin</a> is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds a M.Sc. in Cognitive Psychology.</p>
</article>

<hr>
Copyright &copy; 2016 [Big Data University](https://bigdatauniversity.com/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).​