# Tutorial: Working with Hadoop data using SQL (with Scala)

In this tutorial you will learn to work data in Hadoop using SQL. 

If you specialize in relational database management technology and you have to deal with big data, Apache Hadoop is a perfect container to store and manupulate your data. To query the data stored in Hadoop, we need [__Big SQL__](http://www-01.ibm.com/software/data/infosphere/hadoop/big-sql.html). It allows you to query data stored in Hadoop using industry-standard SQL syntax. Big SQL is designed to provide SQL developers with an easy on-ramp for querying data managed by Hadoop.

### What is Big SQL?
[IBM Big SQL](http://www-01.ibm.com/software/data/infosphere/hadoop/big-sql.html) provides standards-compliant SQL access to data in Hadoop. Developers familiar with SQL can access data in Hadoop without having to learn new languages or skills.

You can work with Hadoop data using SQL in a Jupyter Notebook. The required libraries are pre-installed in your Data Scientist Workbench, so you can establish a connection to a remote Hadoop cluster with Big SQL and then run SQL queries over data in Hadoop.

First, we load the packages we need for this tutorial.

In [None]:
import java.sql.{Connection, DriverManager, ResultSet};

To connect to the database for this tutorial, you need to get your own set of credentials.


### Credentials
Here are steps on how to get your username/password :

1. Sign up for a free account on  [IBM Analytics Demo Cloud](https://my.imdemocloud.com/users/sign_up).

2. An activation email will be sent to you. Please follow up the instructions to set up your account (Note: Your username is different from your email address. For example, the username for `jane.doe@example.com` might be `janedoe`. You can see your username in the top-right corner of Demo Cloud when you're logged in.).

3. Log in [IBM Analytics Demo Cloud](https://my.imdemocloud.com/users/sign_up) and click __Big SQL Technology Sandbox project__ to join it. You will be automatically approved to join.

4. Type in your **username** and **password** within the quotations in the code cell below.

Click inside the cells below and run the cells (_Ctrl+Enter_) to set up a connection to the Big SQL Technology Preview.

In [None]:
val username = "";
val password = ""

Scala code gets compiled into byte code that runs on a Java™ virtual machine (JVM), which allows Scala applications to directly call Java libraries. Therefore, accessing Big SQL from a Scala application is simply a matter of using the existing JDBC driver for DB2: the IBM Data Server Driver for JDBC and SQLJ.

In [None]:
java.sql.DriverManager.registerDriver(new com.ibm.db2.jcc.DB2Driver)
val connection = java.sql.DriverManager.getConnection("jdbc:db2://iop-bi-master.imdemocloud.com:32051/bigsql", username, password)
connection

**Great! Now you're connected!**  
_If you saw an error, check that you filled in your username and password correctly._

### Let's try using SQL queries on a sample table

In this section, we will create a sample table, named __testTable__, load some data into it, and execute a query. Before we do this, we first want to check to see if the table already exists, and if it does, we remove it so we can start from scratch.

To prepare and execute a single SQL statement, you use the __connection.createStatement.executeQuery()__ or __connection.createStatement.executeUpdate__ function. You can call this function and pass the following argument:
* __statement__  
  * A string that contains the SQL statement. This string can include an XQuery expression that is wrapped by an XMLQUERY clause.
  
__Notice:__ In Big SQL, there is only one database, __bigsql__, and you cannot create a new database. However, you CAN have your own schema (which defaults to your user connection name). So, if you connect to the database using your name and execute "CREATE HADOOP TABLE testTable", it creates a table called _YOUR_USER_NAME.testTable_ under your schema. 

To make sure we are using our schema, we first execute __USE__ query.

In [None]:
val query = "USE "+username;
connection.createStatement.executeUpdate(query)

Next we drop the table if it already exists (so we can create a new one):

In [None]:
val query="DROP TABLE IF EXISTS testTable"
connection.createStatement.executeUpdate(query)

Now create a new table, __testTable__ with two columns, named __column1__ and __column2__. To create the table in your schema, run the cell below:

In [None]:
val query="CREATE HADOOP TABLE testTable (column1 INT, column2 STRING)"
connection.createStatement.executeUpdate(query)

Lets insert some data into our _testTable_

In [None]:
val query="INSERT INTO testTable VALUES (1,'Text1') "
connection.createStatement.executeUpdate(query)

Now can we can retrieve the data as below:

In [None]:
val query = "SELECT * FROM testTable";
val resultSet = connection.createStatement.executeQuery(query)
resultSet.next() 
println ("The COLUMN1 value is : "+ resultSet.getString("COLUMN1"))
println ("The COLUMN2 value is : "+ resultSet.getString("COLUMN2"))

### A Big Data Sample
In this section we use sample data that is provided in BigSQL by default. We will use to run queries and create reports about the fictional __Sample Outdoor Company__. 

The schema that is used in this tutorial is the GOSALESDW. It contains fact tables for the following areas:

* Distribution
* Finance
* Geography
* Marketing
* Organization
* Personnel
* Products
* Retailers
* Sales
* Time

In [None]:
val query = "select * from gosalesdw.emp_employee_dim LIMIT 10";
val resultSet = connection.createStatement.executeQuery(query)
while ( resultSet.next() ) {
    val name = resultSet.getString("EMPLOYEE_NAME")
    val key = resultSet.getString("EMPLOYEE_KEY")
    println("Employee key, name = " + key + ", " + name)
}

You can improve the _SELECT *_ statement by adding a _predicate_ to the second statement to return fewer rows. A predicate is a condition on a query that reduces and narrows the focus of the result. A predicate on a query with a multi-way join can improve the performance of the query.

In [None]:
val query = "SELECT * FROM gosalesdw.go_region_dim WHERE region_en LIKE 'Amer%'";
val resultSet = connection.createStatement.executeQuery(query)
while ( resultSet.next() ) {
    val REGION_CODE = resultSet.getString("REGION_CODE")
    println("REGION_CODE = " + REGION_CODE )
}

You can also run a query that returns the number of rows in a table. 

In [None]:
val query = "SELECT COUNT(*) as con FROM gosalesdw.go_region_dim";
val resultSet = connection.createStatement.executeQuery(query)
resultSet.next() 
println ("The count is : "+ resultSet.getString("con"))

To learn what products were ordered from the fictional Sample Outdoor Company, and by what method they were ordered, you must join information from multiple tables in the __gosalesdw__ database because it is a relational database where not everything is in one table.

In [None]:
val query ="""
SELECT 
  pnumb.product_name, 
  sales.quantity, 
  meth.order_method_en 
FROM gosalesdw.sls_sales_fact sales
INNER JOIN gosalesdw.sls_product_dim prod
  ON sales.product_key=prod.product_key 
INNER JOIN gosalesdw.sls_product_lookup pnumb
  ON prod.product_number=pnumb.product_number 
INNER JOIN gosalesdw.sls_order_method_dim meth 
  ON meth.order_method_key=sales.order_method_key
WHERE pnumb.product_language='EN'
FETCH FIRST 10 ROWS ONLY
"""
val resultSet = connection.createStatement.executeQuery(query)
while ( resultSet.next() ) {
    val PRODUCT_NAME = resultSet.getString("PRODUCT_NAME")
    println("PRODUCT_NAME = " + PRODUCT_NAME )
}

Finally, as a best practice we should close the database connection once we're done with it.

In [None]:
connection.close()

### Reference

For more information on Big SQL, please visit the [Big SQL reference page](https://www-01.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.bigsql.doc/doc/bsql_reference.html)

## Want to learn more?

<a href="http://bigdatauniversity.com/courses/sql-access-on-hadoop-big-sql-v4/?utm_source=tutorial-bigsql-scala&utm_medium=dswb&utm_campaign=bdu"><img src = "https://ibm.box.com/shared/static/s5ensv6192ntt3cnwqsmytvxsaunrlmv.png"> </a>


<h3>Authors:</h3>
<article class="teacher">
<div class="teacher-image" style="    float: left;
    width: 115px;
    height: 115px;
    margin-right: 10px;
    margin-bottom: 10px;
    border: 1px solid #CCC;
    padding: 3px;
    border-radius: 3px;
    text-align: center;"><img class="alignnone wp-image-2258 " src="https://ibm.box.com/shared/static/tyd41rlrnmfrrk78jx521eb73fljwvv0.jpg" alt="Saeed Aghabozorgi" width="178" height="178" /></div>
<h4>Saeed Aghabozorgi</h4>
<p><a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>
</article>
<article class="teacher">
<div class="teacher-image" style="    float: left;
    width: 115px;
    height: 115px;
    margin-right: 10px;
    margin-bottom: 10px;
    border: 1px solid #CCC;
    padding: 3px;
    border-radius: 3px;
    text-align: center;"><img class="alignnone size-medium wp-image-2177" src="https://ibm.box.com/shared/static/2ygdi03ahcr97df2ofrr6cf8knq4kodd.jpg" alt="Polong Lin" width="300" height="300" /></div>
<h4>Polong Lin</h4>
<p>
<a href="https://ca.linkedin.com/in/polonglin">Polong Lin</a> is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds a M.Sc. in Cognitive Psychology.</p>
</article>

<hr>
Copyright &copy; 2016 [Big Data University](https://bigdatauniversity.com/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).​