# Overview of NoSQL Technologies and HBase

As part of this topic we will go through some key concepts of NoSQL and getting started with HBase. We will start with use cases in which NoSQL Databases are used, and then jump into HBase to understand more about NoSQL as well as HBase.

* Overview of NoSQL Databases
* NoSQL Concepts/Features
* Setup HBase Locally
* Understanding HBase
* Setup Project
* Put and Get examples using Scala
* Develop GettingStarted using Scala
* Develop NYSELoad using Scala

### Overview of NoSQL Databases
Let us go through the rationale behind NoSQL Databases. As part of the evolution of Databases, one of the key milestones is RDBMS. They are extensively used for applications where transactions are involved.

But over a period of time, we have other types of systems where transactions are not that important – eg: recommendation engine, endorsements, messengers etc. For these type of applications using RDBMS is counterproductive. NoSQL provide us alternative databases which are not only cheaper but also scalable.

***Use Cases***

Let us see some use cases.

* Facebook Messenger
* Majority of LinkedIn Website components such as job recommendations, endorsements etc.
* Streaming Executive Dashboards which are time series in nature.

***List of NoSQL Databases***

Let us go through some of the popular NoSQL databases available for us. Even though concepts are same, syntax and semantics are different.

* HBase – comes as part of Hadoop eco system with most of the distributions such as Cloudera, Hortonworks, MapR etc.
* Cassandra – the most popular NoSQL database which can work with Big Data technologies such as Hadoop, Spark etc.
* MongoDB – simple, easy to use NoSQL database with rich querying capabilities and works seamlessly with Big Data technologies such as Hadoop, Spark etc.
* DynamoDB – popular NoSQL database with in amazon eco system
* MapR DB – MapR version of HBase.

***Job Roles***

Let us see the job roles who use NoSQL Databases.

* Data Engineers – NoSQL is a popular choice in building streaming pipelines.
* Application Developers – Building applications with minimum transaction features.

***Learning Process***

Let us see the typical learning process for exploring NoSQL Databases.

* Setting up a development environment with the programming language or framework of your choice.
* Understand CLI based tools or IDEs to interact with them.
* Basic CRUD Operations
* Querying Features
* Developing Applications based on requirements.
Ability to build applications using NoSQL is key.

### NoSQL Concepts/Features
All NoSQL databases comes with these capabilities

* Key and Value – each row will have a key and value. Value typically contain attributes and respective values. It is very close to XML or JSON.
* Indexed – Data will be typically sorted and indexed based on row key.
* Partitioned – Data will be partitioned by row key column. It also known as sharded/sharding.
* Replication – There will be multiple copies of data
* Commit log – For restore and recovery of the data
* CAP algorithm – Consistency, Accessibility and Partition Tolerance
* Minor and major compaction – periodic merging of files with in each partition so that we will not end up having too many small files
* Tombstones – soft delete of data
* Vacuum Cleaning – hard delete of data
* Consistency Level – Commit point. As we have multiple copies of data consistency level determines whether data is considered to be committed when one copy is updated or all the copies are updated. It is determined depending up on the criticality of data and desired performance for the application.

<q>*While indexing and partitioning serves the purpose of scalability in terms of performance, replication serves the purpose of reliability of database. Even though we do not cover these terms extensively, it is good to understand all these terms in detail.*</q>

### Setup HBase Locally
Let us see the instructions to setup HBase locally.
* HBase is dependent on Zookeeper
* Zookeeper comes as part of the HBase itself
* If you already have zookeeper you need to make sure the port and zookeeper directory are changed.
* Download, untar and unzip tar ball downloaded (using tar xzf)
* Copy untarred and uncompressed directoy under /opt
* Create soft link /opt/hbase to manage upgrades in future
* Update /opt/hbase/conf/hbase-site.xml

* Update PATH to reflect /opt/hbase/bin
* Start HBase, it will take care of starting Zookeeper as well (start-hbase.sh)
* Validate by running some commands – list tables, create table, insert record and scan table.

We will be using lab for the demonstrations.

### Understanding HBase
Let us go through the basics of HBase.

* Review Multi-Node Cluster
* Understanding HBase Shell
* CRUD Operations
* Schema in HBase

***Review Multi-Node Cluster***

Let us review details with respect to the Multi-Node Cluster.
* Zookeeper – 3 Nodes
* Masters – 3 Nodes
* Region Servers – On all worker nodes
* Data will be permanently stored in HDFS. However, data will be first copied to memory in region servers and will be flushed into HDFS at regular intervals.

***Understanding hbase shell***

Let us understand more about hbase shell by going through some of the commands
* On the gateway node of the hbase cluster <mark>run hbase shell</mark>
* <mark>help</mark> the command provides a list of commands in different categories
* Namespace – a group of tables (similar to schema or database)
    * create – <mark>create_namespace 'training'</mark>
    * list – <mark>list_namespace 'training'</mark>
    * list tables – <mark>list_namespace_tables 'training'</mark>
* Table – a group of rows which have keys and values
    * While creating the table we need to specify a table name and at least one column family
    * Column family will have cells. A cell is nothing but, a name and value pair
    * e.g.: <mark>create 'training:hbasedemo', 'cf1'<mark>
    * list – <mark>list 'training:.*'<mark>
    * describe – <mark>describe 'training:hbasedemo'<mark>
    * truncate – <mark>truncate 'training:hbasedemo'<mark>
    * Dropping is 2 step process – disable and drop
    * disable – <mark>disable 'training:hbasedemo'<mark>
    * drop – <mark>drop 'training:hbasedemo'<mark>
* Inserting/updating data

### CRUD Operations
Let us explore details about CRUD.
As part of this course, we will be covering CRUD operations on HBase or MapR-DB database.

* CRUD stands for
    * Create (insert)
    * Read
    * Update
    * Delete
* Here are some of the points to remember with respect to CRUD operations
    * All the databases support all operations. But when it comes to DML (CUD), performance varies depending upon the consistency level.
    * All databases support basic operations
        * Selecting all the data
        * Selecting a range of columns
        * Retrieve row value by passing key
        * Apply filter on row values
    * But databases such as MongoDB also have rich aggregation framework

***Schema in HBase/MapR-DB***

Let us go through a quick overview of HBase/MapR-DB schema.

* Recap of Database Operations
    * A table contains a column family (a group of columns)
    * We do not specify columns while creating the table
    * Data is inserted/updated using put
    * Data can be read using the scan or get. We need to pass the row key to get.
    * Data can be deleted using delete
    * With each put, we only insert/update one row with column within a column family
    * Combination of row key and one column name and value are also known as cell
    * Data is automatically sorted and partitioned on the row key
* We need to design our row key based on the way data is stored internally.
* There are 2 types of schemas in HBase – Thick Schema and Thin Schema
* In RDBMS typically we will have Normalized Data Model where the relationships are established and enforced. In NoSQL, * we typically do not enforce relationships at the database layer and schemas need not be normalized.
* Within each row key, all the cells (column name and value) are sorted based on the key of a cell (column name)
* They support several filters (partial scan and filters on top of cells as part of get)

### Setup Project
Here are the steps involved to setup the project

* Make sure necessary tables is created (training:hbasedemo, nyse:stock_data, nyse:stock_data_wide)
* Create new project HBaseDemo using IntelliJ
    * Choose scala 2.11
    * Choose sbt 0.13.x
* Make sure JDK is chosen
* Update build.sbt. See below
* Define application properties

***Dependencies (build.sbt)***

HBase applications are dependent upon Hadoop and hence we need to add dependencies related to Hadoop as well as HBase.

* Add type safe config dependency so that we can externalize properties
* Add hadoop dependencies
* Add hbase dependencies
* Add assembly plugin, so that we can build a fat jar
    * Create **assembly.sbt** file in the **project** directory.
    * Add this line of code – <mark>addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")</mark>
* Define a merge strategy. It is required to build fat jar so that we can deploy and run on other environments.
* Replace build.sbt with below lines of code

***Externalize Properties***

We need to make sure that application can be run in different environments. It is very important to understand how to externalize properties and pass the information at run time.

* Make sure build.sbt have dependency related to type safe config
* Create new directory under src/main by name resources
* Add file called application.properties and add below entries

### Put and Get Examples (using sbt console)
As we have added necessary dependencies we can use sbt console to launch scala with all dependencies made available to scala to see examples using Scala REPL or CLI.

* Launch Scala REPL using sbt console
* Import all the necessary classes or objects or functions
* Create HBase connection object using zookeeper quorum and port
* Create table object by using appropriate table name (make sure table is pre created using hbase <mark>shell create 'training:hbasedemo'</mark>
* To insert a new cell
    * Create put object
    * Add necessary columns
    * Add or update record using put function on table object
    * Validate by running <mark>scan 'training:hbasedemo'</mark>
* To get one row by using key
    * Create get object
    * Get row using table.get(key)
    * Read individual cell and pass it to functions such as Bytes.toString to typecast data to an original format

### Develop GettingStarted Program
Now let us develop program called GettingStarted, validate using IDE, build and run on cluster.

***Create GettingStarted using IDE***

We will create object file using IDE to develop the logic.

* Create scala program by choosing Scala Class and then type Object
* Make sure program is named as GettingStarted
* First we need to import necessary APIs
* Develop necessary logic
    * Get the properties from application.properties
    * Load zookeeper.quorum and zookeeper.port and create HBase connection
    * Perform necessary operations to demonstrate

* Program takes 2 arguments, environment to load respective properties and HBase table name
* We can go to Run -> Edit Configurations and pass arguments
* If dev is passed it will try to connect to HBase installed locally otherwise it will connect to cluster specified in prod.zookeeper.quorum

***Build, Deploy and Run***

As development and validation is done, now let us see how we can build and deploy on the cluster.

* Right click on the project and copy path
* Go to terminal and run cd command with the path copied
* Make sure assembly plugin is added
* Run <mark>sbt assembly</mark>
* It will generate fat jar. Fat jar is nothing but our application along with all the dependency jars integrated
* Copy to the server where you want to deploy
* Run using java -jar command – <mark>java -jar HBaseDemo-assembly-0.1.jar prod training:hbasedemo</mark>

### Develop NYSELoad using Scala
As part of this program we will see how we can read data from a file and load data into nyse:stock_data using Scala as programming language using HBase APIs.

* Read data from file (we will only process one file at a time)
* Create HBase Connection
* Create table object for nyse:stock_data
* For each record build put object and load into HBase table using table object (for performance reasons we can add multiple rows together)
* We will also see how to add main class as part of assembly, reassemble the fat jar and run it on the cluster (use sbt assembly)