# Apache Sqoop – Getting Started and Import

Let us get started with sqoop to get data from relational databases such as MySQL, Postgres etc into HDFS.
* Introduction to Sqoop
* Preview of MySQL in our labs
* Connectivity to Database using Sqoop
* Sqoop List Commands
* Running queries in MySQL using Sqoop eval
* Simple Sqoop Import
* Sqoop Execution Life Cycle
* Managing Directories

### Introduction to Sqoop
Let us get an overview of Sqoop.
* It is command line tool to copy data between relational data sources to HDFS and vice versa
* It is developed in Java
* We will be using Sqoop 1.4.x as part of the course.
* You can access Sqoop Documentation here – [Sqoop User Guide (v1.4.6)](https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html)
* Important Sqoop topics for the certification are sqoop-import, sqoop-export, sqoop-eval, sqoop-list-databases, sqoop-list-tables and sqoop-help
* To check the version of sqoop use the command <mark>sqoop version</mark>
* For import, the source is typically a relational database and target is HDFS
* For export source is HDFS and target is the relational database
* Data is copied using map reduce jobs

### Preview of MySQL in our labs
In our environment, we will be practicing Sqoop by copying data between MySQL and HDFS. Hence it is important to understand details about MySQL.
* A demo is given using one of our Big Data Clusters. But you can use any other cluster to which you have access to.
* You typically have to connect to Gateway Node in the cluster and from that node you should be able to connect to relational database – to perform Sqoop Import or Export.
* Preview of MySQL
    * Hostname: ms.itversity.com
    * Several Databases for different data sets
        * retail_db
        * hr_db
        * nyse_db
    * Users
        * retail_user
        * hr_user
        * nyse_user
    * Password: itversity
    
***Login to MySQL Database***

* To connect to the MySQL database using the username as <mark>mysql -u retail_user -h ms.itversity.com -p</mark>
* -p will prompt for a password which is **itversity**

## **Connectivity to Database using Sqoop**

Almost all sqoop commands use connect string using.<mark>--connect</mark> It requires the following information:

* Database type (MySQL, Oracle etc)
* Hostname
* Port number
* Database Name (in case of MySQL)

### Sqoop list commands
Sqoop list, facilitate us to see a list of databases or tables from the database.
* list-databases is used to list databases. In databases like Oracle 11g, it will list schemas.
* list-tables is used to list tables from a given database (or schema in case of Oracle)

### Running queries in MySQL using Sqoop eval
Let us understand how we can run queries or commands in remote database. We will be using MySQL for the demo.
* Running Queries – We can use eval to run queries on remote databases
* Any valid SQL query or command can be run using sqoop eval. It is primarily used to load or query log tables while running Sqoop jobs at regular intervals.
* We can issue commands such as select, insert, update, delete etc.
* The user which is used to connect to the database should have required privileges on underlying tables
* We can even invoke stored procedures.

### Simple Sqoop Import
Let us copy data from one table in MySQL to HDFS and understand how it works.
* sqoop import is the main command
* We need <mark>--connect</mark> to pass the connect string
* We need to specify username and password to authenticate to the database
* We need to specify a table name that needs to be copied
* We also need to specify target location in HDFS (either by using <mark>--target-dir</mark> or <mark>--warehouse-dir</mark>)
* We can use <mark>--target-dir</mark> to copy data to the path specified, whereas <mark>--warehouse-dir</mark> creates a directory with the name of the table and then copy the data into it.

### Sqoop Execution Life Cycle
Here is the execution lifecycle of Sqoop.
* Connect to the source database and get metadata
* Generate java file with metadata and compile to a jar file
* Apply **boundaryvalsquery** to apply split logic, default 4
* Use split boundaries to issue queries against the source database
* Each thread will have a different connection to issue the query
* Each thread will get a mutually exclusive subset of the data
* Data will be written to HDFS in a separate file per thread

### Managing Directories
When we run Sqoop import we need to deal with different scenarios with respect to the target directory.
* If the directory already exists, we might want to throw an exception or overwrite or append new data into it. We can achieve all three using different options.
* By default, the sqoop import fails if the target directory already exists
* A directory can be overwritten by using <mark>--delete-target-dir</mark>
* Data can be appended to existing directories by saying <mark>--append</mark>