# Apache Sqoop – Import into Hive Tables and Export

As part of this session let us understand following related to Sqoop Hive Import, import-all-tables as well as Sqoop export.
* Create Database
* Simple Hive import
* Managing Hive tables
* Import all tables
* Data Engineering – Typical life cycle
* Sqoop Simple Export
* Sqoop Export – Column Mapping
* Sqoop Export – Upsert or Merge
* Sqoop Export – Staging Tables

### Create Database
Before getting into Hive import, it is better to create a Hive database to explore the features of Sqoop import into Hive.
* Let us name the database as bootcampdemo

### Simple Hive Import
Let us see simple Hive import
* <mark>--hive-import</mark> will enable hive import. It create table if it does not already exists
* <mark>--hive-database</mark> can be used to specify the database
* Instead of <mark>--hive-database</mark>, we can use database name as prefix as part of <mark>--hive-table</mark>

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --hive-import \
  --hive-database dgadiraju_sqoop_import \
  --hive-table order_items \
  --num-mappers 2

### Managing Tables
* Default hive import behavior
    * Create table if table does not exists
    * If table already exists, data will be appended
* <mark>--create-hive-table</mark> will fail hive import, if table already exists
* <mark>--hive-overwrite</mark> will replace existing data with new set of data

### Import all tables
Sqoop provide capability to import all the tables using import-all-tables
* All the tables from a schema/database can be imported
* <mark>--exclude-tables</mark>, will facilitate to exclude the tables that need not be imported
* <mark>--auto-reset-to-one-mapper</mark>, will let import all tables to choose one mapper in case table does not have primary key
* Most of the features such as <mark>--query</mark>, <mark>--boundary-query</mark>, <mark>--where</mark> etc are not available with import-all-tables.

sqoop import-all-tables \
  --connect "jdbc:mysql://ms.itversity.com:3306/retail_db" \
  --username=retail_user \
  --password=itversity \
  --as-avrodatafile \
  --autoreset-to-one-mapper \
  --warehouse-dir=/user/itversity/sqoop_import

### Data Engineering – Typical Life Cycle
* Get data ingested to HDFS using Sqoop import from relational databases
* Process data using Map Reduce or Spark
* Processed data can be exported back to databases supporting reporting layer

### Sqoop Simple Export
As part of this topic we will run a simple export with delimiters
* Simple export – following are the arguments we need to pass
    * <mark>--connect</mark> with jdbc connect string. It should include target database
    * <mark>--username</mark> and <mark>--password</mark>, the user should have right permission on the table into which data is being exported
    * <mark>--table</mark>, target table in relational database such as MySQL into which data need to be copied
    * <mark>--export-dir</mark> from which data need to be copied
* Delimiters
    * Sqoop by default expect “,” to be field delimiter
    * But Hive default delimiter is Ascii 1 (\001)
    * <mark>--input-fields-terminated-by</mark> can be used to pass delimiting character other than ,
* Number of mappers – we can increase or decrease number of threads by using <mark>--num-mappers</mark> or <mark>-m</mark>

To demonstrate Sqoop Simple Export, we will perform following steps.
* Create a table in MySQL
* Create Hive Table by joining orders and order_items
* Run Sqoop Export Command

***Create a table in MySql***

Let us create MySQL table into which data can be exported.
* Use database retail_export if you want to create the tables and export the data

```create table daily_revenue(
  order_date varchar(30),
  revenue float
);```

***Create Hive Table***

We can export data from any HDFS directory. Even when we try to export data from Hive Table we have to pass underlying directory pointed by the table.
* Join orders and order_items
* Compute revenue for each date
* Use CTAS to create new table using join results

***Sqoop Export Command***

Let us perform simple sqoop export and understand the execution life cycle of Sqoop export
* Read data from export directory
* By default, Sqoop export uses 4 parallel threads to read the data by using Map Reduce split logic (based up on HDFS block size)
* Each thread establishes database connection using JDBC url, username and password
* Generated insert statement to load data into target table
* Issues insert statements in the target table using connection established per thread (or mapper)

sqoop export \
 --connect jdbc:mysql://ms.itversity.com:3306/retail_export \
 --username retail_user \
 --password itversity \
 --export-dir /apps/hive/warehouse/dgadiraju_sqoop_import.db/daily_revenue \
 --table daily_revenue \
 --input-fields-terminated-by "\001"

### Sqoop Export – Column Mapping
Let us see rationale behind column mapping while exporting the data
* Some times the structure of data in HDFS and structure of table in MySQL into which data need to be exported need not match exactly
* There is no way we can change the order of columns in our input data and we have to consume every column
* However, Sqoop export give flexibility to map all the columns to target table columns in the order of data in HDFS. For e.g.
    * HDFS data structure – order_date and revenue
    * MySQL target table – revenue, order_date and description
    * There is no description in HDFS and hence description in target table should be nullable
    * <mark>--columns order_date,revenue</mark> will make sure data is populated into revenue and order_date in target table.

```create table daily_revenue_demo (
     revenue float,
     order_date varchar(30),
     description varchar(200)
);```

sqoop export \
--connect jdbc:mysql://ms.itversity.com:3306/retail_export \
--username retail_user \
--password itversity \
--export-dir /apps/hive/warehouse/dgadiraju_sqoop_import.db/daily_revenue \
--table daily_revenue_demo \
--columns order_date,revenue \
--input-fields-terminated-by "\001" \
--num-mappers 1

### Sqoop Export – Upsert/Merge
As part of this topic we will see, how we can upsert/merge data from HDFS to MySQL tables.
* We will first create table in hive and load data for 2013-07 daily revenue
* Create table in MySQL
* Run export to load 2013-07 data
* Update mysql table with revenue to 0
* Run export with <mark>--update-key</mark>
* Load data into Hive table for 2013-08 data
* Run export with <mark>--update-key</mark> and <mark>--update-mode</mark> as <mark>allowinsert</mark>

```create table daily_revenue (
 order_date string,
 revenue float
);```

```insert into table daily_revenue
 select order_date, sum(order_item_subtotal) daily_revenue
 from orders join order_items on
 order_id = order_item_order_id
 where order_date like '2013-07%'
 group by order_date;```

```create table daily_revenue (
 order_date varchar(30) primary key,
 revenue float
);```

sqoop export \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_export \
  --username retail_user \
  --password itversity \
  --export-dir /apps/hive/warehouse/bootcampdemo.db/daily_revenue \
  --table daily_revenue \
  --input-fields-terminated-by "\001" \
  --num-mappers 1

```update daily_revenue set revenue = 0;```

sqoop export \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_export \
  --username retail_user \
  --password itversity \
  --export-dir /apps/hive/warehouse/bootcampdemo.db/daily_revenue \
  --table daily_revenue \
  --update-key order_date \
  --input-fields-terminated-by "\001" \
  --num-mappers 1

```insert into table daily_revenue
 select order_date, sum(order_item_subtotal) daily_revenue
 from orders join order_items on
 order_id = order_item_order_id
 where order_date like '2013-08%'
 group by order_date;```

sqoop export \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_export \
  --username retail_user \
  --password itversity \
  --export-dir /apps/hive/warehouse/bootcampdemo.db/daily_revenue \
  --table daily_revenue \
  --update-key order_date \
  --update-mode allowinsert \
  --input-fields-terminated-by "\001" \
  --num-mappers 1

### Sqoop Export – Stage tables
Let us understand the relevance of stage tables as part of sqoop export.
* Data will be read from HDFS and insert or update statements are generated to load data into MySQL tables
* If there are any issues with data which violate constraints defined in table, Sqoop will retry 4 times before it give up.
* Due to retries, the target table can be inconsistent state and quite often it can be tedious to clean up the target table.
* To address this issue instead of directly loading into target table we can use stage table
    * Data will be first loaded into stage table by sqoop export
    * After stage table is populated with out any issues, data from stage table will be loaded into final table by issuing merge or upsert statement.
    * We can clean up staging table by using <mark>--clear-staging-table</mark>

```insert into table daily_revenue
 select order_date, sum(order_item_subtotal) daily_revenue
 from orders join order_items on
 order_id = order_item_order_id
 where order_date > '2013-08'
 group by order_date;```

sqoop export \
--connect jdbc:mysql://ms.itversity.com:3306/retail_export \
--username retail_user \
--password itversity \
--export-dir /apps/hive/warehouse/bootcampdemo.db/daily_revenue \
--table daily_revenue \
--staging-table daily_revenue_stage \
--input-fields-terminated-by "\001"