# Apache Sqoop – Importing data to HDFS – Contd…

As we understood the basics of Sqoop and Sqoop import, let us, deep dive, further to understand other common features of Sqoop import.
* Customizing Split logic
* Auto reset to one mapper
* File Formats and Compression
* Filtering of Data
* Delimiters and Handling Nulls
* Incremental Loads

### Customizing Split logic
By default, Sqoop import uses 4 mappers and data will be divided into mutually exclusive subsets using primary key fields (as demonstrated earlier). Now let us see how to handle scenarios where we want to use non-primary key fields to split the data.
* By default number of mappers is 4, it can be changed with <mark>--num-mappers</mark>
* Split logic will be applied on the primary key if exists
* If primary key does not exist and if we set number of mappers to more than 1, then sqoop import will fail
* At that time we can use <mark>.--split-by</mark> to split on a non-key column or explicitly set <mark>--num-mappers</mark> to 1 or use <mark>--auto-reset-to-one-mapper</mark>
* If the primary key column or the column specified in the split-by clause is a non-numeric type, then we need to use this additional argument <mark>-Dorg.apache.sqoop.splitter.allow_text_splitter=true</mark>
* It is quite common that some large tables might not have the primary key or unique key and if we have to import that table using one mapper might not be feasible.
* In that case we can specify a column by using <mark>--split-by</mark>
    * It is a good idea to use the indexed column as part of the,<mark>--split-by</mark> otherwise, each thread might end up doing a full table scan.
    * If there are null values in the column, corresponding records from the table will be ignored
    * Data in the split-by column need not be unique, but if there are duplicates then there can be a skew in the data while importing (which means some files might be relatively bigger compared to other files)


sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items_nopk \
  --warehouse-dir /user/dgadiraju/sqoop_import/retail_db \
  --split-by order_item_order_id
  
#Splitting on text field
sqoop import \
  -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table orders \
  --warehouse-dir /user/dgadiraju/sqoop_import/retail_db \
  --split-by order_status

### Auto Reset to One Mapper
Now let us understand how to handle those tables which do not have a primary key and entire data have to be loaded using one mapper.
* Some tables might not have a primary key
* If we are not sure whether the table has a primary key or not and want to use a number of mappers higher than 1, then for those tables where there is no primary key sqoop import will fail
* One of the ways to address this issue is using.--auto-reset-to-one-mapper If there is a primary key for the table then sqoop import will use otherwise it will only use one mapper.
* This comes handy as part of the automation.

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items_nopk \
  --warehouse-dir /user/dgadiraju/sqoop_import/retail_db \
  --autoreset-to-one-mapper

### File Formats and Compression
Let us understand more about file formats and compression as part of Sqoop import.
* Supported File Formats – Text file, Sequence file, Avro, Parquet etc
    * Text file (default) <mark>--as-textfile</mark>
    * Sequence file <mark>--as-sequencefile</mark>
    * Avro <mark>--as-avrodatafile</mark>
    * Parquet <mark>--as-parquetfile</mark>

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --warehouse-dir /user/dgadiraju/sqoop_import/retail_db \
  --num-mappers 2 \
  --as-sequencefile

* Supported compression algorithms – snappy, gzip etc
    * Go to **/etc/hadoop/conf** and check **core-site.xml** for supported compression codecs
    * Use <mark>--compress</mark> to enable compression
    * If compression codec is not specified, it will use gzip by default
    * Compression algorithm can be specified using <mark>--compression-codec</mark>

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --warehouse-dir /user/dgadiraju/sqoop_import/retail_db \
  --num-mappers 2 \
  --as-textfile \
  --compress \
  --compression-codec org.apache.hadoop.io.compress.GzipCodec

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --warehouse-dir /user/dgadiraju/sqoop_import/retail_db \
  --num-mappers 2 \
  --as-textfile \
  --compress \
  --compression-codec org.apache.hadoop.io.compress.SnappyCodec

### Filtering of Data
Now let us see how we can filter the data using Sqoop Import.
* Using boundary-query
    * Let us recap the typical import life cycle
        * Using primary key field get min and max values (boundary query)
        * Compute ranges using a number of mappers (splits)
        * Establish a database connection for each split and issue a query to read the data
        * Get data and write it o HDFS
    * Using --boundary-query
        * We can avoid issuing a query to get min and max values by hard-coding them (if we know the values up front)
        * We can address the issue of outliers by narrowing down using where clause in --boundary-query

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --warehouse-dir /user/dgadiraju/sqoop_import/retail_db \
  --boundary-query 'select 100000, 172198'

* If we want to select columns from the table while import we can use **--columns**

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --columns order_item_order_id,order_item_id,order_item_subtotal \
  --warehouse-dir /user/dgadiraju/sqoop_import/retail_db \
  --num-mappers 2

* We can pass custom query instead of table using **--query**
* When **--query** is used we need to specify **--split-by** or set **--num-mappers** to 1

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --target-dir /user/dgadiraju/sqoop_import/retail_db/orders_with_revenue \
  --num-mappers 2 \
  --query "select o.*, sum(oi.order_item_subtotal) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id and \$CONDITIONS group by o.order_id, o.order_date, o.order_customer_id, o.order_status" \
  --split-by order_id

### Delimiters and Handling Nulls
Let us see how to handle nulls and delimiters while saving data to HDFS.
* In traditional RDBMS, nulls and delimiters are managed internally. Quite often complexity is hidden from us.
* But when we save data in HDFS, as we have to deal with them as regular files it is our responsibility to deal with null values and delimiters.
* By default for both strings and non-strings, Sqoop places null as placeholders for corresponding null values in the database.
* We can provide custom values by using <mark>--null-string</mark> and <mark>--null-non-string</mark>
* null-non-string is typically to deal with numeric fields
* Sqoop uses comma as default delimiter when data is written to HDFS.
* There are several control arguments to deal with delimiters
    * <mark>--fields-terminated-by</mark> – to specify a custom field delimiter
    * <mark>--lines-terminated-by</mark> – to specify custom line delimiter
    * <mark>--enclosed-by</mark> – to specify an enclosing character
    * <mark>--escaped-by</mark> – specify escape character
    * <mark>--mysql-delimiters</mark> – to use default MySQL delimiters
    * <mark>--optionally-enclosed-by</mark>

#Default behavior
sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/hr_db \
  --username hr_user \
  --password itversity \
  --table employees \
  --warehouse-dir /user/dgadiraju/sqoop_import/hr_db
  
#Changing default delimiters and nulls
sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/hr_db \
  --username hr_user \
  --password itversity \
  --table employees \
  --warehouse-dir /user/dgadiraju/sqoop_import/hr_db \
  --null-non-string -1 \
  --fields-terminated-by "\000" \
  --lines-terminated-by ":"

### Incremental Loads
We can perform incremental imports using different approaches.
* query – we can pass a query with complete logic as we have seen before. We have to specify the split-by clause with query option to import in parallel.
* where – we can filter data based on date field or primary key field to get an incremental load
* Sqoop increment load arguments
    * check-column to specify the column based on which we want to perform incremental load
    * incremental to specify whether we want to append or last-modified (typically used to handle updates in a table)
    * last-value – to get rows with values greater than this for the column specified in check-column

#Baseline import
sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --target-dir /user/dgadiraju/sqoop_import/retail_db/orders \
  --num-mappers 2 \
  --query "select * from orders where \$CONDITIONS and order_date like '2013-%'" \
  --split-by order_id

#Query can be used to load data based on condition
sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --target-dir /user/dgadiraju/sqoop_import/retail_db/orders \
  --num-mappers 2 \
  --query "select * from orders where \$CONDITIONS and order_date like '2014-01%'" \
  --split-by order_id \
  --append
  
#where in conjunction with table can be used to get data based up on a condition
sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --target-dir /user/dgadiraju/sqoop_import/retail_db/orders \
  --num-mappers 2 \
  --table orders \
  --where "order_date like '2014-02%'" \
  --append

#Incremental load using arguments specific to incremental load
sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --target-dir /user/dgadiraju/sqoop_import/retail_db/orders \
  --num-mappers 2 \
  --table orders \
  --check-column order_date \
  --incremental append \
  --last-value '2014-02-28'