## Copying files from HDFS to HDFS

Let us understand how to copy files with in HDFS (from one HDFS location to another HDFS location). 

* We can use `hdfs dfs -cp` command to copy files with in HDFS.
* One need to have at least read permission on source folders or files and write permission on target folder for `cp` command to work as expected.

In [1]:
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db

Deleted /user/itv002461/retail_db


In [2]:
%%sh

hdfs dfs -ls /public/retail_db

Found 7 items
drwxr-xr-x   - hdfs supergroup          0 2021-01-28 08:49 /public/retail_db/categories
drwxr-xr-x   - hdfs supergroup          0 2021-01-28 08:59 /public/retail_db/customers
drwxr-xr-x   - hdfs supergroup          0 2021-01-28 09:44 /public/retail_db/departments
drwxr-xr-x   - hdfs supergroup          0 2021-01-28 09:01 /public/retail_db/order_items
drwxr-xr-x   - hdfs supergroup          0 2021-01-28 09:27 /public/retail_db/orders
drwxr-xr-x   - hdfs supergroup          0 2021-01-28 08:54 /public/retail_db/products
-rw-r--r--   3 hdfs supergroup       4965 2021-08-21 03:48 /public/retail_db/wordcount.rtf


In [3]:
%%sh

hdfs dfs -ls /user/${USER}

Found 3 items
drwx------   - itv002461 supergroup          0 2022-05-25 04:05 /user/itv002461/.Trash
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 05:30 /user/itv002461/.sparkStaging
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 04:11 /user/itv002461/warehouse


In [4]:
%%sh

hdfs dfs -help cp

-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst> :
  Copy files that match the file pattern <src> to a destination.  When copying
  multiple files, the destination must be a directory. Passing -p preserves status
  [topax] (timestamps, ownership, permission, ACLs, XAttr). If -p is specified
  with no <arg>, then preserves timestamps, ownership, permission. If -pa is
  specified, then preserves permission also because ACL is a super-set of
  permission. Passing -f overwrites the destination if it already exists. raw
  namespace extended attributes are preserved if (1) they are supported (HDFS
  only) and, (2) all of the source and target pathnames are in the /.reserved/raw
  hierarchy. raw namespace xattr preservation is determined solely by the presence
  (or absence) of the /.reserved/raw prefix and not by the -p option. Passing -d
  will skip creation of temporary file(<dst>._COPYING_).


* Let us create directory to store all the folders and files related to HDFS under user space. You can review the permissions on retail_db, user have write permissions on the target folder.

In [5]:
%%sh

hdfs dfs -mkdir /user/${USER}/retail_db

In [6]:
%%sh

hdfs dfs -ls /user/${USER}

Found 4 items
drwx------   - itv002461 supergroup          0 2022-05-25 04:05 /user/itv002461/.Trash
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 05:30 /user/itv002461/.sparkStaging
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:21 /user/itv002461/retail_db
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 04:11 /user/itv002461/warehouse


In [None]:
%%sh

hdfs dfs -cp /public/retail_db/* /user/${USER}/retail_db

In [None]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

```{note}
This will fail as retail_db folder already exists.
```

In [None]:
%%sh

hdfs dfs -cp /public/retail_db /user/${USER}

```{note}
Alternative approach, where the folder and contents are copied directly.
```

In [7]:
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db

Deleted /user/itv002461/retail_db


In [8]:
%%sh

hdfs dfs -ls /user/${USER}

Found 3 items
drwx------   - itv002461 supergroup          0 2022-05-25 04:05 /user/itv002461/.Trash
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 05:30 /user/itv002461/.sparkStaging
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 04:11 /user/itv002461/warehouse


In [9]:
%%sh

hdfs dfs -cp /public/retail_db /user/${USER}

In [10]:
%%sh

hdfs dfs -ls -R /user/${USER}/retail_db

drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:29 /user/itv002461/retail_db/categories
-rw-r--r--   3 itv002461 supergroup       1029 2022-05-25 10:29 /user/itv002461/retail_db/categories/part-00000
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:29 /user/itv002461/retail_db/customers
-rw-r--r--   3 itv002461 supergroup     953719 2022-05-25 10:29 /user/itv002461/retail_db/customers/part-00000
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:29 /user/itv002461/retail_db/departments
-rw-r--r--   3 itv002461 supergroup         60 2022-05-25 10:29 /user/itv002461/retail_db/departments/part-00000
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:29 /user/itv002461/retail_db/order_items
-rw-r--r--   3 itv002461 supergroup    5408880 2022-05-25 10:29 /user/itv002461/retail_db/order_items/part-00000
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:29 /user/itv002461/retail_db/orders
-rw-r--r--   3 itv002461 supergroup    2999944 20

* We can also use patterns while using `cp` command to copy files within HDFS. Also, we can pass multiple files or folders in HDFS to `cp` command.

In [11]:
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db

Deleted /user/itv002461/retail_db


In [12]:
%%sh

hdfs dfs -ls /user/${USER}

Found 3 items
drwx------   - itv002461 supergroup          0 2022-05-25 04:05 /user/itv002461/.Trash
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 05:30 /user/itv002461/.sparkStaging
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 04:11 /user/itv002461/warehouse


In [13]:
%%sh

hdfs dfs -mkdir /user/${USER}/retail_db

In [14]:
%%sh

hdfs dfs -cp /public/retail_db/order* /user/${USER}/retail_db

In [15]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 2 items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/order_items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/orders


In [16]:
%%sh

hdfs dfs -cp /public/retail_db/departments /public/retail_db/products /user/${USER}/retail_db

In [17]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 4 items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/departments
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/order_items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/orders
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/products


In [18]:
%%sh

hdfs dfs -cp /public/retail_db/categories /public/retail_db/customers /user/${USER}/retail_db

In [19]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 6 items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/categories
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/customers
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/departments
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/order_items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/orders
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:30 /user/itv002461/retail_db/products
