## Copying files from local to HDFS

We can copy files from local file system to HDFS either by using `copyFromLocal` or `put` command.

* `hdfs dfs -copyFromLocal` or `hdfs dfs -put` – to copy files or directories from local filesystem into HDFS. We can also use `hadoop fs` in place of `hdfs dfs`.
* However, we will not be able to update or fix data in files when they are in HDFS. If we have to fix any data, we have to move file to local file system, fix data and then copy back to HDFS.
* Files will be divided into blocks and will be stored on Datanodes in distributed fashion based on block size and replication factor. We will get into the details later.

![test](https://s3.amazonaws.com/kaizen.itversity.com/hadoop-overview/04HDFSAnatomyOfFileWrite.png)

In [1]:
%%sh

hdfs dfs -ls /user/${USER}

Found 3 items
drwx------   - itv002461 supergroup          0 2022-05-25 04:05 /user/itv002461/.Trash
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 05:30 /user/itv002461/.sparkStaging
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 04:11 /user/itv002461/warehouse


In [2]:
%%sh

hdfs dfs -mkdir /user/${USER}/retail_db

In [3]:
%%sh

hdfs dfs -ls /user/${USER}

Found 4 items
drwx------   - itv002461 supergroup          0 2022-05-25 04:05 /user/itv002461/.Trash
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 05:30 /user/itv002461/.sparkStaging
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:22 /user/itv002461/retail_db
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 04:11 /user/itv002461/warehouse


In [4]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

In [5]:
%%sh

hdfs dfs -help put

-put [-f] [-p] [-l] [-d] <localsrc> ... <dst> :
  Copy files from the local file system into fs. Copying fails if the file already
  exists, unless the -f flag is given.
  Flags:
                                                                       
  -p  Preserves access and modification times, ownership and the mode. 
  -f  Overwrites the destination if it already exists.                 
  -l  Allow DataNode to lazily persist the file to disk. Forces        
         replication factor of 1. This flag will result in reduced
         durability. Use with care.
                                                        
  -d  Skip creation of temporary file(<dst>._COPYING_). 


In [6]:
%%sh

hdfs dfs -help copyFromLocal

-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst> :
  Copy files from the local file system into fs. Copying fails if the file already
  exists, unless the -f flag is given.
  Flags:
                                                                                 
  -p                 Preserves access and modification times, ownership and the  
                     mode.                                                       
  -f                 Overwrites the destination if it already exists.            
  -t <thread count>  Number of threads to be used, default is 1.                 
  -l                 Allow DataNode to lazily persist the file to disk. Forces   
                     replication factor of 1. This flag will result in reduced   
                     durability. Use with care.                                  
  -d                 Skip creation of temporary file(<dst>._COPYING_).           


```{warning}
This will copy the entire folder to `/user/${USER}/retail_db` and you will see `/user/${USER}/retail_db/retail_db`. You can use the next command to get files as expected.
```

In [7]:
%%sh

ls -ltr /data/retail_db

total 20156
-rw-r--r-- 1 root root      806 Jan 21  2021 README.md
drwxr-xr-x 2 root root     4096 Jan 21  2021 products
drwxr-xr-x 2 root root     4096 Jan 21  2021 orders
drwxr-xr-x 2 root root     4096 Jan 21  2021 order_items
-rw-r--r-- 1 root root 10297372 Jan 21  2021 load_db_tables_pg.sql
drwxr-xr-x 2 root root     4096 Jan 21  2021 departments
drwxr-xr-x 2 root root     4096 Jan 21  2021 customers
-rw-r--r-- 1 root root     1748 Jan 21  2021 create_db_tables_pg.sql
-rw-r--r-- 1 root root 10303297 Jan 21  2021 create_db.sql
drwxr-xr-x 2 root root     4096 Jan 21  2021 categories


In [9]:
%%sh

hdfs dfs -put /data/retail_db /user/${USER}/retail_db

put: `/user/itv002461/retail_db/retail_db/departments/part-00000': File exists
put: `/user/itv002461/retail_db/retail_db/products/part-00000': File exists
put: `/user/itv002461/retail_db/retail_db/README.md': File exists
put: `/user/itv002461/retail_db/retail_db/create_db_tables_pg.sql': File exists
put: `/user/itv002461/retail_db/retail_db/load_db_tables_pg.sql': File exists
put: `/user/itv002461/retail_db/retail_db/categories/part-00000': File exists
put: `/user/itv002461/retail_db/retail_db/create_db.sql': File exists
put: `/user/itv002461/retail_db/retail_db/orders/part-00000': File exists
put: `/user/itv002461/retail_db/retail_db/customers/part-00000': File exists
put: `/user/itv002461/retail_db/retail_db/order_items/part-00000': File exists
put: `/user/itv002461/retail_db/retail_db/.git/packed-refs': File exists
put: `/user/itv002461/retail_db/retail_db/.git/HEAD': File exists
put: `/user/itv002461/retail_db/retail_db/.git/description': File exists
put: `/user/itv002461/retail_db

CalledProcessError: Command 'b'\nhdfs dfs -put /data/retail_db /user/${USER}/retail_db\n'' returned non-zero exit status 1.

In [10]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 1 items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:25 /user/itv002461/retail_db/retail_db


In [11]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/retail_db

Found 11 items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:25 /user/itv002461/retail_db/retail_db/.git
-rw-r--r--   3 itv002461 supergroup        806 2022-05-25 08:25 /user/itv002461/retail_db/retail_db/README.md
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:25 /user/itv002461/retail_db/retail_db/categories
-rw-r--r--   3 itv002461 supergroup   10303297 2022-05-25 08:25 /user/itv002461/retail_db/retail_db/create_db.sql
-rw-r--r--   3 itv002461 supergroup       1748 2022-05-25 08:25 /user/itv002461/retail_db/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:25 /user/itv002461/retail_db/retail_db/customers
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:25 /user/itv002461/retail_db/retail_db/departments
-rw-r--r--   3 itv002461 supergroup   10297372 2022-05-25 08:25 /user/itv002461/retail_db/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:25 /user/itv0024

```{note}
Let's drop this folder and make sure files are copied as expected. As the folder is pre-created, we can use patterns to copy the sub folders.
```

In [12]:
%%sh

hdfs dfs -help rm

-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ... :
  Delete all files that match the specified file pattern. Equivalent to the Unix
  command "rm <src>"
                                                                                 
  -f          If the file does not exist, do not display a diagnostic message or 
              modify the exit status to reflect an error.                        
  -[rR]       Recursively deletes directories.                                   
  -skipTrash  option bypasses trash, if enabled, and immediately deletes <src>.  
  -safely     option requires safety confirmation, if enabled, requires          
              confirmation before deleting large directory with more than        
              <hadoop.shell.delete.limit.num.files> files. Delay is expected when
              walking over large directory recursively to count the number of    
              files to be deleted before the confirmation.                       


In [13]:
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db/retail_db

Deleted /user/itv002461/retail_db/retail_db


In [14]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/

In [15]:
%%sh

hdfs dfs -put /data/retail_db/order* /user/${USER}/retail_db

In [16]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/

Found 2 items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:29 /user/itv002461/retail_db/order_items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:29 /user/itv002461/retail_db/orders


In [18]:
%%sh

hdfs dfs -put -f /data/retail_db/* /user/${USER}/retail_db

In [19]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/

Found 10 items
-rw-r--r--   3 itv002461 supergroup        806 2022-05-25 08:30 /user/itv002461/retail_db/README.md
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:30 /user/itv002461/retail_db/categories
-rw-r--r--   3 itv002461 supergroup   10303297 2022-05-25 08:30 /user/itv002461/retail_db/create_db.sql
-rw-r--r--   3 itv002461 supergroup       1748 2022-05-25 08:30 /user/itv002461/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:30 /user/itv002461/retail_db/customers
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:30 /user/itv002461/retail_db/departments
-rw-r--r--   3 itv002461 supergroup   10297372 2022-05-25 08:30 /user/itv002461/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:30 /user/itv002461/retail_db/order_items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 08:30 /user/itv002461/retail_db/orders
drwxr-xr-x   - itv002461 supergroup          0 2022-0

In [None]:
%%sh

hdfs dfs -ls -R /user/${USER}/retail_db/

```{note}
Alternatively you can use `copyFromLocal` as well.
```

In [20]:
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db

Deleted /user/itv002461/retail_db


In [23]:
%%sh

hdfs dfs -mkdir /user/${USER}/retail_db

mkdir: `/user/itv002461/retail_db': File exists


CalledProcessError: Command 'b'\nhdfs dfs -mkdir /user/${USER}/retail_db\n'' returned non-zero exit status 1.

In [24]:
%%sh

hdfs dfs -ls /user/itversity/retail_db/

Found 7 items
drwxr-xr-x   - itversity itversity          0 2022-04-07 13:03 /user/itversity/retail_db/categories
drwxr-xr-x   - itversity itversity          0 2022-04-07 13:03 /user/itversity/retail_db/customers
drwxr-xr-x   - itversity itversity          0 2022-04-07 13:03 /user/itversity/retail_db/departments
drwxr-xr-x   - itversity itversity          0 2022-04-07 13:03 /user/itversity/retail_db/order_items
drwxr-xr-x   - itversity itversity          0 2022-04-07 13:03 /user/itversity/retail_db/orders
drwxr-xr-x   - itversity itversity          0 2022-04-07 13:03 /user/itversity/retail_db/products
-rw-r--r--   3 itversity itversity       4965 2022-04-07 13:03 /user/itversity/retail_db/wordcount.rtf


In [25]:
%%sh

hdfs dfs -copyFromLocal /data/retail_db/* /user/${USER}/retail_db

In [26]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 10 items
-rw-r--r--   3 itv002461 supergroup        806 2022-05-25 10:03 /user/itv002461/retail_db/README.md
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/categories
-rw-r--r--   3 itv002461 supergroup   10303297 2022-05-25 10:03 /user/itv002461/retail_db/create_db.sql
-rw-r--r--   3 itv002461 supergroup       1748 2022-05-25 10:03 /user/itv002461/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/customers
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/departments
-rw-r--r--   3 itv002461 supergroup   10297372 2022-05-25 10:03 /user/itv002461/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/order_items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/orders
drwxr-xr-x   - itv002461 supergroup          0 2022-0

```{note}
We can also use this alternative approach to directly copy the folder `/data/retail_db` to `/user/${USER}/retail_db`. Let us first delete `/user/${USER}/retail_db` using `skipTrash`.
```

In [27]:
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db

Deleted /user/itv002461/retail_db


```{note}
We can specify the target location as `/user/${USER}`. It will create the retail_db folder and its contents.
```

In [28]:
%%sh

hdfs dfs -put /data/retail_db /user/${USER}

In [29]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 11 items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/.git
-rw-r--r--   3 itv002461 supergroup        806 2022-05-25 10:03 /user/itv002461/retail_db/README.md
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/categories
-rw-r--r--   3 itv002461 supergroup   10303297 2022-05-25 10:03 /user/itv002461/retail_db/create_db.sql
-rw-r--r--   3 itv002461 supergroup       1748 2022-05-25 10:03 /user/itv002461/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/customers
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/departments
-rw-r--r--   3 itv002461 supergroup   10297372 2022-05-25 10:03 /user/itv002461/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:03 /user/itv002461/retail_db/order_items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-

* If we try to run `hdfs dfs -put /data/retail_db /user/${USER}` again it will fail as the target folder already exists.

In [30]:
%%sh

hdfs dfs -put /data/retail_db /user/${USER}

put: `/user/itv002461/retail_db/departments/part-00000': File exists
put: `/user/itv002461/retail_db/products/part-00000': File exists
put: `/user/itv002461/retail_db/README.md': File exists
put: `/user/itv002461/retail_db/create_db_tables_pg.sql': File exists
put: `/user/itv002461/retail_db/load_db_tables_pg.sql': File exists
put: `/user/itv002461/retail_db/categories/part-00000': File exists
put: `/user/itv002461/retail_db/create_db.sql': File exists
put: `/user/itv002461/retail_db/orders/part-00000': File exists
put: `/user/itv002461/retail_db/customers/part-00000': File exists
put: `/user/itv002461/retail_db/order_items/part-00000': File exists
put: `/user/itv002461/retail_db/.git/packed-refs': File exists
put: `/user/itv002461/retail_db/.git/HEAD': File exists
put: `/user/itv002461/retail_db/.git/description': File exists
put: `/user/itv002461/retail_db/.git/objects/1f/1bd72ebcfcf65c212e84044d239aa3fe653fb6': File exists
put: `/user/itv002461/retail_db/.git/objects/41/1d0ce6478c6e

CalledProcessError: Command 'b'\nhdfs dfs -put /data/retail_db /user/${USER}\n'' returned non-zero exit status 1.

* We can use `-f` as part of `put` or `copyFromLocal` to replace existing folder.

In [31]:
%%sh

hdfs dfs -put -f /data/retail_db /user/${USER}

In [32]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Found 11 items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/.git
-rw-r--r--   3 itv002461 supergroup        806 2022-05-25 10:04 /user/itv002461/retail_db/README.md
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/categories
-rw-r--r--   3 itv002461 supergroup   10303297 2022-05-25 10:04 /user/itv002461/retail_db/create_db.sql
-rw-r--r--   3 itv002461 supergroup       1748 2022-05-25 10:04 /user/itv002461/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/customers
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/departments
-rw-r--r--   3 itv002461 supergroup   10297372 2022-05-25 10:04 /user/itv002461/retail_db/load_db_tables_pg.sql
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/order_items
drwxr-xr-x   - itv002461 supergroup          0 2022-05-

In [33]:
%%sh

hdfs dfs -ls -R /user/${USER}/retail_db

drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/.git
-rw-r--r--   3 itv002461 supergroup         23 2022-05-25 10:04 /user/itv002461/retail_db/.git/HEAD
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/.git/branches
-rw-r--r--   3 itv002461 supergroup        267 2022-05-25 10:04 /user/itv002461/retail_db/.git/config
-rw-r--r--   3 itv002461 supergroup         73 2022-05-25 10:04 /user/itv002461/retail_db/.git/description
drwxr-xr-x   - itv002461 supergroup          0 2022-05-25 10:04 /user/itv002461/retail_db/.git/hooks
-rw-r--r--   3 itv002461 supergroup        452 2022-05-25 10:04 /user/itv002461/retail_db/.git/hooks/applypatch-msg.sample
-rw-r--r--   3 itv002461 supergroup        896 2022-05-25 10:04 /user/itv002461/retail_db/.git/hooks/commit-msg.sample
-rw-r--r--   3 itv002461 supergroup        189 2022-05-25 10:04 /user/itv002461/retail_db/.git/hooks/post-update.sample
-rw-r--r--   3 itv002461 supe