# Automated uploading into a PostgreSQL database
Automatically detect and upload new CSV files placed in a specific directory to a PostgreSQL server, using a combination of shell scripting and a loop to continuously monitor the directory for new files.

# How to use

1. Open a text editor and create a new file.
  
2. Copy and paste the script into the file.
  
3. Replace the placeholder values with your actual database connection details, directory paths, and any other necessary configurations. Make sure to update the following variables:
  
   - `DB_HOST`: Replace with the hostname or IP address of your PostgreSQL database server.
   - `DB_PORT`: Replace with the port number of your PostgreSQL database server.
   - `DB_NAME`: Replace with the name of your PostgreSQL database.
   - `DB_USER`: Replace with the username to connect to your PostgreSQL database.
   - `DB_PASSWORD`: Replace with the password to connect to your PostgreSQL database.
   - `DIRECTORY`: Replace with the path to the directory where your CSV files are located.
   - `TABLE_NAME`: Replace with the desired name for the table in which the CSV data will be loaded.
  
4. Save the file with a `.sh` extension, for example, `csv_loader.sh`.
  
5. Open a terminal or command prompt.
  
6. Navigate to the directory where the script file is located using the `cd` command.
  
7. Make the script file executable by running the following command:  
   `chmod +x csv_loader.sh`
  
8. Run the script by executing the following command:  
   `./csv_loader.sh`
  
9. The script will start monitoring the specified directory for new CSV files. Whenever a new CSV file is detected, it will be loaded into the PostgreSQL database as a new table, and the file information will be logged in the `processed_files.txt` file.
  
10. The script will continue running in the background, checking for new files every 5 seconds. You can leave it running, and it will automatically process any new CSV files placed in the directory.
  
Remember to have PostgreSQL and the necessary dependencies installed on your system for the script to work properly.
  
NOTE: If you encounter any permission issues or errors, make sure you have the necessary permissions to access the files and directories, and check that the database connection details are correct.  
Make sure to give execute permissions to the script using **chmod +x script.sh**, where script.sh is the name of your shell script file. Then, you can run the script using ./script.sh to start monitoring the directory and automatically upload new CSV files to PostgreSQL.

## To run the script continuously in the background, you have a couple of options:
  
1. **Using nohup: You can run the script with `nohup` to detach it from the terminal and keep it running in the background. The script will continue running even after you close the terminal session.**  

- Here's an example: `nohup ./script.sh &`  
- The & at the end is used to run the script in the background. The script's output will be redirected to a file named nohup.out.  
  
<br><br>
2. **Using screen: Another option is to use the screen utility, which allows you to create a virtual terminal session that continues running even after you disconnect from it. Here's how you can use it**:  
  
- Start a new screen session: `screen -S mysession`
- Run the script in the screen session: `./script.sh`
- Detach from the screen session: Press `Ctrl+A` followed by `d`
  
The script will keep running within the screen session even if you close the terminal or disconnect from the SSH session. You can later reattach to the screen session using screen `-r mysession`.  
  
<br><br>
3. **Using crontab: The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs, also known as cron jobs, to run periodically at fixed times, dates, or intervals.**  
  
- Open the terminal and run the following command to open the crontab editor: `crontab -e`
- If prompted to select an editor, choose your preferred text editor (e.g., nano, vim).
- In the crontab file, each line represents a scheduled task. Each task is defined by a cron expression followed by the command to be executed.
- Add a new line to the crontab file with the schedule and command to run your script. For example, to run the script every 5 minutes, you can use the following line: `*/5 * * * * /path/to/your/script.sh`  
- - The */5 in the minute field means every 5 minutes.
- - The * in the other fields means any value.
- Save the crontab file and exit the editor.  
  
The script will now be executed automatically based on the specified schedule. Make sure the script has executable permissions via `chmod +x script.sh`  

You can check the list of scheduled tasks by running `crontab -l` in the terminal.  
  
<br><br>
NOTE: When using cron, make sure the environment variables and paths required by your script are properly set, as cron runs the tasks in a different environment than your interactive shell.  
  
NOTE: Both `nohup` and `screen` options allow you to keep the script running in the background. You don't need to use `crontab` in this case unless you specifically want to schedule the script to run at specific intervals.

---
## What if I am running it on my personal computer?
**nohup**: When you shut down your computer, the nohup-launched script will be terminated. It is not designed to survive a system shutdown. If you want the script to continue running even after your computer is shut down, you would need to consider other options.  
  
**screen**: The session remains active even if you shut down your computer. When you power on your computer again, you can reconnect to the screen session using the screen -r command. The script within the screen session will resume its execution as if it was never interrupted.  
Using screen allows you to detach and reattach to the session, which makes it suitable for long-running tasks that need to persist even when the computer is offline. However, please note that screen will only keep the session alive if the underlying system remains running. If the system itself is shut down, the screen session will be terminated.  
  
**crontab**: If you shut down your computer, the scheduled tasks in cron will not run until your computer is powered on again. cron relies on the operating system being up and running to execute the scheduled tasks.  
  
<br><br>
To automatically run a script when you power on your computer, you can add a startup script or configure a system service depending on your operating system.  
1. Linux (`systemd`): If you are using a Linux distribution with systemd as the init system (e.g., Ubuntu, CentOS), you can create a systemd service to run your script at startup.  
Here's a general outline of the steps:  
- Create a service unit file for your script. Open a text editor and create a file, for example, myscript.service, in the `/etc/systemd/system/` directory with the following content:
```
[Unit]
Description=My Script
After=network.target

[Service]
ExecStart=/path/to/your/script.sh

[Install]
WantedBy=default.target
```
- Save the file and exit the text editor.
- Enable and start the service using the following commands:
```
sudo systemctl enable myscript.service
sudo systemctl start myscript.service
```
- Now, the script will be executed automatically each time the system starts up.  
  

2. macOS (`LaunchAgents`): On macOS, you can use LaunchAgents to run a script at login. Here's a general outline of the steps:  
- Create a property list (plist) file for the script. Open a text editor and create a file, for example, `com.example.myscript.plist`, in the `~/Library/LaunchAgents/` directory with the following content:  
```
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.example.myscript</string>
  <key>ProgramArguments</key>
  <array>
    <string>/path/to/your/script.sh</string>
  </array>
  <key>RunAtLoad</key>
  <true/>
</dict>
</plist>
```
- Save the file and exit the text editor.
- Load the LaunchAgent using the following command:  
`launchctl load ~/Library/LaunchAgents/com.example.myscript.plist`
- Now, your script will be executed automatically each time you log in to your macOS system.
  
NOTE: Process may vary slightly depending on OS.
  
---
## Where should I place the script?
The placement of the script depends on your specific requirements and system configuration. 
Here are a few options to consider:

**User's home directory**: You can place the script in your user's home directory. This allows you to easily access and modify the script. You can create a new directory, such as scripts, within your home directory and place the script file there.  
Example: `/home/your_username/scripts/script.sh`
  
**System-wide location**: If you want the script to be available for all users on the system, you can place it in a system-wide location, such as /usr/local/bin or /opt. This requires administrative privileges.  
Example: `/usr/local/bin/script.sh`
  
**Custom directory**: You can create a custom directory specifically for your scripts and place the script file there. This can be useful if you have multiple scripts and want to keep them organized in a single location.  
Example: `/path/to/your/scripts/script.sh`
  
<br><br>
Once you have decided on the location, make sure the script file has executable permissions so that it can be run.  
You can set the permissions using the chmod command: `chmod +x /path/to/your/scripts/script.sh`
  
After placing the script in the desired location, you can run it using the command: `./script.sh` (assuming you are in the same directory) or by specifying the full path to the script file.

## Explanation of each part
The script will continue running indefinitely, periodically checking the specified directory for new CSV files, processing them, and marking them as processed to avoid duplicate processing. 
  
Here is a breakdown:
  
1. `#!/bin/bash`: This is called the shebang and specifies the interpreter to be used, in this case, /bin/bash. It indicates that the script should be executed using the Bash shell.
  
2. `DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD`: These variables store the connection details for the PostgreSQL database. You need to replace the placeholder values with your actual database connection information.
  
3. `DIRECTORY` and `PROCESSED_DIRECTORY`: These variables hold the path to the directories that will be monitored for CSV files. Replace the placeholder value with the actual path to the directory you want to monitor and the specified processed directory.
  
4. `process_csv_file()`: This is a function named process_csv_file that takes a CSV file path as an argument. It extracts the table name from the CSV file name by removing the file extension and replacing spaces with hyphens.
  
5. `header=$(head -n 1 "$csv_file")`: This retrieves the header of the CSV file using the head command, which contains the column names.
  
6. `column_names=$(echo "$header" | tr ',' '\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')`: It then uses echo, tr, and sed to convert the comma-separated column names into separate lines and remove leading/trailing spaces.
  
7. `create_table_sql="CREATE TABLE IF NOT EXISTS $table_name (${column_names[*]});"`  
`psql -h $DB_HOST -p $DB_PORT -d $DB_NAME -U $DB_USER -c "$create_table_sql"` : Creates a SQL statement to create a table in the database with the extracted table name and column names. The psql command executes the SQL statement using the provided database connection details.
  
8. `psql -h $DB_HOST -p $DB_PORT -d $DB_NAME -U $DB_USER -c "\\copy $table_name FROM '$csv_file' DELIMITER ',' CSV HEADER"` : executes the psql command to load the CSV file into the corresponding table in the database using the COPY command. The \\copy is used to perform the COPY operation from a local file, and the file path is provided.  
  
9. `mv "$csv_file" "$PROCESSED_DIRECTORY"`: This moves the processed CSV file to the specified processed directory using the mv command.
  
10. `processed_date=$(date +"%Y-%m-%d %H:%M:%S")`  
`file_size=$(du -h "$PROCESSED_DIRECTORY/${csv_file##*/}" | awk '{print $1}')`  
`echo "$processed_date | File: ${csv_file##*/} | Size: $file_size" >> processed_files.txt`: These lines obtain the current date and time using the date command and store it in the processed_date variable. The file size is obtained using the du command, and awk is used to extract the file size from the du output. The processed file information, including the date, file name, and file size, is then appended to the processed_files.txt log file.
  
11. `while true; do`: This starts an infinite loop using while true. The script will keep running until manually interrupted.
  
12. `csv_files=$(find "$DIRECTORY" -maxdepth 1 -type f -name "*.csv")`: This line uses the find command to search for CSV files (*.csv) in the specified directory ($DIRECTORY). The found file paths are stored in the csv_files variable.
  
13. `if [ -n "$csv_files" ]; then`: This checks if there are any CSV files found in the directory. The condition [ -n "$csv_files" ] checks if the csv_files variable is not empty.
  
14. `for csv_file in $csv_files; do`: This starts a loop to iterate over each CSV file found in the directory.

15. `if ! grep -qF "$csv_file" processed_files.txt; then`: This condition checks if the current CSV file is not already present in the processed_files.txt log file. It uses the grep command with the -qF options to search for an exact match (-F) of the file path ($csv_file) in the log file without printing any output (-q).
  
16. `echo "Processing file: $csv_file"`  
`process_csv_file "$csv_file"`  
`echo "$csv_file" >> processed_files.txt`  
`fi`: f the CSV file is not found in the log file, it indicates that the file is new and needs to be processed. The file path is echoed to the console, the process_csv_file function is called to process the file, and the file path is appended to the processed_files.txt log file.
  
17. `sleep 5`: After processing the CSV files in the current iteration, the script waits for 5 seconds using the sleep command before starting the next iteration of the loop.


In [None]:
%%script sh
#!/bin/bash

# Database connection details
DB_HOST="localhost"
DB_PORT="5432"
DB_NAME="analysis"
DB_USER="postgres"
DB_PASSWORD="admin"

# Directories
DIRECTORY="/Users/alexandergursky/Local_Repository/projects/CSV2PostgreSQL/deposit"
PROCESSED_DIRECTORY="/Users/alexandergursky/Local_Repository/projects/CSV2PostgreSQL/processed"

# Function to process CSV file
process_csv_file() {
  local csv_file="$1"
  local table_name="${csv_file##*/}"
  table_name="${table_name%.*}"
  table_name="${table_name// /-}"  # Replace spaces with hyphens

  # Read the header of the CSV file to get the column names
  header=$(head -n 1 "$csv_file")
  IFS=, read -ra column_names <<< "$header"

  # Create a new table based on the CSV file name with dynamic column names
  create_table_sql="CREATE TABLE IF NOT EXISTS $table_name (id SERIAL PRIMARY KEY"
  for column_name in "${column_names[@]}"; do
    create_table_sql+=", \"$column_name\" TEXT"
  done
  create_table_sql+=");"
  psql -h $DB_HOST -p $DB_PORT -d $DB_NAME -U $DB_USER -c "$create_table_sql"

  # Insert data row by row
  tail -n +2 "$csv_file" | while IFS=, read -ra values; do
    # Prepare column names and values for the INSERT statement
    column_list=$(printf '"%s",' "${column_names[@]}")
    column_list="${column_list%,}"  # Remove trailing comma
    value_list=$(printf "'%s'," "${values[@]}")
    value_list="${value_list%,}"  # Remove trailing comma

    insert_sql="INSERT INTO $table_name ($column_list) VALUES ($value_list);"
    psql -h $DB_HOST -p $DB_PORT -d $DB_NAME -U $DB_USER -c "$insert_sql"
  done

  # Move the processed file to the processed directory
  mv "$csv_file" "$PROCESSED_DIRECTORY"

  # Log the processed file information
  processed_date=$(date +"%Y-%m-%d %H:%M:%S")
  file_size=$(du -h "$PROCESSED_DIRECTORY/${csv_file##*/}" | awk '{print $1}')
  echo "$processed_date | File: ${csv_file##*/} | Size: $file_size" >> processed_files.txt
}



# Loop to continuously monitor the directory for new CSV files
while true; do
  # Check if there are any CSV files in the directory
  csv_files=$(find "$DIRECTORY" -maxdepth 1 -type f -name "*.csv")

  if [ -n "$csv_files" ]; then
    for csv_file in $csv_files; do
      # Process the CSV file if it has not been processed before
      if ! grep -qF "$csv_file" processed_files.txt; then
        echo "Processing file: $csv_file"
        process_csv_file "$csv_file"
        echo "$csv_file" >> processed_files.txt
      else
        echo "Skipping processed file: $csv_file"
      fi
    done
  fi

  # Wait for a specified time before checking again
  sleep 5
done


currently it does not do the last line in a file but it works overall