### If you want to try installing GNU parallel
<br>On a Mac that has Anaconda, and on a PC through WSL that has Anaconda, run the following line in your command line terminal:
<br>`conda install -c conda-forge parallel`
<br><br>Mac/Linux users can also install it with homebrew, if you are familiar.
<br><br>If you are using GitBash on a PC, you'll have to run the code found on this answer: https://stackoverflow.com/questions/52393850/how-to-install-gnu-parallel-on-windows-10-using-git-bash
 
<br><br>You can test your installation or GNU parallel by running this line of code:
<br>`parallel echo {} ::: 4`
<br>This should return “4” and not an error.

<br><br>To run on **Quest**, log into Quest as you usually would.

Once you are running on Quest, run the following commands to load the required packages and files:
<br>`module load python/anaconda3`
<br>`module load parallel`
<br>`wget https://raw.githubusercontent.com/aGitHasNoName/PythonForAutomation/main/questFiles.txt`
<br>`parallel wget {} :::: questFiles.txt`
<br>*That's right, we're using GNU parallel to help us automate loading files to Quest!*

# <br><br>GNU parallel for automating Python code

We're going to see how GNU parallel combined with what we learned earlier about sys.argv can save us time.

## <br><br>Parallel computing (very briefly)

**Node:** A single computer
<br><br>**Core:** A single processing unit on a computer
<br><br>**Task or process:** A single thing that you're asking the computer to do at one time
<br><br>There are different ways to parallelize your computing needs. There are two main distinctions.
<br><br>First, do you have multiple unique tasks that don't need to talk to each other (get information/data from) at all? Or do you have a set of more complicated tasks that rely on information from other tasks?
<br><br>If you have multiple unique tasks that don't need to communicate, then you need what is sometimes referred to as ***embarassingly parallel*** computing (as in embarassingly easy).
<br><br>If your tasks need to communicate, there is a second distinction: Can the total job be performed on one node (with multiple cores)? Or is the job too big and complicated for one node?
<br><br>If you can use one node, it is easier to do, as cores on the same node can talk to each other easily (*shared memory*). If you need multiple nodes (*distributed memory*), you usually need to use a specialty language (MPI or OpenMPI) to give the computers instructions on how and when to pass messages.


<br><br>**GNU parallel** is going to allow us to do embarassingly parallel tasks. 
<br><br>We start with a Python script that performs a task on one piece of data, then we use `parallel` to instruct the computer to run the script over many pieces of data (many tasks). 
<br><br>**This does not work if order matters. The tasks do not talk to each other.**
<br><br>GNU parallel appoints the correct amount of memory on a core to complete a single task and runs as many tasks at one time as you have memory available on your computer. Alternatively, you can give it a maximum number of tasks (jobs) to run at one time.
<br><br>This will go much quicker than if you simply set up a `for loop` in Python to loop through all of the pieces of data, as a `for loop` will only run a single task at once.
<br><br>When I say that we write a script that performs a task on one piece of data, the one piece of data can actually combine multiple inputs (for example, both an input filename and an output filename).
<br><br>There are lots of things you can do with GNU parallel, but we are going to go over the most common tasks. To learn more, check out the official tutorial: https://www.gnu.org/software/parallel/parallel_tutorial.html
<br><br>Let's just see how it works!

### <br><br>Simplest `parallel` example
Open up the `add100.py` script. You can see that it takes one argument from the command line (a number) and then prints the sum of that number and 100. Run the script on the command line with the command line argument `1`.

<br>What if we wanted to run the script on three numbers: `1`, `2`, and `3`?
<br><br>With GNU parallel, we can type:
<br>`parallel python add100.py {} ::: 1 2 3`
<br><br>Try it! 
<br><br>Your computer might return the numbers in a different order than how you passed them on the command line - this is because each process is completely independent and will be run on a different piece of memory (some pieces of memory might be a tiny bit faster than others on your computer)

#### <br><br>`parallel` syntax
`parallel python add100.py {} ::: 1 2 3`
<br><br>It starts with the command `parallel`, followed by our regular Python command.
<br><br>Just like how we used `sys.argv[1]` in our Python script to bring in data from the command line, we use `{}` to hold the place in our Python command of where our argument will go.
<br><br>We use a series of three colons `:::` as a divider between the Python command and our data that we want to pull into the `{}` spot.

### <br><br>Exercise 1

*If you are not able to work with GNU parallel today, that's ok! Instead of running the code, just write it out on a piece of paper, in text document, or even here in the notebook. Then you can check it against my code when I go over the answer.*

Open up and review the `argv2.py` script that we worked with earlier. It takes a string and a number from the command line.
1. Run the argv2.py script on the command line (without using parallel) with the string "pineapple" and the number 10.
2. Run the argv2.py script on the command line using parallel. Still use the string "pineapple", but run it with the numbers 1, 2, and 3.
3. Run the argv2.py script on the command line using parallel. Use the strings "pineapple", "lemon", and "tangerine", and run it with the number 10.

### <br><br>`parallel` with multiple arguments

We can also replace multiple command line arguments. The `argv2.py` script that we just used takes two arguments. Let's say we want to run the script with "pineapple", "lemon", and "tangerine", and run it with the numbers 1, 2, and 3.
<br><br>This could mean two things:
1. We could want **all possible combinations** of words and numbers: pineapple, pineapplepineapple, pineapplepineapplepineapple, lemon, lemonlemon, lemonlemonlemon, tangerine, tangerinetangerine, tangerinetangerinetangerine
2. We could want to **link** each word with the corresponding number: pineapple, lemonlemon, tangerinetangerinetangerine

#### <br>All possible combinations
To get all combinations, we use this syntax:
<br>`parallel python argv2.py {} {} ::: pineapple lemon tangerine ::: 1 2 3`

### <br><br>Exercise 2

Run the argv2.py script in parallel with every combination of the words "robin" and "sparrow" and the numbers 1, 2, 3, 4, and 5.

#### <br><br>Linking the arguments
***To link an argument to the one before it***, we add a `+` sign after the three colons. 
<br>`parallel python argv2.py {} {} ::: pineapple lemon tangerine :::+ 1 2 3`

### <br><br>Exercise 3

Run the argv2.py script in parallel to print out:
<br>one
<br>twotwo
<br>threethreethree

### <br><br>`parallel` with data from a file
Those examples work when you only need to parallelize a couple options for each argument. When you want to run a script over 10s, 100s, or 1,000s of data points, **you can put the arguments in a text document (.txt) with each argument on its own line.**

<br>Open up and look at the two text documents `words.txt` and `numbers.txt`. Let's run our script on every combination of these arguments.
<br><br>**To use parallel with a .txt file of arguments, we make one change to our code - we use four colons `::::` instead of three:**
<br>`parallel python argv2.py {} {} :::: words.txt :::: numbers.txt`

### <br><br>Exercise 4
**Try and guess the output of each of these examples before running them:**
<br><br>We can combine the four colon and three colon commands:
<br>`parallel python argv2.py {} {} :::: words.txt ::: 1 2`
<br><br>And we can also link arguments from files:
<br>`parallel python argv2.py {} {} :::: words.txt ::::+ numbers.txt`

### <br><br>A more practical example

Let's return to the email address sorting script. Open up `sortEmailsArguments.py`. This version of our script takes the input file and output file names on the command line. 
<br><br>So far, we've only run the script on one file at a time. Let's run it on three files in parallel. I've saved a list of all the files in a .txt document called `email_files.txt`.
<br><br>For the output filenames, I've save a second list called `email_output_files.txt`.

### <br><br>Exercise 5

Run the `sortEmailsArguments.py` script on every input file listed in the `email_files.txt` document. Link the output to the filenames in the `email_output_files.txt` document.

#### <br><br>The next notebook we're going to work in is Pipe.ipynb.