# Getting Started with GSBatch

The Gsbatch object is for creating batches of file lists. Once batched, you can get/put files from/to a google bucket in batches, apply analysis in between. 

## 1. Passing or creating file lists

Import the Gsbatch object:

In [46]:
from gsbatch import Gsbatch, gsbatch_apply

When we create a `Gsbatch` object, we want it to have a list of files over which to create batches. There a number of ways of doing this with `Gsbatch`:

1. Pass a list of files directly
2. Pass a path containing a wildcard / list with a mix of filenames and wildcards
3. Use the Gsbatch filename constructor

In all three cases, file lists can be local or remote (on a google bucket). The object will work out how to batch the files. Here are examples of all three methods:

* Note: `file_dir` is an optional argument. You can pass directory structures straight to `file_list` or `file_components` too. If `file_dir` is provided, then Gsbatch will place it at the begining of all paths

__1. Pass a list of files directly__

In [12]:
gsb = Gsbatch(file_dir = 'gs://example_bucket',
              file_list = ['file1', 'file2', 'file3', 'file4', 'file5'], 
              batch_size = 2)
gsb.summary()

   Number of batches:     3
   Number of files:       5

   Current batch number:  1


__2. Pass list of paths, optionally containing wildcards__

You can pass wildcards in a list of files. In this case, the wildcard will be searched and expanded at the same list position. You may also pass just a single wildcarded file in a string and it will also be expanded.

Note, if using wildcards, you should specify where Gsbatch is to search locally (`source='local'`) or remotely (`source='remote'`).

In [18]:
gsb = Gsbatch(file_dir = '<home_dir>/testdata/',
              file_list = ['file11.txt', 'test*'], 
              source = 'local',
              batch_size = 2)
gsb.summary()

   Number of batches:     3
   Number of files:       6

   Current batch number:  1


__3. Use the component file constructor__

If you have a set of files which are created in a structured way, then you can use the Gsbatch file constructor to create a list of files. For example, you may have files that are all of the form: 

```
<variable_name>_<model_name>_<time_period>.<file_ext>
```

We could use the file constructor to create files of this form as follows:

In [29]:
var_list = ['var1_','var2_','var3_']
model_list = ['model1_','model2_']
time_list = ['1980-2000', '2020-2040']
ext = ['.nc']

gsb = Gsbatch(file_dir = 'gs://example_bucket',
              file_components = [var_list, model_list, time_list, ext],
              batch_size=5)
gsb.summary()

   Number of batches:     3
   Number of files:       12

   Current batch number:  1


We can view the created files by asking for the `files` variable:

In [30]:
gsb.files

['gs://example_bucket/var1_model1_1980-2000.nc',
 'gs://example_bucket/var1_model1_2020-2040.nc',
 'gs://example_bucket/var1_model2_1980-2000.nc',
 'gs://example_bucket/var1_model2_2020-2040.nc',
 'gs://example_bucket/var2_model1_1980-2000.nc',
 'gs://example_bucket/var2_model1_2020-2040.nc',
 'gs://example_bucket/var2_model2_1980-2000.nc',
 'gs://example_bucket/var2_model2_2020-2040.nc',
 'gs://example_bucket/var3_model1_1980-2000.nc',
 'gs://example_bucket/var3_model1_2020-2040.nc',
 'gs://example_bucket/var3_model2_1980-2000.nc',
 'gs://example_bucket/var3_model2_2020-2040.nc']

### 2. Cycling batches

Once you have your `Gsbatch` object created and file list defined, we can start cycling through our batches. Calling `Gsbatch.summary()` will show you your current batch number. You can see above that it is always initialised to 1. You can also access the batch number directly by asking for `Gsbatch.current_batch`. Note this will return a Python index (starting at 0) whereas `.summary()` will print the human intuitive number (`current_batch + 1`).

You can get the files in the current batch by calling `Gsbatch.files_batch`:

In [33]:
gsb.files_batch

['gs://example_bucket/var1_model1_1980-2000.nc',
 'gs://example_bucket/var1_model1_2020-2040.nc',
 'gs://example_bucket/var1_model2_1980-2000.nc',
 'gs://example_bucket/var1_model2_2020-2040.nc',
 'gs://example_bucket/var2_model1_1980-2000.nc']

You can move onto the next batch by calling `next_batch()`:

In [41]:
gsb.next_batch()
gsb.summary()

   Number of batches:     3
   Number of files:       12

   Current batch number:  2


In [37]:
gsb.files_batch

['gs://example_bucket/var2_model1_2020-2040.nc',
 'gs://example_bucket/var2_model2_1980-2000.nc',
 'gs://example_bucket/var2_model2_2020-2040.nc',
 'gs://example_bucket/var3_model1_1980-2000.nc',
 'gs://example_bucket/var3_model1_2020-2040.nc']

Similarly you can go to the previous batch by calling `gsb.prev_batch()` and reset back to the first batch using `gsb.reset_batch()`.

### 3. Getting / Putting Files

If you are using a file list of files in a google bucket, you can easily download the files in your current batch to a directory of your choice. You provide this directory on creation of your gsbatch object by passing a path to `get_dir` and/or `put_dir`. Then to download data you can call:

In [None]:
gsb.get_batch()

Similarly, if you are dealing with local files that you would like to push to a google bucket directory:

In [None]:
gsb.put_batch()

Files downloaded using `get_batch` are considered temporary files by Gsbatch (otherwise why are you batching?). You might typically want to download a batch of files, apply a function, upload the resulting files and delete the batch of downloaded files. The Gsbatch object will keep track of the files downloaded using `get_batch()` in the `tmp_files` variable. There is a convenient function to delete these files:

In [None]:
gsb.delete_tmp_files()