# Notebook 01: Get Data

### Primary Goal: Get the dataset needed for this tutorial 

#### Background

Like Part 1 (if you haven't looked over Part 1, please do so before jumping into these notebooks), we will continue to use the [The Storm EVent ImagRy (SEVIR) dataset](https://proceedings.neurips.cc/paper/2020/file/fa78a16157fed00d7a80515818432169-Paper.pdf). Unfortunatley, the original SEVIR dataset is about 1 TB in size. Making things challenging because most will not have 1 TB free to play around with. Thus, to make a dataset that is more accessible we made ```sub-sevir```. 

```sub-sevir``` is a sub-sampled version of SEVIR. Specifcally, we re-sample all the images to have 48 x 48 pixels, which equates to about 8 km spatial resolution and only 1 hour of time (original has 4 hours). So in total each *scene* is the shape (12,48,48,4).

To see the differences here are two youtube videos:


1. [Original resolution](https://youtu.be/ntjNB0SAz1Y)
2. [sub-sevir](https://youtu.be/UAEfD1p5uW8)

While there is considerable differences in the resolution, there is still plenty of information content to do machine learning with. 

After the sub-sampling of SEVIR, the data size is now only 2 GB. Thus, **please make sure you have at least 2 GB of storage space avail**. before continuing here.

#### Get Data

The data is hosted on zenodo: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7011372.svg)](https://doi.org/10.5281/zenodo.7011372). 

Zenodo is a free place to store files under 50 GB. What zenodo enables us to do is we can point something like wget right to it. 

If you dont want to use python to download the file, you can go to the above button and mannually download the file.

In [None]:
import wget 

url = 'https://zenodo.org/record/7011372/files/sub-sevir.tar.gz?download=1'

#fix the path here to where you want to put the file
filename = wget.download(url,out='../datasets/sub-sevir.tar.gz')

#### Open Archive

The file is tarballed. What this means is in order to make the data easy to share, we compressed it. To undo the compression:

1. Open a terminal 
    a. on a microsoft computer open powershell (open search bar and type powershell) 
    b. on a mac/linux computer just use the terminal
    
 
2. Navigate to where the file is  
    

    ```cd path_to_dir``` 

3. untar file 

    ``` tar –xvzf sub-sevir.tar.gz ```
    
This should have unzipped the outer folder.

The contents should look like the following: 

- README.md 
- sub-sevir-train.tar.gz
- sub-sevir-val.tar.gz
- sub-sevir-test.tar.gz 

I encourage you to go ahead and look over the README.md, in there this a bunch of meta-data for how the data were created and some explanations of things. It is just a text file, so go ahead and open it with your favorite text editor (e.g., notepad++ etc.)

You probably noticed, ah more pesky .tar.gz files. We will need to decompress these too. Like before: 


1. Navigate into the sub-sevir dir 
    

    ```cd sub-sevir``` 

3. untar training file 

    ``` tar –xvzf sub-sevir-train.tar.gz ``` 
    
4. untar validation file 

    ``` tar –xvzf sub-sevir-val.tar.gz ``` 
    
5. untar test file 

    ``` tar –xvzf sub-sevir-test.tar.gz ``` 
    
 
Congrats! You have successfully set your self up with the dataset to play with some neural networks and machine learning. If you want to save disk space, you can go ahead and del the extra .tar.gz files. 

The next notebook will help you visualize and play with some of the data before we jump into the machine learning.
