# Hurricane Harvey: Explore the Data

In this notebook, you will explore the Hurricane Harvey image dataset. The dataset you will be looking at came from [Detecting Damaged Buildings on Post-Hurricane Satellite Imagery Based on Customized Convolutional Neural Networks](https://dx.doi.org/10.21227/sdad-1e56) and was prepared by Quoc Dung Cao and Youngjun Choe on the 13th of December, 2018. [This paper](https://arxiv.org/pdf/1807.01688.pdf) describes the authors' work in more detail and more [information about the original satellite image data source can be found here](https://www.satimagingcorp.com/satellite-sensors/geoeye-1/). Individual images were labeled by volunteers on the [crowdsourcing platform Tomnod, now Maxar Geohive](https://blog.maxar.com/leading-the-industry/2019/in-the-blink-of-an-eye-looking-back-on-nine-years-with-tomnod). This dataset contains aerial-view windows of the affected area classified as damaged or not, along with building coordinates. 

In this lab you will apply the following steps:
 1. Import Python packages
 2. Explore the dataset
 3. Prepare the dataset
 4. Describe the data
 5. Compare pre-disaster and post-disaster images
 6. Visualizing classified locations with LeafLet

## 1. Import Python packages

Run the next cell to import the Python packages you'll be using in this lab exercise. If everything goes well you should see a message when the cell has finished running that says "All packages imported successfully!".

Note the `import utils` line. This line imports the functions that were specifically written for this lab. If you want to look at what these functions are, go to `File -> Open...` and open the `utils.py` file to have a look.

In [1]:
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact

import utils

print('All packages imported successfully!')

All packages imported successfully!


## 2. Explore the dataset

Run the cell below to display any image and its location.

In [2]:
components = utils.images_on_server(display)

display(components['fileChooser'])
display(components['output'])

FileChooser(path='/home/jovyan/work/data/test', filename='', title='', show_hidden=False, select_desc='Select'…

HBox(children=(Output(), Output()))

## 3. Prepare the dataset

Run the next cell to represent the image dataset in a dataframe using the `pandas` package. A dataframe is just a convenient format you can use for accessing and manipulating data. We will only use this dataframe to understand the path of each image and highlight key features in the naming conventions of the folders and locations.

This dataset has already been split for you into the `train`, `validation`, and `test`. The path name of each image gives us important information. 

```
|- data
  |- train
    |- visible_damage
    |- no_damage
  |- validation
    |- visible_damage
    |- no_damage
  |- test
    |- visible_damage
    |- no_damage
```

* `data`: root folder containing all images
  * `train`: set of data used to train the model
  * `validation`: set of data used to fine tune the model during training
  * `test`: set of data used to test the performance of the model
     * `visible_damage`: set of images that display evidence of damage
     * `no_damage`: set of images that do not display evidence of damage

Each image within each folder is given a unique name containing two numbers. The first number refers to the **longitude** coodinates of the satellite image and the second number refers to the **latitude** coodinates.  

**For example:** 

`/train/visible_damage/-95.638338_29.771164000000002.jpeg`
* `train`: This image is a part of the training set
* `visible_damage`: This image displays evidence of damage
* `-95.638338`: Longitude coordinates of the image
* `29.771164000000002`: Latitude coodinates of the image
* `.jpeg`: Type of image file

In [3]:
metadata = utils.get_dataframe_from_file_structure()

# Display the first five rows of the dataset
metadata.head()

Unnamed: 0,subset,label,lat,lon,path,filename
0,train,visible_damage,29.771164,-95.638338,train/visible_damage/-95.638338_29.77116400000...,-95.638338_29.771164000000002.jpeg
1,train,visible_damage,29.824802,-95.089349,train/visible_damage/-95.089349_29.82480200000...,-95.089349_29.824802000000002.jpeg
2,train,visible_damage,29.981403000000004,-95.119694,train/visible_damage/-95.119694_29.98140300000...,-95.119694_29.981403000000004.jpeg
3,train,visible_damage,29.75738,-95.59024,train/visible_damage/-95.59024000000001_29.757...,-95.59024000000001_29.757379999999998.jpeg
4,train,visible_damage,28.579998,-96.994213,train/visible_damage/-96.994213_28.579998.jpeg,-96.994213_28.579998.jpeg


## 4. Describe the data

Let's look at how many images we have for each of the categoreis (damage and no damage) for each of the three splits (train, val, and test).

### 4.1 Understand image distribution across dataset split

In [4]:
# Create dataframe that summarizes image classification in each subset
df = pd.pivot_table(metadata, index='subset', columns='label', values='filename', aggfunc='count')

# Add new column with total number of images in each subset
df['total'] = df['visible_damage'] + df['no_damage']

# Show dataframe
df

label,no_damage,visible_damage,total
subset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,1000,1000,2000
train,5000,5000,10000
validation,1000,1000,2000


## 5. Compare pre-disaster and post-disaster images

You can plot different images before and after the disaster and check how the affected buildings look. You can use the widget to look at different pairs of images. The `Image Num` parameter corresponds to the index number for each image in the no damage training directory.

**Note** In the labs that follow, you'll be building a classifier to assess images directly for the presence of visible damage, which is to say, you won't be training an algorithm to recognize differences between images taken before and after a disaster. However, doing a before an after comparison is one approach to doing damage assessment and a number of teams around the world have built such systems. Through these labs, you'll have several opportunities to look at before and after images as a way of visualizing how damage appears in the images and it's worth also thinking about how you might design a different type of system to recognize damage given before and after pairs of images. In many cases, you may not have access to images taken before the disaster, but even if you do, you would need to potentially deal with issues of alignment on a pixel-by-pixel basis between images taken at different times, as well as color variations due to different lighting and so on.

In [5]:
dataset = 'train'
# Match images in the no damage and damage dataset based on the location coordinates in the file name
matches = list(set(metadata.query(f'subset == "{dataset}" & label == "visible_damage"')['filename'])
               .intersection(metadata.query(f'subset == "{dataset}" & label == "no_damage"')['filename']))

# Load index slider to navigate between the paired images
file_index_widget = widgets.IntSlider(min=0, max=len(matches)-1, value=10, description='Image Num')

# Load visualizer to match paired images
interact(utils.interactive_plot_pair(f'./data/{dataset}/', matches), file_index=file_index_widget);

interactive(children=(IntSlider(value=10, description='Image Num', max=1741), Output()), _dom_classes=('widget…

As you toggle through the images, reflect on the following questions:
* Consider what types of damage can you identify in the pairs of images?
* What indicators (color, brightness, sharpness, etc.) identify the existance of damage in the image?
* What are some restrictions 

## 6. Visualizing classified locations with LeafLet

Plot the train data using LeafLet/OpenStreetMaps services. It is a way to visualize your data using interactive maps. The pins showing locations of images containing damage are colored in red and the ones without damage are colored green. Clicking on the pin, you will also see the image from that location (it may take a few seconds to show). The map could take 3 minutes to render.

In [6]:
# Load Leadlet visualization with coordinates and labels
utils.leaflet_plot()