# Introduction to the Data Sets

> Written by Dr Daniel Buscombe, Northern Arizona University

> Part of a series of notebooks for image recognition and classification using deep convolutional neural networks

The data sets provided are hosting within an Amazon Web Services S3 bucket

In [1]:
import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.ls('esipfed/cdi-workshop')

['esipfed/cdi-workshop/fully_conv_semseg',
 'esipfed/cdi-workshop/imrecog_data',
 'esipfed/cdi-workshop/semseg_data']

We're going to use this root file structure a lot, so let's define a variable we can call repeatedly

In [2]:
root = 'esipfed/cdi-workshop'

## Looking at file structure

There are three major subdirectories:
* 'imrecog_data': contains example data sets for image recognition
* 'semseg_data': contains example data sets for semantic segmentation
* 'fully_conv_semseg': contains example data sets for fully convolutional semantic segmentation

In [3]:
fs.ls(root+'/imrecog_data/EuroSAT')

['esipfed/cdi-workshop/imrecog_data/EuroSAT/AnnualCrop',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/Forest',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/HerbaceousVegetation',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/Highway',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/Industrial',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/Pasture',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/PermanentCrop',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/Residential',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/River',
 'esipfed/cdi-workshop/imrecog_data/EuroSAT/SeaLake']

In [4]:
len(fs.ls(root+'/semseg_data/gc/train'))

16

In [5]:
fs.ls(root+'/imrecog_data/NWPU-RESISC45/test/')

['esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/airplane',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/airport',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/baseball_diamond',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/basketball_court',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/beach',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/bridge',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/chaparral',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/church',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/circular_farmland',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/cloud',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/commercial_area',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/dense_residential',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/desert',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45/test/forest',
 'esipfed/cdi-workshop/imrecog_data/NWPU-RESISC45

## Reading and displaying imagery

In [None]:
from imageio import imread
import matplotlib.pyplot as plt

In [None]:
with fs.open(root+'/imrecog_data/NWPU-RESISC45/test/airplane/airplane_700.jpg', 'rb') as f:
    image = imread(f, 'jpg')
    plt.figure(0, figsize=(10,10))
    plt.imshow(image);

In [None]:
fs.ls(root+'/imrecog_data/NWPU-RESISC45/test')

In [None]:
names = [f for f in fs.ls(root+'/imrecog_data/NWPU-RESISC45/test/baseball_diamond') if f.endswith('.jpg')]
names = names[:10]

In [None]:
fig, ax = plt.subplots(3, 3)
fig.set_figheight(15)
fig.set_figwidth(15)
for i, axi in enumerate(ax.flat):
    with fs.open(names[i], 'rb') as f:
        image = imread(f, 'jpg')
    axi.imshow(image)
    axi.set(xticks=[], yticks=[])

## Read Labels

'Labels' files are text files that contain class labels. One label per row

In [None]:
fs.ls(root+'/semseg_data/gc/labels')

In [None]:
with fs.open(root+'/semseg_data/gc/labels/labels.txt', 'rb') as f:
      labels = f.readlines()

labels = [x.strip() for x in labels] # get rid of white space, etc.
print(labels)
print(labels[0].decode()) # original was read in with binary ('rb') so it must be decoded into ascii

## Read binary data files

We're using matlab's .mat format to store data from semantic segmentations. Sounds a bit weird, I know, but it is a fairly portable and space efficient format. And it helps matlab users out

In [None]:
from scipy.io import loadmat # matlab was Dan's choice for binary (Python binaries are no good)

In [None]:
with fs.open(root+'/semseg_data/ontario/test/A2014862_geotag_mres.mat') as f:
    dat = loadmat(f)

In [None]:
dat.keys()