## Exploratory Analysis

The first part of wrangling our data is an exploratory analysis. In the following section, we'll explore our data to try to better understand it. We'll need to explore the raw data to know how it's formatted and what the different fields available. To do this we will load data into the notebook, but even before we do that we will want to examine our folder structure and take inventory of the different files available to us. After that we can start exploring their contents and loading the data for exploratory analysis with Pandas and Python.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from skimage.io import imread
from joblib import Parallel, delayed
from file_helpers import print_dir, print_file
import os

First, let's explore the folders we stored the data in a little bit. There are some files in the folders that will help us better understand and navigate the data. Let's take a look and see what we have.

In [2]:
print_dir('../data')

        Name     Full Path
hirise-map-proj-v3_2  ../data/hirise-map-proj-v3_2
processed_data  ../data/processed_data
    __MACOSX  ../data/__MACOSX
Showing 3 of 3 items


We can likely ignore the `__MACOSX`, as that looks like a folder used internally by MACOSX. One of the things that stands out to me is is the `hirise-map-proj-v3_2` folder. This seems to match up with our zip file from the original download. If this is the case, I suspect we'll have similar files inside this folder as what we see above. Let's take a look.

In [3]:
print_dir('../data/hirise-map-proj-v3_2')

        Name     Full Path
map-proj-v3_2  ../data/hirise-map-proj-v3_2/map-proj-v3_2
labels-map-proj_v3_2.txt  ../data/hirise-map-proj-v3_2/labels-map-proj_v3_2.txt
labels-map-proj_v3_2_train_val_test.txt  ../data/hirise-map-proj-v3_2/labels-map-proj_v3_2_train_val_test.txt
landmarks_map-proj-v3_2_classmap.csv  ../data/hirise-map-proj-v3_2/landmarks_map-proj-v3_2_classmap.csv
Showing 4 of 4 items


This looks like the root directory that matches the download previewer on [Zenodo](https://zenodo.org/records/4002935). If we read the description there, we can find information on all these files.

>Contents:
>- map-proj-v3_2/: Directory containing individual cropped landmark images
>- labels-map-proj-v3_2.txt: Class labels (ids) for each landmark image. File includes two columns separated by a space: filename, class_id
>
>- labels-map-proj-v3_2_train_val_test.txt: Includes train/test/val labels and upsampling used for trained model. File includes three columns separated by a space: filename, class_id, set
>- landmarks_map-proj-v3_2_classmap.csv: Dictionary that maps class ids to semantic names

Let's dive into the `map-proj-v3_1` folder to verify that the images are there, as expected.

In [4]:
print_dir('../data/hirise-map-proj-v3_2/map-proj-v3_2')

        Name     Full Path
ESP_012810_0925_RED-0115-brt.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/ESP_012810_0925_RED-0115-brt.jpg
ESP_024646_2570_RED-0016-r270.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/ESP_024646_2570_RED-0016-r270.jpg
PSP_010087_1555_RED-0181-r90.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/PSP_010087_1555_RED-0181-r90.jpg
ESP_025151_1570_RED-0151-r90.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/ESP_025151_1570_RED-0151-r90.jpg
ESP_012494_2050_RED-0044-r90.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/ESP_012494_2050_RED-0044-r90.jpg
ESP_018321_2565_RED-0069-r270.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/ESP_018321_2565_RED-0069-r270.jpg
ESP_016631_2535_RED-0004-r180.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/ESP_016631_2535_RED-0004-r180.jpg
ESP_012637_0935_RED-0236.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/ESP_012637_0935_RED-0236.jpg
PSP_010087_1555_RED-0306.jpg  ../data/hirise-map-proj-v3_2/map-proj-v3_2/PSP_010087_1555_RED-03

It does look like image data, and as expected we have 64,947 images. Let's take a look at one just to get a feel for what these images look like.

![Martian Image](../data/hirise-map-proj-v3_2/map-proj-v3_2/ESP_011283_2265_RED-0013-brt.jpg)

In [5]:
print_file('../data/hirise-map-proj-v3_2/labels-map-proj_v3_2.txt')

ESP_013049_0950_RED-0067.jpg 7
ESP_013049_0950_RED-0067-fv.jpg 7
ESP_013049_0950_RED-0067-brt.jpg 7
ESP_013049_0950_RED-0067-r90.jpg 7
ESP_013049_0950_RED-0067-r180.jpg 7
ESP_013049_0950_RED-0067-r270.jpg 7
ESP_013049_0950_RED-0067-fh.jpg 7
ESP_019697_2020_RED-0024.jpg 1
ESP_019697_2020_RED-0024-fv.jpg 1
ESP_019697_2020_RED-0024-brt.jpg 1
ESP_019697_2020_RED-0024-r90.jpg 1
ESP_019697_2020_RED-0024-r180.jpg 1
ESP_019697_2020_RED-0024-r270.jpg 1
ESP_019697_2020_RED-0024-fh.jpg 1
ESP_015962_1695_RED-0016.jpg 1
ESP_015962_1695_RED-0016-fv.jpg 1
ESP_015962_1695_RED-0016-brt.jpg 1
ESP_015962_1695_RED-0016-r90.jpg 1
ESP_015962_1695_RED-0016-r180.jpg 1
ESP_015962_1695_RED-0016-r270.jpg 1
ESP_015962_1695_RED-0016-fh.jpg 1
ESP_013049_0950_RED-0118.jpg 7
ESP_013049_0950_RED-0118-fv.jpg 7
ESP_013049_0950_RED-0118-brt.jpg 7
ESP_013049_0950_RED-0118-r90.jpg 7
ESP_013049_0950_RED-0118-r180.jpg 7
ESP_013049_0950_RED-0118-r270.jpg 7
ESP_013049_0950_RED-0118-fh.jpg 7
ESP_015962_1695_RED-0017.jpg 1
ESP_0

In [6]:
print_file('../data/hirise-map-proj-v3_2/labels-map-proj_v3_2_train_val_test.txt')

ESP_013049_0950_RED-0067.jpg 7 train
ESP_013049_0950_RED-0067-fv.jpg 7 train
ESP_013049_0950_RED-0067-brt.jpg 7 train
ESP_013049_0950_RED-0067-r90.jpg 7 train
ESP_013049_0950_RED-0067-r180.jpg 7 train
ESP_013049_0950_RED-0067-r270.jpg 7 train
ESP_013049_0950_RED-0067-fh.jpg 7 train
ESP_019697_2020_RED-0024.jpg 1 train
ESP_019697_2020_RED-0024-fv.jpg 1 train
ESP_019697_2020_RED-0024-brt.jpg 1 train
ESP_019697_2020_RED-0024-r90.jpg 1 train
ESP_019697_2020_RED-0024-r180.jpg 1 train
ESP_019697_2020_RED-0024-r270.jpg 1 train
ESP_019697_2020_RED-0024-fh.jpg 1 train
ESP_015962_1695_RED-0016.jpg 1 train
ESP_015962_1695_RED-0016-fv.jpg 1 train
ESP_015962_1695_RED-0016-brt.jpg 1 train
ESP_015962_1695_RED-0016-r90.jpg 1 train
ESP_015962_1695_RED-0016-r180.jpg 1 train
ESP_015962_1695_RED-0016-r270.jpg 1 train
ESP_015962_1695_RED-0016-fh.jpg 1 train
ESP_013049_0950_RED-0118.jpg 7 train
ESP_013049_0950_RED-0118-fv.jpg 7 train
ESP_013049_0950_RED-0118-brt.jpg 7 train
ESP_013049_0950_RED-0118-r90.jpg 

In [7]:
print_file('../data/hirise-map-proj-v3_2/landmarks_map-proj-v3_2_classmap.csv')

0,other
1,crater
2,dark dune
3,slope streak
4,bright dune
5,impact ejecta
6,swiss cheese
7,spider



In [8]:
#Use space as the delimiter, and it looks like the first column is the image name and the second the label
df_labels = pd.read_csv('../data/hirise-map-proj-v3_2/labels-map-proj_v3_2.txt', sep = ' ', header = None, names = ['image', 'label'])
df_labels.head()

Unnamed: 0,image,label
0,ESP_013049_0950_RED-0067.jpg,7
1,ESP_013049_0950_RED-0067-fv.jpg,7
2,ESP_013049_0950_RED-0067-brt.jpg,7
3,ESP_013049_0950_RED-0067-r90.jpg,7
4,ESP_013049_0950_RED-0067-r180.jpg,7


In [9]:
df_labels.shape

(64947, 2)

In [10]:
#Now let's see if we can find the label for the image we looked at earlier
df_labels.query("image == 'ESP_011283_2265_RED-0013-brt.jpg'")

Unnamed: 0,image,label
33746,ESP_011283_2265_RED-0013-brt.jpg,0


It looks like our image from earlier has a label of "0". Let's look at the data dictionary to see what that may mean.

In [11]:
print_file('../data/hirise-map-proj-v3_2/landmarks_map-proj-v3_2_classmap.csv')

0,other
1,crater
2,dark dune
3,slope streak
4,bright dune
5,impact ejecta
6,swiss cheese
7,spider



So it looks like our data dictionary has an entry for 6 different landmarks, with 0 being the "other" catagory. This "other" category seems to be a catch all catagory that doesn't match the others. Let's get a good idea of how many we have of each in the data we've loaded.

In [12]:
df_labels.groupby('label').count()

Unnamed: 0_level_0,image
label,Unnamed: 1_level_1
0,52722
1,5024
2,766
3,1575
4,1654
5,476
6,1834
7,896


It looks like the 0 label makes up the vast majority of our data. I can see this as possibly becoming a proble, as most of our data is part of the catch all catagory. This catagory may be difficult for a model to identify, as the only correlation these records have to each other is the fact that they don't fit any other catagory. We might have to consider dropping these, but in doing so we'll loose the majority of the data. This may be worth doing if this data causes us problems, but it isn't a consideration we should take lightly as we wil then be working on a much smaller data set.

Let's attempt to load the images. There's an interesting tutorial [here](https://kapernikov.com/tutorial-image-classification-with-scikit-learn/) that I referenced while trying to figure out how to read the images.

In [13]:
directory = '../data/hirise-map-proj-v3_2/map-proj-v3_2'
directory_contents = os.listdir(directory)
#Adjsut this according to your hardware set up... I'm leaving 2 logical
#CPUs to perform tasks other than this
num_of_workers = os.cpu_count() - 2

#Read images in parallel accross workers
images = Parallel(n_jobs = num_of_workers)(delayed(imread)(os.path.join(directory, file)) for file in directory_contents)

In [14]:
#Let's take a look to see what the data returned looks like
print(type(images))
print(len(images))

<class 'list'>
64947


In [15]:
#Let's look at what one item (image) looks like
images[0]

array([[167, 167, 167, ..., 175, 175, 175],
       [168, 168, 169, ..., 177, 177, 177],
       [168, 168, 170, ..., 177, 177, 177],
       ...,
       [189, 187, 186, ..., 158, 160, 159],
       [194, 191, 190, ..., 156, 159, 158],
       [199, 197, 195, ..., 158, 161, 162]], dtype=uint8)

In [16]:
#Let's also check the shape of the elements
images[0].shape

(227, 227)

It looks like we have the same number of images as labels (as expected) based on this one folder. We can also see that each image is stored as an array of integers with the shape 227 * 227. This matches what we'd expect returned from `imread`. As stated by the [documentation](https://scikit-image.org/docs/stable/api/skimage.io.html#skimage.io.imread), the method returns an MxN array for a grey image to display the pixels in an array of MxN dimensions. NASA's data catalog states the images are 227 by 227 pixels, so this all matches our expectations.

From here we'll want to read images again but into a dataframe, as this will both perform better and be easier to work with than a Python list. Next, we'll want to  match each image up to its labels, we'll split the data into groups, and finally we'll want to write the cleaned data frame to a file for use in the later analysis steps. We'll perform these actions in the subsequent sections.

## Load Image Data



In [17]:
def read_images (directory, number_of_parallel_workers):
    directory_contents = os.listdir(directory)
    def image_reading_helper(file_path):
        image = imread(file_path)
        file_name = os.path.basename(file_path)
        return (file_name, image)
    images = Parallel(n_jobs = number_of_parallel_workers)(delayed(image_reading_helper)(os.path.join(directory, file)) for file in directory_contents)
    df_images = pd.DataFrame(data = images, columns = ['file_name', 'image'])
    return df_images

num_of_workers = os.cpu_count() - 2
df_images = read_images('../data/hirise-map-proj-v3_2/map-proj-v3_2', num_of_workers)

In [18]:
df_images.head()

Unnamed: 0,file_name,image
0,ESP_012810_0925_RED-0115-brt.jpg,"[[167, 167, 167, 168, 168, 168, 166, 164, 162,..."
1,ESP_024646_2570_RED-0016-r270.jpg,"[[183, 180, 175, 167, 165, 164, 161, 163, 165,..."
2,PSP_010087_1555_RED-0181-r90.jpg,"[[110, 108, 106, 106, 107, 105, 115, 115, 100,..."
3,ESP_025151_1570_RED-0151-r90.jpg,"[[47, 45, 44, 45, 46, 46, 45, 44, 42, 44, 44, ..."
4,ESP_012494_2050_RED-0044-r90.jpg,"[[114, 111, 115, 120, 120, 120, 120, 117, 118,..."


## Transform and Clean

In [19]:
df_labels['file_name'] = df_labels['image']
df_labels.drop(axis = 1, columns = ['image'], inplace = True)

In [20]:
df_labeled_images = df_images.merge(df_labels, on = 'file_name', how = 'inner')
df_labeled_images.head()

Unnamed: 0,file_name,image,label
0,ESP_012810_0925_RED-0115-brt.jpg,"[[167, 167, 167, 168, 168, 168, 166, 164, 162,...",6
1,ESP_024646_2570_RED-0016-r270.jpg,"[[183, 180, 175, 167, 165, 164, 161, 163, 165,...",4
2,PSP_010087_1555_RED-0181-r90.jpg,"[[110, 108, 106, 106, 107, 105, 115, 115, 100,...",1
3,ESP_025151_1570_RED-0151-r90.jpg,"[[47, 45, 44, 45, 46, 46, 45, 44, 42, 44, 44, ...",0
4,ESP_012494_2050_RED-0044-r90.jpg,"[[114, 111, 115, 120, 120, 120, 120, 117, 118,...",0


In [21]:
df_labeled_images.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64947 entries, 0 to 64946
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   file_name  64947 non-null  object
 1   image      64947 non-null  object
 2   label      64947 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.5+ MB


In [22]:
df_labeled_images['image'][0].shape

(227, 227)

## Split Data Set

In [23]:
from sklearn.model_selection import train_test_split

#We want 70% of the data for training
X_train, X_test, y_train, y_test  = train_test_split(df_labeled_images['image'], df_labeled_images['label'], train_size = 0.7, random_state = 13)
#We'll split the remaining 30% between cross validation set and final test data
X_cross, X_test, y_cross, y_test = train_test_split(X_test, y_test, train_size = 0.5, random_state = 13)

In [24]:
#Let's make sure we got the right sample sizes for each group
full_data_size = df_labeled_images.shape[0]

#Train data set
print('X_train size: ', X_train.size, '\tExpected size: ', full_data_size * 0.7)
print('y_train size: ', y_train.size, '\tExpected size: ', full_data_size * 0.7)

#Cross validation data set
print('X_cross size: ', X_cross.size, '\tExpected size: ', full_data_size * 0.15)
print('y_cross size: ', y_cross.size, '\tExpected size: ', full_data_size * 0.15)

#Test validation data set
print('X_test size: ', X_test.size, '\tExpected size: ', full_data_size * 0.15)
print('y_test size: ', y_test.size, '\tExpected size: ', full_data_size * 0.15)

X_train size:  45462 	Expected size:  45462.899999999994
y_train size:  45462 	Expected size:  45462.899999999994
X_cross size:  9742 	Expected size:  9742.05
y_cross size:  9742 	Expected size:  9742.05
X_test size:  9743 	Expected size:  9742.05
y_test size:  9743 	Expected size:  9742.05


In [25]:
#Check data distribution

def print_group_distribution (labels, group_name):
    print('Label distribution of ', group_name, ':')
    print(labels.groupby(labels).count() / labels.shape[0])
    #Add blank line at the end
    print()

group_labels = [(df_labeled_images['label'], 'Full Data Set'), (y_train, 'Training Set'), (y_cross, 'Cross Validation'), (y_test, 'Test Set')]
for label_set, data_set_name in group_labels:
    print_group_distribution(label_set, data_set_name)

Label distribution of  Full Data Set :
label
0    0.811770
1    0.077355
2    0.011794
3    0.024251
4    0.025467
5    0.007329
6    0.028238
7    0.013796
Name: label, dtype: float64

Label distribution of  Training Set :
label
0    0.813823
1    0.076152
2    0.011570
3    0.023998
4    0.025296
5    0.007215
6    0.028573
7    0.013374
Name: label, dtype: float64

Label distribution of  Cross Validation :
label
0    0.807432
1    0.078321
2    0.013242
3    0.024122
4    0.025559
5    0.007288
6    0.029973
7    0.014063
Name: label, dtype: float64

Label distribution of  Test Set :
label
0    0.806528
1    0.082008
2    0.011393
3    0.025557
4    0.026173
5    0.007903
6    0.024941
7    0.015498
Name: label, dtype: float64



## Write Data to Files

In [26]:
os.makedirs('../data/processed_data', exist_ok = True)

In [27]:
#Write labeled image DF
df_labeled_images.to_csv('../data/processed_data/labeled_images.csv')

In [28]:
#Write training data sets
X_train.to_csv('../data/processed_data/x_train.csv')
y_train.to_csv('../data/processed_data/y_train.csv')

In [29]:
#Write cross validation data sets
X_cross.to_csv('../data/processed_data/x_cross_validation.csv')
y_cross.to_csv('../data/processed_data/y_cross_validation.csv')

In [30]:
#Write test data sets
X_test.to_csv('../data/processed_data/x_test.csv')
y_test.to_csv('../data/processed_data/y_test.csv')

In [31]:
#Check the data directory to verify the files are there
print_dir('../data/processed_data')

        Name     Full Path
y_cross_validation.csv  ../data/processed_data/y_cross_validation.csv
  y_test.csv  ../data/processed_data/y_test.csv
 x_train.csv  ../data/processed_data/x_train.csv
labeled_images.csv  ../data/processed_data/labeled_images.csv
 y_train.csv  ../data/processed_data/y_train.csv
x_cross_validation.csv  ../data/processed_data/x_cross_validation.csv
  x_test.csv  ../data/processed_data/x_test.csv
Showing 7 of 7 items
