## Exploratory Analysis

The first part of wrangling our data is an exploratory analysis. In the following section, we'll explore our data to try to better understand it. We'll need to explore the raw data to know how it's formatted and what the different fields available. To do this we will load data into the notebook, but even before we do that we will want to examine our folder structure and take inventory of the different files available to us. After that we can start exploring their contents and loading the data for exploratory analysis with Pandas and Python.

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from skimage.io import imread
from joblib import Parallel, delayed
import os

First, let's explore the folders we stored the data in a little bit. There are some files in the folders that will help us better understand and navigate the data. Let's take a look and see what we have.

In [2]:
!ls ../data

hirise-map-proj-v3_2	landmarks_map-proj-v3_classmap.csv  map-proj-v3
labels-map-proj-v3.txt	__MACOSX			    README.txt


One of the things that stands out to me is is the `hirise-map-proj-v3_2` folder. This seems to match up with our second zip file from the original download. If this is the case, I suspect we'll have similar files inside this folder as what we see above. Let's take a look.

In [3]:
!ls ../data/hirise-map-proj-v3_2

labels-map-proj_v3_2_train_val_test.txt  landmarks_map-proj-v3_2_classmap.csv
labels-map-proj_v3_2.txt		 map-proj-v3_2


Most of these look similar, with a few differences. I would like to explore some of these further, but first let's check the `README.txt` one level up to see if there's a better explanation of the files we're working with.

In [4]:
!cat ../data/README.txt

Mars orbital image (HiRISE) labeled data set version 3
--------------------------------------------
Authors: Kiri L. Wagstaff, Steven Lu, Gary Doran, Lukas Mandrake
Contact: you.lu@jpl.nasa.gov

This data set contains a total of 73,031 landmarks. 10,433 landmarks were detected and extracted from 180 HiRISE browse images, and 62,598 landmarks were augmented from 10,433 original landmarks. For each original landmark, we cropped a square bounding box that includes the full extent of the landmark plus a 30-pixel margin to left, right, top and bottom. Each cropped landmark was resized to 227x227 pixels, and then was augmented to generate 6 additional landmarks using the following methods:

1. 90 degrees clockwise rotation
2. 180 degrees clockwise rotation
3. 270 degrees clockwise rotation
4. Horizontal flip
5. Vertical flip
6. Random brightness adjustment


Contents:
- map-proj-v3/: Directory containing individual cropped landmark images
- labels-map-proj-v3.txt: Class labels (ids) for each

Based on the readme, it sounds like both `map-proj-v3` and `map-proj-v3_2` will have the images. `labels-map-proj-v3.txt` and `labels-map-proj_v3_2.txt` has the labels for the images. `landmarks_map-proj-v3_classmap.csv` is a data dictionary, presumably for the labels (landmarks) assigned to each image. This doesn't explain what `labels-map-proj_v3_2_train_val_test.txt`, which leaves me a bit curious about this file.

We'll want to explore all of these folders and some of the files in more depth. From there we can get a better idea of how we can wrangle this data for use in our analysis. Let's take a quick look at the contents of `map-proj-v3` to find an image to take a look at.

In [5]:
!ls ../data/map-proj-v3

ESP_011283_2265_RED-0013-brt.jpg   ESP_025050_1680_RED-0222-r180.jpg
ESP_011283_2265_RED-0013-fh.jpg    ESP_025050_1680_RED-0222-r270.jpg
ESP_011283_2265_RED-0013-fv.jpg    ESP_025050_1680_RED-0222-r90.jpg
ESP_011283_2265_RED-0013.jpg	   ESP_025050_1680_RED-0223-brt.jpg
ESP_011283_2265_RED-0013-r180.jpg  ESP_025050_1680_RED-0223-fh.jpg
ESP_011283_2265_RED-0013-r270.jpg  ESP_025050_1680_RED-0223-fv.jpg
ESP_011283_2265_RED-0013-r90.jpg   ESP_025050_1680_RED-0223.jpg
ESP_011283_2265_RED-0017-brt.jpg   ESP_025050_1680_RED-0223-r180.jpg
ESP_011283_2265_RED-0017-fh.jpg    ESP_025050_1680_RED-0223-r270.jpg
ESP_011283_2265_RED-0017-fv.jpg    ESP_025050_1680_RED-0223-r90.jpg
ESP_011283_2265_RED-0017.jpg	   ESP_025050_1680_RED-0225-brt.jpg
ESP_011283_2265_RED-0017-r180.jpg  ESP_025050_1680_RED-0225-fh.jpg
ESP_011283_2265_RED-0017-r270.jpg  ESP_025050_1680_RED-0225-fv.jpg
ESP_011283_2265_RED-0017-r90.jpg   ESP_025050_1680_RED-0225.jpg
ESP_011283_2265_RED-0030-brt.jpg   ESP_025050_1680_RED-0225-r1

This is what the `ESP_011283_2265_RED-0013-brt.jpg` image looks like.

![image](../data/map-proj-v3/ESP_011283_2265_RED-0013-brt.jpg)

As `labels-map-proj-v3.txt` is supposed to contain the labels for the images, we should be able to find the label for this image. If we can find it for this one, we should be able to match up the labels for all of them. Let's try to load the `labels-map-proj-v3.txt` file and see if we can find a match for this image as a first step.

In [6]:
#First, let's take a look at the raw file to see if there are headers and what delimiter to use
!cat '../data/labels-map-proj-v3.txt'

ESP_011623_2100_RED-0069.jpg 0
ESP_011623_2100_RED-0069-r90.jpg 0
ESP_011623_2100_RED-0069-r180.jpg 0
ESP_011623_2100_RED-0069-r270.jpg 0
ESP_011623_2100_RED-0069-fh.jpg 0
ESP_011623_2100_RED-0069-fv.jpg 0
ESP_011623_2100_RED-0069-brt.jpg 0
ESP_014156_1865_RED-0062.jpg 0
ESP_014156_1865_RED-0062-r90.jpg 0
ESP_014156_1865_RED-0062-r180.jpg 0
ESP_014156_1865_RED-0062-r270.jpg 0
ESP_014156_1865_RED-0062-fh.jpg 0
ESP_014156_1865_RED-0062-fv.jpg 0
ESP_014156_1865_RED-0062-brt.jpg 0
ESP_018321_2565_RED-0025.jpg 0
ESP_018321_2565_RED-0025-r90.jpg 0
ESP_018321_2565_RED-0025-r180.jpg 0
ESP_018321_2565_RED-0025-r270.jpg 0
ESP_018321_2565_RED-0025-fh.jpg 0
ESP_018321_2565_RED-0025-fv.jpg 0
ESP_018321_2565_RED-0025-brt.jpg 0
ESP_027802_1685_RED-0117.jpg 0
ESP_027802_1685_RED-0117-r90.jpg 0
ESP_027802_1685_RED-0117-r180.jpg 0
ESP_027802_1685_RED-0117-r270.jpg 0
ESP_027802_1685_RED-0117-fh.jpg 0
ESP_027802_1685_RED-0117-fv.jpg 0
ESP_027802_1685_RED-0117-brt.jpg 0
ESP_028733_1370_RED-0403.jpg 0
ESP_0

In [7]:
#Use space as the delimiter, and it looks like the first column is the image name and the second the label
df_labels_1 = pd.read_csv('../data/labels-map-proj-v3.txt', sep = ' ', header = None, names = ['image', 'label'])
df_labels_1.head()

Unnamed: 0,image,label
0,ESP_011623_2100_RED-0069.jpg,0
1,ESP_011623_2100_RED-0069-r90.jpg,0
2,ESP_011623_2100_RED-0069-r180.jpg,0
3,ESP_011623_2100_RED-0069-r270.jpg,0
4,ESP_011623_2100_RED-0069-fh.jpg,0


In [40]:
df_labels_1.shape

(73031, 2)

In [8]:
#Now let's see if we can find the label for the image we looked at earlier
df_labels_1.query("image == 'ESP_011283_2265_RED-0013-brt.jpg'")

Unnamed: 0,image,label
42790,ESP_011283_2265_RED-0013-brt.jpg,0


It looks like our image from earlier has a label of "0". Let's look at the data dictionary to see what that may mean.

In [9]:
!cat ../data/landmarks_map-proj-v3_classmap.csv

0,other
1,crater
2,dark dune
3,slope streak
4,bright dune
5,impact ejecta
6,swiss cheese
7,spider


So it looks like our data dictionary has an entry for 6 different landmarks, with 0 being the "other" catagory. This "other" category seems to be a catch all catagory that doesn't match the others. Let's get a good idea of how many we have of each in the data we've loaded.

In [10]:
df_labels_1.groupby('label').count()

Unnamed: 0_level_0,image
label,Unnamed: 1_level_1
0,61054
1,4900
2,1141
3,2331
4,1750
5,231
6,1148
7,476


Let's check out the counts for the other set of images.

In [11]:
df_labels_2 = pd.read_csv('../data/hirise-map-proj-v3_2/labels-map-proj_v3_2.txt', sep = ' ', header = None, names = ['image', 'label'])
df_labels_2.groupby('label').count()

Unnamed: 0_level_0,image
label,Unnamed: 1_level_1
0,52722
1,5024
2,766
3,1575
4,1654
5,476
6,1834
7,896


It looks like the 0 label makes up the vast majority of our data. I can see this as possibly becoming a proble, as most of our data is part of the catch all catagory. This catagory may be difficult for a model to identify, as the only correlation these records have to each other is the fact that they don't fit any other catagory. We might have to consider dropping these, but in doing so we'll loose the majority of the data. This may be worth doing if this data causes us problems, but it isn't a consideration we should take lightly as we wil then be working on a much smaller data set.

Let's attempt to load the images. There's an interesting tutorial [here](https://kapernikov.com/tutorial-image-classification-with-scikit-learn/) that I referenced while trying to figure out how to read the images.

In [36]:
directory = '../data/map-proj-v3'
directory_contents = os.listdir(directory)
#Adjsut this according to your hardware set up... I'm leaving 2 logical
#CPUs to perform tasks other than this
num_of_workers = os.cpu_count() - 2

#Read images in parallel accross workers
images_1 = Parallel(n_jobs = num_of_workers)(delayed(imread)(os.path.join(directory, file)) for file in directory_contents)

In [35]:
#Let's take a look to see what the data returned looks like
print(type(images_1))
print(len(images_1)

<class 'list'>
73031


In [37]:
#Let's look at what one item (image) looks like
images_1[0]

array([[167, 167, 167, ..., 175, 175, 175],
       [168, 168, 169, ..., 177, 177, 177],
       [168, 168, 170, ..., 177, 177, 177],
       ...,
       [189, 187, 186, ..., 158, 160, 159],
       [194, 191, 190, ..., 156, 159, 158],
       [199, 197, 195, ..., 158, 161, 162]], dtype=uint8)

In [39]:
#Let's also check the shape of the elements
images_1[0].shape

(227, 227)

We only read 1 of the two data folders, but it looks like we have the same number of images as labels (as expected) based on this one folder. We can also see that each image is stored as an array of integers with the shape 227 * 227. This matches what we'd expect returned from `imread`. As stated by the [documentation](https://scikit-image.org/docs/stable/api/skimage.io.html#skimage.io.imread), the method returns an MxN array for a grey image to display the pixels in an array of MxN dimensions. NASA's data catalog states teh images are 227 by 227 pixels, so this all matches our expectations.

From here we'll want to convert the data to a numpy ndarray, as this will both perform better and be easier to work with than a Python list. We'll then want to merge it the images from both folders into one collection. Next, we'll want to  match each image up to its labels, and finally we'll want to write the cleaned data frame to a file for use in the later analysis steps. We'll perform these actions in the subsequent data cleaning section.