### Data Science Case Study Options 

Please select and complete one of the following case studies. We are looking for you to show off your  machine learning and coding skills using Python or R. You are not required to use AWS in your solution,  but you are welcome to spin up an EC2 instance if you would like to. If you use any AWS services please  remember to terminate them after you complete the exercise. 

Please Note: 
- Do not post solution on GitHub or any other online public site 
- Submit a work sample that is comprehensive with respect to your thought process, code, findings,  and recommendations (ie a notebook with annotations). You may submit other documents if you wish. 

Please send the completed work sample at least 1 days prior to the virtual interview. On the day of the  interview you will need access to a laptop/desktop as you will be sharing your screen and going through  the work sample with the interviewer. 
Please send a document displaying your code, annotations, and thought process in a PDF to  kdalenbe@amazon.com and CC your recruiter at least 24 hours prior to your interview. 

## Option 2: Geological Image Similarity 

BACKGROUND 
A geology research company wants to create a tool for identifying interesting patterns in their imagery  data. This tool will possess a search capability whereby an analyst provides an image of interest and is  presented with other images which are similar to it.

GOAL 
Your task is to create the machine learning component for this image similarity application. The machine  learning model should return the top K images that are most similar to this image based on a single  image input.

In [1]:
import os 
import shutil
from PIL import Image
import pandas as pd
import pathlib
import numpy as np

In [2]:
cwd_path = os.getcwd()
print('Project Folder:')
print(cwd_path)

Project Folder:
/home/david/Documents/projects/aws_geo_sim


In [3]:
data_path = os.path.join(cwd_path,'data','geological_similarity')
try:
    if os.path.isdir(data_path)==True:
        print('(Success) Data Path:')
        print(data_path)
    else:
        raise FileNotFoundError
except FileNotFoundError:
    print('(ERROR) Extracted data folder is not found at:')
    print(data_path)

(Success) Data Path:
/home/david/Documents/projects/aws_geo_sim/data/geological_similarity


In [4]:
import shutil

try:
    silly_file = os.path.join(data_path,'.DS_Store')
    if os.path.exists(silly_file):
        os.remove(silly_file)
        print('(Success) Removing silly Mac files.')
    else:
        raise FileNotFoundError
except FileNotFoundError:
    print('(Success) Silly Mac files already removed.')

try:
    silly_path = os.path.join(cwd_path,'data','__MACOSX')
    if os.path.exists(silly_path):
        shutil.rmtree(silly_path)
        print('(Success) Removing silly Mac files.')
    else:
        raise FileNotFoundError
except FileNotFoundError:
    print('(Success) Silly Mac files already removed.')

(Success) Silly Mac files already removed.
(Success) Silly Mac files already removed.


In [5]:
all_file_paths = []

print('Data Folders:')
for root, dirs, files in os.walk(data_path):
    for name in dirs:
        print(os.path.join(root, name))
    for name in files:
        all_file_paths.append(os.path.join(root, name))

print('Number of Image Files:')
print(len(all_file_paths))

Data Folders:
/home/david/Documents/projects/aws_geo_sim/data/geological_similarity/schist
/home/david/Documents/projects/aws_geo_sim/data/geological_similarity/quartzite
/home/david/Documents/projects/aws_geo_sim/data/geological_similarity/andesite
/home/david/Documents/projects/aws_geo_sim/data/geological_similarity/rhyolite
/home/david/Documents/projects/aws_geo_sim/data/geological_similarity/gneiss
/home/david/Documents/projects/aws_geo_sim/data/geological_similarity/marble
Number of Image Files:
29998


In [6]:
# Using PIL for speed vs OpenCV

def get_metadata_single(img_path):
    class_label = pathlib.PurePath(img_path).parent.name
    img = Image.open(img_path)
    width, height = img.size
    color_space = img.mode
    colors = len(img.getbands())

    # Add image file name
    dic = {'image_path': img_path,
            'class': class_label,
            'width': width,
            'height': height,
            'color': color_space,
            'channels': colors}

    return dic

In [7]:
list_of_dics = []
for file_path in all_file_paths:
    list_of_dics.append(get_metadata_single(file_path))

imgs_df = pd.DataFrame.from_dict(list_of_dics)
imgs_df

Unnamed: 0,image_path,class,width,height,color,channels
0,/home/david/Documents/projects/aws_geo_sim/dat...,schist,28,28,RGB,3
1,/home/david/Documents/projects/aws_geo_sim/dat...,schist,28,28,RGB,3
2,/home/david/Documents/projects/aws_geo_sim/dat...,schist,28,28,RGB,3
3,/home/david/Documents/projects/aws_geo_sim/dat...,schist,28,28,RGB,3
4,/home/david/Documents/projects/aws_geo_sim/dat...,schist,28,28,RGB,3
...,...,...,...,...,...,...
29993,/home/david/Documents/projects/aws_geo_sim/dat...,marble,28,28,RGB,3
29994,/home/david/Documents/projects/aws_geo_sim/dat...,marble,28,28,RGB,3
29995,/home/david/Documents/projects/aws_geo_sim/dat...,marble,28,28,RGB,3
29996,/home/david/Documents/projects/aws_geo_sim/dat...,marble,28,28,RGB,3


In [8]:
print(imgs_df.width.unique())
print(imgs_df.height.unique())
print(imgs_df.color.unique())
print(imgs_df.channels.unique())

# All image data is consistently formatted
# 28x28 pixel images are most similar to MNIST dataset

[28]
[28]
['RGB']
[3]


In [9]:
type_counts = imgs_df['class'].value_counts()
type_counts = dict(type_counts)
print(type_counts)

imgs_df['class'].value_counts()
# Wow, very lucky classes are so well balanced...

{'schist': 5000, 'quartzite': 5000, 'andesite': 5000, 'rhyolite': 5000, 'gneiss': 5000, 'marble': 4998}


schist       5000
quartzite    5000
andesite     5000
rhyolite     5000
gneiss       5000
marble       4998
Name: class, dtype: int64