# Input Data Check

In this notebook we will check the input data, display image samples and calculate statistics. This notebook and all the following expecting the input to be in the `gitignored/`-folder. The end goal is that the scientists only need to fill this folder and execute the notebooks or -- in a later phase -- the main program.

Members of the freiheit data science faction can get the data by running the `getData.sh` script. It will download and extract the example data to the `gitignored/` folder.

**Important**:

We do not want to relabel images within our notebooks but we want to keep the same label mapping. When we train the neural network it should predict one of the six classes. But the classes must indexed from o on. So if we have six classes, they must be labeled `[0, 1, 2, 3, 4, 5]`. Therefore we just use this definition. So our `0` eqals to a `1` in [the paper]( https://pubs.acs.org/doi/abs/10.1021/acs.nanolett.7b02419).

## Imports

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import os
import imageio

## Constants

This is the required folder structure. Every notebooks expects it.

In [None]:
input_folder = "gitignored/"
input_images_folder = input_folder + "input_images/"
classification_file = "classification.csv"

## Check Input Files and Folders

We are checking if the folder structure meets the expectations and if all required files are present. In the past we had the problem that some images were missing.

In [None]:
content_of_inputfolder = os.listdir(input_folder)

# check for classification file
assert classification_file in content_of_inputfolder

# check for input images folder
assert "input_images" in content_of_inputfolder

df = pd.read_csv(input_folder + classification_file, delimiter="\t")
content_of_input_images_folder = os.listdir(input_images_folder)

# check for missing files
missing_files = []
for file_name in df["file_name"]:
    if file_name not in content_of_input_images_folder:
        missing_files.append(file_name)
     
assert len(missing_files) == 0, "There are missing files: {}".format(missing_files)

## Class Distribution

Get the class distribution to show how unbalanced the data set is.

In [None]:
hist_data = []
plot_bins = range(6)

for i in plot_bins:
    hist_data.append(len(df[df["label"] == i]))

_ = plt.bar(plot_bins, hist_data)
plt.title("class distribution in data set")
plt.xlabel("classes")
plt.ylabel("number of images")
plt.grid()

for i in range(len(plot_bins)):
    print("label: {}; number of images {}".format(plot_bins[i], hist_data[i]))

## Examples for Image Classes

Plot a few examples for each molecule state class. Each row contains images of molecules of the same state.

**Remember**: Our labels are in the range [0, 5] and not [1, 6] as in the paper.

In [None]:
labels = [0, 1, 2, 3, 4, 5]
n_cols = 6
n_rows = len(labels)
idx = 1

fig = plt.figure(figsize=(n_cols * 4, n_rows * 3))

for label in labels:
    for col in range(n_cols):
        rows_with_label = df[df['label'] == label]
        example = rows_with_label.sample(1)
        file_name = example['file_name'].iloc[0]
        image = imageio.imread(input_images_folder + file_name)
        
        ax = fig.add_subplot(n_rows, n_cols, idx)
        ax.set_title("label: {}".format(label))
        ax.set_xticks([])
        ax.set_yticks([])
        ax.imshow(image)
    
        idx += 1

plt.tight_layout()