# Semantic Learning Machine preparation

Author: Dennis Croon
Date:   5-dec-2019
Source: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia#person100_bacteria_475.jpeg

This report is made to prepare the image dataset for the SLM, this means converting x-ray pixels to clear feature arrays. The dataset is a public Kaggle dataset and contains information about Pneumonia:

In [None]:
from IPython.display import Image
Image("/Users/denniscroon/Desktop/Thesis/vgg_feature_extraction-master/data/train/PNEUMONIA/person8_bacteria_37.jpeg",width =300, height=300)

The x-ray photo above is an example of Pneumonia, all the images have this point of view, but some of them with a slightly change around the lungs. Unfortunately the data does not have a consistency about size; which means all the photos have different height and width. There is a third labelled distinguish possible, namely inside the Pneumonia cases: they could be either bacterial or viral. But for now on it is kept binary.

After exploring the data, it is time to extract the features from each image. Instead of having an array of 2000x2000 (mean pixels) with each cell filled with a value between 0 and 255 (darkness), we rather have an array that points out the specific properties of each image. To make this a bit faster and smarter, we use the knowledge of the pre-trained model VGG16 (expansion and comparison with other models possible). The determination of the features are combined with the change to jsonl format.

In [None]:
def main():
    model = VGG16(weights='imagenet', include_top=False)
    image_paths = [x for x in Path("data/").glob('**/*') if x.is_file() and ".jpeg" in x.name]
    output = []
    for fpath in tqdm(image_paths, desc="Extracting Features", total=len(image_paths)):
        split_path = str(fpath).split("/")  
        img = image.load_img(fpath, target_size=(224,224))
        x = image.img_to_array(img)
        x = np.expand_dims(x, axis=0)
        x = preprocess_input(x)
        features = model.predict(x)
        output.append({
            "file_path": str(fpath),
            "dataset": split_path[1],
            "label": split_path[2],
            "file_name": split_path[3],
            "features": features.tolist()
        })
    with jsonlines.open("features.jsonl", "w") as outfile:
        outfile.write_all(output)

if __name__ == "__main__":
    main()

Now the feature.jsonl file is created for the train and test dataset, a load-method is necessary for future purposes. Here we can convert the jsonl format to a pandas dataframe.

In [None]:
import jsonlines
import numpy as np
import pandas as pd
from tqdm import tqdm


def main():
    num_lines = sum(1 for line in open('features.jsonl'))
    with jsonlines.open("features.jsonl") as jsonl_fh:
        train, test = [], []

        for obj in tqdm(jsonl_fh, desc="Loading Features", total=num_lines):
            if obj["dataset"] == "train":
                train.append(obj)
            elif obj["dataset"] == "test":
                test.append(obj)

    train_df = pd.DataFrame.from_records(train)
    test_df = pd.DataFrame.from_records(test)

    print(train_df.head())
    print(test_df.head())

if __name__ == "__main__":
    main()

dataset                                           features  \
0   train  [[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....   
1   train  [[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....   
2   train  [[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....   
3   train  [[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....   
4   train  [[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....   

                   file_name                                    file_path  \
0  NORMAL2-IM-0867-0001.jpeg  data/train/NORMAL/NORMAL2-IM-0867-0001.jpeg   
1  NORMAL2-IM-0903-0001.jpeg  data/train/NORMAL/NORMAL2-IM-0903-0001.jpeg   
2          IM-0691-0001.jpeg          data/train/NORMAL/IM-0691-0001.jpeg   
3  NORMAL2-IM-0395-0001.jpeg  data/train/NORMAL/NORMAL2-IM-0395-0001.jpeg   
4     IM-0650-0001-0001.jpeg     data/train/NORMAL/IM-0650-0001-0001.jpeg   

    label  
0  NORMAL  
1  NORMAL  
2  NORMAL  
3  NORMAL  
4  NORMAL  