# Data Pre-processing
The purpose of this notebook is to prepare image data for use in Keras. The data used for this project is the Food-11 image datset developed by the Multimedia Signal Processing Group (MMSPG) at the Swiss Federal Institute of Technology. Additional information about the dataset is available on the [MMSPG's homepage](https://www.epfl.ch/labs/mmspg/downloads/food-image-datasets/). The dataset is available for download on [Kaggle](https://www.kaggle.com/vermaavi/food11).

The dataset consists of 16,643 food images grouped in 11 food categories - Bread, Dairy product, Dessert, Egg, Fried food, Meat, Noodles/Pasta, Rice, Seafood, Soup, and Vegetable/Fruit. The dataset comes divided into training, validation, and evaluation subsets. The files are named with the following convention:<br>{ClassID}_{ImageID}.jpg
- ClassID: 0-10; refers to respective food category
- ImageID: ID of the image within the class


## Table of Contents:
1. Import packages
2. Load dataset
3. Add Targets
4. Save data to file

## 1. Import Packages

In [40]:
import pandas as pd
import numpy as np
import os
import pickle

## 2. Load Dataset
In the modeling notebook, we will use Keras' flow_from_dataframe() method to load images. Firt we need to save the image filenames in a dataframe.

In [41]:
# Save filenames and paths of all images into a dataframe
filepaths = []

for path, subdirs, files in os.walk('/home/andy/metis_work/project_05/data/food-11'):
    for name in files:
        filepaths.append([os.path.join(os.path.split(path)[1],name),os.path.split(path)[1],name])
        
df = pd.DataFrame(filepaths, columns=['path','folder','name'])
df

Unnamed: 0,path,folder,name
0,training/3_462.jpg,training,3_462.jpg
1,training/4_69.jpg,training,4_69.jpg
2,training/5_925.jpg,training,5_925.jpg
3,training/1_202.jpg,training,1_202.jpg
4,training/5_241.jpg,training,5_241.jpg
...,...,...,...
16638,evaluation/5_165.jpg,evaluation,5_165.jpg
16639,evaluation/5_393.jpg,evaluation,5_393.jpg
16640,evaluation/8_157.jpg,evaluation,8_157.jpg
16641,evaluation/4_86.jpg,evaluation,4_86.jpg


## 3. Add Targets
Next, we add columns to represent targets for classification. One of the benefits of using flow_from_dataframe() is that we can easily change number of classes without having to restructure our dataset. Here we are creating 3 different classification schemes - 11 original categories, 8 consolidated categories, and binary classifier (meat or not_meat). The 8 consolidated category scheme was created for 2 primary reasons - to help with class imbalance and to more accurately reflect food groupings of interest when thinking about diet pattern composition.

In [42]:
# Dictionaries mapping file class_id's to category names

# 11 original categories
cat_names = {'0':'bread',
            '1':'dairy',
            '2':'dessert',
            '3':'egg',
            '4':'fried',
            '5':'meat',
            '6':'noodles',
            '7':'rice',
            '8':'seafood',
            '9':'soup',
            '10':'fruit-veg'}

#8 consolidated categories. Bread, noodles, and rice consolidated into 'grains'. Dairy and egg consolidated together.
cat2 = {'0':'grains',
        '1':'dairy-egg',
        '2':'dessert',
        '3':'dairy-egg',
        '4':'fried',
        '5':'meat',
        '6':'grains',
        '7':'grains',
        '8':'seafood',
        '9':'soup',
        '10':'fruit-veg'}

In [43]:
# Add columns for target classifications
df['cat'] = df['name'].str.split('_').str[0]
df['cat_name'] = df['cat'].apply(lambda x: cat_names[x])
df['cat2_name'] = df['cat'].apply(lambda x: cat2[x])
df['is_meat'] = df['cat'].apply(lambda x: 'meat' if (x == '5') else 'not_meat')

In [44]:
df

Unnamed: 0,path,folder,name,cat,cat_name,cat2_name,is_meat
0,training/3_462.jpg,training,3_462.jpg,3,egg,dairy-egg,not_meat
1,training/4_69.jpg,training,4_69.jpg,4,fried,fried,not_meat
2,training/5_925.jpg,training,5_925.jpg,5,meat,meat,meat
3,training/1_202.jpg,training,1_202.jpg,1,dairy,dairy-egg,not_meat
4,training/5_241.jpg,training,5_241.jpg,5,meat,meat,meat
...,...,...,...,...,...,...,...
16638,evaluation/5_165.jpg,evaluation,5_165.jpg,5,meat,meat,meat
16639,evaluation/5_393.jpg,evaluation,5_393.jpg,5,meat,meat,meat
16640,evaluation/8_157.jpg,evaluation,8_157.jpg,8,seafood,seafood,not_meat
16641,evaluation/4_86.jpg,evaluation,4_86.jpg,4,fried,fried,not_meat


In [45]:
df.groupby('folder').is_meat.value_counts()

folder      is_meat 
evaluation  not_meat    2915
            meat         432
training    not_meat    8541
            meat        1325
validation  not_meat    2981
            meat         449
Name: is_meat, dtype: int64

In [46]:
df.groupby('folder').cat_name.value_counts()

folder      cat_name 
evaluation  dessert       500
            soup          500
            meat          432
            bread         368
            egg           335
            seafood       303
            fried         287
            fruit-veg     231
            dairy         148
            noodles       147
            rice           96
training    dessert      1500
            soup         1500
            meat         1325
            bread         994
            egg           986
            seafood       855
            fried         848
            fruit-veg     709
            noodles       440
            dairy         429
            rice          280
validation  dessert       500
            soup          500
            meat          449
            bread         362
            seafood       347
            egg           327
            fried         326
            fruit-veg     232
            noodles       147
            dairy         144
            rice  

In [47]:
df.groupby('folder').cat2_name.value_counts()

folder      cat2_name
evaluation  grains        611
            dessert       500
            soup          500
            dairy-egg     483
            meat          432
            seafood       303
            fried         287
            fruit-veg     231
training    grains       1714
            dessert      1500
            soup         1500
            dairy-egg    1415
            meat         1325
            seafood       855
            fried         848
            fruit-veg     709
validation  grains        605
            dessert       500
            soup          500
            dairy-egg     471
            meat          449
            seafood       347
            fried         326
            fruit-veg     232
Name: cat2_name, dtype: int64

Class imbalance is improved with consolidating the categories but still present. We will further address this while modeling.

## 4. Save Data to file
The entire dataframe is saved as separate training, validation, and evaluation dataframes to be read into Keras.

Additionally, we are saving a dataframe that includes the training and validation data. After the final model is decided, this can be used to retrain prior to using on real world tests (images outside of the dataset).

In [48]:
# Saving separate pkl files for training, validation, and evaluation subsets.
df[df.folder == 'training'].to_pickle('/home/andy/sf20_ds18/curriculum/project-05/andy_project_05/data/food11_train.pkl')
df[df.folder == 'validation'].to_pickle('/home/andy/sf20_ds18/curriculum/project-05/andy_project_05/data/food11_val.pkl')
df[df.folder == 'evaluation'].to_pickle('/home/andy/sf20_ds18/curriculum/project-05/andy_project_05/data/food11_test.pkl')

In [49]:
# Saving a dataframe with the validation data combined with the training data.
df[(df.folder == 'training') | (df.folder=='validation')].to_pickle('/home/andy/sf20_ds18/curriculum/project-05/andy_project_05/data/food11_all.pkl')