# Preparation of Ingredient recognition dataset

This notebook is used to prepare the dataset for the ingredient detection task. The used datasets are the [Food-101](https://www.kaggle.com/datasets/dansbecker/food-101) for the images and the [Ingredient-101](http://www.ub.edu/cvub/ingredients101/) for the ingredients. The Food-101 dataset contains 101 food categories with 1000 images each. 
The Ingredient-101 dataset extends the Food-101 dataset with the ingredients for each image. The simplified version of the Ingredients-101 dataset contains 227 different ingredients, which is used in this notebook. The goal is to create a training, validation, and test dataset with images and the corresponding ingredients.

## Import libraries

In [41]:
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")
import os
import json
import jsonlines
import datasets
import random
from PIL import Image


## Data Loading

In [42]:
annotations_path = Path("../data/ingredients-101/Annotations/")
images_path = Path("../data/food-101/images/")
ingredients_path = 'ingredients_simplified.txt'
simplifications_path = Path("../data/ingredients-101/ingredients_simplification/")

In [43]:
def read_data(path, file):
    with open(path / file, 'r') as f:
        data = f.read().strip().split('\n')
    return data

In [44]:
train_images = read_data(annotations_path, 'train_images.txt')
val_images = read_data(annotations_path, 'val_images.txt')
test_images = read_data(annotations_path, 'test_images.txt')

train_labels = read_data(annotations_path, 'train_labels.txt')
val_labels = read_data(annotations_path, 'val_labels.txt')
test_labels = read_data(annotations_path, 'test_labels.txt')

base_ingredients = read_data(simplifications_path, 'baseIngredients.txt')
ingredients_simplified = read_data(annotations_path, ingredients_path)

## Data Exploration

In [45]:
# get every 1000th image in the training set
print(train_images[::1000])

['apple_pie/1005649', 'baby_back_ribs/2609854', 'baklava/824035', 'beef_tartare/2728369', 'beet_salad/651251', 'bibimbap/247378', 'bread_pudding/577928', 'bruschetta/2381015', 'caesar_salad/388850', 'caprese_salad/2287006', 'carrot_cake/3920883', 'cheesecake/2094088', 'cheese_plate/3803340', 'chicken_quesadilla/1952791', 'chicken_wings/3536403', 'chocolate_mousse/1872460', 'churros/3384419', 'club_sandwich/1569490', 'crab_cakes/327268', 'croque_madame/1488004', 'cup_cakes/323974', 'donuts/1396491', 'dumplings/3061004', 'eggs_benedict/1242995', 'escargots/2995220', 'filet_mignon/1125845', 'fish_and_chips/288992', 'french_fries/100148', 'french_onion_soup/2636895', 'french_toast/861171', 'fried_rice/2552407', 'frozen_yogurt/730932', 'gnocchi/2595566', 'greek_salad/624309', 'grilled_salmon/2315101', 'guacamole/598769', 'hamburger/2217236', 'hot_and_sour_soup/457763', 'huevos_rancheros/198909', 'hummus/3770127', 'lasagna/198015', 'lobster_bisque/3510579', 'macaroni_and_cheese/1788759', 'ma

In [46]:
# get every 1000th label in the training set
print(train_labels[::1000])

['0', '1', '2', '4', '5', '7', '8', '10', '11', '13', '14', '16', '17', '19', '20', '22', '23', '25', '26', '28', '29', '31', '32', '34', '35', '37', '38', '40', '41', '42', '44', '45', '47', '48', '50', '51', '53', '54', '56', '57', '59', '60', '62', '63', '65', '66', '68', '69', '71', '72', '74', '75', '77', '78', '80', '81', '82', '84', '85', '87', '88', '90', '91', '93', '94', '96', '97', '99', '100']


In [47]:
ingredients_simplified

['butter,flour,sugar,brown sugar,apple,cinnamon,nut',
 'baby back ribs,apple,salt,mustard,brown sugar,worcestershire,gin,chili',
 'nut,cinnamon,bread,butter,phyllo dough,sugar,honey,lemon,baklava',
 'beef,lemon,gin,salt,pepper,baby arugula,asiago',
 'fat,steak,gin,shallot,parsley,capers,worcestershire,egg,black pepper,crostini',
 'beets,spinach,gorgonzola,nut,red wine,dijon mustard,gin,salt,black pepper,herbs',
 'water,sugar,yeast,egg,salt,milk,butter,flour,sugar',
 'grain,steak,soybean sprouts,spinach,cucumber,zucchini,carrot,garlic,scallions,soy,oil,seeds,salt,pepper,oil,egg,pepper,sugar,water',
 'bread,milk,sugar,butter,salt,egg,vanilla',
 'rolls,bacon,egg,brie,onion,cheddar,flour,salsa',
 'plum,garlic,gin,balsamic vinegar,basil,salt,black pepper,baguette',
 'garlic,plain greek yogurt,cheese,worcestershire,dijon mustard,lemon,anchovy,salt,pepper,lettuce,croutons',
 'sugar,cheese,almond,chocolate,liqueur,cannoli shells,cocoa,cocktail',
 'balsamic vinegar,tomato,cheese,basil,oil,black

The ``train_images`` contain the image filenames for the training dataset, the ``train_labels`` contain the corresponding labels for the training dataset. The labels are not the ingredients, but the food categories. The ingredients needed for training are contained in ``ingredients_simplified``.

## Data Preprocessing

### Fix unclear labels

In [51]:
# change 'baking' to 'baking powder' in ingredients_simplified
ingredients_simplified = [line.replace('baking', 'baking powder') for line in ingredients_simplified]

### Convert Ingredients List to Multi-label Format

In [52]:
ingredients_simplified[:3]

['butter,flour,sugar,brown sugar,apple,cinnamon,nut',
 'baby back ribs,apple,salt,mustard,brown sugar,worcestershire,gin,chili',
 'nut,cinnamon,bread,butter,phyllo dough,sugar,honey,lemon,baklava']

In [53]:
corrected_ingredients_simplified = [ingredient.strip() for sublist in ingredients_simplified for ingredient in sublist.split(',')]
corrected_ingredients_simplified[:10]

['butter',
 'flour',
 'sugar',
 'brown sugar',
 'apple',
 'cinnamon',
 'nut',
 'baby back ribs',
 'apple',
 'salt']

In [54]:
# get only the unique ingredients
unique_ingredients = list(set(corrected_ingredients_simplified))
num_unique_ingredients = len(unique_ingredients)
num_unique_ingredients

227

In [61]:
# sort the ingredients alphabetically
unique_ingredients.sort()

# create a dictionary with the ingredients as keys and the index as values
ingredient_to_idx = {ingredient: idx for idx, ingredient in enumerate(unique_ingredients)}
idx_to_ingredient = {idx: ingredient for ingredient, idx in ingredient_to_idx.items()}

# create the directory if it does not exist
os.makedirs('../data/food-ingredients', exist_ok=True)

# save to a file
with open('../data/food-ingredients/ingredient_to_idx.txt', 'w') as f:
    for key, value in ingredient_to_idx.items():
        f.write(f'{key}, {value}\n')

# save to a file
with open('../data/food-ingredients/idx_to_ingredient.txt', 'w') as f:
    for key, value in idx_to_ingredient.items():
        f.write(f'{key}, {value}\n')

In [56]:
numerical_ingredients_simplified = [[ingredient_to_idx[ingredient.strip()] for ingredient in sublist.split(',')] for sublist in ingredients_simplified]

numerical_ingredients_simplified[:5]

[[36, 100, 206, 34, 5, 63, 143],
 [10, 5, 183, 141, 34, 224, 105, 54],
 [143, 63, 29, 36, 160, 206, 116, 124, 15],
 [21, 124, 105, 183, 159, 9, 7],
 [94, 204, 105, 190, 156, 42, 224, 91, 27, 78]]

In [57]:
# manually check the correctness of the conversion
print(idx_to_ingredient[183])

print([idx_to_ingredient[idx] for idx in numerical_ingredients_simplified[1]]) 

salt
['baby back ribs', 'apple', 'salt', 'mustard', 'brown sugar', 'worcestershire', 'gin', 'chili']


In [58]:
# Create a dictionary to map each class to its ingredients
class_to_ingredients = {i: ingredients for i, ingredients in enumerate(ingredients_simplified)}
class_to_ingredients

# save to a file<
with open('../data/food-ingredients/class_to_ingredients.jsonl', 'w') as f:
    for key, value in class_to_ingredients.items():
        f.write(json.dumps({key: value}) + '\n')

### Create Train, Valid and Test Datasets

In [59]:
def create_jsonl_file(file_path, images, labels, class_to_ingredients, ingredient_to_idx, images_path):
    with jsonlines.open(file_path, 'w') as writer:
        for image, label in zip(images, labels):
            ingredients = class_to_ingredients[int(label)]
            ingredients_numeric = [ingredient_to_idx[ingredient] for ingredient in ingredients.split(',')]
            writer.write({'image': str(images_path) + '/' + image + '.jpg', 'ingredients': ingredients_numeric, 'ingredients_names': ingredients.split(','), 'class': int(label), 'class_name': image.split('/')[0]})

create_jsonl_file('../data/food-ingredients/train.jsonl', train_images, train_labels, class_to_ingredients, ingredient_to_idx, images_path)
create_jsonl_file('../data/food-ingredients/val.jsonl', val_images, val_labels, class_to_ingredients, ingredient_to_idx, images_path)
create_jsonl_file('../data/food-ingredients/test.jsonl', test_images, test_labels, class_to_ingredients, ingredient_to_idx, images_path)

In [60]:
# load dataset
data_files = {"train": "../data/food-ingredients/train.jsonl", "val": "../data/food-ingredients/val.jsonl", "test": "../data/food-ingredients/test.jsonl"}

dataset = datasets.load_dataset('json', data_files=data_files)

Generating train split: 68175 examples [00:00, 868391.86 examples/s]
Generating val split: 7575 examples [00:00, 582905.60 examples/s]
Generating test split: 25250 examples [00:00, 841694.23 examples/s]


### Create T_500 Dataset

In some cases, a smaller dataset is needed to evaluate the models performance on a smaller scale. This smaller dataset is called T_500 and consists of 500 images selected from the test set. The subset contains at least 4 images from each of the 101 food categories, which enables a balanced presentation and at the same time reduces the calculation effort.

In [None]:
# set seed for reproducibility
random.seed(42)

In [None]:
# function to check if an image is valid
def is_valid_image(image_path):
    try:
        with Image.open(image_path) as img:
            if img.format.lower() not in ['png', 'jpeg', 'jpg', 'gif', 'webp']:
                return False
            if os.path.getsize(image_path) > 20 * 1024 * 1024:
                return False
        return True
    except Exception:
        return False

# function to perform stratified sampling
def stratified_sample(data, n):
    # get the number of classes by counting different values in the 'class' column
    classes = {}
    for item in data:
        class_name = item['class']
        if class_name not in classes:
            classes[class_name] = []
        classes[class_name].append(item)

    print(len(classes))

    # calculate the number of samples per class
    num_classes = len(classes)
    samples_per_class = n // num_classes

    # perform stratified sampling with validation
    stratified_sample = []
    class_counts = {}
    for class_name, items in classes.items():
        valid_items = [item for item in items if is_valid_image(item['image'])]
        if len(valid_items) >= samples_per_class:
            selected = random.sample(valid_items, samples_per_class)
            stratified_sample.extend(selected)
            class_counts[class_name] = len(selected)
        else:
            stratified_sample.extend(valid_items)
            class_counts[class_name] = len(valid_items)
    
    # if there aren't enough samples, add more from classes with extra valid samples
    while len(stratified_sample) < n:
        for class_name, items in classes.items():
            if len(stratified_sample) >= n:
                break
            valid_items = [item for item in items if is_valid_image(item['image']) and item not in stratified_sample]
            if valid_items:
                stratified_sample.append(random.choice(valid_items))
                class_counts[class_name] += 1

    return stratified_sample[:n]

In [None]:
def load_data(file_path):
    with open(file_path, 'r') as f:
        return [json.loads(line) for line in f]
    
test_data = load_data('../data/food-ingredients/test.jsonl')
test_data_500 = stratified_sample(test_data, 500)

In [None]:
# check how many images per class are in the test_data_500
classes = {}
for item in test_data_500:
    class_name = item['class']
    if class_name not in classes:
        classes[class_name] = 0
    classes[class_name] += 1

print(classes)

In [None]:
# save stratified sample to file
def save_data(data, file_path):
    with open(file_path, 'w') as f:
        for item in data:
            f.write(json.dumps(item) + '\n')

save_data(test_data_500, '../data/food-ingredients/test_500.jsonl')