#Overview


* This notebook was originally intended to run in colab [link here](https://colab.research.google.com/drive/1miy7mCbC_ZxoyrcgqJj9mc0x4fJdXkWn?usp=sharing)
* For versioning across milestones, check our [github](https://github.com/dyeramosu/AC215_snapnutrition)
*   FOODD is a large food dataset with several papers on food to calorie/nutrition info mapping. These papers differred from Nutrition5k in that it attempted to identify the food or foods in each image, then map that identification to a known nutrition label for that food type. Nutrition5k did not focus on identifying the food types but does have ingredient lists in its metadata.
*   This experimental notebook contains basic EDA, preprocessing ideas, as well as ideation on how to annotate the data for versioning.
* This notebook also contains some cells to link colab to our GCP drive, which was successfully then implemented in Nutrition5k_EDA_Base_Model.ipynb



## Mount Drive

Choose either the Google Drive or our team GCS bucket

In [None]:
# Foor Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# For GCP Bucket

# Authenticate
from google.colab import auth
auth.authenticate_user()

# Install Cloud Storage FUSE.
!echo "deb https://packages.cloud.google.com/apt gcsfuse-`lsb_release -c -s` main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
!apt -qq update && apt -qq install gcsfuse

# Mount a Cloud Storage bucket or location, without the gs:// prefix.
mount_path = "snapnutrition_data_bucket"  # or a location like "my-bucket/path/to/mount"
local_path = f"/mnt/gs/{mount_path}"

!mkdir -p {local_path}
!gcsfuse --implicit-dirs {mount_path} {local_path}
print('\n==== GCS Bucket Successfully Mounted ====\n')
!ls -lh {local_path}

deb https://packages.cloud.google.com/apt gcsfuse-jammy main
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2659  100  2659    0     0  17281      0 --:--:-- --:--:-- --:--:-- 17379
OK
19 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mhttps://packages.cloud.google.com/apt/dists/gcsfuse-jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.[0m
gcsfuse is already the newest version (1.1.0).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
I0925 21:56:52.148918 2023/09/25 21:56:52.148871 Start gcsfuse/1.1.0 (Go version go1.20.5) for app "" using mount point: /mnt/gs/snapnutrition_data_bucket

====GCS Bucket Successfully Mounted====

total 0
drwxr-xr-x 1 root root 0 Sep 25 21:56 processed_data
drwxr-xr-x 1 root root 0 Sep 25 21:56 raw_data


## EDA

Import Libraries

In [None]:
import os
import json
import cv2
import glob
import pandas as pd
import spacy


First we'll need to create a label for each image. The default directory structure is:

```
<parent>/
|----FooDD/
     |----<food label>/
          |----<camera & lighting>/
               |----<image number>.jpg
```

For now, we won't worry about camera and lighting information. Instead, We'll create a json file that annotates each food label with paths to all corresponding images. It will be structured:

```
{
    "<food label 1>":[
        <image path 1>,
        <image path 2>,
        <image path N>
    ],
    "<food label 2>":[
        <image path 1>,
        <image path 2>,
        <image path N>
    ]
}
```

Choose the cell below based upon which drive is mounted

In [None]:
# For Google Drive
root = '/content/drive/MyDrive/AC215'

# Set FooDD folder
FooDD = 'data/FooDD'

In [None]:
# For GCP Bucket
root = local_path

# Set FooDD folder
FooDD = 'raw_data/FooDD'

Now we'll create an annotations dictionary and see what food categories are in the dataset.

In [None]:
# Create an empty annotations dictionary
annotations = dict()

# Iterate through directory
for food in os.listdir(os.path.join(root, FooDD)):
    food_path = os.path.join(FooDD, food)

    # Note: the creators of this dataset included images they found from the
    # web. We'll set them aside for now
    if (food == "Net images" or not os.path.isdir(os.path.join(root, food_path))):
        continue

    food = food.lower().replace(' ','_')
    print(food)

    image_paths = glob.glob(
        "**/*.[Jj][Pp][Gg]",
        root_dir=os.path.join(root, food_path),
        recursive=True
    )

    annotations[food] = []
    for image_path in image_paths:
        annotations[food].append(os.path.join(food_path, image_path))


apple
banana
bean
bread
carrot
cheese
cucumber
egg
grape
grape_&_apple
mixed
onion
orange
pasta
pepper
qiwi
tomato
watermelon
sauce


We need to clean up the labels a bit before proceeding. `grape_&_apple` is technically `mixed`, so we'll change the label. We'll also rename `qiwi` to the more common `kiwi`

In [None]:
# Merge grape_&_apple into mixed
annotations['mixed'].extend(annotations['grape_&_apple'])
del annotations['grape_&_apple']

# Rename qiwi to kiwi
annotations['kiwi'] = annotations.pop('qiwi')

In [None]:
# Print results
for key in annotations.keys():
    print(key)

apple
banana
bean
bread
carrot
cheese
cucumber
egg
grape
mixed
onion
orange
pasta
pepper
tomato
watermelon
sauce
kiwi


Now let's take a look at the "Net images" file to see what types of food are there. We'll need to get the names of each file and remove any capitalization, numbers, characters, and plural forms to standardize.  

In [None]:
# Use spaCy to process file names
nlp = spacy.load('en_core_web_sm')

# Create a consolidated dictionary with food type as the key
net_images = dict()

for file_name in os.listdir(os.path.join(root, FooDD, "Net images")):

    # Remove extension, spaces, numbers, and special characters
    food = nlp(os.path.splitext(file_name)[0])
    food = '_'.join([token.lemma_.lower() for token in food if token.is_alpha])

    path = os.path.join(FooDD, "Net images", file_name)
    if food in net_images:
        net_images[food].append(path)
    else:
        net_images[food] = [path]

In [None]:
# Print the resulting foods
print(f'Additional foods from "Net images" folder: {len(net_images)}')
for food in net_images.keys():
    print(food)

Additional foods from "Net images" folder: 60
cucumber
friedchicken

apple
apricot
aubergine
avocado
beetroot
bread
cabbage
carrot
cauliflower
cherry
chili
coconut
corn
date
egg
fig
garlic
ginger
grapefruit
grape
green_onion
green_pepper
guava
imagescavutofk
kiwi
lemon
lemone
lentil
lettuce
mandarin
mango
melon
mushroom
okra
olive
onion
orang
orange
papaya
peach
pear
pineapple
pomegranate
potato
radish
raspberry
red_pepper
red_radish
rice
spinach
strawberry
sweet_potato
tomato
untitled
watermelon
white_radish
zucchini


We need to clean a few things up with the labels in this folder:
- Merge `untitled` and `imagescavutofk` to grape
- Rename `<blank>` to mixed
- Merge `lemone` and `lemon`
- Rename `aubergine` to `eggplant`
- Rename `beetroot` to `beet`
- Rename `friedchicken` to `fried_chicken`
- Merge `orang` and `orange`
- Move `imagesCAKOFJ21.jpg` to watermelon

In [None]:
# Merge untitled into grape
net_images['grape'].extend(net_images['untitled'])
del net_images['untitled']

# Merge imagescavutofk into grape
net_images['grape'].extend(net_images['imagescavutofk'])
del net_images['imagescavutofk']

# Rename <blank> to mixed
net_images['mixed'] = net_images.pop('')

# Merge lemone into lemon
net_images['lemon'].extend(net_images['lemone'])
del net_images['lemone']

# Rename aubergine to eggplant
net_images['eggplant'] = net_images.pop('aubergine')

# Rename beetroot to beet
net_images['beet'] = net_images.pop('beetroot')

# Rename friedchicken to fried_chicken
net_images['fried_chicken'] = net_images.pop('friedchicken')

# Merge orang into orange
net_images['orange'].extend(net_images['orang'])
del net_images['orang']

# Move imagesCAKOFJ21.jpg to watermelon
net_images['watermelon'].append(os.path.join(FooDD, "Net images", "imagesCAKOFJ21.jpg"))
net_images['mixed'].remove(os.path.join(FooDD, "Net images", "imagesCAKOFJ21.jpg"))

Next we'll merge `net_images` with `annotations` in preperation for creating our JSON file.

In [None]:
# Merge net_images with annotations
for food in net_images.keys():
    if food in annotations:
        annotations[food].extend(net_images[food])
    else:
        annotations[food] = net_images[food]


Choose the cell below based upon which drive is mounted. This will create an `annotations.json` file in the format discussed earlier

In [None]:
# For Google Drive

# Specify the name of the JSON file
file_name = 'annotations.json'

# Open the file in write mode and save the dictionary as JSON
with open(os.path.join(root, FooDD, file_name), 'w') as json_file:
    json.dump(annotations, json_file)

In [None]:
# For GCP Bucket
from google.cloud import storage

# Specify the name of the JSON file
file_name = 'annotations.json'


storage_client = storage.Client()
bucket = storage_client.bucket(mount_path)
blob = bucket.blob(os.path.join(FooDD, file_name))


# Open the file in write mode and save the dictionary as JSON
with blob.open('w') as json_file:
    json.dump(annotations, json_file)

Start the notebook here after `annotations.json` is created. Choose the cell based upon which drive is mounted.

In [None]:
# For Google Drive
with open(os.path.join(root, FooDD, 'annotations.json'), 'r') as json_file:
    # Load the JSON data into a Python dictionary
    annotations = json.load(json_file)

In [1]:
# For GCP Bucket
# from google.cloud import storage

# storage_client = storage.Client()
# bucket = storage_client.bucket(mount_path)
# blob = bucket.blob(os.path.join(FooDD, 'annotations.json'))

# with blob.open("r") as json_file:
#     # Load the JSON data into a Python dictionary
#     annotations = json.load(json_file)


Let's have a look at some of the contents in this dataset.

In [None]:
print(f'Number of food classes: {len(annotations)}')
print(f'Number of images: {sum(len(value) for value in annotations.values())}')

Number of food classes: 62
Number of images: 3886


In [None]:
labels = []
for food, paths in annotations.items():
    labels.extend([food]*len(paths))
df = pd.DataFrame({'label':labels})
print('Number of images for each food class:\n')
food_counts = df['label'].value_counts()
pd.set_option('display.max_rows', len(food_counts))
print(food_counts)
pd.reset_option('display.max_rows')


Number of images for each food class:

apple            453
onion            339
bean             279
tomato           277
bread            270
egg              269
cheese           259
orange           241
sauce            210
pasta            207
grape            167
mixed            144
cucumber         123
banana           119
carrot            98
pepper            94
watermelon        78
kiwi              68
grapefruit        15
lemon             12
pomegranate       11
cabbage           11
papaya             9
apricot            8
avocado            8
zucchini           7
eggplant           6
melon              6
coconut            5
olive              5
chili              5
strawberry         5
garlic             5
sweet_potato       5
pineapple          5
pear               4
mango              3
guava              3
date               3
lettuce            3
cherry             3
raspberry          3
peach              3
red_radish         3
corn               3
mushroom        

## Potentially useful methods:

In [None]:
# step 1
filenames = tf.constant(list(df_all_meta['image_id']))
labels = tf.constant(list(df_all_meta['label']))

# step 2: create a dataset returning slices of `filenames`
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))

# step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
    image_string = tf.io.read_file('drive_path_to_image' + filename)
    image_decoded = tf.image.decode_jpeg(image_string, channels=3)
    image = tf.cast(image_decoded, tf.float32)/255
    return image, label

dataset = dataset.map(_parse_function)

In [None]:
# Method that takes images and converts them to a uniform size.

def resize_images(images, target_size):
    resized_images = tf.image.resize(images, target_size)
    return resized_images

# In the above code, `images` is the input tensor containing a batch of images,
# and `target_size` is the desired size for the images, specified as a tuple `(height, width)`.
# The `resize_images` function uses `tf.image.resize` to resize each image in the batch to the target size.

## Archived Cells

In [None]:
# # Create an empty annotations dictionary
# annotations = dict()

# # Iterate through directory
# for i, food in enumerate(os.listdir(path)):
#     food_path = os.path.join(path, food)

#     # Note: the creators of this dataset included images they found from the
#     # web. We'll set them aside for now
#     if (food == "Net images" or not os.path.isdir(os.path.join(path, food))):
#         continue

#     food = food.lower().replace(' ','_')
#     print(food)
#     images = glob.glob("**/*.[Jj][Pp][Gg]", root_dir=food_path, recursive=True)

#     for j, image_path in enumerate(images):
#         image_id = str(i).zfill(3) + str(j).zfill(5)
#         image_path = os.path.join(food_path, image_path)
#         image= cv2.imread(image_path)
#         height, width = image.shape[:2]
#         annotations[image_id] = {
#             'path': image_path,
#             'label': food,
#             'width': width,
#             'height': height
#         }

# # Could also make the annotations file simply keyed off the label and a list of
# #  values that are just the path to the image. No unique ID needed

apple
cucumber
bread
carrot
bean
banana
cheese
mixed
onion
orange
grape
egg
grape_&_apple
tomato
pepper
qiwi
pasta
sauce
watermelon


In [None]:
# labels, heights, widths = [], [], []

# for k, v in annotations.items():
#     labels.append(v['label'])
#     heights.append(v['height'])
#     widths.append(v['width'])

In [None]:
# print(f'Number of foods: {len(set(labels))}')
# print(f'Number of unique image sizes: {len(set(zip(widths, heights)))}')

Number of foods: 19
Number of unique image sizes: 16


In [None]:
import time

digits = 7
print(f'{time.time():.{digits}f}')
print(time.time())
print(time.time())
print(time.time())
print(time.time())
print(time.time())

1695590215.7572193
1695590215.759003
1695590215.7595582
1695590215.7600505
1695590215.760544


In [None]:
str(time.time()).replace('.','')

'1695590901707672'

In [None]:
import uuid

print(uuid.uuid4())
print(uuid.uuid4())
print(uuid.uuid4())
print(uuid.uuid4())

955feba2-bf66-46e4-95b6-789ba18223d9
6a180f83-c1e0-462c-92ee-c661d3c23118
126e3de0-fd17-4c2d-91e3-94ff36248488
bf0ff9eb-71d6-46a3-92b1-205c59be2895


In [None]:
print(uuid.uuid1())
print(uuid.uuid1())
print(uuid.uuid1())
print(uuid.uuid1())
print(uuid.uuid1())
print(uuid.uuid1())
print(uuid.uuid1())
print(uuid.uuid1())

2ea01ac0-5b1f-11ee-a5c4-0242ac1c000c
2ea063a4-5b1f-11ee-a5c4-0242ac1c000c
2ea07ca4-5b1f-11ee-a5c4-0242ac1c000c
2ea091c6-5b1f-11ee-a5c4-0242ac1c000c
2ea0a68e-5b1f-11ee-a5c4-0242ac1c000c
2ea0b3ae-5b1f-11ee-a5c4-0242ac1c000c
2ea0b7be-5b1f-11ee-a5c4-0242ac1c000c
2ea0bb6a-5b1f-11ee-a5c4-0242ac1c000c


In [None]:
str(uuid.uuid4())

'36a9f877-99ea-4fbe-94a0-cdd09aa66194'

Instead of trying to version the raw data GCP bucket, just version the processed data bucket. So we'll come up with a unique annotations.json file for each dataset we add to the raw bucket that always has the same format. This will be done once in a colab notebook and then uploaded to the GCP bucket. The annotations file will have the food category as the key, and the value will be a list of strings that point to the file path of each image. Perhaps might also include the citation of the source of the data.

The preprocessing container will look for this annotations.json to perform all of the pipeline transformations, and then store the processed images into another GCP bucket. The processed images will need a unique ID, so we'll use the UUID package to assign something unique for each image. It might also be smart to create another json file in this processed data bucket that points from the UUID to the original file path, labels, and source of the data.

The version containter will keep track of only the processed data bucket.