# Exploratory data analysis of food images 

This notebook will do the exploratory data analysis of the food images.

### EDA checklist

* What question(s) are you trying to solve (or prove wrong)?
* What kind of data do you have and how do you treat different types?
* What’s missing from the data and how do you deal with it?
* Where are the outliers and why should you care about them?
* How can you add, change or remove features to get more out of your data?

## Random walk across datasets

The bigger goal is to <span style="color:blue">recognize food from images</span>. Starting with analyzing the amount and kind of data present for this project.

The dataset contains 4 different datasets.
* UECFOOD100
* UECFOOD256
* food-101
* google-images

<b>UECFOOD100</b>
This dataset is under the dataset100 directory and contains images of 100 different japanese food items. Images of each food items are stored in a directory named as a number. The parent directory also contains a text file which indicated the mapping of food directories to their corresponding labels. 

In [None]:
# reading the first 7 lines of the file that contains the food labels
uecFood100DataLabels = open("../../data/raw/dataset100/UECFOOD100/category.txt", "r")
print(uecFood100DataLabels.read(98)) # 98 denoted the number of words to read from file

The images within each directory are named randomly to some numbers. Here are the names of last 10  within directory number 2 i.e <i>eels on rice</i>. 

In [None]:
import os

path = '../../data/raw/dataset100/UECFOOD100/2'

filesEelsOnRice = []

# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        filesEelsOnRice.append(os.path.join(r, file))
        
filesEelsOnRice[-10:]

So, the images are not named in any systematic manner. Before we rename them, let's see the quality of these images by randomly plotting three images(<i>randomly selected 15650, 10768, 112</i>) from this directory. Also, each directory contains a bb_info.txt files. We can look into that as well.

In [None]:
from IPython.display import Image as Images, display
import random

listOfImageNames = ['15650.jpg', '10768.jpg', '112.jpg'] 

# Randomly selecting 3 images, but it will not generate the same output everytime.
# for i in range(3):
#     listOfImageNames.append(random.choice(os.listdir("../../data/raw/dataset100/UECFOOD100/2/")))
    
for imageName in listOfImageNames:
    display(Images(filename="../../data/raw/dataset100/UECFOOD100/2/" + imageName))
                

2 clear findings are:
<b>
* images are not of fixed size
* images can have multiple food items (different from labeled item)
</b>

Before we go further, let's see what is inside the <i>bb_info.txt</i> file.

In [None]:
bbInfoMysteryContent = open("../../data/raw/dataset100/UECFOOD100/2/bb_info.txt", "r")
print(bbInfoMysteryContent.read(100)) # 100 denoted the number of words to read from file

This contains pixel coordinates of the food in images. We can draw a rectangle using these coordinates on the image <i>15650.jpg</i>, which has the coordinates 27 152 258 312.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import numpy as np

%matplotlib inline

imageName = "../../data/raw/dataset100/UECFOOD100/2/15650.jpg"
im = np.array(Image.open(imageName), dtype=np.uint8)

# Create figure and axes
fig,ax = plt.subplots(1, figsize=(12,8))

# Display the image
ax.imshow(im)

# Create a Rectangle patch 
# format of coordinates changed to ((x1,y1),x2-x1,y2-y1)
rect = patches.Rectangle((27,152),231,160,linewidth=5,edgecolor='b',facecolor='none')  

# Add the patch to the Axes
ax.add_patch(rect)

plt.show()

This shows the actual labeled food item in the whole image. This can be helpful later to recognize labeled food items from the image or to crop out the food item to prepare a better training set.

Next thing to look out in the dataset is to see whether it contains the <b><i>duplicate images</i></b>. Scanning directory 3 i.e. <i>pilaf</i>.

In [None]:
path = '../../data/raw/dataset100/UECFOOD100/3'

files = []

# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        files.append(os.path.join(r, file))
        
for i in (range(len(files))):
    for j in range(i+1,len(files)):
        if(open(files[i],"rb").read() == open(files[j],"rb").read()): # comparing images
            display(Images(filename=files[i]))
            display(Images(filename=files[j]))
            print("Duplicate images are: " + files[i] + " & " + files[j]) # printing the name of duplicate images.

Hmmmmm, this is interesting!!! We have 5 duplicate pilaf images. We can remove this duplicacy later while preparing data for training. 