# Classification Dataset
BS"D

In this notebook, I will put together the dataset of food items labeled according to dietary restrictions. It will be used to train classifiers for each dietary restriction.

The diets I will be working on are:
- Vegetarian
- Vegan
- Gluten Free
- Dairy Free

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Hand-Labeling Data
The following set of ingredients will be hand-labeled according to the dietary restrictions.

In [1]:
def load_common_ingredients_from_source_json(file_name = "source_data/common_ingredients.json"):
    '''
    Load the list of common ingredients from the source data

    Parameters
    ----------
    file_name : str
        The name of the file to load the data from (default is "source_data/common_ingredients.json")

    Returns
    -------
    pd.DataFrame
        A dataframe containing the common ingredients and the frequency of their occurrence

    '''

    # Load the data
    data = pd.read_json(file_name)

    # The data is a dataframe of dictionaries and needs to be exploded
    data["ingredient"] = data["data"].apply(lambda x: x["index"])
    data["quantity"] = data["data"].apply(lambda x: x["quantity"])

    # Drop the data column
    data = data.drop("data", axis=1)

    return data

In [None]:
common_ingredients = load_common_ingredients_from_source_json()

common_ingredients

Unnamed: 0,ingredient,quantity
0,salt,18049
1,olive oil,7972
2,onions,7972
3,water,7457
4,garlic,7380
...,...,...
495,boneless chicken breast,146
496,crème fraîche,145
497,cooked white rice,145
498,pecans,144


In [None]:
# Save the data as an excel file
# with pd.ExcelWriter("common_ingredients.xlsx") as writer:
#     common_ingredients.to_excel(writer, sheet_name="common_ingredients", index=False)

### Loading the Hand-Labeled Data

In [17]:
small_dataset = pd.read_csv("common_ingredients_initial.csv")

small_dataset


Unnamed: 0,ingredient,vegetarian,vegan,dairy_free,gluten_free
0,salt,Yes,Yes,Yes,Yes
1,olive oil,Yes,Yes,Yes,Yes
2,onions,Yes,Yes,Yes,Yes
3,water,Yes,Yes,Yes,Yes
4,garlic,Yes,Yes,Yes,Yes
...,...,...,...,...,...
494,boneless chicken breast,No,no,yes,Yes
495,crème fraîche,Yes,no,no,Yes
496,cooked white rice,Yes,yes,yes,Yes
497,pecans,Yes,yes,yes,Yes


In [12]:
columns = small_dataset.columns

def print_unique_values_in_columns(columns, dataset):
    '''
    Print the unique values in each column of a dataset

    Parameters
    ----------
    columns : list
        A list of column names
    dataset : pd.DataFrame
        The dataset to print the unique values from

    Returns
    -------
    None

    '''

    for column in columns:
        if column == "ingredient":
            continue

        # Print the unique values in the column
        print(f"Column: {column}")
        print(dataset[column].unique())
        print("\n")

print_unique_values_in_columns(columns, small_dataset)

Column: vegetarian
['Yes' 'No']


Column: vegan
['Yes' 'No' 'yes' 'no' '?']


Column: dairy_free
['Yes' 'No' 'no' 'yes' '?']


Column: gluten_free
['Yes' 'No' 'Sometimes']




I will now edit the dataset to only be lowercase.

In [13]:
# Make everything lowercase
columns = small_dataset.columns
for column in columns:
    small_dataset[column] = small_dataset[column].str.lower()

# Get the common ingredients
print_unique_values_in_columns(columns, small_dataset)

Column: vegetarian
['yes' 'no']


Column: vegan
['yes' 'no' '?']


Column: dairy_free
['yes' 'no' '?']


Column: gluten_free
['yes' 'no' 'sometimes']




#### Collect the values that aren't yes or no

In [14]:
# Find the rows that have values that are not "yes" or "no"
for column in columns:
    if column == "ingredient":
        continue

    print(f"Column: {column}")
    print(small_dataset[~small_dataset[column].isin(["yes", "no"])])
    print("\n")



Column: vegetarian
Empty DataFrame
Columns: [ingredient, vegetarian, vegan, dairy_free, gluten_free]
Index: []


Column: vegan
           ingredient vegetarian vegan dairy_free gluten_free
408             chili        yes     ?        yes         yes
428             bread        yes     ?        yes          no
466  asian fish sauce         no     ?        yes         yes


Column: dairy_free
              ingredient vegetarian vegan dairy_free gluten_free
453  semisweet chocolate        yes    no          ?         yes


Column: gluten_free
    ingredient vegetarian vegan dairy_free gluten_free
347    noodles        yes   yes        yes   sometimes
352  tortillas        yes   yes        yes   sometimes




Fish sauce is a mistake. It is not vegan because it is made from fish. I will change it to no.

In [15]:
# Show the row of asian fish sauce
small_dataset[small_dataset["ingredient"] == "asian fish sauce"]

Unnamed: 0,ingredient,vegetarian,vegan,dairy_free,gluten_free
466,asian fish sauce,no,?,yes,yes


In [16]:
# Change the vegan column to "no" for asian fish sauce
small_dataset.loc[small_dataset["ingredient"] == "asian fish sauce", "vegan"] = "no"

# Show the row of asian fish sauce
small_dataset[small_dataset["ingredient"] == "asian fish sauce"]

Unnamed: 0,ingredient,vegetarian,vegan,dairy_free,gluten_free
466,asian fish sauce,no,no,yes,yes
