# What's in an Avocado Toast: A Supply Chain Analysis

You're in London, making an avocado toast, a quick-to-make dish that has soared in popularity on breakfast menus since the 2010s. A simple smashed avocado toast can be made with five ingredients: one ripe avocado, half a lemon, a big pinch of salt flakes, two slices of sourdough bread and a good drizzle of extra virgin olive oil. It's no small feat that most of these ingredients are readily available in grocery stores. 

In this project, you'll conduct a supply chain analysis of three of these ingredients used in an avocado toast, utilizing the Open Food Facts database. This database contains extensive, openly-sourced information on various foods, including their origins. Through this analysis, you will gain an in-depth understanding of the complex supply chain involved in producing a single dish.

Three pairs of files are provided in the data folder:
- A CSV file for each ingredient, such as `avocado.csv`, with data about each food item and countries of origin
- A TXT file for each ingredient, such as `relevant_avocado_categories`, containing only the category tags of interest for that food.

Here are some other key points about these files:
- Some of the rows of data in each of the three CSV files do not contain relevant data for your investigation. In each dataset, you will need to filter out rows with irrelevant data, based on values in the `categories_tags` column. Examples of categories are, fruits, vegetables, and fruit-based oils. Filter the DataFrame to include only rows where `categories_tags` contains one of the tags in the relevant categories for that ingredient.
- Each row of data usually has multiple categories tags in the `categories_tags` column.
- There is a column in each CSV file called `origins_tags` with strings for country of origin of that item.

After completing this project, you'll be armed with a list of ingredients and their countries of origin, and be well-positioned to launch into other analyses that explore how long, on average, these ingredients spend at sea.

![](avocado_wallpaper.jpeg)

In [219]:
import pandas as pd
import numpy as np

In [220]:
# Set up necessary info for reading files
csv_delimiter = '/t'
avocado_categories_file = 'data/relevant_avocado_categories.txt' 
avocado_csv = 'data/avocado.csv'
olive_oil_categories_file = 'data/relevant_olive_oil_categories.txt' 
olive_oil_csv = 'data/olive_oil.csv'
sourdough_categories_file = 'data/relevant_sourdough_categories.txt' 
sourdough_csv = 'data/sourdough.csv'

In [221]:
def get_categories(filename):
    with open(filename) as f:
        categories = [line.rstrip() for line in f]

    return categories

In [222]:
def get_ingredient_df(filename):
    df = pd.read_csv(filename, delimiter='\t', usecols=['categories_tags', 'origins_tags', 'countries'])
    df.dropna(inplace=True)
    df = df[df['countries'] == 'United Kingdom']
    return df

In [234]:
# Initial attempt using regex on string in column
def get_filtered_df_1(df, categories):
    # Initialize an empty DataFrame to store the filtered results
    filtered_df = pd.DataFrame()

    # Iterate over each search string and apply the search individually
    for serach_string in categories:
        # Use str.contains() with a regular expression to search within the 'categories_tags' column
        # The regular expression ensures that the search is performed on text separated by commas
        # Case sensitivity can be controlled using the 'case' parameter
        temp_df = df[df['categories_tags'].str.contains(r'[^,]+'.format(search_string), case=False)]

        # Append the filtered results to the main DataFrame
        filtered_df = filtered_df.append(temp_df)

    filtered_df = filtered_df.drop_duplicates()
    
    # Split the strings in the 'origins_tags' column by commas
    filtered_df['origins_tags'] = filtered_df['origins_tags'].str.split(',')
    
    # Explode the lists in the 'origins_tags' column into separate rows
    filtered_df = filtered_df.explode('origins_tags')
    
    return filtered_df

In [233]:
# Second attempt splitting string on delimiter into new column
def get_filtered_df_2(df, categories):
    df['categories_list'] = df['categories_tags'].str.split(',')
    df = df[df['categories_list']\
            .apply(lambda row: any([category for category in row if category in categories]))]
    df['origins_tags'] = df['origins_tags'].str.split(',')
    
    # Explode the lists in the 'origins_tags' column into separate rows
    df = df.explode('origins_tags')
      
    return df

In [235]:
def get_top_origin_country(categories_file, csv_file):
    categories = get_categories(categories_file)
    df = get_ingredient_df(csv_file)
    # df = get_filtered_df_1(df, categories)
    df = get_filtered_df_2(df, categories)

    # Get the country which sourced the most ingredients form the initial ingredient's file
    max_country = df['origins_tags'].value_counts().idxmax()
    top_origin = max_country.split(':')[1].replace('-', ' ').upper()

    return top_origin

In [236]:
# Avocados
top_avocado_origin = get_top_origin_country(avocado_categories_file, avocado_csv)
print(top_avocado_origin)

PERU


In [237]:
# Olive Oil
top_olive_oil_origin = get_top_origin_country(olive_oil_categories_file, olive_oil_csv)
print(top_olive_oil_origin)

GREECE


In [238]:
# Sourdough
top_sourdough_origin = get_top_origin_country(sourdough_categories_file, sourdough_csv)
print(top_sourdough_origin)

UNITED KINGDOM
