## Imports

In [3]:
import pandas as pd
import numpy as np

## Data Input
Source : https://www.foodstandards.gov.au/science-data/monitoringnutrients/afcd/australian-food-composition-database-download-excel-files

In [4]:
data_path = "data//source//"

## Data Loading from Excel Files
In this section, we are loading two different datasets obtained from the Australian Food Composition Database using the Pandas library. 
We are loading it using `pd.read_excel` located in a directory path specified by the variable `data_path`.

1. **Food Dataset**: 
   - This dataset contains background information relating to each food.

2. **Nutrient Dataset**: 
   - This dataset contains the nutrient data available for each food, with the nutrient data provided in two ways where tab 1: Per 100 g – all foods and all beverages are reported per 100 g edible portion and tab 2: Per 100 mL – beverages and other liquid foods only, reported per 100 mL edible portion.

In [5]:
food = pd.read_excel(data_path + 'food_file.xlsx')
nutrient = pd.read_excel(data_path + 'nutrient_file.xlsx', sheet_name = 'All solids & liquids per 100g')

## Data Preprocessing 
We check for missing values across all columns in the `food` and then the `nutrient` dataset. 

### Output Interpretation:
From the output, we observe that most columns do not have missing values (`False`). However, columns like `Classification` and `Classification Name` do contain missing values (`True`). This insight will guide us in handling these missing values, which could involve imputing them, dropping them, or analyzing them further to understand the reasons behind their absence.

In [6]:
food.isna().any()

Public Food Key        False
Food Profile ID        False
Derivation             False
Food Name              False
Food Description       False
Sampling Details       False
Nitrogen Factor        False
Fat Factor             False
Specific Gravity       False
Analysed Portion       False
Unanalysed Portion     False
Classification          True
Classification Name     True
dtype: bool

## Quantifying Missing Values in `Classification` and `Classification Name` columns
This step helps in understanding the extent of missing data and informs subsequent data cleaning or imputation strategies.
The output is printed to give a clear count of how many missing values are present in each of these columns.

### Output Interpretation:
- The printed output indicates that there is only 1 missing value in both the `Classification` and `Classification Name` columns.

- **Matching Number of Missing Values**: The fact that both columns have exactly 1 missing value each suggests a possibility that the missing value in both columns comes from the same row. This could occur, for example, if a particular food item was not properly classified, resulting in missing entries in both related columns.

In [7]:
missing_values_sum_classification = food['Classification'].isna().sum()
missing_values_sum_classification_name = food['Classification Name'].isna().sum()

print("Sum of missing values in 'classification' column:", missing_values_sum_classification)
print("Sum of missing values in 'classification name' column:", missing_values_sum_classification_name)

Sum of missing values in 'classification' column: 1
Sum of missing values in 'classification name' column: 1


## Identifying Rows with Missing Values in the Food Dataset
In this section, we focus on pinpointing the exact rows in the `food` dataset that contain missing values. 

### Output Interpretation:
- The output shows `Index([1616], dtype='int64')`, indicating that row index 1616 in the pandas DataFrame contains missing values.
- Note that the pandas index 1616 corresponds to row number 1618 in the Excel file since pandas indexing starts at 0, whereas Excel's row numbering starts at 1. In addition, Excel often has a header row which pandas considers as part of the data.
- The row at index 1616, identified as having missing values, is actually just a repetition of the column headers, and can be safely removed from the dataset without losing any valuable data.

In [8]:
nan_row_indices = food[food.isna().any(axis=1)].index
print(nan_row_indices)

Index([1616], dtype='int64')


## Removing Duplicate Header Row from the DataFrame
- We drop the row to ensure the integrity and cleanliness of our dataset.

In [9]:
food = food.drop(1616)

## Removing unnecessary columns for Macro Food Calculator Database
We retain only the most relevant fields, such as the unique code Public Food Key, food name, and detailed description.

In [10]:
food.drop(['Derivation', 'Food Profile ID', 'Sampling Details', 'Specific Gravity', 'Analysed Portion', 'Unanalysed Portion', 'Nitrogen Factor', 'Fat Factor'], axis=1, inplace=True, errors = 'ignore')

# Identifying Missing Values in `Nutrient` dataset
## Output Interpretation
- Columns such as `Classification`, `Serine (mg)`, `Threonine (mg)`, `Tyrosine (mg)`, and `Valine (mg)` show `True`, which tell us that these nutrients have missing data points for some food items.

In [11]:
nutrient.isna().any()

Public Food Key                                  False
Classification                                    True
Food Name                                        False
Energy with dietary fibre, equated \n(kJ)        False
Energy, without dietary fibre, equated \n(kJ)    False
                                                 ...  
Serine \n(mg)                                     True
Threonine \n(mg)                                  True
Tyrosine \n(mg)                                   True
Tryptophan \n(mg)                                False
Valine \n(mg)                                     True
Length: 293, dtype: bool

## Locating Rows with Missing 'Classification' in the Nutrient Dataset
- The code below filters out rows where the `Classification` column is `NaN` and retrieve their indices.

### Output Interpretation:
- These indices correspond to rows 126 and 132 in the pandas DataFrame. 
- As we investigate the rows with missing data, we notice that 

In [12]:
nan_indices = nutrient[nutrient['Classification'].isna()].index
print(nan_indices) 

Index([126, 132], dtype='int64')


## Filling Missing 'Classification' Values in the Nutrient Dataset Using Data from the Food Dataset
- First, we create `classification_map`, a Series from the `food` DataFrame. In this Series, 'Public Food Key' is set as the index, and 'Classification' is the value. This mapping provides a reference to look up the 'Classification' of each food item based on its unique key.
- Next, we use this mapping to fill in missing 'Classification' values in the `nutrient` DataFrame. 

In [13]:
# Mapping Series from the 'food' DataFrame where 'Public Food Key' is the index and 'Classification' is the value
classification_map = food.set_index('Public Food Key')['Classification']

# Use the map to fill NaN values in the 'Classification' column of 'nutrient' DataFrame
nutrient['Classification'] = nutrient['Classification'].fillna(nutrient['Public Food Key'].map(classification_map))

## Removing unnecessary columns for Macro Food Calculator Database

In [14]:
# Drop unncessary columns from the 'nutrient' df
columns_to_keep = [
    'Public Food Key', 'Classification', 'Food Name', 
    'Energy with dietary fibre, equated \n(kJ)',
    'Protein \n(g)', 'Fat, total \n(g)',
    'Total dietary fibre \n(g)', 'Total sugars (g)', 
    'Added sugars (g)', 'Free sugars \n(g)', 'Starch \n(g)',
    'Available carbohydrate, with sugar alcohols \n(g)',
    'Caffeine \n(mg)'
]

# Index the DataFrame with this list to keep only these columns
nutrient = nutrient[columns_to_keep]

## Dataframe Merging 
**Setting Index**:
- We prepare `nutrient` and `food` DataFrames for merging by setting their 'Public Food Key' columns as indexes.

In [15]:
# Setting 'Public Food Key' as index for original dfs
nutrient.set_index('Public Food Key', inplace=True)
food.set_index('Public Food Key', inplace=True)

## Checking for Column Consistency Across `nutrient` and `food` DataFrames

**Identifying Common Keys**:
  - Finding common indices from `Public Food Key` in both DataFrames using `intersection`. 

### Output Interpretation:
- The output `False` for the 'Classification' column indicates discrepancies between the two datasets. This suggests that for some common food items, the classification differs between the `nutrient` and `food` DataFrames.


In [16]:
# Find the common keys in both dataframes
common_keys = nutrient.index.intersection(food.index)

# Iterate through each column to compare
for column in ['Food Name', 'Classification']:
    # Compare the columns for common keys and check if all matching
    matching_columns = nutrient.loc[common_keys, column].equals(food.loc[common_keys, column])
    print(f"{column} columns identical for common primary keys in both dataframes?", matching_columns)

Food Name columns identical for common primary keys in both dataframes? True
Classification columns identical for common primary keys in both dataframes? False


## Investigating Discrepancies in 'Classification' Column Across DataFrames
Since the previous code identified a discrepancy in the 'Classification' column between the `nutrient` and `food` DataFrames, we further investigate to pinpoint the exact mismatches.

### Output Interpretation:
- We find a mismatch for the key `F009802`, indicating that the classification for this particular food item is different in the `nutrient` DataFrame compared to the `food` DataFrame.

In [17]:
# Find mismatches in the current column
mismatch_mask = nutrient.loc[common_keys, 'Classification'] != food.loc[common_keys, 'Classification']
mismatched_keys = mismatch_mask[mismatch_mask].index.tolist()

# Print the keys where mismatches were found
if mismatched_keys:  # If the list is not empty
        print(f"Mismatches in '{column}' for keys:", mismatched_keys)
else:
        print(f"No mismatches found in '{column}' column.")

Mismatches in 'Classification' for keys: ['F009802']


## Resolving the Classification Discrepancy for Key 'F009802'

**Accessing Classification Data**:
  - We access the 'Classification' data for the key 'F009802' in both `food` and `nutrient` DataFrames using the `loc` method to directly compare the classification values assigned in each dataset.
 
### Output Interpretation:
- The output reveals that the classification codes for this particular food item are different in the two datasets, with `food` having a classification of `28202.0` and `nutrient` having `28301.0`.

In [18]:
# Accessing data by the index 'F009802'
food_classification = food.loc['F009802', 'Classification']
print(food_classification)

nutrient_classification = nutrient.loc['F009802', 'Classification']
print(nutrient_classification)

28202.0
28301.0


## Updating Classification for Key 'F009802'
We have chosen to update the classification in the `food` DataFrame to match that in the `nutrient` DataFrame.

**Rationale for Choosing '28301.0'**:
  - The key 'F009802' corresponds to a 'Bar, snack, with >35% nuts, added protein'. In the `nutrient` DataFrame, this item is classified as '28301.0', while in the `food` DataFrame, it is classified as '28202.0'.
  - Upon closer examination, we found that another item, 'F009801' (Bar, snack, with >55% nuts), is consistently classified as '28202' in both dataframes. This consistency suggests that '28202' might represent a general category for nut-based snack bars.
  - However, since 'F009802' includes added protein, which could differentiate it nutritionally and categorically, the classification '28301.0' in the `nutrient` DataFrame may more accurately reflect its unique characteristics. Therefore, we choose to align the `food` DataFrame with this classification.


In [19]:
# Updating the 'Classification' value for the row with 'Public Food Key' of 'F009802'
food.loc['F009802', 'Classification'] = '28301.0'

  food.loc['F009802', 'Classification'] = '28301.0'


## Verifying the Alignment of 'Classification' Columns Across Both DataFrames
We perform a final check to ensure alignment and consistency across the `food` and `nutrient` DataFrames.

### Output Interpretation:
- The output indicates that there are still some mismatches in the 'Classification' column between the food and nutrient DataFrames for certain common items.

In [22]:
# Compare 'Classification' columns for common keys
classification_match = (food.loc[common_keys, 'Classification'] == nutrient.loc[common_keys, 'Classification']).all()
print("Do 'Classification' columns match for all common 'Public Food Key'?", classification_match)

Do 'Classification' columns match for all common 'Public Food Key'? True


## Converting 'Classification' Column to String Type for Consistency
- We convert both columns as strings, eliminating data type mismatches.
- If we run the code above once again, the mismatch in the `Clasffication` column will be resolved. 

In [21]:
nutrient['Classification'] = nutrient['Classification'].astype(str)
food['Classification'] = food['Classification'].astype(str)

## Creating the Final DataFrame

**Merging DataFrames with Suffixes**:
  - To avoid column name conflicts, we specify suffixes on columns with the same name. 
  - The merge is performed on the indices, as both DataFrames share 'Public Food Key' as their index. 

**Column Selection**:
  - From the merged DataFrame, we select specific columns that are most relevant for our macro food calculator. 

**Resetting the Index**:
  - To make 'Public Food Key' a column again (instead of an index), we reset the index of our final DataFrame.

In [24]:
# Merging the dataframes with specified suffixes to avoid errors with columns of the same name
merged_df = food.merge(nutrient, left_index=True, right_index=True, suffixes=('_food', '_nutrient'))

# Selecting specific columns to keep
final_df = merged_df[[
    'Food Name_food', 'Classification_food', 'Classification Name', 
    'Protein \n(g)', 'Fat, total \n(g)', 
    'Available carbohydrate, with sugar alcohols \n(g)', 
    'Total dietary fibre \n(g)', 'Total sugars (g)', 
    'Added sugars (g)', 'Free sugars \n(g)', 'Starch \n(g)', 
    'Caffeine \n(mg)', 'Energy with dietary fibre, equated \n(kJ)'
]]

# Resetting the index to turn 'Public Food Key' back to a column
final_df.reset_index(inplace=True)
final_df.rename(columns={'index': 'Public Food Key'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df.rename(columns={'index': 'Public Food Key'}, inplace=True)


## Adding Converted Energy Values (kJ to kCal) column to the Final DataFrame
- We provide energy content both in kilojoules (kJ) and kilocalories (kCal), catering to diverse user preferences and standards.
- This column is calculated by converting the energy values from kilojoules (kJ) to kilocalories (kCal). The conversion formula used is based on the principle that 1 kCal is approximately equal to 4.184 kJ.

In [25]:
# Adding a column 'Energy with dietary fibre, equated \n(kCal)' to the DataFrame
final_df['Energy with dietary fibre, equated \n(kCal)'] = (final_df['Energy with dietary fibre, equated \n(kJ)'] * 0.239005736).round(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['Energy with dietary fibre, equated \n(kCal)'] = (final_df['Energy with dietary fibre, equated \n(kJ)'] * 0.239005736).round(1)


## Renaming Columns for Clarity and Consistency in the Final DataFrame
- After merging the `food` and `nutrient` DataFrames, we had added suffixes to certain columns to avoid conflicts where columns in both dataframes had the same name. Now, we are streamlining `final_df` by renaming these columns for clarity.

In [26]:
final_df.rename(columns={'Classification_food': 'Classification', 'Food Name_food': 'Food Name'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df.rename(columns={'Classification_food': 'Classification', 'Food Name_food': 'Food Name'}, inplace=True)


## Categorizing Food Items in the Final DataFrame
To enhance the usability of the final dataframe, we add another layer of categorization by mapping each specific 'Classification Name' to a broader category, simplifying and grouping the food items. 

- We investigate each unique 'Classification Name' from the dataframe. 

In [27]:
# unique values in Classification Name
final_df['Classification Name'].unique()

array(['Herbs(dried) and spices', 'Salt', 'Stock cubes and seasonings',
       'Chemical raising agents and cooking ingredients', 'Gelatine',
       'Cereal flours and starches', 'Potato products', 'Essences',
       'Yeast', 'Beers, > 3.5% alcohol',
       'Beers, 1.15- 3.5% alcohol, reduced alcohol / light', 'Spirits',
       'Cider',
       'Wines, red (including sparkling varieties and rose styles)',
       'Wines, white (including sparkling varieties)', 'Fortified wines',
       'Unfortified dry beverage flavourings',
       'Fortified dry beverage flavourings',
       'Fortified beverage flavourings prepared with water or milk',
       'Unfortified beverage flavourings prepared with water or milk',
       'Dry coffee powder, caffeinated or decaffeinated',
       'Coffee beverage, prepared with water',
       'Coffee beverage, decaffeinated, prepared with water',
       'Coffee beverage, prepared with milk or milk substitute',
       'Cordial concentrate', 'Cordials, made from con

**Creating a Category Mapping**:
  - A dictionary named `category_mapping` is created, where each key represents a specific 'Classification Name' from our dataset, and each value corresponds to a broader category.
  - Note that this categorization into broader categories is based on manual curation and domain knowledge.

In [28]:
# Creating a dictionary where keys are the specific 'Classification Names' and values are the broad categories
category_mapping = {
    # Spices and Cooking Ingredients
    'Herbs(dried) and spices': 'Spices and Cooking Ingredients',
    'Salt': 'Spices and Cooking Ingredients',
    'Stock cubes and seasonings': 'Spices and Cooking Ingredients',
    'Chemical raising agents and cooking ingredients': 'Spices and Cooking Ingredients',
    'Gelatine': 'Spices and Cooking Ingredients',
    'Cereal flours and starches': 'Spices and Cooking Ingredients',
    'Cereal flours and starches, fortified': 'Spices and Cooking Ingredients',
    'Essences': 'Spices and Cooking Ingredients',
    'Yeast': 'Spices and Cooking Ingredients',
    'Sugar': 'Spices and Cooking Ingredients',
    'Herbs, fresh': 'Spices and Cooking Ingredients',

    # Cakes and Cake Mixes
    'Cakes and cake mixes, other types': 'Cakes and Cake Mixes',
    'Slices, biscuit and cake-type': 'Cakes and Cake Mixes',
    'Cakes and cake mixes, chocolate': 'Cakes and Cake Mixes',
    'Cakes and cake mixes, sponge': 'Cakes and Cake Mixes',
    'Muffins, cake type, and muffin mixes': 'Cakes and Cake Mixes',
    'Pancakes, crepes and dishes': 'Cakes and Cake Mixes',
    'Drop scones, pikelets': 'Cakes and Cake Mixes',
    'Cake-type desserts': 'Cakes and Cake Mixes',
    'Scones and rock cakes, plain or with added fruit or vegetables only': 'Cakes and Cake Mixes',

    # Beverages
    'Beers, > 3.5% alcohol': 'Beverages',
    'Beers, 1.15- 3.5% alcohol, reduced alcohol / light': 'Beverages',
    'Spirits': 'Beverages',
    'Cider': 'Beverages',
    'Wines, red (including sparkling varieties and rose styles)': 'Beverages',
    'Wines, white (including sparkling varieties)': 'Beverages',
    'Fortified wines': 'Beverages',
    'Unfortified dry beverage flavourings': 'Beverages',
    'Fortified dry beverage flavourings': 'Beverages',
    'Fortified beverage flavourings prepared with water or milk': 'Beverages',
    'Unfortified beverage flavourings prepared with water or milk': 'Beverages',
    'Coffee beverage, prepared with water': 'Beverages',
    'Coffee beverage, decaffeinated, prepared with water': 'Beverages',
    'Coffee beverage, prepared with milk or milk substitute': 'Beverages',
    'Cordial concentrate': 'Beverages',
    'Cordials, made from concentrate': 'Beverages',
    'Fruit drinks (ready to drink or made from concentrate)': 'Beverages',
    'Fruit juices, commercially prepared': 'Beverages',
    'Purchased packaged water including mineral water': 'Beverages',
    'Flavoured mineral waters': 'Beverages',
    'Soft drinks, cola': 'Beverages',
    'Soft drinks, cola, intense sweetened': 'Beverages',
    'Energy drinks': 'Beverages',
    'Soft drinks, non-cola': 'Beverages',
    'Soft drinks, non-cola, intense sweetened': 'Beverages',
    'Tea, regular, caffeinated, prepared with water': 'Beverages',
    'Tea, regular, decaffeinated, prepared with water or milk': 'Beverages',
    'Domestic water (including tap, tank/rain water)': 'Beverages',
    'Coffee-based mixes, beverage': 'Beverages',
    'Other alcoholic beverages': 'Beverages',
    'Milk, coffee/chocolate flavoured and milk-based drinks, reduced fat': 'Beverages',
    'Sport and protein prepared beverages': 'Beverages',

    # Savoury Biscuits and Crackers 
    'Savoury biscuits, rice based (includes rice cakes)': 'Savoury Biscuits and Crackers',
    'Savoury biscuits, wheat based, plain, energy >1800 kJ per 100 g': 'Savoury Biscuits and Crackers',
    'Savoury biscuits, wheat based, plain, energy <=1800 kJ per 100 g': 'Savoury Biscuits and Crackers',
    'Savoury biscuits, corn based': 'Savoury Biscuits and Crackers',
    'Savoury biscuits, rye based': 'Savoury Biscuits and Crackers',
    'Savoury biscuits': 'Savoury Biscuits and Crackers',

     # Sweet Biscuits and Cookies
    'Sweet biscuits, plain or flavoured including short bread varieties': 'Sweet Biscuits and Cookies',
    'Sweet biscuits, chocolate-coated, chocolate or cream filled': 'Sweet Biscuits and Cookies',
    'Sweet biscuits, cream-filled': 'Sweet Biscuits and Cookies',
    'Sweet biscuits, plain with fruit or nuts': 'Sweet Biscuits and Cookies',
    'Sweet biscuits, chocolate-coated, chocolate chip': 'Sweet Biscuits and Cookies',

# Bread
    'Breads, and bread rolls, white, mandatorily fortified': 'Bread',
    'Breads, and bread rolls, white, not stated as to fortification': 'Bread',
    'Flat breads (e.g. Pita bread), wheat based': 'Bread',
    'Breads, and bread rolls, white, additional voluntary fortification': 'Bread',
    'Savoury filled or topped breads and bread rolls': 'Bread',
    'Breads, and bread rolls, wholemeal and brown, mandatorily fortified': 'Bread',
    'Breads, and bread rolls, mixed grain, mandatorily fortified': 'Bread',
    'Sweet breads, buns and scrolls, uniced, unfilled': 'Bread',
    'Fried bread products and garlic breads': 'Bread',
    'Breads, and bread rolls, gluten free': 'Bread',
    'Breads, and bread rolls, rye, mandatorily fortified': 'Bread',
    'Sweet breads, buns and scrolls, iced and/or filled': 'Bread',
    'Crumpets': 'Bread',
    'English-style muffins': 'Bread',

    # Quick Breakfasts
    'Muesli and cereal style bars, added coatings or confectionery': 'Quick Breakfasts',
    'Muesli bar, with fruit or fruit paste filling': 'Quick Breakfasts',
    'Breakfast cereal beverages': 'Quick Breakfasts',
    'Breakfast cereal, corn based': 'Quick Breakfasts',
    'Breakfast cereal, corn based, fortified': 'Quick Breakfasts',
    'Breakfast cereal, mixed grain, fortified, sugars <=20 g/100g': 'Quick Breakfasts',
    'Breakfast cereal, mixed grain, with fruit and/or nuts, fortified': 'Quick Breakfasts',
    'Breakfast cereal, mixed grain, fortified, sugars >20 g/100g': 'Quick Breakfasts',
    'Breakfast cereal, mixed grain': 'Quick Breakfasts',
    'Breakfast cereal, rice based': 'Quick Breakfasts',
    'Breakfast cereal, rice based, fortified': 'Quick Breakfasts',
    'Breakfast cereal, wheat based, with fruit and/or nuts, fortified, sugars <=25 g/100g': 'Quick Breakfasts',
    'Breakfast cereal, wheat based': 'Quick Breakfasts',
    'Breakfast cereal, wheat based, fortified, sugars <=20 g/100g': 'Quick Breakfasts',
    'Breakfast cereal, mixed grain, with fruit and/or nuts': 'Quick Breakfasts',
    'Muesli and cereal style bars, with fruit and/or nuts': 'Quick Breakfasts',
    'Muesli and cereal style bars, no fruit': 'Quick Breakfasts',

    # Pastries
    'Pastry, plain/unfilled, all types': 'Pastries',
    'Savoury pastry products, pies, rolls and envelopes': 'Pastries',
    'Savoury pastry products, pies, rolls and envelopes, fried': 'Pastries',
    'Sweet pastry products, fruit and/or nut fillings': 'Pastries',
    'Sweet pastry products, egg or dairy based fillings': 'Pastries',

    # Rice, Grains, Nuts and Seeds
    'Grains (other than rice) and grain fractions': 'Rice, Grains, Nuts and Seeds',
    'Porridge style, oat based': 'Rice, Grains, Nuts and Seeds', 
    'Seed products': 'Rice, Grains, Nuts and Seeds',
    'Rice and rice grain fractions': 'Rice, Grains, Nuts and Seeds',
    'Other nuts and nut products and dishes': 'Rice, Grains, Nuts and Seeds',
    'Peanuts': 'Rice, Grains, Nuts and Seeds',
    'Seeds': 'Rice, Grains, Nuts and Seeds',

    # Pasta and Noodles
    'Pasta and noodles, not wheat based': 'Pasta and Noodles',
    'Pasta and noodles, wheat based, other than instant noodles': 'Pasta and Noodles',
    'Instant noodles and noodle products, wheat based': 'Pasta and Noodles',
    'Savoury pasta/noodle and sauce dishes, saturated fat <=5 g/100 g': 'Pasta and Noodles',
    'Beef dishes with gravy, sauce or vegetables': 'Pasta and Noodles',

    # Sauce, Dressing and Condiments
    'Italian and French-style dressings, full fat': 'Sauce, Dressing and Condiments',
    'Mayonnaise and cream-style dressings, full fat': 'Sauce, Dressing and Condiments',
    'Dry gravy mixes': 'Sauce, Dressing and Condiments',
    'Gravies (prepared)': 'Sauce, Dressing and Condiments',
    'Mayonnaise and cream-style dressings, reduced or non-fat': 'Sauce, Dressing and Condiments',
    'Savoury sauces, not tomato based, commercial': 'Sauce, Dressing and Condiments',
    'Savoury pastes': 'Sauce, Dressing and Condiments',
    'Savoury sauces, commercial, simmer style': 'Sauce, Dressing and Condiments',
    'Savoury sauces, tomato based, commercial': 'Sauce, Dressing and Condiments',
    'Savoury sauces, dairy based, homemade': 'Sauce, Dressing and Condiments',
    'Vinegar': 'Sauce, Dressing and Condiments',
    'Fruit-based pickles, chutneys and relishes': 'Sauce, Dressing and Condiments',
    'Vegetable-based pickles, chutneys and relishes': 'Sauce, Dressing and Condiments',

    # Legumes and Pulses
    'Legume and pulse products': 'Legumes and Pulses',
    'Beans': 'Legumes and Pulses',
    'Mature legumes and pulses': 'Legumes and Pulses',
    'Mature legumes and pulses, commercially sterile': 'Legumes and Pulses',
    'Peas and edible-podded peas': 'Legumes and Pulses',

    # Fish and Seafood
    'Fish and seafood products': 'Fish and Seafood',
    'Fin fish, fresh, frozen': 'Fish and Seafood', 
    'Fin fish, battered or crumbed': 'Fish and Seafood', 
    'Eel': 'Fish and Seafood',
    'Molluscs, fresh, frozen': 'Fish and Seafood', 
    'Crustacea, fresh, frozen': 'Fish and Seafood',
    'Packed fin fish': 'Fish and Seafood',
    'Smoked fish': 'Fish and Seafood',

    # Meat and Meat Products
    'Meat substitutes': 'Meat and Meat Products', 
    'Beef': 'Meat and Meat Products', 
    'Veal': 'Meat and Meat Products', 
    'Lamb and mutton': 'Meat and Meat Products',
    'Other mammalian game': 'Meat and Meat Products', 
    'Pork': 'Meat and Meat Products',
    'Kangaroo': 'Meat and Meat Products', 
    'Liver': 'Meat and Meat Products', 
    'Bacon': 'Meat and Meat Products',
    'Processed meat, commercially sterile (includes canned meats)': 'Meat and Meat Products',
    'Frankfurts and saveloys, saturated fat content >5 g/100g': 'Meat and Meat Products', 
    'Ham': 'Meat and Meat Products',
    'Sausage, saturated fat content >5 g/100g': 'Meat and Meat Products',
    'Sausages, frankfurts and saveloys, saturated fat content <=5 g/100g': 'Meat and Meat Products',
    'Fermented, comminuted meats (e.g. Salami):': 'Meat and Meat Products',
    'Reptiles': 'Meat and Meat Products',
    'Wild harvested mammalian meat': 'Meat and Meat Products',
    'Burgers, saturated fat >5 g/100 g': 'Meat and Meat Products',
    'Burgers, saturated fat <=5 g/100 g': 'Meat and Meat Products',
    'Fermented, comminuted meats (e.g. Salami)': 'Meat and Meat Products',

    #Poultry and Poultry Products
    'Chicken': 'Poultry and Poultry Products',
    'Poultry crumbed, battered, meatloaf or patty type with cereal and/or vegetables': 'Poultry and Poultry Products',
    'Other poultry': 'Poultry and Poultry Products',
    'Feathered game': 'Poultry and Poultry Products',
    'Processed delicatessen meat, poultry': 'Poultry and Poultry Products',
    'Poultry and feathered game': 'Poultry and Poultry Products',

    # Spreads and Dips
    'Peanut products': 'Spreads and Dips', 
    'Yeast extracts': 'Spreads and Dips',
    'Polyunsaturated margarine spreads, fat content >= 65g/100g': 'Spreads and Dips',
    'Margarine spreads with added phytosterols': 'Spreads and Dips',
    'Monounsaturated margarine spreads, fat content >= 65 g/100g': 'Spreads and Dips',
    'Monounsaturated margarine spreads,fat content <65 g/100g': 'Spreads and Dips',
    'Jams and conserves, sugar sweetened': 'Spreads and Dips',
    'Sweet spreads or sauces, chocolate/coffee flavoured': 'Spreads and Dips',
    'Honey and sugar syrups': 'Spreads and Dips',
    'Legume based dips': 'Spreads and Dips',

    # Dairy
    'Cheese, hard cheese ripened styles': 'Dairy',
    'Cheese, camembert, brie and other surface ripened cheeses': 'Dairy',
    'Cheese, hard cheese ripened styles, reduced fat': 'Dairy',
    'Cheese, processed': 'Dairy', 
    'Cheese, processed, reduced fat': 'Dairy',
    'Cheese, unripened styles, including cream and cottage cheese, regular fat': 'Dairy',
    'Cream, regular and increased fat': 'Dairy',
    'Cream substitute, artificial cream': 'Dairy', 
    'Cream, sour': 'Dairy',
    'Cream, sour, reduced fat': 'Dairy',
    'Ice cream, tub varieties, fat content >10 g/100 g': 'Dairy',
    'Ice cream, individual bar, stick and cone varieties, fat content 4 - 10 g/100 g': 'Dairy',
    'Water ice confection, gelato, sorbet': 'Dairy',
    'Milk, cow, fluid, regular whole, full fat': 'Dairy',
    'Milk, cow, fluid, regular whole, full fat, fortified': 'Dairy',
    'Milk, cow, fluid, reduced fat, <2 g/100g': 'Dairy',
    'Milk, cow, fluid, reduced fat, <2 g/100g, fortified': 'Dairy',
    'Milk, cow, fluid, skim, non-fat': 'Dairy',
    'Milk, evaporated or condensed, undiluted': 'Dairy',
    'Milk, powder, cow, dry': 'Dairy', 
    'Human breast milk': 'Dairy',
    'Custard, fat content <4 g/100 g': 'Dairy',
    'Dairy desserts, smooth or gelatin-based dairy desserts': 'Dairy',
    'Yoghurt, natural, regular fat and high fat (>4 g/100g fat)': 'Dairy',
    'Yoghurt, flavoured or added fruit, full fat': 'Dairy',
    'Yoghurt, flavoured or added fruit, low fat or skim, sugar sweetened': 'Dairy',
    'Butter': 'Dairy', 
    'Butter products': 'Dairy',
    'Dairy blend, regular, fat content >= 65g/100g': 'Dairy',
    'Yoghurt, flavoured or added fruit and/or cereal, high fat (>4 g/100g fat)': 'Dairy',
    'Yoghurt, flavoured or added fruit, low fat or skim, intense sweetened': 'Dairy',
    'Yoghurt, flavoured or added fruit, reduced fat': 'Dairy',

    # Dairy Alternatives
    'Cereal- or nut-based milk substitute': 'Dairy Alternatives',
    'Soy-based beverage, plain': 'Dairy Alternatives',
    'Soy-based beverage, plain, fortified': 'Dairy Alternatives',
    'Soy-based beverage, plain, reduced fat, fortified': 'Dairy Alternatives',
    'Soy-based yoghurts, regular fat': 'Dairy Alternatives',
    'Soy-based yoghurts, reduced fat': 'Dairy Alternatives',
    'Cheese substitute': 'Dairy Alternatives',
    'Dairy & meat substitutes': 'Dairy Alternatives',

     # Fats and Oils
    'Monounsaturated oils': 'Fats and Oils', 
    'Polyunsaturated oils': 'Fats and Oils',
    'Vegetable-based solid fats': 'Fats and Oils', 
    'Animal-based solid fats': 'Fats and Oils',

    # Egg Products 
    'Eggs, chicken': 'Egg Products', 
    'Eggs, chicken, modified (e.g. Omega-3, folate)': 'Egg Products',
    'Egg dishes, savoury': 'Egg Products',

    # Fruits and Vegetables
    'Apples': 'Fruits and Vegetables',
    'Other dried fruit including mixed dried fruit': 'Fruits and Vegetables',
    'Other stone fruit, commercially sterile': 'Fruits and Vegetables', 
    'Bananas': 'Fruits and Vegetables',
    'Berry fruit': 'Fruits and Vegetables', 
    'Other stone fruit': 'Fruits and Vegetables',
    'Mixtures of two or more groups of fruit, commercially sterile': 'Fruits and Vegetables',
    'Other citrus fruit': 'Fruits and Vegetables', 
    'Dried vine fruit': 'Fruits and Vegetables',
    'Other tropical and subtropical fruit, inedible peel': 'Fruits and Vegetables',
    'Other tropical and subtropical fruit, edible peel': 'Fruits and Vegetables', 
    'Other fruit': 'Fruits and Vegetables',
    'Lemons and limes': 'Fruits and Vegetables', 
    'Preserved fruit': 'Fruits and Vegetables', 
    'Other pome fruit': 'Fruits and Vegetables',
    'Citrus fruit, commercially sterile': 'Fruits and Vegetables',
    'Tropical and subtropical fruit, commercially sterile': 'Fruits and Vegetables', 
    'Oranges': 'Fruits and Vegetables',
    'Peaches and nectarines': 'Fruits and Vegetables',
    'Peaches and nectarines, commercially sterile': 'Fruits and Vegetables', 
    'Pears': 'Fruits and Vegetables',
    'Pears, commercially sterile': 'Fruits and Vegetables', 
    'Pineapples': 'Fruits and Vegetables',
    'Berry fruit, commercially sterile': 'Fruits and Vegetables',
    'Coconut and coconut products': 'Fruits and Vegetables',
    'Stalk vegetables': 'Fruits and Vegetables',
    'Other root vegetables': 'Fruits and Vegetables', 
    'Other fruiting vegetables': 'Fruits and Vegetables',
    'Cabbage and similar brassica vegetables': 'Fruits and Vegetables',
    'Broccoli, broccolini and cauliflower': 'Fruits and Vegetables', 
    'Carrots': 'Fruits and Vegetables', 
    'Sweetcorn': 'Fruits and Vegetables',
    'Leaf vegetables': 'Fruits and Vegetables', 
    'Other vegetables': 'Fruits and Vegetables',
    'Mixtures of two or more vegetables': 'Fruits and Vegetables',
    'Mushrooms': 'Fruits and Vegetables',
    'Potatoes': 'Fruits and Vegetables',
    'Potato mixed dishes': 'Fruits and Vegetables',
    'Pumpkin': 'Fruits and Vegetables',
    'Seaweeds': 'Fruits and Vegetables',
    'Sprouts': 'Fruits and Vegetables',
    'Squash and zucchini': 'Fruits and Vegetables',
    'Tomato': 'Fruits and Vegetables',
    'Tomato products': 'Fruits and Vegetables',
    'Wild harvested vegetables': 'Fruits and Vegetables',
    'Onion, leek and garlic': 'Fruits and Vegetables',

    # Snacks
    'Corn chips': 'Snacks',
    'Extruded snacks': 'Snacks', 
    'Popcorn': 'Snacks', 
    'Potato crisps': 'Snacks',
    'Potato products': 'Snacks',
    'Other snacks': 'Snacks',

    # Confectionery
    'Chocolate-based confectionery with nut fillings or additions': 'Confectionery',
    'Chocolate-based confectionery with other fillings or additions': 'Confectionery',
    'Chocolate (plain, unfilled varieties)': 'Confectionery',
    'Lollies and other confectionery, sugar sweetened': 'Confectionery',
    'Sugar-based desserts': 'Confectionery',
    'Fruit bar and fruit-based confectionery': 'Confectionery',
    'Nut and seed based confectionery': 'Confectionery',
    'Chocolate and chocolate-based confectionery': 'Confectionery',
    'Doughnuts': 'Confectionery',

    # Instant Food
    'Dry soup mix containing meat, poultry or seafood': 'Instant Food',
    'Soup containing meat, poultry or seafood': 'Instant Food',
    'Dry soup mix, vegetable only': 'Instant Food',

    # Ready Meals
    'Savoury rice-based dishes, saturated fat <=5 g/100 g': 'Ready Meals',

    # Sanwiches and Wraps
    'Sandwiches and filled rolls, saturated fat <=5 g/100 g': 'Sandwiches and Wraps',

    # Fast Food and Takeaway
     'Pizza, saturated fat <=5 g/100 g': 'Fast Food and Takeaway',

    # Flours and Starches
    'Cereal flours and starches': 'Flours and Starches',
    'Cereal flours and starches, fortified': 'Flours and Starches',

    # Powdered Drinks
    'Dry coffee powder, caffeinated or decaffeinated': 'Powdered Drinks',
    'Dry or concentrate coffee-based mixes': 'Powdered Drinks',
    'Sport and protein, dry powders': 'Powdered Drinks',

}

In [29]:
# Create a new column in final_df using mapping 
final_df['Category'] = final_df['Classification Name'].map(category_mapping)

## Adding 'Standard Serve' Column to the Data in a New Excel Sheet
- We add nformation about the standard serving sizes for each food item. 
Source: https://nutritionaustralia.org/app/uploads/2022/03/Australian-Dietary-Guidelines-Standard-serves.pdf

**Creating a Copy of the Final DataFrame**:
- We start by creating a copy of our `final_df` and add the 'Standard Serve' column. This column will provide the standard serving size information for each food item.
- We will  recalculate nutritional values in this new DataFrame to reflect the serving sizes.

In [30]:
final_df2 = final_df.copy()

# Assign a missing value in every row in the new column, `Serving Size`
final_df2['Serving Size (g)'] = pd.NA

column_order = [
    'Public Food Key', 'Food Name', 'Classification', 
    'Classification Name', 'Serving Size (g)', 
    'Protein \n(g)', 'Fat, total \n(g)', 
    'Available carbohydrate, with sugar alcohols \n(g)', 
    'Total dietary fibre \n(g)', 'Total sugars (g)', 
    'Added sugars (g)', 'Free sugars \n(g)', 
    'Starch \n(g)', 'Caffeine \n(mg)', 
    'Energy with dietary fibre, equated \n(kJ)', 
    'Energy with dietary fibre, equated \n(kCal)', 
    'Category'
]

# Reorder the columns
final_df2 = final_df2[column_order]

## Adjusting Nutritional Values Based on Serving Size
- We create two primary functions, `target_energy` and `adjust_serving_size`, which will be used to adjust the nutritional values in the `final_df2` DataFrame. The adjustments are based on the serving sizes calculated for different food categories.
- Nutritional values in each row are modified to reflect the appropriate serving size for the food item's category.

### `target_energy` function:
- Calculates the serving size based on a target energy value (in kJ) and adjusts nutritional values accordingly.

### `adjust_serving_size` function:
- **Predefined Serving Sizes**: Applies specific serving sizes for categories like 'Spices and Cooking Ingredients', 'Fats and Oils', 'Sauce, Dressing and Condiments', and 'Spreads and Dips'.
- **Category-Specific Adjustments**:
  - For categories in `adjust_500`, `adjust_600`, and `adjust_350`, it uses the `target_energy` function with respective target energies.
  - Sets serving size to 125g for 'Beverages' category containing 'juice'.
  - Defaults to `target_energy` with a target of 600kJ for all other categories.

**Note on Standard Serving Sizes**: The standard serving sizes in this analysis are primarily based on the minimum energy value recommendations of the Australian Dietary Guidelines, with some exceptions:
- For `Fruits and Vegetables`, the serving size is set according to the maximum target energy value recommended by the guidelines.
- For `Spices and Cooking Ingredients`, a fixed serving size of 2 grams per serve is used.
- For all categories not explicitly mentioned in the guidelines, they are categorized as 'Discretionary Choices'. The Australian Dietary Guidelines suggest a standard intake of 600 KJ per serve for these items.


In [31]:
def target_energy(row, target_energy_kj):
    energy_per_100g = row['Energy with dietary fibre, equated \n(kJ)']

    # Check if energy_per_100g is zero to avoid division by zero (python error)
    if energy_per_100g == 0:
        if row['Category'] == 'Spices and Cooking Ingredients':
            serving_size = 2  
        else:
            # For other categories, set 100 as the serving size
            serving_size = 100
    else:
        serving_size = (target_energy_kj / energy_per_100g) * 100

    scale_factor = serving_size / 100

    # Adjusting nutritional values
    row['Protein \n(g)'] *= scale_factor
    row['Fat, total \n(g)'] *= scale_factor
    row['Available carbohydrate, with sugar alcohols \n(g)'] *= scale_factor
    row['Total dietary fibre \n(g)'] *= scale_factor
    row['Total sugars (g)'] *= scale_factor
    row['Added sugars (g)'] *= scale_factor
    row['Free sugars \n(g)'] *= scale_factor
    row['Starch \n(g)'] *= scale_factor
    row['Caffeine \n(mg)'] *= scale_factor
    row['Energy with dietary fibre, equated \n(kJ)'] = target_energy_kj
    row['Energy with dietary fibre, equated \n(kCal)'] *= scale_factor
    row['Serving Size (g)'] = serving_size

    return row


def adjust_serving_size(row):
    # Predefined serving sizes for certain categories
    predefined_serving_sizes = {
        'Spices and Cooking Ingredients': 2,
        'Fats and Oils': 13,
        'Sauce, Dressing and Condiments': 20,
        'Spreads and Dips': 20
    }

    # Categories with target energy adjustments
    adjust_500 = [
        'Cakes and Cake Mixes', 'Bread', 'Confectionery', 'Dairy', 'Dairy Alternatives', 
        'Egg Products', 'Fish and Seafood', 'Legumes and Pulses', 'Meat and Meat Products', 
        'Pasta and Noodles', 'Poultry and Poultry Products', 'Quick Breakfasts', 'Rice, Grains, Nuts and Seeds'
    ]
    adjust_600 = [
        'Fast Food and Takeaway', 'Confectionery', 'Instant Food', 'Pastries', 'Ready Meals', 
        'Sandwiches and Wraps', 'Savoury Biscuits and Crackers', 'Sweet Biscuits and Cookies', 'Snacks'
    ]
    adjust_350 = ['Fruits and Vegetables']

    # Apply predefined serving sizes
    if row['Category'] in predefined_serving_sizes:
        serving_size = predefined_serving_sizes[row['Category']]
    elif row['Category'] in adjust_500:
        return target_energy(row, 500)
    elif row['Category'] in adjust_600:
        return target_energy(row, 600)
    elif row['Category'] in adjust_350:
        return target_energy(row, 350)
    elif row['Category'] == 'Beverages' and 'juice' in row['Food Name'].lower():
        serving_size = 125
    else:
        return target_energy(row, 600)

    # Calculate scale factor for nutritional adjustments
    scale_factor = serving_size / 100

    # Adjust nutritional values for predefined serving sizes
    row['Protein \n(g)'] *= scale_factor
    row['Fat, total \n(g)'] *= scale_factor
    row['Available carbohydrate, with sugar alcohols \n(g)'] *= scale_factor
    row['Total dietary fibre \n(g)'] *= scale_factor
    row['Total sugars (g)'] *= scale_factor
    row['Added sugars (g)'] *= scale_factor
    row['Free sugars \n(g)'] *= scale_factor
    row['Starch \n(g)'] *= scale_factor
    row['Caffeine \n(mg)'] *= scale_factor
    row['Energy with dietary fibre, equated \n(kJ)'] *= scale_factor
    row['Energy with dietary fibre, equated \n(kCal)'] *= scale_factor
    row['Serving Size (g)'] = serving_size

    return row

      
# Apply the function to the DataFrame
final_df2 = final_df2.apply(adjust_serving_size, axis=1)


## Applying rounding to specified columns

In [32]:
# Round the 'Serving Size (g)' column to the nearest whole number
final_df2['Serving Size (g)'] = final_df2['Serving Size (g)'].round(0).astype(int)

In [33]:
# List of columns to be rounded
columns_to_round = [
    "Protein \n(g)", "Fat, total \n(g)", 
    "Available carbohydrate, with sugar alcohols \n(g)", 
    "Total dietary fibre \n(g)", "Total sugars (g)", 
    "Added sugars (g)", "Free sugars \n(g)", 
    "Starch \n(g)", "Caffeine \n(mg)", 
    "Energy with dietary fibre, equated \n(kJ)", 
    "Energy with dietary fibre, equated \n(kCal)"
]

# Applying rounding 
final_df2[columns_to_round] = final_df2[columns_to_round].round(2)
final_df[columns_to_round] = final_df[columns_to_round].round(2)


## Exporting Data to Excel
- We export the processed data to an Excel file:

- **DataFrames Exported**:
  - `final_df` is saved in the sheet named 'Nutritional Values per 100g'. This sheet contains nutritional values standardized per 100 grams of each food item.
  - `final_df2` is saved in the sheet named 'Nutritional Values per Serving'. This sheet includes nutritional values based on the calculated serving sizes as per the Australian Dietary Guidelines.

In [34]:
with pd.ExcelWriter('data//database//NutritionDataAU_022224.xlsx') as writer:
    final_df.to_excel(writer, sheet_name='Nutritional Values per 100g')
    final_df2.to_excel(writer, sheet_name='Nutritional Values per Serving')