<a href="https://colab.research.google.com/github/ethansong206/Climate-Plus-Project/blob/main/DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import necessary packages for the code

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import re

Load the Duke Dining csv file and make the data easier to handle. This includes but is not limited to:

1) Renaming locations

2) Changing the date from a string to a date object



In [2]:
DiningDataFull = pd.read_csv('Climate+ Data 2019 thru 2023.csv', 
                             dtype = {'Priority 1': str, 'Priority 2': str,
                                      'Priority 3': str})
#print(DiningDataFull['Unit Name'].value_counts()) # Show how many entries there are of each location

#below code to make 'Unit Name' column easier to handle
#can add more lines given more locations
def location_rename(location):
    if("Marketplace" in location): #Combine data for Marketplace Kitchen and Marketplace Special Event
        return "Marketplace"
    if("Marine Lab" in location):
        return "DuML"
    if("Trinity" in location):
        return "Trinity"
    if("Freeman" in location):
        return "Freeman"
    return None

DiningDataFull['Unit Name'] = DiningDataFull.apply(lambda d: location_rename(d['Unit Name']), axis = 1)

DiningDataFull['Purchase Date'] = DiningDataFull.apply(lambda d: datetime.strptime(d['Purchase Date'], '%m/%d/%Y').date(), axis = 1)

print("Rows in DiningDataFull: ", DiningDataFull.shape[0])

Rows in DiningDataFull:  90534


Extract the unit names for each item into a new column called `Unit`, then rename each unit into a more simple label (i.e. LB CS to LB). Convert cans and bottle cases into the equivalent value in OZ.

Extract the simplified item name from `Vendor Item Description` into a new column called `Item Name`.

---
Note on Exclusion: Some units are left out in this first calculation of emissions as they are either not directly food (i.e. gloves) or are too difficult to go through individually and find a measurement that is not ambiguous. The total number of entries left out is 12171. **The number of non-food items in this amount can be calculated later.**

---
Note on Conversions: The column `Vendor Item Purchase Unit` is in the format x/y n which should be read as x bags of y units of n food.

---

In [3]:
#this code extracts just the unit information
DiningDataFull['Unit'] = DiningDataFull['Vendor Item Purchase Unit'].str.extract(r" ?([A-Za-zÀ-ÿ ]*)$").astype(str)
#below code to simplify redundant labels (i.e. LB CS to LB)
def unit_rename(unit):
    if('LB' in unit) | ('lb' in unit) | ('Lb' in unit) | ('Pound' in unit):
        return 'LB'
    if('OZ' in unit) | ('oz' in unit) | ('Oz' in unit) | (' Z' in unit):
        return 'OZ'
    if('GA' in unit) | ('Gal' in unit):
        return 'GA'
    if('QT' in unit):
        return 'QT'
    if('PT' in unit) | ('Pint' in unit) | ('PINT' in unit):
        return 'PT'
    if('LT' in unit):
        return 'LT' 
    if('BU' in unit) | ('Bushel' in unit): #bushels
        return 'BU'
    if('KG' in unit):
        return 'KG' 
    if('GR' in unit): #grams
        return 'GR' 
    if('ML' in unit) | ('ml' in unit) | ('Ml' in unit): #milliliters
        return 'ML'
    if('CN' in unit) | ('Can' in unit):
        return 'CN'
    if('Bottle Case' in unit):
        return 'Bottle Case'
    return None
#Notes: Bottle Case is 64 oz each, find # of can and translate to oz, anything with EA is not included for now (~9000)
DiningDataFull['Unit'] = DiningDataFull.apply(lambda d: unit_rename(d['Unit']), axis = 1)

#convert cans to OZ
#using estimates for weight of cans through https://food.unl.edu/article/how-interpret-can-size-numbers
DiningDataFull = DiningDataFull.replace({'#10 CN' : '110.5 OZ'}, regex = True)
DiningDataFull = DiningDataFull.replace({'#10 Can' : '110.5 OZ'}, regex = True)
DiningDataFull = DiningDataFull.replace({'#300 CN' : '15 OZ'}, regex = True)
DiningDataFull = DiningDataFull.replace({'CN' : 'OZ'})
#convert bottle case to OZ
DiningDataFull = DiningDataFull.replace({' Bottle Case' : '/64 OZ'}, regex = True)
DiningDataFull = DiningDataFull.replace({'Bottle Case' : 'OZ'})

#WIP
#extract item names

Make a new column called `Total Amount` for the total amount of food in the current unit of measurement, before converting to grams. Make a new column called `Total Amount(g)` for converting all units to the equivalent value in grams, then filter out main descriptor word(s) in `Vendor Item Description` into a new column called `Food Name`.

---

Take the column with units in grams and multiply by the `Receive Quantity` if provided into a new column called `Total Grams`. If there is no value in `Receive Quantity`, then assume it is the value 1.

---
Note on Conversions: Most of these conversions are estimated to the nearest tenth. **More accurate calculations can be found later**

In [4]:
#make a new dataset for just entries with known units
DDReduced = DiningDataFull[DiningDataFull['Unit'].notna()]
#print(DDReduced.head())
print("Excluded rows: ", DiningDataFull.shape[0] - DDReduced.shape[0])

#find total amount of food before converting to grams
DDReduced = DDReduced.copy()
DDReduced['Total Amount'] = DDReduced['Vendor Item Purchase Unit'].str.extract(r'^[a-zA-Z]* ?-?/? ?([0-9]*/?[0-9.]*-?[0-9.]*)')
DDReduced['Range'] = DDReduced['Total Amount'].str.extract(r'([0-9.]*-[0-9.]*)')
DDReduced['Range'] = DDReduced['Range'].astype(str)
DDReduced['Range'] = DDReduced['Range'].replace({'nan' : '0'})
DDReduced['Range Average'] = DDReduced['Range'].replace({'-' : '+'}, regex = True)
DDReduced['Range Average'] = DDReduced.apply(lambda d: eval(d['Range Average']), axis = 1)
DDReduced['Range Average'] /= 2

has_slash = ~(DDReduced['Total Amount'].str.contains('/')) & (DDReduced['Total Amount'].str.len() > 0)
DDReduced.loc[has_slash, 'Total Amount'] = (
    '1/' + DDReduced.loc[has_slash, 'Total Amount']
)

has_range = DDReduced['Total Amount'].str.contains('-')
DDReduced.loc[has_range, 'Total Amount'] = (
    DDReduced.loc[has_range, 'Total Amount'].str.split('/').str[0]
    + '/'
    + DDReduced.loc[has_range, 'Range Average'].astype(str)
)

DDReduced = DDReduced.drop(['Range', 'Range Average'], axis = 1)

DDReduced['Total Amount'] = DDReduced['Total Amount'].replace({'' : '0'})
DDReduced['Receive Quantity'] = DDReduced['Receive Quantity'].fillna(0)
DDReduced['Total Amount'] = DDReduced['Total Amount'].replace({'/' : '*'}, regex = True)
DDReduced['Total Amount'] = DDReduced.apply(lambda d: eval(d['Total Amount']), axis = 1)
DDReduced['Total Amount'] = DDReduced['Total Amount'].astype(float) * DDReduced['Receive Quantity'].astype(float)

#convert units to grams
def convert_units(row):
    if row['Unit'] == 'LB':
        return row['Total Amount'] * 453.6
    if row['Unit'] == 'OZ':
        return row['Total Amount'] * 28.35
    if row['Unit'] == 'GA':
        return row['Total Amount'] * 3785.4 #assuming density of water, most drinks are MORE dense so number is underestimate
    if row['Unit'] == 'QT':
        return row['Total Amount'] * 3785.4 / 4
    if row['Unit'] == 'PT':
        return row['Total Amount'] * 3785.4 / 8
    if (row['Unit'] == 'LT') | (row['Unit'] == 'KG'):
        return row['Total Amount'] * 1000 #also assuming density of water, most drinks MORE dense
    if row['Unit'] == 'BU':
        return row['Total Amount'] * 32.5 * 453.6 #average of 40lbs per bu apples, squash, etc. and 25lbs per bu peppers, etc.
    if (row['Unit'] == 'ML') | (row['Unit'] == 'G'):
        return row['Total Amount'] * 453.6
    return 0
DDReduced['Total Amount(g)'] = DDReduced.apply(convert_units, axis = 1)

#print(DDReduced.head())

Excluded rows:  12126


---
Load the csv with CO2 numbers and find numbers for carbon emissions

In [5]:
CO2 = pd.read_csv('CO2 values for FACCWTHA v1.1 - Foods.csv')
print(CO2.head())

                                            FoodName FoodNumber  \
0  A unique name. Based on FoodName from the NDNS...  From NDNS   
1                                             Totals        NaN   
2               BEEF RUMP STEAK GRILLED LEAN AND FAT        952   
3       BEEF RUMP STEAK GRILLED LEAN AND FAT, global     300001   
4   BEEF RUMP STEAK GRILLED LEAN AND FAT, UK20220104     300002   

                                     FoodDisplayName  CO2eSLBBook  \
0  Graph/table -friendly food names. Used in matl...          NaN   
1                                                NaN          NaN   
2                                              Steak       4620.0   
3                                        Global beef       9950.0   
4                                            UK beef       4000.0   

                                      SLBOrder  CO2eRatioSLBBook  \
0  This is the order I like to see the rows in               NaN   
1                                          NaN

Making one dataset for each location and each year

In [6]:
location_col = DDReduced['Unit Name']
DiningData_Marketplace = DDReduced[location_col == "Marketplace"]
DiningData_DuML = DDReduced[location_col == "DuML"]
DiningData_Trinity = DDReduced[location_col == "Trinity"]
DiningData_Freeman = DDReduced[location_col == "Freeman"]

date_col = DDReduced['Purchase Date']
DiningData_2019 = DDReduced[(date_col >= pd.Timestamp(2019, 1, 1)) & (date_col < pd.Timestamp(2020, 1, 1))]
DiningData_2020 = DDReduced[(date_col >= pd.Timestamp(2020, 1, 1)) & (date_col < pd.Timestamp(2021, 1, 1))]
DiningData_2021 = DDReduced[(date_col >= pd.Timestamp(2021, 1, 1)) & (date_col < pd.Timestamp(2022, 1, 1))]
DiningData_2022 = DDReduced[(date_col >= pd.Timestamp(2022, 1, 1)) & (date_col < pd.Timestamp(2023, 1, 1))]
DiningData_2023 = DDReduced[(date_col >= pd.Timestamp(2023, 1, 1)) & (date_col < pd.Timestamp(2024, 1, 1))]

  DiningData_2019 = DDReduced[(date_col >= pd.Timestamp(2019, 1, 1)) & (date_col < pd.Timestamp(2020, 1, 1))]
  DiningData_2020 = DDReduced[(date_col >= pd.Timestamp(2020, 1, 1)) & (date_col < pd.Timestamp(2021, 1, 1))]
  DiningData_2021 = DDReduced[(date_col >= pd.Timestamp(2021, 1, 1)) & (date_col < pd.Timestamp(2022, 1, 1))]
  DiningData_2022 = DDReduced[(date_col >= pd.Timestamp(2022, 1, 1)) & (date_col < pd.Timestamp(2023, 1, 1))]
  DiningData_2023 = DDReduced[(date_col >= pd.Timestamp(2023, 1, 1)) & (date_col < pd.Timestamp(2024, 1, 1))]
