<a href="https://colab.research.google.com/github/ethansong206/Climate-Plus-Project/blob/main/DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import necessary packages for the code

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime

Load the Duke Dining csv file and make the data easier to handle. This includes but is not limited to:

1) Renaming locations

2) Changing the date from a string to a date object



In [71]:
DiningDataFull = pd.read_csv('Climate+ Data 2019 thru 2023.csv', 
                             dtype = {'Priority 1': str, 'Priority 2': str,
                                      'Priority 3': str})
#print(DiningDataFull['Unit Name'].value_counts()) # Show how many entries there are of each location

#below code to make 'Unit Name' column easier to handle
#can add more lines given more locations
def location_rename(location):
    if("Marketplace" in location): #Combine data for Marketplace Kitchen and Marketplace Special Event
        return "Marketplace"
    if("Marine Lab" in location):
        return "DuML"
    if("Trinity" in location):
        return "Trinity"
    if("Freeman" in location):
        return "Freeman"
    return None

DiningDataFull['Unit Name'] = DiningDataFull.apply(lambda d: location_rename(d['Unit Name']), axis = 1)

DiningDataFull['Purchase Date'] = DiningDataFull.apply(lambda d: datetime.strptime(d['Purchase Date'], '%m/%d/%Y').date(), axis = 1)

print("Rows in DiningDataFull: ", DiningDataFull.shape[0])

Rows in DiningDataFull:  90534


#**\*\*\*WIP\*\*\***

Extract the unit names for each item into a new column called `Unit`, then rename each unit into a more simple label (i.e. LB CS to LB). Convert some of the cans into the equivalent value in OZ. Make a new column called `Unit(g)` for converting all units to the equivalent value in grams, then filter out main descriptor word(s) in `Vendor Item Description` into a new column called `Food Name`.

---
Note on Exclusion: Some units are left out in this first calculation of emissions as they are either not directly food (i.e. gloves) or are too difficult to go through individually and find a measurement that is not ambiguous. The total number of entries left out is 12171. **The number of non-food items in this amount can be calculated later.**

---
Note on Conversions: The column `Vendor Item Purchase Unit` is in the format x/y n which should be read as x bags of y units of n food.

---

Take the column with units in grams and multiply by the `Receive Quantity` if provided into a new column called `Total Grams`. If there is no value in `Receive Quantity`, then assume it is the value 1.


In [72]:
#this code extracts just the unit information
DiningDataFull['Unit'] = DiningDataFull['Vendor Item Purchase Unit'].str.extract(r" ([A-Za-zÀ-ÿ ]*)$").astype(str)
#below code to simplify redundant labels (i.e. LB CS to LB)
def unit_rename(unit):
    if('LB' in unit) | ('lb' in unit) | ('Lb' in unit) | ('Pound' in unit):
        return 'LB'
    if('OZ' in unit) | ('oz' in unit) | ('Oz' in unit) | ('Z' in unit):
        return 'OZ'
    if('GA' in unit) | ('Gal' in unit):
        return 'GA'
    if('QT' in unit):
        return 'QT'
    if('PT' in unit) | ('Pint' in unit) | ('PINT' in unit):
        return 'PT'
    if('LT' in unit):
        return 'LT' 
    if('BU' in unit) | ('Bushel' in unit): #bushels
        return 'BU'
    if('KG' in unit):
        return 'KG' 
    if('GR' in unit): #grams
        return 'GR' 
    if('ML' in unit) | ('ml' in unit) | ('Ml' in unit): #milliliters
        return 'ML'
    if('CN' in unit) | ('Can' in unit):
        return 'CN'
    if('Bottle Case' in unit):
        return 'Bottle Case'
    return None
#Notes: Bottle Case is 64 oz each, find # of can and translate to oz, anything with EA is not included for now (~9000)
DiningDataFull['Unit'] = DiningDataFull.apply(lambda d: unit_rename(d['Unit']), axis = 1)

#make a new dataset for just entries with known units
DiningDataReduced = DiningDataFull[DiningDataFull['Unit'].notna()]
#print(DiningDataReduced.head())
print("Excluded rows: ", DiningDataFull.shape[0] - DiningDataReduced.shape[0])

#CONVERT UNITS
#first convert cans


#use this to find the totals of each unit, and to find units that need to be manually defined
#print(DiningDataFull['Unit'].value_counts())

Excluded rows:  12171


Making one dataset for each location and each year

In [4]:
location_col = DiningDataFull['Unit Name']
DiningData_Marketplace = DiningDataFull[location_col == "Marketplace"]
DiningData_DuML = DiningDataFull[location_col == "DuML"]
DiningData_Trinity = DiningDataFull[location_col == "Trinity"]
DiningData_Freeman = DiningDataFull[location_col == "Freeman"]

date_col = DiningDataFull['Purchase Date']
DiningData_2019 = DiningDataFull[(date_col >= pd.Timestamp(2019, 1, 1)) & (date_col < pd.Timestamp(2020, 1, 1))]
DiningData_2020 = DiningDataFull[(date_col >= pd.Timestamp(2020, 1, 1)) & (date_col < pd.Timestamp(2021, 1, 1))]
DiningData_2021 = DiningDataFull[(date_col >= pd.Timestamp(2021, 1, 1)) & (date_col < pd.Timestamp(2022, 1, 1))]
DiningData_2022 = DiningDataFull[(date_col >= pd.Timestamp(2022, 1, 1)) & (date_col < pd.Timestamp(2023, 1, 1))]
DiningData_2023 = DiningDataFull[(date_col >= pd.Timestamp(2023, 1, 1)) & (date_col < pd.Timestamp(2024, 1, 1))]

  DiningData_2019 = DiningDataFull[(date_col >= pd.Timestamp(2019, 1, 1)) & (date_col < pd.Timestamp(2020, 1, 1))]
  DiningData_2020 = DiningDataFull[(date_col >= pd.Timestamp(2020, 1, 1)) & (date_col < pd.Timestamp(2021, 1, 1))]
  DiningData_2021 = DiningDataFull[(date_col >= pd.Timestamp(2021, 1, 1)) & (date_col < pd.Timestamp(2022, 1, 1))]
  DiningData_2022 = DiningDataFull[(date_col >= pd.Timestamp(2022, 1, 1)) & (date_col < pd.Timestamp(2023, 1, 1))]
  DiningData_2023 = DiningDataFull[(date_col >= pd.Timestamp(2023, 1, 1)) & (date_col < pd.Timestamp(2024, 1, 1))]
