<a href="https://colab.research.google.com/github/brendanpshea/data_clean_nypl/blob/main/New_York_Public_Library_Menus_Clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# New York Public Library-Menus - Data Cleaning



Brendan's Notes:

The original data files are available here:

https://uofi.app.box.com/s/zh2hxfkq0cc6vyftw91nqa4smdpq7ybk

I just downloaded all the files (as a zip), and stuck them on dropbox (see below). From there, I just ran the standard Pandas commands to get an overview of the data.


In [88]:
import os

url = "https://www.dropbox.com/scl/fi/l8b5np5xoes57nr1hqhae/NYPL-menus.zip?rlkey=pak0ox3wae0x0yd09d23eoqma&st=mfq9dka9&dl=1"
output_file = "NYPL-menus.zip"

# Check if the file already exists
if not os.path.exists(output_file):
    !wget -q "$url" -O "$output_file"
    !unzip "NYPL-menus.zip"
else:
    print(f"{output_file} already exists. No download needed.")

!ls

NYPL-menus.zip already exists. No download needed.
dirty_menus.db	menu_clean.csv	    menupage_clean.csv	nypl_menus.db	sample_data
dish_clean.csv	menuitem_clean.csv  NYPL-menus		NYPL-menus.zip


## ERD Diagram
Here is an entity relationship diagram for the data.

In [89]:
import base64
from IPython.display import Image, display, HTML

def mm(graph):
    graphbytes = graph.encode("utf8")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))


mm("""

erDiagram
    DISH ||--o{ MENUITEM : "is included in"
    MENU ||--o{ MENUPAGE : "contains"
    MENUPAGE ||--o{ MENUITEM : "includes"

    DISH {
        int id PK
        string name
        float description
        int menus_appeared
        int times_appeared
        int first_appeared
        int last_appeared
        float lowest_price
        float highest_price
    }

    MENU {
        int id PK
        string name
        string sponsor
        string event
        string venue
        string place
        string physical_description
        string occasion
        string notes
        string call_number
        float keywords
        float language
        string date
        string location
        float location_type
        string currency
        string currency_symbol
        string status
        int page_count
        int dish_count
    }

    MENUITEM {
        int id PK
        int menu_page_id FK
        float price
        float high_price
        float dish_id FK
        string created_at
        string updated_at
        float xpos
        float ypos
    }

    MENUPAGE {
        int id PK
        int menu_id FK
        string page_number
        string image_id
        string full_height
        string full_width
        string uuid
        string created_at
        string updated_at
    }
""")

### Load Data Using Pandas


In [90]:
import pandas as pd
import numpy as np

dish_df = pd.read_csv('NYPL-menus/Dish.csv')
menu_df = pd.read_csv('NYPL-menus/Menu.csv')
menuitem_df = pd.read_csv('NYPL-menus/MenuItem.csv')
menupage_df = pd.read_csv('NYPL-menus/MenuPage.csv')


## Main Use Case and Dirty Data Queries

We would like to see how the **popularity and price of different dishes have changed over the years**.  I'll use "spaghetti" as example here.

Let's run some queries for this using the (not yet cleaned!) data.

In [91]:
import sqlite3
import pandas as pd

# Create a connection to the SQLite database
conn = sqlite3.connect('dirty_menus.db')

dish_df.to_sql('dishes', conn, if_exists='replace', index=False)
print("Dishes data loaded successfully.")

menu_df.to_sql('menus', conn, if_exists='replace', index=False)
print("Menus data loaded successfully.")

menuitem_df.to_sql('menuitems', conn, if_exists='replace', index=False)
print("Menu items data loaded successfully.")

menupage_df.to_sql('menupages', conn, if_exists='replace', index=False)
print("Menu pages data loaded successfully.")

# Commit the changes and close the connection
conn.commit()
conn.close()

print("Database created and data loaded successfully.")

# Reopen the connection to verify the data
conn = sqlite3.connect('dirty_menus.db')
cursor = conn.cursor()

# Check the number of rows in each table
tables = ['dishes', 'menus', 'menuitems', 'menupages']
for table in tables:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"Number of rows in {table} table: {count}")

# print table schema for all tables
for table in tables:
    cursor.execute(f"PRAGMA table_info({table})")
    schema = cursor.fetchall()
    print(f"\n\nTable Schema for {table}:")
    for column in schema:
        print(column)
conn.close()


Dishes data loaded successfully.
Menus data loaded successfully.
Menu items data loaded successfully.
Menu pages data loaded successfully.
Database created and data loaded successfully.
Number of rows in dishes table: 423397
Number of rows in menus table: 17545
Number of rows in menuitems table: 1332726
Number of rows in menupages table: 66937


Table Schema for dishes:
(0, 'id', 'INTEGER', 0, None, 0)
(1, 'name', 'TEXT', 0, None, 0)
(2, 'description', 'REAL', 0, None, 0)
(3, 'menus_appeared', 'INTEGER', 0, None, 0)
(4, 'times_appeared', 'INTEGER', 0, None, 0)
(5, 'first_appeared', 'INTEGER', 0, None, 0)
(6, 'last_appeared', 'INTEGER', 0, None, 0)
(7, 'lowest_price', 'REAL', 0, None, 0)
(8, 'highest_price', 'REAL', 0, None, 0)


Table Schema for menus:
(0, 'id', 'INTEGER', 0, None, 0)
(1, 'name', 'TEXT', 0, None, 0)
(2, 'sponsor', 'TEXT', 0, None, 0)
(3, 'event', 'TEXT', 0, None, 0)
(4, 'venue', 'TEXT', 0, None, 0)
(5, 'place', 'TEXT', 0, None, 0)
(6, 'physical_description', 'TEXT', 0,

In [92]:
%reload_ext sql
%sql sqlite:///dirty_menus.db

In [93]:
%%sql
SELECT COUNT(*) AS total_spaghetti_dishes
FROM dishes
WHERE name LIKE "%spaghetti%"

 * sqlite:///dirty_menus.db
   sqlite:///nypl_menus.db
Done.


total_spaghetti_dishes
1832


In [94]:
%%sql
SELECT name, COUNT(*) AS dish_count
FROM dishes
JOIN menuitems ON dishes.id = menuitems.dish_id
WHERE name LIKE "%spaghetti%"
GROUP BY menuitems.dish_id
ORDER BY dish_count DESC
LIMIT 15

 * sqlite:///dirty_menus.db
   sqlite:///nypl_menus.db
Done.


name,dish_count
Spaghetti,299
Spaghetti au Gratin,231
Spaghetti Italienne,180
Special Spaghetti with Fresh Mushrooms,119
Spaghetti a l'Italienne,113
"Spaghetti, Italienne",108
Spaghetti au gratin,90
Spaghetti Au Gratin,88
Spaghetti Milanaise,84
Spaghetti Bolognaise,55


In [95]:
%%sql
-- Get high-priced spagetti dishes
SELECT name, AVG(price) AS avg_price, COUNT() AS dish_count
FROM dishes
JOIN menuitems ON dishes.id = menuitems.dish_id
WHERE name LIKE "%spaghetti%"
GROUP BY menuitems.dish_id
ORDER BY avg_price DESC
LIMIT 15

 * sqlite:///dirty_menus.db
   sqlite:///nypl_menus.db
Done.


name,avg_price,dish_count
"Spaghettini alla ""Bassanese""",6000.0,1
Spaghettini al Pomodoro Fresco,4000.0,1
Spaghetti all'amatriciana,1000.0,1
Spaghetti alla tonnata,1000.0,1
Spaghetti alle vongole bianche,650.0,1
Les Spaghetti a l'Italienne,550.0,1
Spaghettie a la Bolognaise,550.0,1
Spaghettie with Meat Sauce,550.0,1
"Spaghetti ""Maitre d'Hotel""",500.0,1
Spaghetti all amatriciana,450.0,1


In [96]:
%%sql
-- Get earliest years for common spaghetti dishes
SELECT name, MIN(first_appeared) AS earliest_year
FROM dishes
JOIN menuitems ON dishes.id = menuitems.dish_id
WHERE name LIKE "%spaghetti%"
GROUP BY menuitems.dish_id
ORDER BY earliest_year ASC
LIMIT 10

 * sqlite:///dirty_menus.db
   sqlite:///nypl_menus.db
Done.


name,earliest_year
Spaghetti a la Bontout,0
Spaghetti alla Checca,0
Spaghettis Napolitaine,0
Spaghetti tomate et basilic,0
Spaghetti alla Certosina,0
Spaghetti with seafood sauce,0
Spaghetti con pancetta,0
Spaghetti a la (panse),0
Spaghetti with bacon,0
Spaghetti alle vongole bianche,0


In [97]:
%%sql
-- get latest years for common spaghetti dishes
SELECT name, MAX(last_appeared) AS latest_year
FROM dishes
JOIN menuitems ON dishes.id = menuitems.dish_id
WHERE name LIKE "%spaghetti%"
GROUP BY menuitems.dish_id
ORDER BY latest_year DESC
LIMIT 10

 * sqlite:///dirty_menus.db
   sqlite:///nypl_menus.db
Done.


name,latest_year
Special Spaghetti with Fresh Mushrooms,2928
Spaghetti,2928
Special home made spaghetti with veal ragu',2012
SPAGHETTI ALLA CHITARRA CON RAGU' DI CARNE DI VITELLO,2012
Spaghetti with sun-dried tomatoes,2012
SPAGHETTI CON POMODORI ESSICCATI AL SOLE,2012
Spaghetti Carbonara with pancetta and Parmesan,2006
Spaghetti Bolognese,2002
"MINI VEGI-LOAF Tofu, Chestnut and Cilantro Croquettes, Served over Spaghetti in a Tomato Sauce (Taro Spring Roll, Pickled Cabbage)",1999
SPAGHETTI MEAT BALLS,1999


## Data Cleaning
Now, let's clean the data. For each table, we will:

1. Get a profile of what it looks life before the query.
2. Clean it (to a new dataframe/csv.
3. Produce a profile of what it looks like after cleaning.

### Clean `dish_df`

In [98]:
dish_df.head()

Unnamed: 0,id,name,description,menus_appeared,times_appeared,first_appeared,last_appeared,lowest_price,highest_price
0,1,Consomme printaniere royal,,8,8,1897,1927,0.2,0.4
1,2,Chicken gumbo,,111,117,1895,1960,0.1,0.8
2,3,Tomato aux croutons,,13,13,1893,1917,0.25,0.4
3,4,Onion au gratin,,41,41,1900,1971,0.25,1.0
4,5,St. Emilion,,66,68,1881,1981,0.0,18.0


In [99]:
dish_df.shape

(423397, 9)

In [100]:
dish_df.describe().round(2)

Unnamed: 0,id,description,menus_appeared,times_appeared,first_appeared,last_appeared,lowest_price,highest_price
count,423397.0,0.0,423397.0,423397.0,423397.0,423397.0,394297.0,394297.0
mean,264456.59,,3.06,3.15,1675.51,1679.3,0.97,1.6
std,150489.07,,27.82,29.96,651.32,651.93,6.71,12.7
min,1.0,,0.0,-6.0,0.0,0.0,0.0,0.0
25%,132374.0,,1.0,1.0,1900.0,1900.0,0.0,0.0
50%,269636.0,,1.0,1.0,1914.0,1917.0,0.0,0.0
75%,397135.0,,1.0,1.0,1949.0,1955.0,0.4,0.6
max,515677.0,,7740.0,8484.0,2928.0,2928.0,1035.0,3050.0


In [101]:
dish_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423397 entries, 0 to 423396
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              423397 non-null  int64  
 1   name            423397 non-null  object 
 2   description     0 non-null       float64
 3   menus_appeared  423397 non-null  int64  
 4   times_appeared  423397 non-null  int64  
 5   first_appeared  423397 non-null  int64  
 6   last_appeared   423397 non-null  int64  
 7   lowest_price    394297 non-null  float64
 8   highest_price   394297 non-null  float64
dtypes: float64(3), int64(5), object(1)
memory usage: 29.1+ MB


In [102]:
dish_df.isnull().sum()

id                     0
name                   0
description       423397
menus_appeared         0
times_appeared         0
first_appeared         0
last_appeared          0
lowest_price       29100
highest_price      29100
dtype: int64

Main problems in `dish_df` include:

- The 'description' column is entirely null (423,397 null values)
- 'lowest_price' and 'highest_price' columns have 29,100 null values each
- Potential inconsistencies in casing, whitespace, and punctuation (not directly visible, but common issues)
- There are some extremely high prices.

Let's fix these using Pandas. First, let's introduce a function to deal with text issues (which come up repeatedly):

In [103]:
import re
def clean_text(text):
    if pd.isna(text) or not isinstance(text, str):
        return text

    # Convert to lowercase
    text = text.lower()


    # Replace contractions
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)

    # Remove special characters, keeping only letters, numbers, and basic punctuation
    text = re.sub(r'[^a-zA-Z0-9\s.,!?()-]', '', text)

    # Standardize spacing around punctuation
    text = re.sub(r'\s*([.,!?()])\s*', r'\1 ', text)
    text = re.sub(r'\s+', ' ', text)

    # Remove extra periods
    text = re.sub(r'\.{2,}', '.', text)

    # Remove spaces at the start and end of parentheses and brackets
    text = re.sub(r'\(\s+', '(', text)
    text = re.sub(r'\s+\)', ')', text)
    text = re.sub(r'\[\s+', '[', text)
    text = re.sub(r'\s+\]', ']', text)

    # Remove leading/trailing whitespace
    text = text.strip()

    # Capitalize first letter of each word (title case)
    text = text.title()

    return text

Now, we can apply this to clean our data.

In [104]:
# Create a copy of the original dataframe
dish_clean_df = dish_df.copy()

# 1. Set negative times_appeared to 0
dish_clean_df['times_appeared'] = dish_clean_df['times_appeared'].clip(lower=0)

# 2. Clean first_appeared and last_appeared
dish_clean_df.loc[(dish_clean_df['first_appeared'] < 1850) | (dish_clean_df['first_appeared'] > 2015), 'first_appeared'] = np.nan
dish_clean_df.loc[(dish_clean_df['last_appeared'] < 1850) | (dish_clean_df['last_appeared'] > 2015), 'last_appeared'] = np.nan

# 3. Replace 0 with null in lowest_price and highest_price
dish_clean_df['lowest_price'] = dish_clean_df['lowest_price'].replace(0, np.nan)
dish_clean_df['highest_price'] = dish_clean_df['highest_price'].replace(0, np.nan)

# 4. Handle extreme high prices
for col in ['lowest_price', 'highest_price']:
    mean = dish_clean_df[col].mean()
    std = dish_clean_df[col].std()
    dish_clean_df.loc[dish_clean_df[col] > mean + 3*std, col] = np.nan

# Clean text in 'name' column using the clean_text function
dish_clean_df['name'] = dish_clean_df['name'].apply(clean_text)

# Remove the 'description' column as it's entirely null
dish_clean_df = dish_clean_df.drop('description', axis=1)

print("Null values after cleaning:")
print(dish_clean_df.isnull().sum())

print("\nShape of cleaned dataframe:", dish_clean_df.shape)

print("\nSample of cleaned data:")
print(dish_clean_df.head())

print("\nSummary statistics of price columns:")
print(dish_clean_df[['lowest_price', 'highest_price']].describe())

# save to csv
dish_clean_df.to_csv('dish_clean.csv', index=False)


Null values after cleaning:
id                     0
name                   0
menus_appeared         0
times_appeared         0
first_appeared     55503
last_appeared      55500
lowest_price      253237
highest_price     248722
dtype: int64

Shape of cleaned dataframe: (423397, 8)

Sample of cleaned data:
   id                        name  menus_appeared  times_appeared  \
0   1  Consomme Printaniere Royal               8               8   
1   2               Chicken Gumbo             111             117   
2   3         Tomato Aux Croutons              13              13   
3   4             Onion Au Gratin              41              41   
4   5                 St. Emilion              66              68   

   first_appeared  last_appeared  lowest_price  highest_price  
0          1897.0         1927.0          0.20            0.4  
1          1895.0         1960.0          0.10            0.8  
2          1893.0         1917.0          0.25            0.4  
3          1900.0     

### Clean `Menu_Df`

First, let's provide an overview of the data.

In [105]:
menu_df.head()

Unnamed: 0,id,name,sponsor,event,venue,place,physical_description,occasion,notes,call_number,keywords,language,date,location,location_type,currency,currency_symbol,status,page_count,dish_count
0,12463,,HOTEL EASTMAN,BREAKFAST,COMMERCIAL,"HOT SPRINGS, AR",CARD; 4.75X7.5;,EASTER;,,1900-2822,,,1900-04-15,Hotel Eastman,,,,complete,2,67
1,12464,,REPUBLICAN HOUSE,[DINNER],COMMERCIAL,"MILWAUKEE, [WI];",CARD; ILLUS; COL; 7.0X9.0;,EASTER;,WEDGEWOOD BLUE CARD; WHITE EMBOSSED GREEK KEY ...,1900-2825,,,1900-04-15,Republican House,,,,under review,2,34
2,12465,,NORDDEUTSCHER LLOYD BREMEN,FRUHSTUCK/BREAKFAST;,COMMERCIAL,DAMPFER KAISER WILHELM DER GROSSE;,CARD; ILLU; COL; 5.5X8.0;,,"MENU IN GERMAN AND ENGLISH; ILLUS, STEAMSHIP A...",1900-2827,,,1900-04-16,Norddeutscher Lloyd Bremen,,,,complete,2,84
3,12466,,NORDDEUTSCHER LLOYD BREMEN,LUNCH;,COMMERCIAL,DAMPFER KAISER WILHELM DER GROSSE;,CARD; ILLU; COL; 5.5X8.0;,,"MENU IN GERMAN AND ENGLISH; ILLUS, HARBOR SCEN...",1900-2828,,,1900-04-16,Norddeutscher Lloyd Bremen,,,,complete,2,63
4,12467,,NORDDEUTSCHER LLOYD BREMEN,DINNER;,COMMERCIAL,DAMPFER KAISER WILHELM DER GROSSE;,FOLDER; ILLU; COL; 5.5X7.5;,,"MENU IN GERMAN AND ENGLISH; ILLUS, HARBOR SCEN...",1900-2829,,,1900-04-16,Norddeutscher Lloyd Bremen,,,,complete,4,33


In [106]:
menu_df.shape

(17545, 20)

In [107]:
menu_df.describe().round(2)

Unnamed: 0,id,keywords,language,location_type,page_count,dish_count
count,17545.0,0.0,0.0,0.0,17545.0,17545.0
mean,25325.95,,,,3.48,75.62
std,6431.55,,,,3.3,98.44
min,12463.0,,,,1.0,0.0
25%,20742.0,,,,2.0,20.0
50%,26165.0,,,,2.0,35.0
75%,30707.0,,,,4.0,93.0
max,35526.0,,,,74.0,4053.0


In [108]:
menu_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17545 entries, 0 to 17544
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    17545 non-null  int64  
 1   name                  3197 non-null   object 
 2   sponsor               15984 non-null  object 
 3   event                 8154 non-null   object 
 4   venue                 8119 non-null   object 
 5   place                 8123 non-null   object 
 6   physical_description  14763 non-null  object 
 7   occasion              3791 non-null   object 
 8   notes                 10613 non-null  object 
 9   call_number           15983 non-null  object 
 10  keywords              0 non-null      float64
 11  language              0 non-null      float64
 12  date                  16959 non-null  object 
 13  location              17545 non-null  object 
 14  location_type         0 non-null      float64
 15  currency           

In [109]:
menu_df.nunique()

id                      17545
name                      797
sponsor                  6370
event                    1770
venue                     233
place                    3714
physical_description     6268
occasion                  423
notes                    6969
call_number             15936
keywords                    0
language                    0
date                     6599
location                 6283
location_type               0
currency                   42
currency_symbol            34
status                      2
page_count                 46
dish_count                555
dtype: int64

Some major issues here include:

-   Missing values in several columns, including 'name', 'sponsor', 'event', 'venue', 'place', 'physical_description', 'occasion', 'notes', 'call_number', 'date', 'currency', and 'currency_symbol'.
-   Completely empty columns include 'keywords', 'language', and 'location_type' (all null values).
-   Potential inconsistencies in casing, whitespace, and punctuation (not directly visible, but common issues).
-   Possible outliers in numerical columns like 'page_count' and 'dish_count'.

Let's fix some of these using Pandas.

In [110]:
def standardize_event(event):
    if pd.isna(event) or not isinstance(event, str):
        return event
    event = event.lower()
    if 'breakfast' in event or 'morning' in event:
        return 'Breakfast'
    elif 'lunch' in event or 'noon' in event:
        return 'Lunch'
    elif 'dinner' in event or 'supper' in event or 'evening' in event:
        return 'Dinner'
    elif 'banquet' in event:
        return 'Banquet'
    elif 'wedding' in event:
        return 'Wedding'
    elif 'brunch' in event:
        return 'Brunch'
    elif 'birthday' in event:
        return 'Birthday'
    elif 'party' in event:
        return 'Party'
    elif 'anniversary' in event:
        return 'Anniversary'
    else:
        return event.title()


# Create a copy of the original dataframe
menu_clean_df = menu_df.copy()

# Clean text columns
text_columns = ['name', 'sponsor', 'venue', 'place', 'physical_description', 'occasion', 'notes', 'call_number', 'location']
for col in text_columns:
    menu_clean_df[col] = menu_clean_df[col].apply(clean_text)

# Standardize 'event' column
menu_clean_df['event'] = menu_clean_df['event'].apply(standardize_event)

# Convert 'date' column to datetime
menu_clean_df['date'] = pd.to_datetime(menu_clean_df['date'], errors='coerce')

# Set dates outside a reasonable range (e.g., 1800-2023) to NaT
menu_clean_df.loc[(menu_clean_df['date'].dt.year < 1800) | (menu_clean_df['date'].dt.year > 2023), 'date'] = pd.NaT

# Standardize 'currency' column and set unknown to 'USD'
menu_clean_df['currency'] = menu_clean_df['currency'].str.upper()
menu_clean_df['currency'] = menu_clean_df['currency'].fillna('USD')

# Drop the 'currency_symbol' column
menu_clean_df = menu_clean_df.drop('currency_symbol', axis=1)

# Ensure 'page_count' and 'dish_count' are non-negative
menu_clean_df['page_count'] = menu_clean_df['page_count'].clip(lower=0)
menu_clean_df['dish_count'] = menu_clean_df['dish_count'].clip(lower=0)



print("Null values after cleaning:")
print(menu_clean_df.isnull().sum())

print("\nShape of cleaned dataframe:", menu_clean_df.shape)

print("\nSample of cleaned data:")
print(menu_clean_df.head())

print("\nSummary statistics of numerical columns:")
print(menu_clean_df[['page_count', 'dish_count']].describe())

print("\nUnique values in categorical columns:")
print(menu_clean_df[['name', 'sponsor', 'event', 'venue', 'place', 'physical_description', 'occasion', 'notes', 'call_number', 'location']].nunique())

print("\nUnique events after standardization and dropping singular events:")
print(menu_clean_df['event'].value_counts())

# save to csv
menu_clean_df.to_csv('menu_clean.csv', index=False)


Null values after cleaning:
id                          0
name                    14348
sponsor                  1561
event                    9391
venue                    9426
place                    9422
physical_description     2782
occasion                13754
notes                    6932
call_number              1562
keywords                17545
language                17545
date                      591
location                    0
location_type           17545
currency                    0
status                      0
page_count                  0
dish_count                  0
dtype: int64

Shape of cleaned dataframe: (17545, 19)

Sample of cleaned data:
      id name                     sponsor      event       venue  \
0  12463  NaN               Hotel Eastman  Breakfast  Commercial   
1  12464  NaN            Republican House     Dinner  Commercial   
2  12465  NaN  Norddeutscher Lloyd Bremen  Breakfast  Commercial   
3  12466  NaN  Norddeutscher Lloyd Bremen      Lunc

### Clean `menu_item_df

In [111]:
menuitem_df.head()

Unnamed: 0,id,menu_page_id,price,high_price,dish_id,created_at,updated_at,xpos,ypos
0,1,1389,0.4,,1.0,2011-03-28 15:00:44 UTC,2011-04-19 04:33:15 UTC,0.111429,0.254735
1,2,1389,0.6,,2.0,2011-03-28 15:01:13 UTC,2011-04-19 15:00:54 UTC,0.438571,0.254735
2,3,1389,0.4,,3.0,2011-03-28 15:01:40 UTC,2011-04-19 19:10:05 UTC,0.14,0.261922
3,4,1389,0.5,,4.0,2011-03-28 15:01:51 UTC,2011-04-19 19:07:01 UTC,0.377143,0.26272
4,5,3079,0.5,1.0,5.0,2011-03-28 15:21:26 UTC,2011-04-13 15:25:27 UTC,0.105714,0.313178


In [112]:
menuitem_df.shape

(1332726, 9)

In [113]:
menuitem_df.describe().round(2)

Unnamed: 0,id,menu_page_id,price,high_price,dish_id,xpos,ypos
count,1332726.0,1332726.0,886810.0,91905.0,1332485.0,1332726.0,1332726.0
mean,697898.38,47594.87,12.84,8.11,158011.04,0.39,0.55
std,399980.67,22039.21,499.55,90.1,167762.04,0.22,0.22
min,1.0,130.0,0.0,0.0,1.0,0.0,0.0
25%,350251.25,32049.0,0.25,0.5,5089.0,0.18,0.37
50%,702410.5,53371.0,0.4,1.25,80700.0,0.38,0.57
75%,1045548.75,66823.0,1.0,3.0,332524.0,0.57,0.74
max,1385906.0,77425.0,180000.0,7800.0,515677.0,0.99,1.0


In [114]:
menuitem_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1332726 entries, 0 to 1332725
Data columns (total 9 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   id            1332726 non-null  int64  
 1   menu_page_id  1332726 non-null  int64  
 2   price         886810 non-null   float64
 3   high_price    91905 non-null    float64
 4   dish_id       1332485 non-null  float64
 5   created_at    1332726 non-null  object 
 6   updated_at    1332726 non-null  object 
 7   xpos          1332726 non-null  float64
 8   ypos          1332726 non-null  float64
dtypes: float64(5), int64(2), object(2)
memory usage: 91.5+ MB


In [115]:
menuitem_df.nunique()

id              1332726
menu_page_id      26590
price              1336
high_price          671
dish_id          414138
created_at      1291090
updated_at      1295796
xpos               1323
ypos             616305
dtype: int64

Main problems here include:

1. Missing values in 'price', 'high_price', and 'dish_id' columns.
'created_at' and 'updated_at' are stored as object types instead of datetime.
2. Potential outliers or inconsistencies in 'price' and 'high_price' columns.
3. Possible duplicate entries (given the high number of unique IDs).
4. 'xpos' and 'ypos' columns might need normalization or scaling.

In [116]:
# Create a copy of the original dataframe
menuitem_clean_df = menuitem_df.copy()

# Convert 'created_at' and 'updated_at' to datetime
menuitem_clean_df['created_at'] = pd.to_datetime(menuitem_clean_df['created_at'], errors='coerce')
menuitem_clean_df['updated_at'] = pd.to_datetime(menuitem_clean_df['updated_at'], errors='coerce')

# Handle extreme price values
def clean_price(col):
    # Calculate mean and standard deviation for the column
    mean = menuitem_clean_df[col].mean()
    std = menuitem_clean_df[col].std()

    # Define upper bound (we don't use lower bound as prices can't be negative)
    upper_bound = mean + 3 * std

    # Replace values above the upper bound with NaN
    menuitem_clean_df.loc[menuitem_clean_df[col] > upper_bound, col] = np.nan

# Apply price cleaning to 'price' and 'high_price' columns
clean_price('price')
clean_price('high_price')

# Ensure 'high_price' is always greater than or equal to 'price'
mask = (menuitem_clean_df['high_price'] < menuitem_clean_df['price']) & menuitem_clean_df['high_price'].notna()
menuitem_clean_df.loc[mask, 'high_price'] = menuitem_clean_df.loc[mask, 'price']

# Handle negative prices
menuitem_clean_df.loc[menuitem_clean_df['price'] < 0, 'price'] = np.nan
menuitem_clean_df.loc[menuitem_clean_df['high_price'] < 0, 'high_price'] = np.nan

# Ensure dish_id is an integer
menuitem_clean_df['dish_id'] = menuitem_clean_df['dish_id'].fillna(-1).astype(int)

# Handle xpos and ypos
menuitem_clean_df['xpos'] = menuitem_clean_df['xpos'].clip(lower=0)
menuitem_clean_df['ypos'] = menuitem_clean_df['ypos'].clip(lower=0)

# Drop items with dish_id of -1
menuitem_clean_df = menuitem_clean_df[menuitem_clean_df['dish_id'] != -1]

print("Null values after cleaning:")
print(menuitem_clean_df.isnull().sum())

print("\nShape of cleaned dataframe:", menuitem_clean_df.shape)

print("\nSample of cleaned data:")
print(menuitem_clean_df.head())

print("\nSummary statistics of numerical columns:")
print(menuitem_clean_df.describe())

print("\nData types of columns:")
print(menuitem_clean_df.dtypes)

print("\nNumber of unique values in each column:")
print(menuitem_clean_df.nunique())

# save to csv
menuitem_clean_df.to_csv('menuitem_clean.csv', index=False)


Null values after cleaning:
id                    0
menu_page_id          0
price            446563
high_price      1241026
dish_id               0
created_at            0
updated_at            0
xpos                  0
ypos                  0
dtype: int64

Shape of cleaned dataframe: (1332485, 9)

Sample of cleaned data:
   id  menu_page_id  price  high_price  dish_id                created_at  \
0   1          1389    0.4         NaN        1 2011-03-28 15:00:44+00:00   
1   2          1389    0.6         NaN        2 2011-03-28 15:01:13+00:00   
2   3          1389    0.4         NaN        3 2011-03-28 15:01:40+00:00   
3   4          1389    0.5         NaN        4 2011-03-28 15:01:51+00:00   
4   5          3079    0.5         1.0        5 2011-03-28 15:21:26+00:00   

                 updated_at      xpos      ypos  
0 2011-04-19 04:33:15+00:00  0.111429  0.254735  
1 2011-04-19 15:00:54+00:00  0.438571  0.254735  
2 2011-04-19 19:10:05+00:00  0.140000  0.261922  
3 2011-04-19 

### Clean `Menupage`

In [117]:
menupage_df.head()

Unnamed: 0,id,menu_id,page_number,image_id,full_height,full_width,uuid
0,119,12460,1.0,1603595,7230.0,5428.0,510d47e4-2955-a3d9-e040-e00a18064a99
1,120,12460,2.0,1603596,5428.0,7230.0,510d47e4-2956-a3d9-e040-e00a18064a99
2,121,12460,3.0,1603597,7230.0,5428.0,510d47e4-2957-a3d9-e040-e00a18064a99
3,122,12460,4.0,1603598,7230.0,5428.0,510d47e4-2958-a3d9-e040-e00a18064a99
4,123,12461,1.0,1603591,7230.0,5428.0,510d47e4-2959-a3d9-e040-e00a18064a99


In [118]:
menupage_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66937 entries, 0 to 66936
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           66937 non-null  int64  
 1   menu_id      66937 non-null  int64  
 2   page_number  65735 non-null  float64
 3   image_id     66937 non-null  object 
 4   full_height  66608 non-null  float64
 5   full_width   66608 non-null  float64
 6   uuid         66937 non-null  object 
dtypes: float64(3), int64(2), object(2)
memory usage: 3.6+ MB


In [119]:
menupage_df.shape

(66937, 7)

In [120]:
menupage_df.describe().round(2)

Unnamed: 0,id,menu_id,page_number,full_height,full_width
count,66937.0,66937.0,65735.0,66608.0,66608.0
mean,42719.76,25653.58,3.76,3859.1,2778.59
std,21274.0,6158.83,4.91,1156.01,970.29
min,119.0,12460.0,1.0,616.0,558.0
25%,27108.0,21743.0,1.0,2988.0,2120.0
50%,43894.0,26202.0,2.0,3630.0,2527.0
75%,60696.0,30531.0,4.0,4617.25,3295.25
max,77431.0,35526.0,74.0,12044.0,9175.0


In [121]:
menupage_df.nunique()

id             66937
menu_id        19816
page_number       74
image_id       63244
full_height     5612
full_width      5041
uuid           63041
dtype: int64

This is pretty clean. However:

- We have a few missing values in columns (which we can pretty easily impute).
- Some data types (page number) look inappropriate (they should be ints).
- We can check for standard things like duplicates.

In [122]:
df_handled_missing = menupage_df.copy()

# Fill missing page_number with median and convert to int
median_page = df_handled_missing['page_number'].median()
df_handled_missing['page_number'] = df_handled_missing['page_number'].fillna(median_page).astype(int)

# For full_height and full_width, we'll use the median of the respective column
df_handled_missing['full_height'] = df_handled_missing['full_height'].fillna(df_handled_missing['full_height'].median())
df_handled_missing['full_width'] = df_handled_missing['full_width'].fillna(df_handled_missing['full_width'].median())

df_correct_types = df_handled_missing.copy()
# Ensure page_number is int, full_height and full_width remain float
df_correct_types['page_number'] = df_correct_types['page_number'].astype(int)
df_correct_types['full_height'] = df_correct_types['full_height'].astype(float)
df_correct_types['full_width'] = df_correct_types['full_width'].astype(float)

menupage_clean_df = df_correct_types.drop_duplicates()

print("Null values after cleaning:")
print(menupage_clean_df.isnull().sum())

print("\nShape of cleaned dataframe:", menupage_clean_df.shape)

print("\nSample of cleaned data:")
print(menupage_clean_df.head())

print("\nSummary statistics of numerical columns:")
print(menupage_clean_df.describe())

print("\nData types of columns:")
print(menupage_clean_df.dtypes)

print("\nNumber of unique values in each column:")
print(menupage_clean_df.nunique())

# save to csv
menupage_clean_df.to_csv('menupage_clean.csv', index=False)

Null values after cleaning:
id             0
menu_id        0
page_number    0
image_id       0
full_height    0
full_width     0
uuid           0
dtype: int64

Shape of cleaned dataframe: (66937, 7)

Sample of cleaned data:
    id  menu_id  page_number image_id  full_height  full_width  \
0  119    12460            1  1603595       7230.0      5428.0   
1  120    12460            2  1603596       5428.0      7230.0   
2  121    12460            3  1603597       7230.0      5428.0   
3  122    12460            4  1603598       7230.0      5428.0   
4  123    12461            1  1603591       7230.0      5428.0   

                                   uuid  
0  510d47e4-2955-a3d9-e040-e00a18064a99  
1  510d47e4-2956-a3d9-e040-e00a18064a99  
2  510d47e4-2957-a3d9-e040-e00a18064a99  
3  510d47e4-2958-a3d9-e040-e00a18064a99  
4  510d47e4-2959-a3d9-e040-e00a18064a99  

Summary statistics of numerical columns:
                 id       menu_id   page_number   full_height    full_width
count  6

## Load Data Into SQL Database

In [123]:
import sqlite3
import pandas as pd

# Create a connection to the SQLite database
conn = sqlite3.connect('nypl_menus.db')

dish_clean_df.to_sql('dishes', conn, if_exists='replace', index=False)
print("Dishes data loaded successfully.")

menu_clean_df.to_sql('menus', conn, if_exists='replace', index=False)
print("Menus data loaded successfully.")

menuitem_clean_df.to_sql('menuitems', conn, if_exists='replace', index=False)
print("Menu items data loaded successfully.")

menupage_clean_df.to_sql('menupages', conn, if_exists='replace', index=False)
print("Menu pages data loaded successfully.")

# Commit the changes and close the connection
conn.commit()
conn.close()

print("Database created and data loaded successfully.")

# Reopen the connection to verify the data
conn = sqlite3.connect('nypl_menus.db')
cursor = conn.cursor()

# Check the number of rows in each table
tables = ['dishes', 'menus', 'menuitems', 'menupages']
for table in tables:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"Number of rows in {table} table: {count}")

conn.close()


Dishes data loaded successfully.
Menus data loaded successfully.
Menu items data loaded successfully.
Menu pages data loaded successfully.
Database created and data loaded successfully.
Number of rows in dishes table: 423397
Number of rows in menus table: 17545
Number of rows in menuitems table: 1332485
Number of rows in menupages table: 66937


Now, let's use SQL magic to inspect this:

In [124]:
%reload_ext sql
%sql sqlite:///nypl_menus.db

In [125]:
%%sql
SELECT * FROM dishes LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


id,name,menus_appeared,times_appeared,first_appeared,last_appeared,lowest_price,highest_price
1,Consomme Printaniere Royal,8,8,1897.0,1927.0,0.2,0.4
2,Chicken Gumbo,111,117,1895.0,1960.0,0.1,0.8
3,Tomato Aux Croutons,13,13,1893.0,1917.0,0.25,0.4
4,Onion Au Gratin,41,41,1900.0,1971.0,0.25,1.0
5,St. Emilion,66,68,1881.0,1981.0,,18.0


In [126]:
%%sql
SELECT * FROM menus LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


id,name,sponsor,event,venue,place,physical_description,occasion,notes,call_number,keywords,language,date,location,location_type,currency,status,page_count,dish_count
12463,,Hotel Eastman,Breakfast,Commercial,"Hot Springs, Ar",Card 4. 75X7. 5,Easter,,1900-2822,,,1900-04-15 00:00:00,Hotel Eastman,,USD,complete,2,67
12464,,Republican House,Dinner,Commercial,"Milwaukee, Wi",Card Illus Col 7. 0X9. 0,Easter,Wedgewood Blue Card White Embossed Greek Key Border Easter Sunday Embossed In White Violet Colored Spray Of Flowers In Upper Left Corner,1900-2825,,,1900-04-15 00:00:00,Republican House,,USD,under review,2,34
12465,,Norddeutscher Lloyd Bremen,Breakfast,Commercial,Dampfer Kaiser Wilhelm Der Grosse,Card Illu Col 5. 5X8. 0,,"Menu In German And English Illus, Steamship And Sailing Vessel",1900-2827,,,1900-04-16 00:00:00,Norddeutscher Lloyd Bremen,,USD,complete,2,84
12466,,Norddeutscher Lloyd Bremen,Lunch,Commercial,Dampfer Kaiser Wilhelm Der Grosse,Card Illu Col 5. 5X8. 0,,"Menu In German And English Illus, Harbor Scene With Sailing Vessel",1900-2828,,,1900-04-16 00:00:00,Norddeutscher Lloyd Bremen,,USD,complete,2,63
12467,,Norddeutscher Lloyd Bremen,Dinner,Commercial,Dampfer Kaiser Wilhelm Der Grosse,Folder Illu Col 5. 5X7. 5,,"Menu In German And English Illus, Harbor Scene With Rocks And Lighthouse Steamship And Sailing Vessels Concert Program Dates On German Side Of Menu Montag, Den 16 April 1900 On English Side Of Menu Monday, April 15Th, 1900",1900-2829,,,1900-04-16 00:00:00,Norddeutscher Lloyd Bremen,,USD,complete,4,33


In [127]:
%%sql
SELECT * FROM menuitems LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


id,menu_page_id,price,high_price,dish_id,created_at,updated_at,xpos,ypos
1,1389,0.4,,1,2011-03-28 15:00:44+00:00,2011-04-19 04:33:15+00:00,0.111429,0.254735
2,1389,0.6,,2,2011-03-28 15:01:13+00:00,2011-04-19 15:00:54+00:00,0.438571,0.254735
3,1389,0.4,,3,2011-03-28 15:01:40+00:00,2011-04-19 19:10:05+00:00,0.14,0.261922
4,1389,0.5,,4,2011-03-28 15:01:51+00:00,2011-04-19 19:07:01+00:00,0.377143,0.26272
5,3079,0.5,1.0,5,2011-03-28 15:21:26+00:00,2011-04-13 15:25:27+00:00,0.105714,0.313178


In [128]:
%%sql
SELECT * FROM menupages LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


id,menu_id,page_number,image_id,full_height,full_width,uuid
119,12460,1,1603595,7230.0,5428.0,510d47e4-2955-a3d9-e040-e00a18064a99
120,12460,2,1603596,5428.0,7230.0,510d47e4-2956-a3d9-e040-e00a18064a99
121,12460,3,1603597,7230.0,5428.0,510d47e4-2957-a3d9-e040-e00a18064a99
122,12460,4,1603598,7230.0,5428.0,510d47e4-2958-a3d9-e040-e00a18064a99
123,12461,1,1603591,7230.0,5428.0,510d47e4-2959-a3d9-e040-e00a18064a99


## Additional Data Cleaning in SQL

Now, let's use SQL to find and correct some remaining problems in our data.

In [129]:
%%sql
-- Find dishes without menuitems
SELECT dishes.*
FROM dishes
LEFT JOIN menuitems ON dishes.id = menuitems.dish_id
WHERE menuitems.id IS NULL
LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


id,name,menus_appeared,times_appeared,first_appeared,last_appeared,lowest_price,highest_price
825,"Rice, Semolina",1,0,1900.0,1900.0,,
2799,Half Pint,1,0,1901.0,1901.0,0.1,0.1
3031,Saute With Mushrooms,1,0,1900.0,1900.0,,
3032,A La Lyonnaise,1,0,1900.0,1900.0,0.35,0.35
3033,En Brochette,1,0,1900.0,1900.0,0.5,0.5


In [130]:
%%sql
-- Find menuitems with dishes
SELECT menuitems.*
FROM menuitems
LEFT JOIN dishes ON menuitems.dish_id = dishes.id
WHERE dishes.id IS NULL
LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


id,menu_page_id,price,high_price,dish_id,created_at,updated_at,xpos,ypos
619133,51020,,,220797,2011-10-30 14:27:33+00:00,2011-10-30 14:27:33+00:00,0.605714,0.215599
837354,60235,0.2,,329183,2012-03-08 23:28:32+00:00,2012-03-08 23:28:32+00:00,0.664286,0.173588
1047160,69117,0.45,,395403,2012-08-14 09:52:01+00:00,2012-08-14 09:52:01+00:00,0.377333,0.469635


In [131]:
%%sql
-- Find menus without menu pages
SELECT menus.*
FROM menus
LEFT JOIN menupages ON menus.id = menupages.menu_id
WHERE menupages.id IS NULL
LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


id,name,sponsor,event,venue,place,physical_description,occasion,notes,call_number,keywords,language,date,location,location_type,currency,status,page_count,dish_count


In [132]:
%%sql
-- Find menu pages without menues
SELECT menupages.*
FROM menupages
LEFT JOIN menus ON menupages.menu_id = menus.id
WHERE menus.id IS NULL
LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


id,menu_id,page_number,image_id,full_height,full_width,uuid
119,12460,1,1603595,7230.0,5428.0,510d47e4-2955-a3d9-e040-e00a18064a99
120,12460,2,1603596,5428.0,7230.0,510d47e4-2956-a3d9-e040-e00a18064a99
121,12460,3,1603597,7230.0,5428.0,510d47e4-2957-a3d9-e040-e00a18064a99
122,12460,4,1603598,7230.0,5428.0,510d47e4-2958-a3d9-e040-e00a18064a99
123,12461,1,1603591,7230.0,5428.0,510d47e4-2959-a3d9-e040-e00a18064a99


In [133]:
%%sql
-- Find places where dish highest price is lower than a corresponding menuitem price
SELECT dishes.name, dishes.highest_price, menuitems.price
FROM dishes
JOIN menuitems ON dishes.id = menuitems.dish_id
WHERE dishes.highest_price < menuitems.price
LIMIT 5;

   sqlite:///dirty_menus.db
 * sqlite:///nypl_menus.db
Done.


name,highest_price,price
Sardines,50.0,75.0
Oysters,35.0,100.0
"Roast Sirloin Of Beef, Yorkshire Pudding",0.75,1.9
Ananas,0.95,1.0
Ananas,0.95,1.2


## Workflow Graphic (In mermaid)

In [134]:
mm("""
flowchart TD

    subgraph "dish_df"
        A1[Initial dish.csv]
        A2[Load into Pandas]
        A3[Clean text in 'name' column]
        A4[Handle extreme prices]
        A5[Remove description column]
        A6[Export to dish_clean.csv]
        A7[Load into SQLite]

        A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7
    end

    subgraph "menu_df"
        B1[Initial menu.csv]
        B2[Load into Pandas]
        B3[Clean text columns]
        B4[Standardize 'event' column]
        B5[Convert 'date' to datetime]
        B6[Handle currency]
        B7[Export to menu_clean.csv]
        B8[Load into SQLite]

        B1 --> B2 --> B3 --> B4 --> B5 --> B6 --> B7 --> B8
    end

    subgraph "menuitem_df"
        C1[Initial menuitem.csv]
        C2[Load into Pandas]
        C3[Convert timestamps to datetime]
        C4[Handle extreme price values]
        C5[Ensure dish_id is integer. Drop -1 values]
        C6[Handle xpos and ypos]
        C7[Export to menuitem_clean.csv]
        C8[Load into SQLite]

        C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> C8
    end

    subgraph "menupage_df"
        D1[Initial menupage.csv]
        D2[Load into Pandas]
        D3[Handle missing values]
        D4[Convert page_number to integer]
        D5[Ensure correct data types]
        D6[Remove duplicates]
        D7[Export to menupage_clean.csv]
        D8[Load into SQLite]

        D1 --> D2 --> D3 --> D4 --> D5 --> D6 --> D7 --> D8
    end
  subgraph "nypl_menus.db"
    E[SQLite Database]
    E1[Some SQL Operations]
    A7 & B8 & C8 & D8 --> E
    E --> E1

  end
"""
)

### Summary: Dirty vs Clean Data

In [135]:
import pandas as pd
import numpy as np

def is_numeric_dtype(dtype):
    if pd.api.types.is_numeric_dtype(dtype):
        return True
    if hasattr(dtype, 'name') and 'Int' in dtype.name:  # This catches pandas integer extension types
        return True
    return False

def summarize_differences(dirty_df, clean_df, name):
    print(f"\nSummary for {name}:")

    # Compare shapes
    print(f"Shape: {dirty_df.shape} -> {clean_df.shape}")

    # Compare column names
    removed_columns = list(set(dirty_df.columns) - set(clean_df.columns))
    added_columns = list(set(clean_df.columns) - set(dirty_df.columns))
    print(f"Removed columns: {removed_columns}")
    print(f"Added columns: {added_columns}")

    # Compare data types
    common_columns = list(set(dirty_df.columns) & set(clean_df.columns))
    for col in common_columns:
        if dirty_df[col].dtype != clean_df[col].dtype:
            print(f"Column '{col}' dtype changed: {dirty_df[col].dtype} -> {clean_df[col].dtype}")

    # Compare null values
    dirty_nulls = dirty_df[common_columns].isnull().sum()
    clean_nulls = clean_df[common_columns].isnull().sum()
    null_diff = clean_nulls - dirty_nulls
    print("Null value changes:")
    print(null_diff[null_diff != 0])

    # Compare unique values
    dirty_uniques = dirty_df[common_columns].nunique()
    clean_uniques = clean_df[common_columns].nunique()
    unique_diff = clean_uniques - dirty_uniques
    print("Unique value changes:")
    print(unique_diff[unique_diff != 0])

    # Compare numeric columns statistics
    numeric_columns = [col for col in common_columns if is_numeric_dtype(dirty_df[col].dtype) and is_numeric_dtype(clean_df[col].dtype)]
    for col in numeric_columns:
        dirty_stats = dirty_df[col].describe()
        clean_stats = clean_df[col].describe()
        if not dirty_stats.equals(clean_stats):
            print(f"\nStatistics changed for column '{col}':")
            print(pd.concat([dirty_stats, clean_stats], axis=1, keys=['Dirty', 'Clean']))

# Usage
summarize_differences(dish_df, dish_clean_df, "Dish DataFrame")
summarize_differences(menu_df, menu_clean_df, "Menu DataFrame")
summarize_differences(menuitem_df, menuitem_clean_df, "MenuItem DataFrame")
summarize_differences(menupage_df, menupage_clean_df, "MenuPage DataFrame")


Summary for Dish DataFrame:
Shape: (423397, 9) -> (423397, 8)
Removed columns: ['description']
Added columns: []
Column 'first_appeared' dtype changed: int64 -> float64
Column 'last_appeared' dtype changed: int64 -> float64
Null value changes:
highest_price     219622
first_appeared     55503
lowest_price      224137
last_appeared      55500
dtype: int64
Unique value changes:
times_appeared       -4
name             -39698
highest_price      -118
first_appeared       -3
lowest_price       -144
last_appeared        -3
dtype: int64

Statistics changed for column 'times_appeared':
               Dirty          Clean
count  423397.000000  423397.000000
mean        3.146794       3.146872
std        29.962122      29.962110
min        -6.000000       0.000000
25%         1.000000       1.000000
50%         1.000000       1.000000
75%         1.000000       1.000000
max      8484.000000    8484.000000

Statistics changed for column 'highest_price':
               Dirty          Clean
count 