# CS 513 Final Project Phase II Report (Team 210 Wizards of Illinois Place)

**Team Members**

- Blake McBride (blakepm2@illinois.edu)
- Abdelrahman Hamdan (ah57@illinois.edu)
- Anshul Gonswami (ashlug3@illinois.edu)

## I. Description of Data Cleaning Performed

Below we outline the actual steps of data cleaning that were performed for our use case of standardizing dish names.

1) Convert To Titlecase. Many of the fields below were all capital, which didn't make sense of what they were representing
- Name, Sponser, Event, Venue, Place, Physical Description, Easter, etc

2) Convert notes to lowercase. There are long description of notes in menu.csv. It would make sense to have those notes lowercase. 

3) Cluster Events, since many different variations for the same word, such as Dinner having 10 different variations. Combining them gives much more clarity. 

4) Converted Date String to Date Type so that analysis software can correctly recognize that the columns are indeed dates. 

5) Most values in Dish.csv were numbers. Convert appropriate columns to number type.

6) Repeated same step as 5 for other sheets

7) Removed Menu Items whose price is greater than $100,000, as a menu item greater than 100,000 is not resonable

8) Used Text Faucet and clustering in Venue. Similar reason to #3, where we don't want variations of text that is representing the same word. 

9) Used Date Faucet in First_Appeared_Year and Last_Appeared in Dish.CV to remove years greater than 2025, since it is not possible for menu items to appear in years greater than 2025 (as of the current date)

## II. Document Data Quality Changes

In [1]:
import pandas as pd
import pandasql as psql
from helpers import read_in_data

In [2]:
dish_df, menu_df, menu_item_df, menu_page_df = read_in_data()

In [3]:
dish_df.head()

Unnamed: 0,id,name,description,menus_appeared,times_appeared,first_appeared,last_appeared,lowest_price,highest_price
0,1,Consomme printaniere royal,,8,8,1897,1927,0.2,0.4
1,2,Chicken gumbo,,111,117,1895,1960,0.1,0.8
2,3,Tomato aux croutons,,13,13,1893,1917,0.25,0.4
3,4,Onion au gratin,,41,41,1900,1971,0.25,1.0
4,5,St. Emilion,,66,68,1881,1981,0.0,18.0


In [4]:
dish_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423397 entries, 0 to 423396
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              423397 non-null  int64  
 1   name            423397 non-null  object 
 2   description     0 non-null       float64
 3   menus_appeared  423397 non-null  int64  
 4   times_appeared  423397 non-null  int64  
 5   first_appeared  423397 non-null  int64  
 6   last_appeared   423397 non-null  int64  
 7   lowest_price    394297 non-null  float64
 8   highest_price   394297 non-null  float64
dtypes: float64(3), int64(5), object(1)
memory usage: 29.1+ MB


## Queries 

### Query 1: Identifying the count of distinct dish names before/after standardization

In [None]:
query = """
select distinct name
  from dish_df
order by 1 asc
"""

result = psql.sqldf(query, globals())

In [None]:
result

### Query 2: Analyzing the average price of dishes in the catalog before/after standardization

In [None]:
query = """
select distinct name
    , avg(highest_price) as avg_price
 from dish_df
group by 1
order by 2 desc
"""

result = psql.sqldf(query, globals())

In [None]:
result.head()

## III. Create a Workflow Model

<p align="center">
    <img src="OpenRefineFlow-1.png">
</p>

## IV. Conclusions & Summary

To summarize, in this project we initially set out to address the data quality problem of repeated dish names in the Dish dataset by standardizing dish names. Without standard naming conventions in place, it is impossible to perform accurate analysis on the popularity and pricing trends of dishes. Dishes whose names are mispelled, or use different terminology to refer to the same thing, create erroneous data with separate line items for producsts that should otherwise be examined together as a whole. To promote standardization, we leverage OpenRefine to merge similarly named dishes under the assumption that differences between certain dish names reflect true data quality issues. To analyze and quantify our results, and to determine whether or not we were indeed successful in improving data quality in our approach, we make use of Python; we employ packages in Python such as Pandas, which allows us to read our original and cleaned datasets into DataFrames, and PandaSQL to execute our queries for measuring data quality. We find that...

For the purposes of this project, we divided the work up evenly among our group members based on our respective domain expertise. Anshul, having the most experience and comfort using OpenRefine, took care of the actual merging of dish names and our data cleaning steps within that platform. Blake and Abduhamdan, being more comfortable with analysis in Python, handled the queries and measurement framework to examine and quantify the results. 