<a href="https://colab.research.google.com/github/bnghiem18/COGS108_Repo/blob/main/DataCheckpoint_Group087_FA24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COGS 108 - Data Checkpoint

# Names

- Dominic Duong
- Jacob Doan
- Gavin Nguyen
- Brandon Nghiem
- Luis Millan

# Research Question


**Is there a correlation between the demographic distribution of ethnic populations in San Diego neighborhoods and the types of ethnic cuisines offered by local restaurants within those neighborhoods?**



## Background and Prior Work


San Diego, like most major cities, has immense variety in its offering of restaurants throughout its neighborhoods. With specific areas like Convoy, Mira Mesa, and Rancho Peñasquitos being known for their diverse populations and restaurants, it is worth investigating whether there exists any kind of correlation between a neighborhood’s ethnic makeup and the cultures represented in local cuisine.

Some personal observations as local San Diego natives also led us to be curious about this topic. Our interest mainly stemmed from how Convoy is well known in the local community for having an excessive amount of Korean BBQ, and other popular Asian restaurant types, such as Hot Pot, Ramen, or Dim Sum. With a single glance, the ratio of ethnic restaurant types was obviously unbalanced in that neighborhood. However, because of such an imbalance, this led us to wonder if Convoy’s population demographic was actually more Korean, Chinese, or just Asian-driven in general. We then began to question if the cuisine types in other San Diego neighborhoods were ethnically driven by their population.

To aid us in our research, we looked over some of the COGS108 project repos to uncover any work that might be tackling a question similar to ours. The closest one we found was a project investigating racial and identity profiling in traffic stops. Although it did not make any mention of restaurants, it did use San Diego census data to draw a relationship involving ethnic composition. From the data, they concluded that Black and Native American drivers were stopped at a higher proportion than other races, showing that there did exist a relationship between ethnicity and traffic stops[<sup>1</sup>](https://github.com/COGS108/FinalProjects-Sp22/blob/main/FinalProject_group022.ipynb). The findings of their research, and usage of a census to gather information about ethnic populations, gave us an idea on how to approach our project. In our project, we will also investigate a correlation involving ethnicity, but instead try to connect it to local cuisine.

From our initial research, we’ve come to understand that there does exist a relationship between ethnic makeup and local cuisine. In a 2020 study by Jennifer Lee Boch, Tomás Jiménez, and J. Michael Roesler titled "Immigrant Exclusion and Ingroup Preferences: How Ethnic Demographic Change and Cultural Preferences Shape the Local Food Environment"[<sup>2</sup>](https://sociology.stanford.edu/sites/sociology/files/boch_jimenez_roesler_2020.pdf), the assimilation of Mexican cuisine into mainstream cuisine is studied in relation to the rising Mexican population. Although the authors primarily focus on the correlation between Mexican population and perceived importance of authenticity in Yelp reviews, their discussion of cuisine is strongly relevant. Specifically, they point out that in areas with a stronger Mexican community, Mexican food is more integrated into local mainstream cuisine. For our project, we would like to expand upon this by looking at various neighborhoods in San Diego and seeing if this pattern continues for other cultures.

### References

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Jiang, M., Bodhisartha, A., Tiu, J., Xu, J., & Shetty, D. (2022). COGS108 final project: Racial and ethnic disparities in traffic stops. *GitHub*. https://github.com/COGS108/FinalProjects-Sp22/blob/main/FinalProject_group022.ipynb
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Boch, J., Jiménez, A., & Roesler, A. (2020). The color of justice: Racial disparities in traffic stop outcomes in the United States. *Stanford University*. https://sociology.stanford.edu/sites/sociology/files/boch_jimenez_roesler_2020.pdf



# Hypothesis



We believe the frequency of ethnic populations of neighborhoods in San Diego are more **strongly correlated** with the types of cuisines offered by local restaurants. In other words, the more prevalent an ethnic population is across a particular neighborhood, the more we would expect to see more local restaurants of that cuisine type.



# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: San Diego Food Database: COVID-19 Edition
  - Link to the dataset: https://docs.google.com/spreadsheets/d/1omC5ZEUo1KHQHthAvcO4kDOidThTqMQiJd9ewAM45s0/htmlview#
  - Number of observations: 438
  - Number of variables: 9
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

Dataset #1 is our main source of restaurants we plan to use for our analysis. It is provided by an author, Vietca, of a blog labeled "Cravings While Quarantined: Ultimate Food Guide". Vietca mentions that it pulls from highly rated restaurant spots inspired by San Diego Magazine but adds more mom and pop shops to this data set, which would make it more representative of the neighborhood. The data includes important variables such as the name, neighborhood, and cuisine. Within each row these are the 9 variables:

- Name: name of the restuarant (hyperlinked with its exact location)
- Neighborhood: San Diego sub-region
- Cuisine: type of food (culture)
- Cuisine: a second cuisine variable if the restaurant is a mix of multiple cuisines (not common for this dataset)
- Take-out: indicates whether the restaurant offers takeout
- Delivery: indicates whether the restaurant offers delivery
- Hat-tip: the author references a specific person that provided the restaurant (irrelevant for our analysis)
- Last confirmed: The date that corresponds to the last time the status of the restaurant was updated

We plan to wrangle this data set, by adding a zipcode column corresponding to the neighborhood, so we can use our SANDAG population dataset (which is standardized by zipcode).

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.
We plan to combine the datasets by looking how the proportions of restaurants and cuisines in a region compares to the proportions of population in the SANDAG data set.

## Dataset #1 (use name instead of number here)

In [5]:
import pandas as pd
import seaborn as sns
import requests

In [6]:
url = "https://docs.google.com/spreadsheets/d/1NW2RJU9d5Sphr5GCJuzuRKxlkCU4Od5oSd-UcgyA9gU/htmlview".replace('/htmlview', '/export?format=csv')
df = pd.read_csv(url)

df.head(20)

Unnamed: 0,Entity,Neighborhood,Cuisine,Cuisine.1,Take-Out,Delivery,Hat Tip,Last Confirmed,Notes
0,0 Zone Billiard Cafe,Kearny Mesa,American,,Yes,Yes,,4/6/20,
1,101 Bagels & Subs,Oceanside,Bagels,Cafe,Yes,Yes,,3/20/20,
2,"101 Diner Inc., The",Encinitas,American,,Yes,No,,3/20/20,
3,3 Punk Ales Brewing Co.,Chula Vista,Brewery,,Yes,No,,3/20/20,
4,619 Spirits Distillery & Tasting Room,North Park,Distillery,,Yes,Yes,,3/20/20,
5,Achilles Coffee Roasters,Downtown SD,Coffee,Cafe,Yes,Yes,,3/20/20,
6,Ajisen Ramen San Diego,Kearny Mesa,Japanese,,Yes,Yes,,4/6/20,
7,Al's Cafe in the Village,Carlsbad,Cafe,,Yes,Yes,,3/20/20,
8,Aladdin Mediterranean Restaurant,Kearny Mesa,Mediterranean,,Yes,Yes,,3/20/20,
9,AleSmith Brewing Co.,Miramar,Brewery,,Yes,No,,3/20/20,


In [8]:
# My cleaning approach:
# List unique Neighborhood
# For each Neighborhood
# I think we use replace?
# If neighborhood name is name(s), replace with zipcode

df['Zipcode'] = df['Neighborhood']
df.head(20)

# Listing the unique neighborhoods to define a dictionary
# print(df['Neighborhood'].unique())
unique_neighborhoods = df['Neighborhood'].unique()
#asked chatgpt how to cleanly print it with quotations

# Print each unique value on a new line
for neighborhood in unique_neighborhoods:
    print(f"'{neighborhood}': '',")

#For neighborhoods with multiple zip codes,
#assign the most commonly listed zip code for the restaurants in that neighborhood.
#If they're tied, count the ones that aren't closed

#Obvious flaw is that some zipcodes will have few restaraunts to represent them
#3/4 Chula Vista Restaraunts are closed

#Note:
#Mira Mesa is 92126, Sorrento Valley(mainly businesses) is 92121
#Convoy is Kearny Mesa + Clairemont (92111, 92123)
# San Marcos is split evenly in half between 92078 and 92069
# We should maybe delete neighborhoods that only have one restaraunt
neighborhood_to_zip = {
    'Kearny Mesa': '92123',
    'Oceanside': '92054',
    'Encinitas': '92024',
    'Chula Vista': '91910', #consider discarding, the majority of the Chula locations are closed
    'North Park': '92104',
    'Downtown SD': '92101',
    'Carlsbad': '92008', #92008 chosen for restaraunt location, but residential area is 92009
    'Miramar': '92126',
    'College Area': '92115',
    'La Mesa': '91942',
    'Hillcrest': '92103',
    'Barrio Logan': '92113',
    'East Village': '92101',
    'Bay Park': '92117',
    'Crown Point': '92109',
    'Fallbrook': '92028',
    'Little Italy': '92101',
    'Multiple Locations': 'N/A',
    'Vista': '92083',
    'Del Mar': '92014',
    'Mission Valley': '92108',
    'South Park': '92102',
    'Carmel Valley': '92130',
    'Old Town': '92110',
    'Sorrento Valley': '92121',
    'La Jolla': '92037', #UCSD is 92093
    'Mission Hills': '92103',
    'Point Loma': '92016',
    'Solana Beach': '92075',
    'Kensington': '92116',
    'Rancho Bernardo': '92128',
    'El Cajon': '92020',
    'University Heights': '92116',
    'Normal Heights': '92116',
    'University City/UTC': '92122',
    'Banker\'s Hill': '92103',
    'San Marcos': '92078',
    'Temecula': '92592',
    'Scripps Ranch': '92131',
    'Santa Ysabel': '92070',
    'Lemon Grove': '91945',
    'Golden Hill': '92102',
    'Clairemont': '92111',
    'Ocean Beach': '92107',
    'Bonita': '91902',
    'Julian': '92036',
    'Poway': '92064',
    'Chollas View': '92102',
    'Leucadia': '92024',
    'Mission Beach': '92109',
    'National City': '91950',
    'Pacific Beach': '92109',
    'Tijuana, Baja': 'N/A',
    'City Heights': '92105',
    'Linda Vista': '92111',
    'Mira Mesa': '92126',
    'Spring Valley': '91977',
    'Escondido': '92025',
    'San Ysidro': '92173',
    'Ramona': '92065',
}



'Kearny Mesa': '',
'Oceanside': '',
'Encinitas': '',
'Chula Vista': '',
'North Park': '',
'Downtown SD': '',
'Carlsbad': '',
'Miramar': '',
'College Area': '',
'La Mesa': '',
'Hillcrest': '',
'Barrio Logan': '',
'East Village': '',
'Bay Park': '',
'Crown Point': '',
'Fallbrook': '',
'Little Italy': '',
'Multiple Locations': '',
'Vista': '',
'Del Mar': '',
'Mission Valley': '',
'South Park': '',
'Carmel Valley': '',
'Old Town': '',
'Sorrento Valley': '',
'La Jolla': '',
'Mission Hills': '',
'Point Loma': '',
'Solana Beach': '',
'Kensington': '',
'Rancho Bernardo': '',
'El Cajon': '',
'University Heights': '',
'Normal Heights': '',
'University City/UTC': '',
'Banker's Hill': '',
'San Marcos': '',
'Temecula': '',
'Scripps Ranch': '',
'Santa Ysabel': '',
'Lemon Grove': '',
'Golden Hill': '',
'Clairemont': '',
'Ocean Beach': '',
'Bonita': '',
'Julian': '',
'Poway': '',
'Chollas View': '',
'Leucadia': '',
'Mission Beach': '',
'National City': '',
'Pacific Beach': '',
'Tijuana, Baja': '',
'Ci

In [9]:
df['Zipcode'] = df['Zipcode'].replace(neighborhood_to_zip)
df.head(20)

Unnamed: 0,Entity,Neighborhood,Cuisine,Cuisine.1,Take-Out,Delivery,Hat Tip,Last Confirmed,Notes,Zipcode
0,0 Zone Billiard Cafe,Kearny Mesa,American,,Yes,Yes,,4/6/20,,92123
1,101 Bagels & Subs,Oceanside,Bagels,Cafe,Yes,Yes,,3/20/20,,92054
2,"101 Diner Inc., The",Encinitas,American,,Yes,No,,3/20/20,,92024
3,3 Punk Ales Brewing Co.,Chula Vista,Brewery,,Yes,No,,3/20/20,,91910
4,619 Spirits Distillery & Tasting Room,North Park,Distillery,,Yes,Yes,,3/20/20,,92104
5,Achilles Coffee Roasters,Downtown SD,Coffee,Cafe,Yes,Yes,,3/20/20,,92101
6,Ajisen Ramen San Diego,Kearny Mesa,Japanese,,Yes,Yes,,4/6/20,,92123
7,Al's Cafe in the Village,Carlsbad,Cafe,,Yes,Yes,,3/20/20,,92008
8,Aladdin Mediterranean Restaurant,Kearny Mesa,Mediterranean,,Yes,Yes,,3/20/20,,92123
9,AleSmith Brewing Co.,Miramar,Brewery,,Yes,No,,3/20/20,,92126


In [10]:
df.to_csv('SDFoodDB_zipcode.csv', index=False)

## Dataset #2 (if you have more than one, use name instead of number here)

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

When looking at the correlation between neighborhoods' ethnic compositions and the type of cuisines offered, several data biases have to be considered. Some types of restaurants may be less likely to appear on platforms like Yelp or Google Maps due to the limited online presence of specific customer demographics. For example, establishments that cater to local communities or those with little online presence may be underrepresented. This could affect cuisines traditionally served by smaller, family-owned establishments that might need more resources or access to online marketing.

Most of the time, the way cuisines are classified on platforms like Yelp or Google Maps can oversimplify or group diverse cuisines under general categories (e.g., "Asian," "American," or "Latin American"). This can mask the true diversity of offerings, mainly when a neighborhood offers a variety of cuisines that need to fit these categories. Given that our dataset is from COVID-19, it revolutionized the restaurant industry, and any data around that time may reflect a temporary or atypical snapshot of restaurant availability. This can include closures, shifting business models (to delivery-only), or changes in menu offerings that may have been done due to supply chain issues. Restaurants with economic leverage may have been more successful during this time, which can be reflected in the data as more small-owned restaurants wouldn't have the financial leeway to be successful.

Platforms like Yelp and Google Maps rely on user-generated content, meaning specific neighborhoods with more tech-savvy or affluent populations might have more extensive or detailed listings. This can lead to more data for some areas while underrepresenting others, especially low-income neighborhoods where restaurants have fewer reviews and less consistent online coverage.

In addition to these biases, ethical considerations must be addressed. Ensuring data privacy is crucial; while aggregated regional data mitigates some privacy concerns, it is essential to avoid inadvertently revealing any personally identifiable information. Complying with terms of use agreements for datasets from government or law enforcement agencies is also necessary to avoid legal issues. Transparency in data reporting is vital to prevent selective reporting or data manipulation. Acknowledging potential biases in the dataset and striving to mitigate them is essential. Lastly, considering fairness and equity, we should prevent our analysis from perpetuating stereotypes or justifying discriminatory practices, ensuring that our findings do not reinforce existing inequalities or biases.

# Team Expectations


### 1) **Communication**
  - Be clear and punctual with our availability and weekly effort through our messaging group chat so we can communicate what needs to be done
  - Answer each other within 24 hours
  - Weekly Meeting Time: (should be used to check-in with each other and work together if needed) 6-7pm
### 2) **Tone**
  - Blunt but polite, not hostile, but should be able to critique each other’s work and offer improvements
  - Decision-Making
  - Ideally Unanimous, but if need be, majority decision
  - All decisions should be made as a group
### 3) **Tasks**
  - Delegated evenly amongst each other
  - Refer to our timeline for deadlines and use imessage to decide who completes what tasks for the week
### 4) **Conflicts**
  - Bring it up at weekly meetings or over text
  - Resolve rather than let it linger


# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting | Success |
|---|---|---|---|---|
| 10/30  |  8 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions, do background research on topic, Edit, finalize, and submit proposal | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research, Discuss ideal dataset(s) and ethics | DONE |
| 11/7  |  6-7pm |  Search for datasets (Everyone) | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part. Assign who cleans/wrangles each data set | DONE |
| 11/14  | 6-7pm  | Clean and Wrangle Data Sets | Share our cleaned data, potential edits and what we found | |
| 11/21  | 6-7PM  | Wrangling/Data (Finalize wrangling/EDA) | Review/Edit wrangling/EDA; Discuss Analysis Plan | |
| 11/28  | 6-7PM  | Begin Analysis | Discuss/edit Analysis; Complete project check-in | |
|  12/6 | 6-7PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project| |
| 12/11  | Before 11:59pm  |NA| Turn in Final Project & Group Project Surveys | |
