# Kitchen Dreams 
An Examination of the Tri-State Area Cuisine in New York Using Census and Yelp Data. 

![Rooftop Restaurant NYC](https://www.therooftopguide.com/rooftop-news/Bilder/rooftop-restaurant-nyc-230-1.jpg)

## Table of Contents
1. [Abstract](#abstract)
2. [Introduction](#introduction)
3. [Research Approach](#research-approach)
4. [Data Preparation](#data-preparation)
5. [Exploratory Data Analysis (EDA)](#eda)
6. [Prepped Data Review](#prepped-data-review)
7. [Investigative Analysis & Results](#investigative-analysis-results)
8. [Conclusions](#conclusions)

## Abstract <a id="abstract"></a>

This research investigates the complex interactions of economic, demographic, and culinary variables in the New York Tri-State region. We use Yelp data and Census information to unearth the stories behind the restaurants in the area because we are curious in how different cultures and economic levels affect dining preferences. Our study questions explore the following topics: how geography affects Yelp reviews; how economic status and restaurant kinds are correlated; how demographic considerations affect restaurant success; where are the best places for new businesses to open; and how popularity, pricing, and ratings relate to each other.

Using the Yelp API for restaurant details and Census.gov for demographic and economic data, the process entails a thorough collection of data. Our group works together to do univariate, bivariate, and multivariate analyses using Python modules such as Beautiful Soup, Pandas, Matplotlib, and Seaborn. The goal of the investigative analysis is to discover economic elements influencing culinary preferences, provide recommendations for aspiring restaurateurs based on current trends, and shed light on the relationship between restaurant ratings and demographic composition.

We put a high priority on data integrity throughout the project, carry out in-depth exploratory data analysis (EDA), and apply data preparation strategies. The results include a thorough examination of how socioeconomic variables influence food choices in the New York Tri-State region, a sophisticated grasp of the culinary environment, and practical advice for restaurant owners. This project exemplifies how to combine extensive data sources with Python's analytical powers to unravel the complexity of the regional food scene.

## Introduction <a id="introduction"></a>

This project explores the thriving culinary and eating scene in the Tri-State area of New York. It's about learning how the diverse range of cultures and economic circumstances in the area affects where and what we consume. We are going to learn the narratives behind the restaurants in the area by examining demographic data and Yelp ratings. Consider it an exploration of how the local food scene is shaped by the people and their backgrounds as you go across the areas of New York and New Jersey.

1. **Geographical Impact on Yelp Ratings:**
   - How does the specific location within the Tri-State area influence a restaurant's Yelp rating, especially concerning particular cuisines?
  
2. **Economic Status and Restaurant Types:**
   - Is there a discernible correlation between the economic status of a neighborhood and the types of restaurants it hosts?

3. **Demographic Influences on Restaurant Success:**
   - To what extent do demographic factors, such as ethnic composition, contribute to the success of various restaurant types?

4. **Optimal Locations for New Restaurants:**
   - What are the current trends, and where are the optimal locations for opening new restaurants based on these trends?

5. **Cuisine Pricing, Ratings, and Popularity:**
   - How is the pricing of different cuisines related to their Yelp ratings and overall popularity in the Tri-State area?

In exploring these questions, we aim to uncover the unique narratives behind the diverse restaurants that populate the Tri-State area, offering insights into how the local food scene is intricately woven into the fabric of its people and their backgrounds.

## Expected Results

- Information on how restaurant ratings and demographic composition relate to one another.
- Determining the economic variables affecting taste in food.
- Suggestions for potential restaurateurs regarding the locations of new businesses.

## Timetable

**Week 1:** Gathering data and integrating APIs. 

**Week 2:** Cleaning data and conducting preliminary analysis.

**Week 3:** Comprehensive data visualization and analysis.

**Week 4:** Report assembly and delivery.

## Importance & Inspiration for Research

Restaurant owners, food fans, and culinary specialists will all gain from this study's insightful analysis of the sociodemographically driven gastronomic tastes of the New York Tri-State region.

## Research Approach <a id="research-approach"></a>

#### Data Collection:

- **Yelp API:** To gather data on restaurant ratings, cuisine types, pricing, and customer reviews from the New York Tri-State area. [Yelp API Documentation](https://docs.developer.yelp.com/docs/fusion-intro)
  - Restaurants will be of different cuisine categories, must have mixed ratings on Yelp, and a '$$' price tag.
  
- **Census.gov (CSV):** To obtain demographic and economic data for the same region, enabling a comprehensive analysis of how these factors relate to food preferences and restaurant success. [Census Data for New York Tri-State Area](https://www.census.gov/quickfacts/fact/table/bergencountynewjersey,richmondcountynewyork,bronxcountynewyork,queenscountynewyork,kingscountynewyork,newyorkcountynewyork/PST045222)

  - **Data Cleaning File:**
    - Created an Excel file for data cleaning, containing essential variables such as rating, price, zip_code, categories, cuisine, and borough. This file facilitated the initial stages of data preparation and ensured consistency across relevant attributes.

#### Approach to Data Management:

- **Version Control:** Git version control was used to keep track of changes, facilitate productive collaboration, and preserve an unambiguous project history.
- **Data Storage:** Organized directories to facilitate easy access and replication for both raw and processed data.
- **Documentation:** Transparency and reproducibility are ensured by recording data sources, preparation steps, and analytic procedures in Markdown cells.
- **Collaboration:** Promoted teamwork by utilizing technologies that made it easy to share and update analysis and code.

#### Investigative Data Examination:
- Used Python libraries (Pandas, Matplotlib, Seaborn) to conduct thorough EDA.
- Looked for trends and connections in the distribution of restaurant ratings, cuisine categories, and pricesncy.

## Data Preparation <a id="data-preparation"></a>

- Addressed data integrity and usability issues identified during EDA.
- Employed Pandas for data cleaning, handling missing values, outliers, and ensuring consistency

In [None]:
from .restaurant_analysis.rest_analysis import Scrape
#a = Scrape('https://www.census.gov/quickfacts/fact/table/richmondcountynewyork,bronxcountynewyork,queenscountynewyork,kingscountynewyork,newyorkcountynewyork/RHI825222#RHI825222')
b = Scrape('https://www.census.gov/quickfacts/fact/table/richmondcountynewyork,bronxcountynewyork,queenscountynewyork,kingscountynewyork,newyorkcountynewyork/RHI825222#RHI825222')

**Data Overview**

A complete perspective of numerous housing, economic, social, and demographic indicators for various counties in the New York Tri-State area can be obtained from the dataset {dfboroughs}. Every borough in the dataset that isn't identified is a borough located in New York City (NYC). Population estimates, age and sex distribution, race and Hispanic origin, housing specifics, economic indicators, patterns of transportation, and income and poverty statistics are just a few of the topics it covers.

**Estimated Population (2022):** To facilitate a comparison of population changes across time, the dataset contains population estimates for July 1, 2022, as well as base estimates from April 1, 2020.

**Age and Sex Distribution:** Data on the proportion of people under five, under eighteen, and over sixty-five years old, offering insights into the age distribution of the populace.

**Hispanic Origin and Race:** Detailed racial population breakdowns, including percentages for Asian, White, Black or African American, and Hispanic or Latino populations.

**Housing Details:** Information providing a thorough understanding of housing features, including the number of housing units, the percentage of owner-occupied housing units, the median value of owner-occupied housing units, and more.

**Economic Indicators:** Data on retail sales, per capita income, health care and social assistance revenues, lodging and food service sales, retail sales, and the labor force participation rate.

**Income & Poverty:** Median household income and per capita income, both adjusted for inflation, along with the percentage of persons in poverty.

### Projections

We may conduct a thorough study using this dataset to look for trends and correlations pertaining to restaurant success and patron preferences in the area of New York City. This could entail looking into how the restaurant sector is impacted by housing characteristics, economic conditions, and demographic variables.

In [None]:
#create pandas dataframe object from html reader
#race = a.reader()
dfboroughs = b.reader()
dfboroughs

Unnamed: 0,Population,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,"Population Estimates, July 1, 2022, (V2022)"," 491,133"," 1,379,946"," 2,278,029"," 2,590,516"," 1,596,273"
1,"Population estimates base, April 1, 2020, (V2022)"," 495,749"," 1,472,656"," 2,405,464"," 2,736,075"," 1,694,250"
2,"Population, percent change - April 1, 2020 (es...", -0.9%, -6.3%, -5.3%, -5.3%, -5.8%
3,"Population, Census, April 1, 2020",495747,1472654,2405464,2736074,1694251
4,"Population, Census, April 1, 2010",468730,1385108,2230722,2504700,1585873
5,Age and Sex,,,,,
6,"Persons under 5 years, percent", 5.3%, 6.6%, 5.4%, 6.4%, 4.2%
7,"Persons under 18 years, percent", 21.3%, 24.3%, 19.4%, 22.1%, 14.0%
8,"Persons 65 years and over, percent", 17.4%, 14.4%, 18.0%, 15.6%, 18.4%
9,"Female persons, percent", 50.9%, 52.7%, 51.1%, 52.3%, 52.2%


In [None]:

from src.restaurant_analysis.rest_analysis import Scrape
#a = Scrape('https://www.census.gov/quickfacts/fact/table/richmondcountynewyork,bronxcountynewyork,queenscountynewyork,kingscountynewyork,newyorkcountynewyork/RHI825222#RHI825222')
b = Scrape('https://www.census.gov/quickfacts/fact/table/richmondcountynewyork,bronxcountynewyork,queenscountynewyork,kingscountynewyork,newyorkcountynewyork/RHI825222#RHI825222')

In [None]:
import pandas as pd
dfboroughs = pd.read_csv('boroughs_data.csv')
dfboroughs.head()

Unnamed: 0.1,Unnamed: 0,Population,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,0,"Population Estimates, July 1, 2022, (V2022)"," 491,133"," 1,379,946"," 2,278,029"," 2,590,516"," 1,596,273"
1,1,"Population estimates base, April 1, 2020, (V2022)"," 495,749"," 1,472,656"," 2,405,464"," 2,736,075"," 1,694,250"
2,2,"Population, percent change - April 1, 2020 (es...", -0.9%, -6.3%, -5.3%, -5.3%, -5.8%
3,3,"Population, Census, April 1, 2020",495747,1472654,2405464,2736074,1694251
4,4,"Population, Census, April 1, 2010",468730,1385108,2230722,2504700,1585873


### Overall Purpose

The code's goal is to produce a polished dataset that is suited to particular economic and demographic metrics. This makes it simpler to explore and analyze the data from the New York Tri-State area, providing insights into the selected elements.

In [None]:
predictors =  ['Population Estimates, July 1, 2022, (V2022)', 'White alone, not Hispanic or Latino, percent', 'Foreign born persons, percent, 2018-2022', 'Median household income (in 2022 dollars), 2018-2022']
#creating a a dataframe from quering a virtual table of our scraped data. creating location variable to join with 

dfboroughs.query('Population in ("Population Estimates, July 1, 2022, (V2022)", "White alone, not Hispanic or Latino, percent", "Foreign born persons, percent, 2018-2022", "Median household income (in 2022 dollars), 2018-2022")').copy().reset_index()

Unnamed: 0.1,index,Unnamed: 0,Population,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,0,0,"Population Estimates, July 1, 2022, (V2022)"," 491,133"," 1,379,946"," 2,278,029"," 2,590,516"," 1,596,273"
1,18,18,"White alone, not Hispanic or Latino, percent", 56.6%, 8.7%, 23.9%, 36.7%, 45.5%
2,21,21,"Foreign born persons, percent, 2018-2022",24.8%,33.9%,47.1%,35.3%,28.1%
3,55,55,"Median household income (in 2022 dollars), 201...","$96,185","$47,036","$82,431","$74,692","$99,880"


## Exploratory Data Analysis (EDA) <a id="eda"></a>

## Prepped Data Review <a id="prepped-data-review"></a>

## Investigative Analysis & Results <a id="investigative-analysis-results"></a>

## Conclusions <a id="conclusions"></a>

Only several predictors are included for analysis

In [None]:
predictors =  ['Population Estimates, July 1, 2022, (V2022)', 'White alone, not Hispanic or Latino, percent', 'Foreign born persons, percent, 2018-2022', 'Median household income (in 2022 dollars), 2018-2022']


KeyError: "None of [Index(['Population Estimates, July 1, 2022, (V2022)',\n       'White alone, not Hispanic or Latino, percent',\n       'Foreign born persons, percent, 2018-2022',\n       'Median household income (in 2022 dollars), 2018-2022'],\n      dtype='object')] are in the [columns]"

In [None]:
dfboroughs.to_csv('boroughs_data.csv')

In [None]:
import pandas as pd
rests = pd.read_csv('restaurants_data.csv')
rests.head()

Unnamed: 0,name,rating,price,zip_code,categories,cuisine,borough
0,Piccoli Trattoria,4.5,$$,11215,"Italian, Pasta Shops, Wine Bars",Italian,"Brooklyn, NY"
1,Osteria Brooklyn,4.5,$$$,11205,Italian,Italian,"Brooklyn, NY"
2,Barbalu - Brooklyn,4.5,Not Available,11201,"Italian, Pizza, Bars",Italian,"Brooklyn, NY"
3,Cent'Anni,4.0,$$,11238,Italian,Italian,"Brooklyn, NY"
4,Ammazzacaffè,4.5,$$,11211,"Cocktail Bars, Italian, Wine Bars",Italian,"Brooklyn, NY"


In [None]:
manhattan_zip_codes = ['10001', '10002', '10003', '10004', '10005',
    '10006', '10007', '10009', '10010', '10011',
    '10012', '10013', '10014', '10016', '10017',
    '10018', '10019', '10020', '10021', '10022',
    '10023', '10024', '10025', '10026', '10027',
    '10028', '10029', '10030', '10031', '10032',
    '10033', '10034', '10035', '10036', '10037',
    '10038', '10039', '10040', '10128', '10280']

brooklyn_zip_codes = ['11201', '11203', '11204', '11205', '11206',
    '11207', '11208', '11209', '11210', '11211',
    '11212', '11213', '11214', '11215', '11216',
    '11217', '11218', '11219', '11220', '11221',
    '11222', '11223', '11224', '11225', '11226',
    '11228', '11229', '11230', '11231', '11232',
    '11233', '11234', '11235', '11236', '11237',
    '11238', '11239', '11249']

queens_zip_codes = ['11001', '11004', '11101', '11102', '11103',
    '11104', '11105', '11106', '11354', '11355',
    '11356', '11357', '11358', '11359', '11360',
    '11361', '11362', '11363', '11364', '11365',
    '11366', '11367', '11368', '11369', '11370',
    '11371', '11372', '11373', '11374', '11375',
    '11377', '11378', '11379', '11385', '11411',
    '11412', '11413', '11414', '11415', '11416',
    '11417', '11418', '11419', '11420', '11421',
    '11422', '11423', '11426', '11427', '11428',
    '11429', '11430', '11432', '11433', '11434',
    '11435', '11436', '11691', '11692', '11693',
    '11694', '11697']

staten_island_zip_codes = ['10301', '10302', '10303', '10304', '10305',
    '10306', '10307', '10308', '10309', '10310',
    '10311', '10312', '10314']

bronx_zip_codes = ['10451', '10452', '10453', '10454', '10455',
    '10456', '10457', '10458', '10459', '10460',
    '10461', '10462', '10463', '10464', '10465',
    '10466', '10467', '10468', '10469', '10470',
    '10471', '10472', '10473', '10474', '10475']



In [None]:
rests['where'] = rests.apply(lambda row: 'Manhattan' if str(row['zip_code']).strip() in manhattan_zip_codes
else ('Brooklyn' if str(row['zip_code']).strip() in brooklyn_zip_codes
else ('Queens' if str(row['zip_code']).strip() in queens_zip_codes
else ('The Bronx' if str(row['zip_code']).strip() in bronx_zip_codes
else ('Staten Island' if str(row['zip_code']).strip() in staten_island_zip_codes
else 'other')))), axis=1)

In [None]:
rests = rests[rests['where'] != 'other'].reset_index(drop=True)

(116, 8)

In [None]:
rests = rests[rests['price'] != 'Not Available'].reset_index(drop=True)

In [None]:
rests[rests['where'] == 'other']

Unnamed: 0,name,rating,price,zip_code,categories,cuisine,borough,where


In [None]:
rests.shape

(650, 8)

In [None]:
rests.to_csv('new_rests_data.csv')