---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

### Introduction 

The first step in any project is to collect data. The idea for this project stemmed from restaurants. The constant question of where to eat, what food is good, and what neighborhood has the best food are all questions people ask daily. The one place everyone goes to determine where they should eat is Yelp. Here, we will be collecting data from Yelp, just like what you would see on your phone. These features will hopefully help us dig deeper into our story.



The first step is to create and load our API key from Yelp. For security reasons, this key is stored elsewhere and is not directly included here. To create your own API, go to the Yelp's developer portal to get started. 

In [1]:
#this imports the api to scrap
# import json package 
import json
#opens and finds the api to load it 
with open('/Users/rachnarawalpally/project-rachnarawalpally/technical-details/data-collection/api-key.json') as f:
    keys = json.load(f)
    #labels the api to call it in the code 
API_KEY = keys['yelp']

### Method: Obtaining Data 

The following code blocks outline the process of fetching data from the Yelp API and cleaning it into a readable CSV file.

The first block includes the necessary documentation to call the Yelp API and retrieve data. Some of this code is sourced directly from Yelp's developer website. https://docs.developer.yelp.com/docs/fusion-intro - Here is the link to getting started using a API from Yelp. 

The second block processes the received data by extracting only the relevant information needed for the CSV file. This includes gathering key details that are readily available on Yelp, such as restaurant name, rating, and location. The extracted data is then stored in a Pandas DataFrame.

The final block converts the Pandas DataFrame into a CSV file. Since each API response contains information for up to 50 restaurants, each CSV file includes data for a maximum of 50 Yelp rated restaurants.



In [14]:
#reads in the request pacakge 
import requests
# this sets up the API request 
headers = {'Authorization': f'Bearer {API_KEY}'}
# this function fetches restaurant's data from Yelp's API
# location = which area to search for 
# term = picks between restuarants, coffee shops, or bars (tried to get as much information about yelp reivews in DC)
# limit and offset = these are specific for yelp, limit to search for 50 restaurants at a time
# and start the offset at 0 so it goes to the next restaurant, like going to the next page 
def fetch_restaurant_data(location, term="restaurants", limit=50, offset=0):
    url = 'https://api.yelp.com/v3/businesses/search'
    params = {
        'term': term,
        'location': location,
        'limit': limit,
        'offset': offset,  # Include offset in the params
    }
    response = requests.get(url, headers=headers, params=params)
    data = response.json()
    return data

offset = 0  # Start from 50 for the next set of results
limit = 50   # You want to get 50 results at a time

# Fetch data for restaurants in Washington D.C.
restaurants = fetch_restaurant_data('Washington D.C.', offset=offset, limit=limit)

In [15]:
# import package 
import pandas as pd
# this creates the function to clean up the json requests from above and clean it up and have it as a pandas dataframe 
def process_data(restaurants_data):
    # creates an emtpy list to store the data in 
    restaurant_list = []
    for business in restaurants_data['businesses']:
        # this adds all these specific items together 
        restaurant_list.append({
            'name': business['name'],
            'cuisine': business['categories'][0]['title'] if business['categories'] else 'Unknown',
            'price_range': business.get('price', 'N/A'),
            'rating': business.get('rating', 'N/A'),
            'review_count': business.get('review_count','N/A'),
            'neighborhoods': business.get('neighborhoods', 'N/A'),
            'latitude': business['coordinates']['latitude'],
            'longitude': business['coordinates']['longitude'],
            'zip_code': business['location']['zip_code'],
        })
        #saves it as a dataframe 
    df = pd.DataFrame(restaurant_list)
    return df
# takes the function above to create a dataframe from the json information above 
df_restaurants = process_data(restaurants)
# prints the first few results to insure everything looks good 
print(df_restaurants.head())


                   name             cuisine price_range  rating  review_count  \
0  Unconventional Diner        New American          $$     4.4          2946   
1             L'Ardente             Italian         $$$     4.5          1242   
2          Grazie Nonna             Italian          $$     4.1           536   
3      Old Ebbitt Grill                Bars          $$     4.2         11086   
4      Gypsy Kitchen DC  Tapas/Small Plates          $$     4.3           919   

  neighborhoods   latitude  longitude zip_code  
0           N/A  38.906139 -77.023800    20001  
1           N/A  38.898919 -77.014074    20001  
2           N/A  38.904010 -77.035000    20005  
3           N/A  38.897967 -77.033342    20005  
4           N/A  38.914880 -77.031550    20009  


In [230]:
# print the results again to insure everything is correct 
#print(df_restaurants)
#saves the data frame to a csv file int he raw_data folder
df_restaurants.to_csv('../../data/raw-data/df_coffee5.csv')
#/Users/rachnarawalpally/project-rachnarawalpally/data/raw-data


                                   name                cuisine price_range  \
0               Pitango Gelato & Coffee           Coffee & Tea          $$   
1                      Capital One Café           Coffee & Tea         N/A   
2                    Mah-Ze-Dahr Bakery               Bakeries          $$   
3        Junction Bistro Bar and Bakery               Bakeries          $$   
4                          Coffee Alley           Coffee & Tea         N/A   
5                       Gregorys Coffee               Bakeries          $$   
6                           Atrium Cafe                  Cafes           $   
7                         Mitsitam Cafe               American          $$   
8                        Cafe Levantine               Lebanese         N/A   
9                      Capital One Café  Banks & Credit Unions         N/A   
10                               Zeleno             Sandwiches          $$   
11                        Le Caprice DC                  Cafes  

### Final Thoughts 

This code demonstrates how to retrieve data from Yelp's API to gather information on businesses such as restaurants, coffee shops, and bars (as used in this project). The script can be easily modified to query any other types of businesses available on Yelp. It provides a simple and efficient way to. Fetch Yelp data based on specific parameters. Convert the raw JSON response into a pandas DataFrame. Save the DataFrame as a CSV file for later analysis. By altering the search term (e.g., "restaurants", "coffee shops", or "bars") and adjusting other parameters like limit and offset, you can customize the data retrieval to suit your needs. This method simplifies the process of gathering Yelp data for analysis and ensures easy access to relevant business information.