# 1 Data Collection - Yelp
The Yelp API contains the abilty to search for businesses and filter based on location and different categories. The "Business Search" API is used to collect the data (https://www.yelp.com/developers/documentation/v3/business_search).  
  
First a random list of zipcodes is determined for the study area. Then for each zipcode, the Yelp API is queried. Once for those businesses the have the category of "gluten free" and restaurant; the second query just get records that are "restaurant".

The Yelp business data is then summarized for each zipcode. The end result is two .CSV files:
  
  * yelpData.csv: Name of the file that contains DataFrame of all data returned from the Yelp API calls
  
  * summarizedYelpData.csv: Name of the file that contains the Dataframe that summarizes the Yelp data
 
## Dependency
##### 1 barnum Library
The barnum python library is used to generate a random list of zip codes.  It is install with the following:
  'pip install barnum'
  
##### 2 secrets.py
The collection of restaurant data uses the Yelp API that requires a key. This key is stored within the secrets.py that contains a single variable, "yelpKey".

In [7]:
#-- Import Libraries
import pandas as pd
import os
import math
import barnum
import requests


# Yelp API key in secrets.py; .gitignore prevents the secrets.py from being pushed to GitHub
from secrets import yelpKey


#-- Configuration Settings

#- Common Settings
# Name of the column that contains the zip code information
zipcodeColumnName = "Zipcode"

# Folder that is to contain output of different processing
outputDirectory = "AnalysisData"


#- Random Zipcodes
# Number of zip codes to gather data for; use 100 for analysis and 3 for testing
numRandomZipcodes = 50

# Name of the file that contains the DataFrame of the random zipcodes
randomZipcodesFileName = "randomZipcode.csv"


#- Collect Yelp Datasets
# Yelp search radius; used with API call
yelpSearchRadius = 3000

# TRUE- use the file of the random zipcodes for yelp dataset FALSE- use the DataFrame in memory
useFileForYelpSearch = False

# Name of the file that contains DataFrame of all data returned from the Yelp API calls
yelpDataFileName = "yelpData.csv"

# Name of the file that contains DataFrame of the zipcode and counts return from Yelp API calls
yelpDataSummaryFileName = "YelpSummaryData.csv"

# Name of the file that contains DataFrame for just one zip code
yelpDataZipcodeFileName = "YelpDataZip_"


#- Summarize Yelp Data
# TRUE- use the file of the DataFrame that contains the yelp businesses FASLE- use the DataFrame in memory
useFileForYelpSummarize = True

# Name of the file that contains the Dataframe that summarizes the Yelp data
summarizedYelpFileName = "summarizedYelpData.csv"


# 1 Create Random Zip Codes   
This step creates a list of the random zip codes within the study area of Southern California. The barnum library is used to generate the random list. The random zip codes are verified to ensure that they are within the study area and then stored to disk.

In [72]:
#-- Create Random Zipcodes

#- Create List
hasAllZipcodes = False
randomZipcodes = []

while hasAllZipcodes == False:
    theZipCode = barnum.create_city_state_zip()[0]
    numZipCode = int(theZipCode)
    
    if (numZipCode >= 90001) and (numZipCode <= 93005):
        randomZipcodes.append(theZipCode)
        
    if (numRandomZipcodes == len(randomZipcodes)):
        hasAllZipcodes = True 

        
#- Create DataFrame
randomZipcodes_df = pd.DataFrame(randomZipcodes)

randomZipcodes_df.columns = [zipcodeColumnName]


#- Save to Disk
randomZipcodesPath = os.path.join(".", outputDirectory, randomZipcodesFileName)

randomZipcodes_df.to_csv(randomZipcodesPath)


#- Preview Random Zipcodes
randomZipcodes_df.head()

Unnamed: 0,Zipcode
0,90746
1,92210
2,90210
3,92605
4,90028


# 2 Collect Yelp Data
For each of the random zipcodes the Yelp API is used to collect a single datasets using the Business Search API (https://www.yelp.com/developers/documentation/v3/business_search). The Yelp API is queried two times:

* Yelp filter of "restaurant" and "gluten_free".
* Yelp filter of only "restaurant"

There is validation to ensure that the business returned from the Yelp API is contained with the zip code provided. Additionally, it was found that some businesses did not have a "price" attribute within the JSON and there is validation to prevent this from stopping the processing.  
  
The yelp data for all the zipcodes is stored within a DataFrame, yelpData_df, and is also exported to a CSV file in: {outputDirectory}/{yelpDataFileName}
  
Another DataFrame that contains the zipcode with the count of the records found with the two searches is also created.  This DataFrame, yelpDataSummary_df, and exported to a CSV file in: {outputDirectory}/{yelpDataSummaryFileName}  
    
The individual zipcode yelp data is stored within a DataFrame, yelpData_df, and is also exported to CSV file in {outputDirectory}/{yelpDataZipcodeFileName}
    
    

In [3]:
def getDataForZipcode(isGlutenFreeSearch, searchZipCode):
    ''' Searches the Yelp API to get the business that satisfy the filter. Validation to ensure that the 
    response from the Yelp API is successful; when there is a failure the code has try again logic.
    
    Accepts : isGlutenFreeSearcj (bool) TRUE- search for gluten free term FALSE- search just for restaurant
                searchZipCode (str) zip code to search for records within
    
    Returns : (dictionary) contains information of business for the zip code
                ID: (str) Unique Yelp ID for the business
                Name: (str) Name of the business
                Zipcode: (str) Location of the business
                Latitude: (num) coordinate of the business location
                Longitude: (num) coordinate of the business location
                Price: (str) Price level of the business. Value is one of $, $$, $$$, $$$$ and NA
                Rating: (num) Rating for this business (value ranges from 1, 1.5, ... 4.5, 5)
                IsGlutenFree: (num) 1 - used with the gluten free filter 0 not used with gluten free search
                
                Or None when errors encountered with requests
    '''
    
    #- Prepare Search
    # Source Url
    baseYelpUrl = "https://api.yelp.com/v3/businesses/search"

    # API Key passed through header
    headers = {
            'Authorization': 'Bearer %s' % yelpKey,
    }
    
    # Search Term
    searchTerm = 'restaurant'
    
    if (isGlutenFreeSearch == True):
        searchTerm = 'gluten_free,restaurant'
    
    
    # Dictionary stores data
    yelpData = {
        'ID': [],
        'Name': [],
        'Zipcode': [],
        'Latitude': [],
        'Longitude': [],
        'Price' : [],
        'Rating' : [],
        'IsGlutenFree': [],
    }
    
    
    #- Search
    #  API limits 50 records being returned at once; must loop and request offset of results to get all records
    recordLimit = 50
    currentOffset = 0
    hasMoreData = True
    retryCount = 0
    totalCount = 0
    
    while hasMoreData == True:
        
        #- Prepare Parameters
        parameters = {
            'location': searchZipCode,
            'term': searchTerm,
            'limit': recordLimit,
            'offset': currentOffset,
            }
        
        #- Request
        print(f"  Requesting data. Offset: {currentOffset}  Known Total: {totalCount}")
        
        response = requests.request('GET', baseYelpUrl, headers=headers, params=parameters)
        
        
        #- Check Response
        if (response.status_code == requests.codes.ok):
            
            # Reset Retry
            retryCount = 0
            
            
            # Get Json from Response
            responseJson = response.json()
            
            
            # Search Businesses
            for business in responseJson['businesses']:
                
                # Determine Use Business
                useBusiness = checkBusinessForUsage(business, searchZipCode)
                
                if (useBusiness == True):
                    
                    # Populate Dictionary with Business Information
                    yelpData['ID'].append(business['id'])
                    yelpData['Name'].append(business['name'])
                    yelpData['Zipcode'].append(business['location']['zip_code'])
                    
                    yelpData['Latitude'].append(business['coordinates']['latitude'])
                    yelpData['Longitude'].append(business['coordinates']['longitude'])
                    
                    yelpData['Price'].append(getPriceForBusiness(business))
                    yelpData['Rating'].append(business['rating'])
                    
                    # Update search type
                    if (isGlutenFreeSearch == True):
                        yelpData['IsGlutenFree'].append(1)
                    else:
                        yelpData['IsGlutenFree'].append(0)
         
            #- Prepare for Next search
            # API only supports 50 records at a time; must query with offset
            currentOffset = (currentOffset + recordLimit)
            
            totalCount = responseJson['total']
            
            if (currentOffset > responseJson['total']):
                print(f"Collected all data. Current Offset: {currentOffset}  Total: {responseJson['total']}")
                hasMoreData = False
        
        else:
            #- Error with request
            retryCount += 1
            print(f"Response Error for data: {response.status_code}  Retry Count: {retryCount}")     
            
            #- Attempt to retry request
            if (retryCount == 4):
                print("Error getting data for zipcode")
                return None
    
                  
    #- Metadata on Data
    print(f"Total businesses found: {len(yelpData['ID'])}  Search Term: {searchTerm}")
               
          
    #- Return data from function
    return yelpData

In [4]:
def checkBusinessForUsage(businessInfo, searchZipcode):
    ''' Determines if the business can be used in the Analysis
    
    Accepts : businessInfo (dictionary) contains the metadata for individual business 
                searchZipCode (str) zip code searching for data within
    
    Return : bool TRUE- business meets critera, able to use FALSE- unable to use business
    '''
    
    #- Check Within Search Zipcode
    businessZipCode = businessInfo['location']['zip_code']
    
    if (businessZipCode != searchZipcode):
        return False
    
    
    return True

In [5]:
def getPriceForBusiness(businessInfo):
    ''' Gets the price for a business; not all businesses contain this property within the JSON;
    when not found just uses NA.
    
    Accepts : businessInfo (dictionary) metadata on an individual business
    
    Returns : (num) value from price tag
    '''
    try:
        
        return businessInfo['price']
    
    except:
        return 'NA'

In [73]:
#-- Collect Yelp Datasets

#- Get Random Zipcodes
if (useFileForYelpSearch == True):
    randomZipcodesPath = os.path.join(".", outputDirectory, randomZipcodesFileName)
    
    randomZipcodes_df = pd.read_csv(randomZipcodesPath)

else:
    if (randomZipcodes_df is None):
        raise Exception("Unable to collect Yelp dataset; missing reference to randomZipcodes_df")

        
#- Prepare Variables
zipcodeSummary = {
    zipcodeColumnName: [],
    'HasApiFailure': [],
    'Count_GlutenFree': [],
    'Count_Restaurant': []
}

yelpData_df = None
hasFirstYelpData = True
counter = 0


#- Collect Data
for index, row in randomZipcodes_df.iterrows():
    
    #- Get Zipcode
    searchZipcode = str(row[0])
    
    
    #- Message
    counter += 1
    print(f"-> Search -> {searchZipcode}  -> {counter} of {randomZipcodes_df.shape[0]}")
    
    
    #- Get Data from Yelp: Gluten Free Search
    yelpDataForZipcode = getDataForZipcode(True, searchZipcode)
    
    
    #- Get Data for Yelp: Restaurant
    yelpDataRestaurantForZipcode = getDataForZipcode(False, searchZipcode)
    
    
    #- Create DataFrames; check success getting data from endpoint
    if not (yelpDataForZipcode is None) and not (yelpDataRestaurantForZipcode is None):
        
        #- Gluten Free
        # Create DataFrame
        yelpDataGlutenFreeForZipcode_df = pd.DataFrame(yelpDataForZipcode)
    
        # Determine number of records
        countYelpDataForZipcode = yelpDataGlutenFreeForZipcode_df.shape[0]
    
 
        #- All Restaurants
        # Create DataFrame
        yelpDataAllForZipcode_df = pd.DataFrame(yelpDataRestaurantForZipcode)
        
        # Determine number of records
        countYelpDataSummaryForZipcode = yelpDataAllForZipcode_df.shape[0]
        
        
        #- Combine and Save Data for Zipcode
        # Merge Gluten Free & Restuarnts for Zipcode
        yelpDataForZipcode_df = pd.concat([yelpDataGlutenFreeForZipcode_df, yelpDataAllForZipcode_df])
        
        # Save to Disk
        yelpDataZipcodePath = os.path.join(".", outputDirectory, f"{yelpDataZipcodeFileName}{searchZipcode}.csv")
        
        yelpDataForZipcode_df.to_csv(yelpDataZipcodePath)                                  
        
        print(" Exported data for zipcode to disk..")
        
                                           
        #- Create Single DataFrame
        if (hasFirstYelpData == True):
            hasFirstYelpData = False
            yelpData_df = yelpDataForZipcode_df
    
        else:
            yelpData_df = pd.concat([yelpData_df, yelpDataForZipcode_df])
        
        
        #- Update Summary
        zipcodeSummary[zipcodeColumnName].append(searchZipcode)
        zipcodeSummary['HasApiFailure'].append(0)
        
        zipcodeSummary['Count_GlutenFree'].append(countYelpDataForZipcode)
        zipcodeSummary['Count_Restaurant'].append(countYelpDataSummaryForZipcode)  
        
    else:
        
        #- Error with at least one search filter; do not use zipcode
        zipcodeSummary[zipcodeColumnName].append(searchZipcode)
        zipcodeSummary['HasApiFailure'].append(1)
        
        zipcodeSummary['Count_GlutenFree'].append(0)
        zipcodeSummary['Count_Restaurant'].append(0) 
    

#- Message
print("<--<")
print("Completed getting data from Yelp API")
    
    
#- Export: Yelp Data
yelpDataFilePath = os.path.join('.', outputDirectory, yelpDataFileName)

yelpData_df.to_csv(yelpDataFilePath)


#- Export: Yelp Summary Data
yelpDataSummaryFilePath = os.path.join('.', outputDirectory, yelpDataSummaryFileName)

yelpDataSummary_df = pd.DataFrame(zipcodeSummary)

yelpDataSummary_df.to_csv(yelpDataSummaryFileName)


#- Completed Message
print("Completed export of data")

-> Search -> 90746  -> 1 of 50
  Requesting data. Offset: 0  Known Total: 0
  Requesting data. Offset: 50  Known Total: 358
  Requesting data. Offset: 100  Known Total: 358
  Requesting data. Offset: 150  Known Total: 358
  Requesting data. Offset: 200  Known Total: 358
  Requesting data. Offset: 250  Known Total: 358
  Requesting data. Offset: 300  Known Total: 358
  Requesting data. Offset: 350  Known Total: 358
Collected all data. Current Offset: 400  Total: 358
Total businesses found: 11  Search Term: gluten_free,restaurant
  Requesting data. Offset: 0  Known Total: 0
  Requesting data. Offset: 50  Known Total: 2600
  Requesting data. Offset: 100  Known Total: 2600
  Requesting data. Offset: 150  Known Total: 2600
  Requesting data. Offset: 200  Known Total: 2600
  Requesting data. Offset: 250  Known Total: 2600
  Requesting data. Offset: 300  Known Total: 2600
  Requesting data. Offset: 350  Known Total: 2600
  Requesting data. Offset: 400  Known Total: 2600
  Requesting data. Off

  Requesting data. Offset: 200  Known Total: 347
  Requesting data. Offset: 250  Known Total: 347
  Requesting data. Offset: 300  Known Total: 347
Collected all data. Current Offset: 350  Total: 347
Total businesses found: 0  Search Term: gluten_free,restaurant
  Requesting data. Offset: 0  Known Total: 0
Collected all data. Current Offset: 50  Total: 6
Total businesses found: 0  Search Term: restaurant
 Exported data for zipcode to disk..
-> Search -> 90264  -> 7 of 50
  Requesting data. Offset: 0  Known Total: 0
Collected all data. Current Offset: 50  Total: 16
Total businesses found: 0  Search Term: gluten_free,restaurant
  Requesting data. Offset: 0  Known Total: 0
Collected all data. Current Offset: 50  Total: 17
Total businesses found: 0  Search Term: restaurant
 Exported data for zipcode to disk..
-> Search -> 92401  -> 8 of 50
  Requesting data. Offset: 0  Known Total: 0
Collected all data. Current Offset: 50  Total: 46
Total businesses found: 1  Search Term: gluten_free,restau

  Requesting data. Offset: 300  Known Total: 1200
  Requesting data. Offset: 350  Known Total: 1200
  Requesting data. Offset: 400  Known Total: 1200
  Requesting data. Offset: 450  Known Total: 1200
  Requesting data. Offset: 500  Known Total: 1200
  Requesting data. Offset: 550  Known Total: 1200
  Requesting data. Offset: 600  Known Total: 1200
  Requesting data. Offset: 650  Known Total: 1200
  Requesting data. Offset: 700  Known Total: 1200
  Requesting data. Offset: 750  Known Total: 1200
  Requesting data. Offset: 800  Known Total: 1200
  Requesting data. Offset: 850  Known Total: 1200
  Requesting data. Offset: 900  Known Total: 1200
  Requesting data. Offset: 950  Known Total: 1200
  Requesting data. Offset: 1000  Known Total: 1200
Response Error for data: 400  Retry Count: 1
  Requesting data. Offset: 1000  Known Total: 1200
Response Error for data: 400  Retry Count: 2
  Requesting data. Offset: 1000  Known Total: 1200
Response Error for data: 400  Retry Count: 3
  Requesting

Collected all data. Current Offset: 50  Total: 12
Total businesses found: 0  Search Term: gluten_free,restaurant
  Requesting data. Offset: 0  Known Total: 0
Collected all data. Current Offset: 50  Total: 13
Total businesses found: 0  Search Term: restaurant
 Exported data for zipcode to disk..
-> Search -> 91001  -> 19 of 50
  Requesting data. Offset: 0  Known Total: 0
  Requesting data. Offset: 50  Known Total: 549
  Requesting data. Offset: 100  Known Total: 549
  Requesting data. Offset: 150  Known Total: 549
  Requesting data. Offset: 200  Known Total: 549
  Requesting data. Offset: 250  Known Total: 549
  Requesting data. Offset: 300  Known Total: 549
  Requesting data. Offset: 350  Known Total: 549
  Requesting data. Offset: 400  Known Total: 549
  Requesting data. Offset: 450  Known Total: 549
  Requesting data. Offset: 500  Known Total: 549
Collected all data. Current Offset: 550  Total: 549
Total businesses found: 7  Search Term: gluten_free,restaurant
  Requesting data. Offs

  Requesting data. Offset: 700  Known Total: 898
  Requesting data. Offset: 750  Known Total: 898
  Requesting data. Offset: 800  Known Total: 898
  Requesting data. Offset: 850  Known Total: 898
Collected all data. Current Offset: 900  Total: 898
Total businesses found: 59  Search Term: gluten_free,restaurant
  Requesting data. Offset: 0  Known Total: 0
  Requesting data. Offset: 50  Known Total: 1400
  Requesting data. Offset: 100  Known Total: 1400
  Requesting data. Offset: 150  Known Total: 1400
  Requesting data. Offset: 200  Known Total: 1400
  Requesting data. Offset: 250  Known Total: 1400
  Requesting data. Offset: 300  Known Total: 1400
  Requesting data. Offset: 350  Known Total: 1400
  Requesting data. Offset: 400  Known Total: 1400
  Requesting data. Offset: 450  Known Total: 1400
  Requesting data. Offset: 500  Known Total: 1400
  Requesting data. Offset: 550  Known Total: 1400
  Requesting data. Offset: 600  Known Total: 1400
  Requesting data. Offset: 650  Known Total:

Response Error for data: 400  Retry Count: 4
Error getting data for zipcode
  Requesting data. Offset: 0  Known Total: 0
Collected all data. Current Offset: 50  Total: 41
Total businesses found: 7  Search Term: restaurant
-> Search -> 92105  -> 29 of 50
  Requesting data. Offset: 0  Known Total: 0
  Requesting data. Offset: 50  Known Total: 1000
  Requesting data. Offset: 100  Known Total: 1000
  Requesting data. Offset: 150  Known Total: 1000
  Requesting data. Offset: 200  Known Total: 1000
  Requesting data. Offset: 250  Known Total: 1000
  Requesting data. Offset: 300  Known Total: 1000
  Requesting data. Offset: 350  Known Total: 1000
  Requesting data. Offset: 400  Known Total: 1000
  Requesting data. Offset: 450  Known Total: 1000
  Requesting data. Offset: 500  Known Total: 1000
  Requesting data. Offset: 550  Known Total: 1000
  Requesting data. Offset: 600  Known Total: 1000
  Requesting data. Offset: 650  Known Total: 1000
  Requesting data. Offset: 700  Known Total: 1000
  

  Requesting data. Offset: 200  Known Total: 203
Collected all data. Current Offset: 250  Total: 203
Total businesses found: 8  Search Term: gluten_free,restaurant
  Requesting data. Offset: 0  Known Total: 0
  Requesting data. Offset: 50  Known Total: 904
  Requesting data. Offset: 100  Known Total: 904
  Requesting data. Offset: 150  Known Total: 904
  Requesting data. Offset: 200  Known Total: 904
  Requesting data. Offset: 250  Known Total: 904
  Requesting data. Offset: 300  Known Total: 904
  Requesting data. Offset: 350  Known Total: 904
  Requesting data. Offset: 400  Known Total: 904
  Requesting data. Offset: 450  Known Total: 904
  Requesting data. Offset: 500  Known Total: 904
  Requesting data. Offset: 550  Known Total: 904
  Requesting data. Offset: 600  Known Total: 904
  Requesting data. Offset: 650  Known Total: 904
  Requesting data. Offset: 700  Known Total: 904
  Requesting data. Offset: 750  Known Total: 904
  Requesting data. Offset: 800  Known Total: 904
  Reques

  Requesting data. Offset: 300  Known Total: 529
  Requesting data. Offset: 350  Known Total: 529
  Requesting data. Offset: 400  Known Total: 529
  Requesting data. Offset: 450  Known Total: 529
  Requesting data. Offset: 500  Known Total: 529
Collected all data. Current Offset: 550  Total: 529
Total businesses found: 75  Search Term: gluten_free,restaurant
  Requesting data. Offset: 0  Known Total: 0
  Requesting data. Offset: 50  Known Total: 1800
  Requesting data. Offset: 100  Known Total: 1800
  Requesting data. Offset: 150  Known Total: 1800
  Requesting data. Offset: 200  Known Total: 1800
  Requesting data. Offset: 250  Known Total: 1800
  Requesting data. Offset: 300  Known Total: 1800
  Requesting data. Offset: 350  Known Total: 1800
  Requesting data. Offset: 400  Known Total: 1800
  Requesting data. Offset: 450  Known Total: 1800
  Requesting data. Offset: 500  Known Total: 1800
  Requesting data. Offset: 550  Known Total: 1800
  Requesting data. Offset: 600  Known Total: 

In [74]:
#-- Preview Yelp Data
print(yelpData_df.shape)

yelpData_df.head()

(1492, 8)


Unnamed: 0,ID,Name,Zipcode,Latitude,Longitude,Price,Rating,IsGlutenFree
0,Pw1-_aGws_Hc6-_u8clsHQ,The Nest,92210,33.72136,-116.35884,$$,4.0,1.0
1,A-M0cPJnjjzzeGBecScPKw,Eureka!,92210,33.72092,-116.356821,$$,4.0,1.0
2,1ZKbcWcvThf2E-Fb8pqEvw,Vue Grille & Bar,92210,33.724492,-116.329666,$$,4.5,1.0
3,3RbkcqU38_WvWDUVvKjwPQ,Hyatt Regency Indian Wells Resort & Spa,92210,33.724518,-116.331203,$$$,4.0,1.0
4,PNgbVkNt6JEUh5IZLDWUUA,Cafe Italia,92210,33.720386,-116.358901,$$,4.0,1.0


In [75]:
#-- Preview Yelp Summary Data
print(yelpDataSummary_df.shape)

yelpDataSummary_df.head()

(50, 4)


Unnamed: 0,Zipcode,HasApiFailure,Count_GlutenFree,Count_Restaurant
0,90746,1,0,0
1,92210,0,9,20
2,90210,1,0,0
3,92605,0,0,0
4,90028,0,135,287


# 3 Summarize Yelp Data
For each zipcode found within the sample dataset, summarize the number of business found, the menu and standard deviation for the price and rating.  This summarization is done for both those business that satisfied the "Gluten Free/Restaurant" filter and the "Restaurant" filter.  
  
A new DataFrame is created, summarizedYelpData_df, and it is also saved to disk. 

In [12]:
def summarizeYelpDataForZipcode(subset_df, searchPrefix, summarizeResult):
    ''' Summarizes the Yelp data based on the DataFrame provided
    
    Accepts : subset_df (DataFrame) records for the zipcode and filter
                searchPrefix (str) prefix used with column names; "GF_" or "ALL_"
                summarizedResults (dictionary) contains summarized data
                
    Returns : summarizedResults (dictionary) append information 
                'Zipcode' (str) name of the zipcode
                'GF_Total' (num) total number of businesses
                'GF_Price_1' (num) total with price of '$'
                'GF_Price_2' (num) total with price of '$$'
                'GF_Price_3' (num) total with price of '$$$'
                'GF_Price_4' (num) total with price of '$$$$'
                'GF_Price_0' (num) total with price of 'NA'
                'GF_Rating_10' (num) total with range of 1.0
                'GF_Rating_15' (num) total with range of 1.5
                'GF_Rating_20' (num) total with range of 2.0
                'GF_Rating_25' (num) total with range of 2.5
                'GF_Rating_30' (num) total with range of 3.0
                'GF_Rating_35' (num) total with range of 3.5
                'GF_Rating_40' (num) total with range of 4.0
                'GF_Rating_45' (num) total with range of 4.5
                'GF_Rating_50' (num) total with range of 5.0
                'GF_Price_Mean' (num) average for price
                'GF_Price_Std' (num) standard deviation for price
                'GF_Rating_Mean' (num) average for rating
                'GF_Rating_Std' (num) standard deviation for rating
                'ALL_Total' (num) total number of businesses
                'ALL_Price_1' (num) total with price of '$'
                'ALL_Price_2' (num) total with price of '$$'
                'ALL_Price_3' (num) total with price of '$$$'
                'ALL_Price_4' (num) total with price of '$$$$'
                'ALL_Price_0' (num) total with price of 'NA'
                'ALL_Rating_10' (num) total with range of 1.0
                'ALL_Rating_15' (num) total with range of 1.5
                'ALL_Rating_20' (num) total with range of 2.0
                'ALL_Rating_25' (num) total with range of 2.5
                'ALL_Rating_30' (num) total with range of 3.0
                'ALL_Rating_35' (num) total with range of 3.5
                'ALL_Rating_40' (num) total with range of 4.0
                'ALL_Rating_45' (num) total with range of 4.5
                'ALL_Rating_50' (num) total with range of 5.0
                'ALL_Price_Mean' (num) average for price
                'ALL_Price_Std' (num) standard deviation for price
                'ALL_Rating_Mean' (num) average for rating
                'ALL_Rating_Std' (num) standard deviation for rating
    '''
    
    #- Total
    summarizeResult[f'{searchPrefix}Total'].append(subset_df.shape[0])
    

    #- Price
    summarizeResult[f'{searchPrefix}Price_1'].append(subset_df[subset_df['Price'] == '$'].shape[0])
    summarizeResult[f'{searchPrefix}Price_2'].append(subset_df[subset_df['Price'] == '$$'].shape[0])
    summarizeResult[f'{searchPrefix}Price_3'].append(subset_df[subset_df['Price'] == '$$$'].shape[0])
    summarizeResult[f'{searchPrefix}Price_4'].append(subset_df[subset_df['Price'] == '$$$$'].shape[0])
    summarizeResult[f'{searchPrefix}Price_0'].append(subset_df[subset_df['Price'] == 'NA'].shape[0])


    #- Rating
    summarizeResult[f'{searchPrefix}Rating_10'].append(subset_df[subset_df['Rating'] == 1.0].shape[0])
    summarizeResult[f'{searchPrefix}Rating_15'].append(subset_df[subset_df['Rating'] == 1.5].shape[0])
    summarizeResult[f'{searchPrefix}Rating_20'].append(subset_df[subset_df['Rating'] == 2.0].shape[0])
    summarizeResult[f'{searchPrefix}Rating_25'].append(subset_df[subset_df['Rating'] == 2.5].shape[0])
    summarizeResult[f'{searchPrefix}Rating_30'].append(subset_df[subset_df['Rating'] == 3.0].shape[0])
    summarizeResult[f'{searchPrefix}Rating_35'].append(subset_df[subset_df['Rating'] == 3.5].shape[0])
    summarizeResult[f'{searchPrefix}Rating_40'].append(subset_df[subset_df['Rating'] == 4.0].shape[0])
    summarizeResult[f'{searchPrefix}Rating_45'].append(subset_df[subset_df['Rating'] == 4.5].shape[0])
    summarizeResult[f'{searchPrefix}Rating_50'].append(subset_df[subset_df['Rating'] == 5.0].shape[0])


    #- Averages
    summarizeResult[f'{searchPrefix}Price_Mean'].append(subset_df['PriceNum'].mean())
    summarizeResult[f'{searchPrefix}Rating_Mean'].append(subset_df['Rating'].mean())


    #- Standard Deviation
    summarizeResult[f'{searchPrefix}Price_Std'].append(subset_df['PriceNum'].std())
    summarizeResult[f'{searchPrefix}Rating_Std'].append(subset_df['Rating'].std())
    
    
    
    return summarizeResult
    

In [13]:
def calculatePrice(row):
    ''' Converts the Yelp text of dollar signs into numeric value
    
    Accepts : row individual row from DataFrame; has "Price" column 
    
    Returns : (num) numeric value that is converted from text value
    '''
    
    value = 0.0
    
    if (row['Price'] == '$'):
        value = 1
        
    elif (row['Price'] == '$$'):
        value = 2
        
    elif (row['Price'] == '$$$'):
        value = 3
        
    elif (row['Price'] == '$$$$'):
        value = 4
        
    else:
        value = math.nan
        
    
    return value

In [38]:
def getZipcodeYelpDataFromFiles():
    ''' Gets the Yelp data that is stored on disk; one file that contains a DataFrame for each
    zipcode
    
    Returns : (DataFrame) contains Yelp data from all of the files found in the output direct
    '''
    
    #- Message
    print("Getting Yelp Data from output folder...")
    
    
    #-- Files in Output Folder
    outputPath = os.path.join(".", outputDirectory)

    files = os.listdir(outputPath)
    
    
    #-- Load Files
    counter = 0
    yelpData_df = None
    hasFirstYelpData = False
    
    for file in files:
        
        #- Check File Name
        if (file.startswith(yelpDataZipcodeFileName) == True):
            
            counter += 1
            
            
            #- Get DataFrame
            filePath = os.path.join('.', outputDirectory, file)
            
            zipCodeYelpData_df = pd.read_csv(filePath)
            
            
            
            #- Merge
            if (hasFirstYelpData == False):
                # First DataFrame, just assign variable
                hasFirstYelpData = True
                
                yelpData_df = zipCodeYelpData_df
                
            else:
                # Merge DataFrames
                yelpData_df = pd.concat([zipCodeYelpData_df, yelpData_df])
                
        
            #- Message
            print(f"Yelp Data... Records: {zipCodeYelpData_df.shape[0]} Counter: {counter}  {file}")
    
    
    return yelpData_df
    

In [76]:
#-- Summarize Data Based on Zipcode

#- Get Yelp Data
if (useFileForYelpSummarize == True):
    
    print("Using files on disk to summarize")
    
    # Yelp Business DataFrame
    yelpData_df = getZipcodeYelpDataFromFiles()
    
    
    # Yelp Zipcode Summary DataFrame
    yelpDataSummaryFilePath = os.path.join('.', outputDirectory, yelpDataSummaryFileName)
    
    yelpDataSummary_df = pd.read_csv(yelpDataSummaryFilePath)
    
else:
    if (yelpData_df is None) or (yelpDataSummary_df is None):
        raise Exception("Unable to collect Yelp dataset; missing reference to yelpDataSummary_df or yelpData_df")

  
#- Prepare DataFrame: Numeric Price
# Converts text of different number of $ to numeric value to allow calculations
yelpData_df['PriceNum'] = yelpData_df.apply(lambda row: calculatePrice(row), axis=1)


#- Create Results Container
# Container is to be converted into DataFrame
summarizedResults = {
    zipcodeColumnName: [],
    'GF_Total' : [],
    'GF_Price_1': [],
    'GF_Price_2': [],
    'GF_Price_3': [],
    'GF_Price_4': [],
    'GF_Price_0': [],
    'GF_Rating_10': [],
    'GF_Rating_15': [],
    'GF_Rating_20': [],
    'GF_Rating_25': [],
    'GF_Rating_30': [],
    'GF_Rating_35': [],
    'GF_Rating_40': [],
    'GF_Rating_45': [],
    'GF_Rating_50': [],
    'GF_Price_Mean': [],
    'GF_Rating_Mean': [],
    'GF_Price_Std': [],
    'GF_Rating_Std': [],
    'ALL_Total' : [],
    'ALL_Price_1': [],
    'ALL_Price_2': [],
    'ALL_Price_3': [],
    'ALL_Price_4': [],
    'ALL_Price_0': [],
    'ALL_Rating_10': [],
    'ALL_Rating_15': [],
    'ALL_Rating_20': [],
    'ALL_Rating_25': [],
    'ALL_Rating_30': [],
    'ALL_Rating_35': [],
    'ALL_Rating_40': [],
    'ALL_Rating_45': [],
    'ALL_Rating_50': [],
    'ALL_Price_Mean': [],
    'ALL_Rating_Mean': [],
    'ALL_Price_Std': [],
    'ALL_Rating_Std': [],
    }


#- Message
print(" ")
print("Start summary of Yelp data...")


#- Group by Zipcode
zipcodeYelpData_GroupBy = yelpData_df.groupby(zipcodeColumnName)


#- Summarize for each Zipcode
for groupName, groupedYelpData_df in zipcodeYelpData_GroupBy:
    
    # Zipcode
    summarizedResults[zipcodeColumnName].append(groupName)
    
    
    # Filter: Gluten Free
    subset_df = groupedYelpData_df.loc[(groupedYelpData_df['IsGlutenFree'] == 1)]
    
    summarizedResults = summarizeYelpDataForZipcode(subset_df, "GF_", summarizedResults)
    
    
    # Filter: Restaurants
    subset_df = groupedYelpData_df.loc[(groupedYelpData_df['IsGlutenFree'] == 0)]
    
    summarizedResults = summarizeYelpDataForZipcode(subset_df, "ALL_", summarizedResults)
    

    
#- Create DataFrame
summarizedYelpData_df = pd.DataFrame(summarizedResults)


#- Save to Disk
summarizedYelpPath = os.path.join(".", outputDirectory, summarizedYelpFileName)

summarizedYelpData_df.to_csv(summarizedYelpPath)


print(f"Completed summarizing the data  {summarizedYelpPath}")

Using files on disk to summarize
Getting Yelp Data from output folder...
Yelp Data... Records: 0 Counter: 1  YelpDataZip_92393.csv
Yelp Data... Records: 96 Counter: 2  YelpDataZip_91103.csv
Yelp Data... Records: 2 Counter: 3  YelpDataZip_92350.csv
Yelp Data... Records: 2 Counter: 4  YelpDataZip_91329.csv
Yelp Data... Records: 0 Counter: 5  YelpDataZip_91507.csv
Yelp Data... Records: 94 Counter: 6  YelpDataZip_90744.csv
Yelp Data... Records: 0 Counter: 7  YelpDataZip_92153.csv
Yelp Data... Records: 0 Counter: 8  YelpDataZip_92609.csv
Yelp Data... Records: 19 Counter: 9  YelpDataZip_90222.csv
Yelp Data... Records: 0 Counter: 10  YelpDataZip_92186.csv
Yelp Data... Records: 90 Counter: 11  YelpDataZip_91316.csv
Yelp Data... Records: 2 Counter: 12  YelpDataZip_92233.csv
Yelp Data... Records: 0 Counter: 13  YelpDataZip_92232.csv
Yelp Data... Records: 5 Counter: 14  YelpDataZip_91934.csv
Yelp Data... Records: 1 Counter: 15  YelpDataZip_90633.csv
Yelp Data... Records: 150 Counter: 16  YelpData

Yelp Data... Records: 0 Counter: 152  YelpDataZip_91526.csv
Yelp Data... Records: 29 Counter: 153  YelpDataZip_92210.csv
Yelp Data... Records: 0 Counter: 154  YelpDataZip_92589.csv
Yelp Data... Records: 5 Counter: 155  YelpDataZip_90639.csv
Yelp Data... Records: 1 Counter: 156  YelpDataZip_90822.csv
Yelp Data... Records: 0 Counter: 157  YelpDataZip_92172.csv
Yelp Data... Records: 0 Counter: 158  YelpDataZip_92166.csv
Yelp Data... Records: 1 Counter: 159  YelpDataZip_92364.csv
Yelp Data... Records: 0 Counter: 160  YelpDataZip_91109.csv
Yelp Data... Records: 0 Counter: 161  YelpDataZip_92158.csv
Yelp Data... Records: 0 Counter: 162  YelpDataZip_92616.csv
Yelp Data... Records: 2 Counter: 163  YelpDataZip_92548.csv
Yelp Data... Records: 0 Counter: 164  YelpDataZip_90410.csv
Yelp Data... Records: 36 Counter: 165  YelpDataZip_92549.csv
Yelp Data... Records: 75 Counter: 166  YelpDataZip_91915.csv
Yelp Data... Records: 2 Counter: 167  YelpDataZip_92039.csv
Yelp Data... Records: 0 Counter: 168 

In [77]:
#- Preview Summarized Data
print(summarizedYelpData_df.shape)

pd.set_option('display.max_columns', 500)
summarizedYelpData_df.head()

(107, 39)


Unnamed: 0,Zipcode,GF_Total,GF_Price_1,GF_Price_2,GF_Price_3,GF_Price_4,GF_Price_0,GF_Rating_10,GF_Rating_15,GF_Rating_20,GF_Rating_25,GF_Rating_30,GF_Rating_35,GF_Rating_40,GF_Rating_45,GF_Rating_50,GF_Price_Mean,GF_Rating_Mean,GF_Price_Std,GF_Rating_Std,ALL_Total,ALL_Price_1,ALL_Price_2,ALL_Price_3,ALL_Price_4,ALL_Price_0,ALL_Rating_10,ALL_Rating_15,ALL_Rating_20,ALL_Rating_25,ALL_Rating_30,ALL_Rating_35,ALL_Rating_40,ALL_Rating_45,ALL_Rating_50,ALL_Price_Mean,ALL_Rating_Mean,ALL_Price_Std,ALL_Rating_Std
0,90006,26,7,15,2,0,0,0,1,1,0,0,6,13,4,1,1.791667,3.826923,0.58823,0.72031,217,98,76,4,0,0,4,7,8,12,20,47,68,34,17,1.47191,3.663594,0.54389,0.890907
1,90028,135,37,77,16,1,0,0,0,0,4,9,38,55,27,2,1.854962,3.862963,0.645959,0.501601,287,111,130,20,2,0,2,4,16,20,34,73,83,46,9,1.669202,3.594077,0.648605,0.787711
2,90032,7,3,4,0,0,0,0,1,0,0,0,1,2,2,1,1.571429,3.857143,0.534522,1.144344,67,49,10,0,0,0,2,3,3,3,3,10,16,17,10,1.169492,3.798507,0.378406,1.048302
3,90033,7,2,5,0,0,0,0,0,0,0,0,0,3,3,1,1.714286,4.357143,0.48795,0.377964,140,86,26,1,0,0,1,1,9,7,15,17,32,37,21,1.247788,3.871429,0.453773,0.906355
4,90035,43,5,35,1,0,0,0,0,2,0,2,12,14,11,2,1.902439,3.895349,0.374492,0.64141,107,28,64,3,1,0,0,4,3,7,10,23,33,19,8,1.760417,3.714953,0.55715,0.82734


In [78]:
#- Preview Yelp Data
print(yelpData_df.shape)

yelpData_df.head()

(6618, 10)


Unnamed: 0.1,Unnamed: 0,ID,Name,Zipcode,Latitude,Longitude,Price,Rating,IsGlutenFree,PriceNum
0,0,r5_pdXxKe6XEh3m15nSpxg,Nonna's Italian Restaurant,92407,34.175171,-117.330432,$,4.5,1,1.0
1,1,oagIREcZKztHsyWgKAmiDg,Yang Noodle House,92407,34.188532,-117.356811,$$,4.5,1,2.0
2,2,TjUj2qJectQo6FjOEU9h4w,Rock N Roll Sushi,92407,34.177175,-117.330335,$$,3.5,1,2.0
3,3,LI2r9n8vqCWQE6lx2TwUKw,Baguette Express,92407,34.175739,-117.330994,$,4.0,1,1.0
4,4,WUChRoWrYmLvapthpRBVAg,Don Roman,92407,34.162861,-117.304994,$,4.5,1,1.0
