# YELP API Data on businesses: City of Los Angeles CA
--------------------------------------------

## ETL PROCES REPORT
--------------------------------------------
### E: PULLING DATA FROM YELP'S API:

* Step 1: Get an API KEY  from yelp
* Investigate what kind of data to retrieve. In our case: General business profiles in the City of LA California
* The API returns only 50 results max per call
* Perform a test pull to look at the result and it's json structure to design the data frame

### T: PREPROCESSING THE JSON RESPONSE FROM THE API:
--------------------------------------------

* Declare/Initialize the variables/columns that will hold the downloaded data
* We build a dataset of 1000 resutls by making multiple calls dynamically and saving data into a dataframe on the fly
* Data processing consisted in extracting data from the json response and formatting it to create tabular data for the df
* The Transforming process included string manipulation regex and python scripting.

### L: LOADING TO SQL AND MONGODB:
--------------------------------------------
* This was the most straightforward step accomplished with no issues
* We simplify this proces by using sqlite instead of mysql or postgress
* We also saved csv files of the data before we loaded it into the database systems.


In [7]:
import requests
import json
import pandas as pd
from credentials import APIKey
import time
from datetime import date, datetime
from time import gmtime, strftime

# strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime())
# strftime("%a, %d %b %Y %H:%M:%S +0000")
strftime("%a, %b %d %Y %H:%M:%S")

'Tue, May 21 2019 11:36:23'

In [8]:
MY_API_KEY = APIKey
HEADERS = {'Authorization': 'bearer %s' % MY_API_KEY}


In [9]:
base_url = "https://api.yelp.com/v3/businesses/search"

In [10]:
search_location = 'los angeles ca united states'

In [11]:
PARAMETERS = {'location': search_location,
                  'limit': 50,
                  'radius': 40000,
                 }

In [27]:
response = requests.get(url = base_url,
                        params = PARAMETERS,
                        headers = HEADERS)

print(f"Yelp API hit at: {strftime('%a, %d %b %Y %H:%M:%S')}")

Yelp API hit at: Mon, 20 May 2019 15:52:07


In [319]:
### Check if response was succesful: response code 200 🕺🏽
print(response.status_code)
if response.status_code == 200:
    print(f"Response succesfull 🥳")

200
Response succesfull 🥳


In [None]:
response.text

In [41]:
# Save a pickle to local machine of the first 50 results
with open('Pickles/yelp_first50_text_pickle', 'wb') as datafile :
    pickle.dump(response.text, datafile )


In [None]:
#Inspect the pickle file
with open('Pickles/yelp_first50_text_pickle','rb') as pickleText:
    yelp_pickle_50 = pickle.load(pickleText)
    

In [None]:
# yelp_pickle_50
yelp_pickle_50

In [322]:
responseDict = {}

In [None]:
response.text

In [324]:
response_url = response.url
print(f"{response_url}")
print(f"type(response_url): {type(response_url)}")

https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000
type(response_url): <class 'str'>


In [325]:
json_response = response.json()

In [326]:
# inspect the json file
json_response['businesses'][0]

{'id': 'TkFEKhsCixPWlShULKvMdQ',
 'alias': 'bottega-louie-los-angeles',
 'name': 'Bottega Louie',
 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/rAImnKvUNcNY8i6qEDWrZA/o.jpg',
 'is_closed': False,
 'url': 'https://www.yelp.com/biz/bottega-louie-los-angeles?adjust_creative=rCpoZ5I2XhS-AiKMOroCkg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=rCpoZ5I2XhS-AiKMOroCkg',
 'review_count': 16248,
 'categories': [{'alias': 'italian', 'title': 'Italian'},
  {'alias': 'bakeries', 'title': 'Bakeries'},
  {'alias': 'breakfast_brunch', 'title': 'Breakfast & Brunch'}],
 'rating': 4.0,
 'coordinates': {'latitude': 34.0469300995766, 'longitude': -118.256601457672},
 'transactions': [],
 'price': '$$',
 'location': {'address1': '700 S Grand Ave',
  'address2': None,
  'address3': '',
  'city': 'Los Angeles',
  'zip_code': '90017',
  'country': 'US',
  'state': 'CA',
  'display_address': ['700 S Grand Ave', 'Los Angeles, CA 90017']},
 'phone': '+12138021470',
 'display_phon

In [330]:
# Save the body of the first API call(FIRST 50 RESULTS)
responseDict[response_url] = response.text

In [331]:
print(f"First API call results:")
print(f"for the city of {search_location}")
print(f"-"*45)
print(f"Yelp found {json_response['total']} businesses in the area")
print(f"Resutls returned in the response object: {len(json_response['businesses'])} ")
print(f"Keys in the response object: {json_response.keys()}")


First API call results:
for the city of los angeles ca united states
---------------------------------------------
Yelp found 11000 businesses in the area
Resutls returned in the response object: 50 
Keys in the response object: dict_keys(['businesses', 'total', 'region'])


In [332]:
biz_count = json_response['total']
biz_count

11000

In [333]:
# test joining business categories  from the response object into one string
', '.join([val['title'] for val in json_response['businesses'][0]['categories']])

'Italian, Bakeries, Breakfast & Brunch'

In [334]:
# The API will only privide 1000 results max.
# if there are more than 1000 results:
# Set the variable to a max of 1000 to build the urls accordingly
if biz_count > 1000:
    biz_count = 1000

In [335]:
# variables to use in second loop. bins to dynamically divide the calls into chunks
print(biz_count)
bins = [chunk for chunk in range(50, biz_count, 50)]
temp_urls = []
response_url

1000


'https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000'

In [336]:
# loop to build the amount of urls according to the amount of results from the response object
for i in range(len(bins)):
    temp_urls.append(response_url + "&offset=" + f"{str(bins[i])}")
    

In [337]:
# inspect the first and last url to test the chunks
print(temp_urls[0],'\n', temp_urls[-1])

https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=50 
 https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=950


In [338]:
# Initialize an empty dataframe
biz_df = pd.DataFrame()

In [346]:
# Initialize columns for the dataframe
biz_df['id'] = ''
biz_df['alias'] = ''
biz_df['name'] = ''
biz_df['categories'] = ''
biz_df['review_count'] = ''
biz_df['rating'] = ''
biz_df['lat'] = ''
biz_df['long'] = ''
# biz_df['transactions'] = ''
biz_df['address'] = ''
biz_df['city'] = ''
biz_df['zip_code'] = ''
biz_df['country'] = ''
biz_df['state'] = ''
biz_df['phone'] = ''
biz_df['pulled_on'] = ''

# Inspect columns
biz_df

Unnamed: 0,id,alias,name,categories,review_count,rating,lat,long,address,city,zip_code,country,state,phone,timestamp,pulled_on


In [347]:
# Inspect the last result
json_response['businesses'][-1]

{'id': 'sYn3SNQP-j2t2XSwjlCbRg',
 'alias': 'montys-good-burger-los-angeles',
 'name': "Monty's Good Burger",
 'image_url': 'https://s3-media2.fl.yelpcdn.com/bphoto/qoVy9kU9SLr8dsJNuIweWA/o.jpg',
 'is_closed': False,
 'url': 'https://www.yelp.com/biz/montys-good-burger-los-angeles?adjust_creative=rCpoZ5I2XhS-AiKMOroCkg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=rCpoZ5I2XhS-AiKMOroCkg',
 'review_count': 1210,
 'categories': [{'alias': 'burgers', 'title': 'Burgers'},
  {'alias': 'vegan', 'title': 'Vegan'}],
 'rating': 4.5,
 'coordinates': {'latitude': 34.06469, 'longitude': -118.30876},
 'transactions': ['pickup', 'delivery'],
 'price': '$$',
 'location': {'address1': '516 S Western Ave',
  'address2': '',
  'address3': None,
  'city': 'Los Angeles',
  'zip_code': '90020',
  'country': 'US',
  'state': 'CA',
  'display_address': ['516 S Western Ave', 'Los Angeles, CA 90020']},
 'phone': '+12139150257',
 'display_phone': '(213) 915-0257',
 'distance': 1206.434807

In [348]:
# Inspect the urls to be used in dynamic api calls to get results from 51 to 1000. Each one will return 50 results
print(f"The second loop will make {len(temp_urls)} calls to the following {len(temp_urls)} different urls\n")
for i in range(len(temp_urls)):
    print(temp_urls[i])

The second loop will make 19 calls to the following 19 different urls

https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=50
https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=100
https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=150
https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=200
https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=250
https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=300
https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=350
https://api.yelp.com/v3/businesses/search?location=los+angeles+ca+united+states&limit=50&radius=40000&offset=400
https://api.yelp.com/v3/bu

In [349]:
#Outside variable that will count results and serve as an index as df is built
# THIS WILL keep track of the results in both loops: first 50 results and the rest:
# in this case it is 1000
index_ = 0


## First Loop: Inject data from first API CALL performed earlier to dataframe


In [None]:

try:
    for j in range(len(json_response['businesses'])):
        print("=======================================================")
        print(f"Injecting DataFrame with first set of 50 results")
        print(f"biz index: {index_}: {json_response['businesses'][j]['name']}")
        print("=======================================================")
        
#         detail = temp_response_json['businesses'][j]
        
        biz_df.set_value(index_, "id", json_response['businesses'][j]['id'])
        biz_df.set_value(index_, "alias", json_response['businesses'][j]['alias'])

        biz_df.set_value(index_, "name", json_response['businesses'][j]['name'])
        biz_df.set_value(index_, "categories", ', '.join([val['title'] for val in json_response['businesses'][j]['categories']]) )        

        biz_df.set_value(index_, "review_count", json_response['businesses'][j]['review_count'])
        biz_df.set_value(index_, "rating", json_response['businesses'][j]['rating'])

        biz_df.set_value(index_, "lat", json_response['businesses'][j]['coordinates']['latitude'])
        biz_df.set_value(index_, "long", json_response['businesses'][j]['coordinates']['longitude'])

        biz_df.set_value(index_, "address", ' '.join(json_response['businesses'][j]['location']['display_address']))
        biz_df.set_value(index_, "city", json_response['businesses'][j]['location']['city'])

        biz_df.set_value(index_, "zip_code", json_response['businesses'][j]['location']['zip_code'])
        biz_df.set_value(index_, "country", json_response['businesses'][j]['location']['country'])

        biz_df.set_value(index_, "state", json_response['businesses'][j]['location']['state'])
        biz_df.set_value(index_, "phone", json_response['businesses'][j]['phone'])
        biz_df.set_value(index_, "pulled_on", datetime.now())

        index_ += 1
except:
    print("Something went WRONG 😫😓")            
              

## Second loop: making multiple calls simultanuslly to Yelp's API

In [None]:
i = 0
j = 0

try:
    for i in range(len(temp_urls)):
        temp_response = requests.get(url = temp_urls[i], headers = HEADERS)
        
        print("=======================================================")
        print(f"temp_response code: {temp_response}")
        if temp_response.status_code == 200:
            print(f"Response succesfull 🥳")
        else:
            print(f"Something went wrong 😰")
        
        temp_response_json = temp_response.json()
        
        temp_response_string = temp_response.text
        
        responseDict[temp_response.url] = temp_response_string
        
        print("=======================================================")
        print(f"Downloading results from: URL: {temp_response.url}" )

        for j in range(len(temp_response_json['businesses'])):
                  
            biz_df.set_value(index_, "id", json_response['businesses'][j]['id'])
            biz_df.set_value(index_, "alias", json_response['businesses'][j]['alias'])

            biz_df.set_value(index_, "name", json_response['businesses'][j]['name'])
            biz_df.set_value(index_, "categories", ', '.join([val['title'] for val in json_response['businesses'][j]['categories']]) )        

            biz_df.set_value(index_, "review_count", json_response['businesses'][j]['review_count'])
            biz_df.set_value(index_, "rating", json_response['businesses'][j]['rating'])

            biz_df.set_value(index_, "lat", json_response['businesses'][j]['coordinates']['latitude'])
            biz_df.set_value(index_, "long", json_response['businesses'][j]['coordinates']['longitude'])

            biz_df.set_value(index_, "address", ' '.join(json_response['businesses'][j]['location']['display_address']))
            biz_df.set_value(index_, "city", json_response['businesses'][j]['location']['city'])

            biz_df.set_value(index_, "zip_code", json_response['businesses'][j]['location']['zip_code'])
            biz_df.set_value(index_, "country", json_response['businesses'][j]['location']['country'])

            biz_df.set_value(index_, "state", json_response['businesses'][j]['location']['state'])
            biz_df.set_value(index_, "phone", json_response['businesses'][j]['phone'])
            biz_df.set_value(index_, "pulled_on", datetime.now())
            

            index_ += 1
              
        time.sleep(3)

except:
    print("Something went WRONG 😰")

print(f"\n\nDatabase ready: records entered:{index_}")

In [426]:
biz_df.tail(2)

Unnamed: 0,id,alias,name,categories,review_count,rating,lat,long,address,city,zip_code,country,state,phone,pulled_on
998,YA3bV7kd3RpWPvrarIgpWQ,milk-jar-cookies-los-angeles,Milk Jar Cookies,"Desserts, Bakeries, Coffee & Tea",1445,4.5,34.0621,-118.348,"5466 Wilshire Blvd Los Angeles, CA 90036",Los Angeles,90036,US,CA,13236349800,2019-05-20 22:18:50.470209
999,sYn3SNQP-j2t2XSwjlCbRg,montys-good-burger-los-angeles,Monty's Good Burger,"Burgers, Vegan",1210,4.5,34.0647,-118.309,"516 S Western Ave Los Angeles, CA 90020",Los Angeles,90020,US,CA,12139150257,2019-05-20 22:18:50.471351


In [164]:
# responseDict

In [369]:
# Save a pickle to local machine of the first and second loop data combined 
#Dictionary of 1000 businesses
tstamp = time.time()
path = f"Pickles/responseDict_pickle_{tstamp}"

with open(path, 'wb') as datafile :
    pickle.dump(responseDict, datafile)


In [370]:
#Inspect the pickle file d
with open(path,'rb') as datafile:
    responseDictUnpickled = pickle.load(datafile)
    

In [371]:
print(type(responseDictUnpickled))

<class 'dict'>


# L:  load
-----------------------------------------------
* Save df to csv format

In [373]:
biz_df.to_csv(f"Resources/la_yelp_{biz_df['id'].count()}.csv", index=False)

In [388]:
biz_df.drop(columns=['timestamp'], inplace=True)

In [390]:
# mongodb didn't like my timestamp object
biz_df['pulled_on'] = biz_df['pulled_on'].astype(str)

In [401]:
type(biz_df['pulled_on'][0])

str

* load dataframe into mongo

In [402]:
import pymongo

In [403]:
# Setup connection to mongodb
conn = "mongodb://localhost:27017"
client = pymongo.MongoClient(conn)

In [405]:
# Select database and collection to use
db = client.yelp
bizProfile = db.bizProfile
census_zips = db.census_zips

In [406]:
yelpData = biz_df.to_dict(orient='records')

In [407]:
yelpData[0]


{'id': 'TkFEKhsCixPWlShULKvMdQ',
 'alias': 'bottega-louie-los-angeles',
 'name': 'Bottega Louie',
 'categories': 'Italian, Bakeries, Breakfast & Brunch',
 'review_count': 16248,
 'rating': 4.0,
 'lat': 34.0469300995766,
 'long': -118.256601457672,
 'address': '700 S Grand Ave Los Angeles, CA 90017',
 'city': 'Los Angeles',
 'zip_code': '90017',
 'country': 'US',
 'state': 'CA',
 'phone': '+12138021470',
 'pulled_on': '2019-05-20 22:15:30.692854'}

In [408]:
bizProfile.insert_many(yelpData)

<pymongo.results.InsertManyResult at 0x11d64c108>

## test query database

In [409]:
results = bizProfile.find()

In [410]:
type(results)

pymongo.cursor.Cursor

In [411]:
for result in results[0:2]:
    print()
    print(result)



{'_id': ObjectId('5ce3660abe1a7794394fce81'), 'id': 'TkFEKhsCixPWlShULKvMdQ', 'alias': 'bottega-louie-los-angeles', 'name': 'Bottega Louie', 'categories': 'Italian, Bakeries, Breakfast & Brunch', 'review_count': 16248, 'rating': 4.0, 'lat': 34.0469300995766, 'long': -118.256601457672, 'address': '700 S Grand Ave Los Angeles, CA 90017', 'city': 'Los Angeles', 'zip_code': '90017', 'country': 'US', 'state': 'CA', 'phone': '+12138021470', 'pulled_on': '2019-05-20 22:15:30.692854'}

{'_id': ObjectId('5ce3660abe1a7794394fce82'), 'id': '7O1ORGY36A-2aIENyaJWPg', 'alias': 'howlin-rays-los-angeles-3', 'name': "Howlin' Ray's", 'categories': 'Southern, Chicken Shop, American (Traditional)', 'review_count': 5060, 'rating': 4.5, 'lat': 34.0614861063899, 'long': -118.239554800093, 'address': '727 N Broadway Ste 128 Los Angeles, CA 90012', 'city': 'Los Angeles', 'zip_code': '90012', 'country': 'US', 'state': 'CA', 'phone': '+12139358399', 'pulled_on': '2019-05-20 22:15:30.694905'}


* First document in the collection

In [427]:
# Interesting: result object came back with duplicate businesses
for result in bizProfile.find({'id': 'TkFEKhsCixPWlShULKvMdQ'}):
    print(result['address'], '| Review Count: ', result['review_count'])


700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Angeles, CA 90017 | Review Count:  16248
700 S Grand Ave Los Ange

# Create SQL database from  Dataframe

In [12]:
from sqlalchemy import create_engine

In [13]:
database_path = "Resources/la_biz.sqlite"

In [14]:
engine = create_engine(f"sqlite:///{database_path}")

In [420]:
engine.table_names()

[]

In [421]:
biz_df.to_sql(name='bizProfile', con=engine, if_exists='append', index=False)

In [16]:
engine.table_names()

['bizProfile']

### Test Query the biz_profile table


In [17]:
# pd.read_sql_query('select * from customer_name', con=engine).head()
pd.read_sql_query('select * from bizProfile', con=engine).head()

Unnamed: 0,id,alias,name,categories,review_count,rating,lat,long,address,city,zip_code,country,state,phone,pulled_on
0,TkFEKhsCixPWlShULKvMdQ,bottega-louie-los-angeles,Bottega Louie,"Italian, Bakeries, Breakfast & Brunch",16248,4.0,34.04693,-118.256601,"700 S Grand Ave Los Angeles, CA 90017",Los Angeles,90017,US,CA,12138021470,2019-05-20 22:15:30.692854
1,7O1ORGY36A-2aIENyaJWPg,howlin-rays-los-angeles-3,Howlin' Ray's,"Southern, Chicken Shop, American (Traditional)",5060,4.5,34.061486,-118.239555,"727 N Broadway Ste 128 Los Angeles, CA 90012",Los Angeles,90012,US,CA,12139358399,2019-05-20 22:15:30.694905
2,KQBGm5G8IDkE8LeNY45mbA,wurstküche-los-angeles-2,Wurstküche,"Hot Dogs, German, Gastropubs",8058,4.0,34.045605,-118.236061,"800 E 3rd St Los Angeles, CA 90013",Los Angeles,90013,US,CA,12136874444,2019-05-20 22:15:30.696984
3,iSZpZgVnASwEmlq0DORY2A,daikokuya-little-tokyo-los-angeles,Daikokuya Little Tokyo,"Ramen, Noodles",8126,4.0,34.050081,-118.24018,"327 E 1st St Los Angeles, CA 90012",Los Angeles,90012,US,CA,12136261680,2019-05-20 22:15:30.703246
4,DJek3FUewBzMc0gS-Gms9w,the-morrison-los-angeles,The Morrison,"Gastropubs, Burgers, Bars",4074,4.5,34.12384,-118.26868,"3179 Los Feliz Blvd Los Angeles, CA 90039",Los Angeles,90039,US,CA,13236671839,2019-05-20 22:15:30.705182
