# Notebook 1: Querying Yelp API

In this notebook, we will query yelp's api for marijuana dispensaries in los angeles. We plan on collecting all of the dispensaries and looking at crimes within a certain radius of each dispensary over 4 years, the 2 leading up to when marijuana was voted to be legalized, and the 2 years after it was voted to be legalized.

In [3]:
import pandas as pd
import numpy as np
import time
import re
import json
import pickle

In [2]:
pwd

'/Users/adamburpee/dsi/project_4/adam_burpee_la7/code'

In [43]:
pip install yelp

Note: you may need to restart the kernel to use updated packages.


In [44]:
pip install yelpapi

Collecting yelpapi
  Downloading https://files.pythonhosted.org/packages/bb/07/f01be72829a3ce2da71bfde33d4bfe9ce5d8173a5a0470420fcb4dbacdd9/yelpapi-2.3.0-py2.py3-none-any.whl
Installing collected packages: yelpapi
Successfully installed yelpapi-2.3.0
Note: you may need to restart the kernel to use updated packages.


Importing necessary modules for querying yelp's api. 

In [8]:
from yelpapi import YelpAPI

In [9]:
from yelp.client import Client

MY_API_KEY = "iR_MtPHW6J6afZlZTRc-yugv-BTa_1Ms_aRlwda5wDLwO9kPPOCvn67-ZDCHQDRWHJgOKeAr5j6KUzGauok2XzFdqn4ewSTT2oFdJGxECFIkickZbIOE_zJq2By1XHYx" #  Replace this with your real API key

client = Client(MY_API_KEY)

Running a function that takes parameters for the yelp query. 

In [17]:
def get_search_parameters(term, location, limit):
    #See the Yelp API for more details
    params = {}
    params["term"] = term
    params['location'] = location
    #params["ll"] = "{},{}".format(str(lat),str(long))
    #params["radius_filter"] = "2000"
    params["limit"] = limit

    return params

This function will create a logfile and format the file names with a unique timestamp.

In [18]:
def filename_format_log(file_path, 
                        logfile = './data/file_log.txt',  
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{round(time.time())}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(round(time.time())))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, file_description

This function will collect dispensaries and parse them into a dataframe with the features of interest, saving out the raw data for each pull. After saving each query, it runs a time delay and then continues for as many queries as I set.

In [44]:
def yelp_query(category, location, offset_number=0, n_samples=1000):
    yelp_api = YelpAPI('iR_MtPHW6J6afZlZTRc-yugv-BTa_1Ms_aRlwda5wDLwO9kPPOCvn67-ZDCHQDRWHJgOKeAr5j6KUzGauok2XzFdqn4ewSTT2oFdJGxECFIkickZbIOE_zJq2By1XHYx')
    last_result = round(time.time())
    yelp_results = []
    size = 50
    loops = 0
    run = 1
    offset_count = offset_number
   
    
    while loops < n_samples:
        
        print(f'Starting query {run}')
        posts = yelp_api.search_query(categories= category, location = location, offset=offset_count, limit=size) 
            
        yelp_results.extend(posts['businesses'])
        loops += size
        offset_count += 50
        time.sleep(3) 
        run += 1
       
    
    formatted_name, file_description = filename_format_log(file_path =f'./data/raw_{category}.json')
    with open(formatted_name, 'w+') as f:
        json.dump(yelp_results, f)
    
    print(f'Saved and completed query and returned {len(yelp_results)} {category}s.')
    print(f'Yelp text is ready for processing.')
    return print(f'Last timestamp was {round(time.time())}.')

Querying dispensaries in los angeles, looking for 1000 samples. The query will stop once there are no more dispensaries. 

In [94]:
yelp_query('cannabisdispensaries', 'los angeles', n_samples = 1000)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Starting query 11
Starting query 12
Starting query 13
Starting query 14
Starting query 15
Starting query 16
Starting query 17
Starting query 18
Starting query 19
Starting query 20
Saved and completed query and returned 162 cannabisdispensariess.
Yelp text is ready for processing.
Last timestamp was 1555437201.


In [7]:
with open(f'../data/1555437201_raw_cannabisdispensaries.json', 'r') as f:
    weed_list = json.load(f)

Creating a latitude and longitude list so that we can input those in two columns instead of having a dictionary of lat and long values. in a column. 

In [8]:
lat_list = [weed_list[i]['coordinates']['latitude'] for i in range(len(weed_list))]

In [9]:
long_list = [weed_list[i]['coordinates']['longitude'] for i in range(len(weed_list))]

In [10]:
weed_list[0]

{'id': 'NcgdskgqNMrpM4K2nkWTIw',
 'alias': 'extra-special-delivery-north-hollywood-16',
 'name': 'Extra Special Delivery',
 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/GPZ7g9247Ckdvmr7Atki1g/o.jpg',
 'is_closed': False,
 'url': 'https://www.yelp.com/biz/extra-special-delivery-north-hollywood-16?adjust_creative=sZ3yryoyE3Dcs3GsAvlcFA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=sZ3yryoyE3Dcs3GsAvlcFA',
 'review_count': 68,
 'categories': [{'alias': 'cannabisdispensaries',
   'title': 'Cannabis Dispensaries'}],
 'rating': 5.0,
 'coordinates': {'latitude': 34.1667203, 'longitude': -118.3751849},
 'transactions': [],
 'price': '$',
 'location': {'address1': '5250 Lankershim',
  'address2': None,
  'address3': '',
  'city': 'North Hollywood',
  'zip_code': '91601',
  'country': 'US',
  'state': 'CA',
  'display_address': ['5250 Lankershim', 'North Hollywood, CA 91601']},
 'phone': '+18886549872',
 'display_phone': '(888) 654-9872',
 'distance': 12696.08876

Parsing the data into a dataframe. 

In [22]:
def yelp_parse(sample, df):
    
    col_list = ['name',
                'is_closed',
                'url',
                'rating',
                'coordinates',
                'location',
#                 'rating',
                'price',
                'review_count'
                #'coordinates'
                #'location'
                ]
    
    df = pd.DataFrame(sample)
    df = df[col_list]
    
    #df.rename(columns={'subreddit':'subreddit'}, inplace=True)
    
    col_order = ['name',
                'is_closed',
                'url',
                'rating',
                'coordinates',
                'location',
#                 'rating',
                'price',
                'review_count'
                #'coordinates'
                #'location'
                ]

    return df[col_order]

In [23]:
df_weed = yelp_parse(weed_list, df = 'df_weed')

In [24]:
df_weed.head()

Unnamed: 0,name,is_closed,url,rating,coordinates,location,price,review_count
0,Extra Special Delivery,False,https://www.yelp.com/biz/extra-special-deliver...,5.0,"{'latitude': 34.1667203, 'longitude': -118.375...","{'address1': '5250 Lankershim', 'address2': No...",$,68
1,Ganjarunner,False,https://www.yelp.com/biz/ganjarunner-los-angel...,5.0,"{'latitude': 34.10161, 'longitude': -118.30206}","{'address1': '', 'address2': '', 'address3': '...",,51
2,The Higher Path,False,https://www.yelp.com/biz/the-higher-path-sherm...,4.5,"{'latitude': 34.1493390052598, 'longitude': -1...","{'address1': '14080 Ventura Blvd', 'address2':...",$$,149
3,Kushfly,False,https://www.yelp.com/biz/kushfly-los-angeles-2...,4.0,"{'latitude': 34.1276, 'longitude': -118.34669}","{'address1': '3151 Cahuenga Blvd W', 'address2...",$$,147
4,MedMen West Hollywood,False,https://www.yelp.com/biz/medmen-west-hollywood...,4.0,"{'latitude': 34.0905911417357, 'longitude': -1...","{'address1': '8208 Santa Monica Blvd', 'addres...",$$,323


Setting latitude and longitude equal to the lat and long lists. 

In [25]:
df_weed['latitude'] = lat_list

In [26]:
df_weed['longitude'] = long_list

In [27]:
# long_list_2 = [df_weed.loc[i]['longitude'] for i in range(len(df_weed['longitude']))]

In [29]:
df_weed.head()

Unnamed: 0,name,is_closed,url,rating,coordinates,location,price,review_count,latitude,longitude
0,Extra Special Delivery,False,https://www.yelp.com/biz/extra-special-deliver...,5.0,"{'latitude': 34.1667203, 'longitude': -118.375...","{'address1': '5250 Lankershim', 'address2': No...",$,68,34.16672,-118.375185
1,Ganjarunner,False,https://www.yelp.com/biz/ganjarunner-los-angel...,5.0,"{'latitude': 34.10161, 'longitude': -118.30206}","{'address1': '', 'address2': '', 'address3': '...",,51,34.10161,-118.30206
2,The Higher Path,False,https://www.yelp.com/biz/the-higher-path-sherm...,4.5,"{'latitude': 34.1493390052598, 'longitude': -1...","{'address1': '14080 Ventura Blvd', 'address2':...",$$,149,34.149339,-118.439875
3,Kushfly,False,https://www.yelp.com/biz/kushfly-los-angeles-2...,4.0,"{'latitude': 34.1276, 'longitude': -118.34669}","{'address1': '3151 Cahuenga Blvd W', 'address2...",$$,147,34.1276,-118.34669
4,MedMen West Hollywood,False,https://www.yelp.com/biz/medmen-west-hollywood...,4.0,"{'latitude': 34.0905911417357, 'longitude': -1...","{'address1': '8208 Santa Monica Blvd', 'addres...",$$,323,34.090591,-118.36729


Dropping the location column.

In [30]:
df_weed.drop('location', axis = 1, inplace = True)

Adding a location column with lat long as a tuple in case our distance function needs it as a tuple for input. 

In [31]:
df_weed['location'] = list(zip(df_weed['latitude'], df_weed['longitude']))

In [32]:
df_weed.head()

Unnamed: 0,name,is_closed,url,rating,coordinates,price,review_count,latitude,longitude,location
0,Extra Special Delivery,False,https://www.yelp.com/biz/extra-special-deliver...,5.0,"{'latitude': 34.1667203, 'longitude': -118.375...",$,68,34.16672,-118.375185,"(34.1667203, -118.3751849)"
1,Ganjarunner,False,https://www.yelp.com/biz/ganjarunner-los-angel...,5.0,"{'latitude': 34.10161, 'longitude': -118.30206}",,51,34.10161,-118.30206,"(34.10161, -118.30206)"
2,The Higher Path,False,https://www.yelp.com/biz/the-higher-path-sherm...,4.5,"{'latitude': 34.1493390052598, 'longitude': -1...",$$,149,34.149339,-118.439875,"(34.1493390052598, -118.439874686508)"
3,Kushfly,False,https://www.yelp.com/biz/kushfly-los-angeles-2...,4.0,"{'latitude': 34.1276, 'longitude': -118.34669}",$$,147,34.1276,-118.34669,"(34.1276, -118.34669)"
4,MedMen West Hollywood,False,https://www.yelp.com/biz/medmen-west-hollywood...,4.0,"{'latitude': 34.0905911417357, 'longitude': -1...",$$,323,34.090591,-118.36729,"(34.0905911417357, -118.367290442404)"


No that we have lat long and location we can drop coordinates. 

In [33]:
df_weed.drop(labels = ['coordinates'], axis = 1, inplace = True)

In [34]:
weed_list[154]['coordinates']['longitude']

-118.18426

In [35]:
df_weed.shape

(162, 9)

Looking for any duplicates. 

In [36]:
df_weed['name'][df_weed.duplicated('location')]

109          Metro Bloomin
114    The Cannabis Method
Name: name, dtype: object

In [37]:
pwd

'/Users/adamburpee/dsi/project_4/adam_burpee_la7/code'

Pushing weed dataframe to csv. 

In [38]:
df_weed.to_csv('../data/df_weed.csv')

Checking for null values

In [39]:
df_weed.isnull().sum()

name             0
is_closed        0
url              0
rating           0
price           84
review_count     0
latitude         0
longitude        0
location         0
dtype: int64

Will look at how to deal with null values in the price section later. 

Importing crime dataset. We downloaded this data from the los angeles city database. 

In [109]:
crime = pd.read_csv('./data/Crime_Data_from_2010_to_Present.csv')

Looking at null values. 

In [111]:
crime.isnull().sum()

DR Number                       0
Date Reported                   0
Date Occurred                   0
Time Occurred                   0
Area ID                         0
Area Name                       0
Reporting District              0
Crime Code                      0
Crime Code Description          0
MO Codes                   211791
Victim Age                      0
Victim Sex                 182963
Victim Descent             183006
Premise Code                   50
Premise Description           119
Weapon Used Code          1297375
Weapon Description        1297376
Status Code                     3
Status Description              0
Crime Code 1                    9
Crime Code 2              1822075
Crime Code 3              1945645
Crime Code 4              1948656
Address                         0
Cross Street              1623108
Location                        0
dtype: int64

In [113]:
crime.shape

(1948750, 26)

Replacing all column name spaces with underscores and lowercasing all characters. 

In [114]:
crime.columns = [column.lower().replace(' ', '_') for column in crime.columns]

In [119]:
crime.head()

Unnamed: 0,dr_number,date_reported,date_occurred,time_occurred,area_id,area_name,reporting_district,crime_code,crime_code_description,mo_codes,...,weapon_description,status_code,status_description,crime_code_1,crime_code_2,crime_code_3,crime_code_4,address,cross_street,location_
0,110215293,08/08/2011,08/08/2011,1700,2,Rampart,256,930,CRIMINAL THREATS - NO WEAPON DISPLAYED,0421 0432 0444 0601 0913,...,VERBAL THREAT,AO,Adult Other,930.0,,,,1800 WILSHIRE BL,,"(34.0564, -118.2725)"
1,110215294,08/08/2011,07/30/2011,5,2,Rampart,231,956,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",1820,...,,IC,Invest Cont,956.0,,,,100 N OCCIDENTAL BL,,"(34.0707, -118.2795)"
2,110215295,08/08/2011,08/02/2011,1430,2,Rampart,231,956,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",1814 2000,...,,AO,Adult Other,956.0,,,,100 S DILLON ST,,"(34.0723, -118.2831)"
3,110215296,08/08/2011,08/08/2011,1810,2,Rampart,257,901,VIOLATION OF RESTRAINING ORDER,1814 2000,...,,AO,Adult Other,901.0,,,,1200 WILSHIRE BL,,"(34.053, -118.2648)"
4,110215299,08/09/2011,08/03/2011,1800,2,Rampart,299,440,THEFT PLAIN - PETTY ($950 & UNDER),0344 1223 1251 1259,...,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",IC,Invest Cont,440.0,,,,PICO,UNION,"(34.0446, -118.2769)"


In [116]:
crime.area_name.value_counts()

77th Street    134448
Southwest      125420
N Hollywood    105469
Pacific        103227
Southeast      102569
Mission         96565
Northeast       92934
Van Nuys        92390
Newton          91884
Hollywood       90491
Topanga         90235
Devonshire      89537
Central         87727
Olympic         87612
Harbor          85226
West Valley     82805
Rampart         82365
West LA         81866
Wilshire        80350
Foothill        73882
Hollenbeck      71748
Name: area_name, dtype: int64

In [122]:
crime.isnull().sum()

dr_number                       0
date_reported                   0
date_occurred                   0
time_occurred                   0
area_id                         0
area_name                       0
reporting_district              0
crime_code                      0
crime_code_description          0
mo_codes                   211791
victim_age                      0
victim_sex                 182963
victim_descent             183006
premise_code                   50
premise_description           119
weapon_used_code          1297375
weapon_description        1297376
status_code                     3
status_description              0
crime_code_1                    9
crime_code_2              1822075
crime_code_3              1945645
crime_code_4              1948656
address                         0
cross_street              1623108
location_                       0
dtype: int64

Dropping unnecessary columns. 

In [123]:
crime.drop(labels = ['mo_codes', 'cross_street'], axis = 1, inplace = True)

In [125]:
crime.drop(labels = 'weapon_used_code', axis = 1, inplace = True)

In [126]:
crime.weapon_description.fillna('unknown', inplace = True)

In [127]:
crime.isnull().sum()

dr_number                       0
date_reported                   0
date_occurred                   0
time_occurred                   0
area_id                         0
area_name                       0
reporting_district              0
crime_code                      0
crime_code_description          0
victim_age                      0
victim_sex                 182963
victim_descent             183006
premise_code                   50
premise_description           119
weapon_description              0
status_code                     3
status_description              0
crime_code_1                    9
crime_code_2              1822075
crime_code_3              1945645
crime_code_4              1948656
address                         0
location_                       0
dtype: int64

In [129]:
crime.drop(labels = ['crime_code_1', 'crime_code_2', 'crime_code_3', 'crime_code_4'], axis = 1, inplace = True)

In [159]:
crime.isnull().sum()

dr_number                      0
date_reported                  0
date_occurred                  0
time_occurred                  0
area_id                        0
area_name                      0
reporting_district             0
crime_code                     0
crime_code_description         0
victim_age                     0
victim_sex                182963
victim_descent            183006
premise_code                  50
premise_description          119
weapon_description             0
status_code                    3
status_description             0
address                        0
location_                      0
dtype: int64

In [170]:
crime.drop(labels = ['premise_description', 'status_code', 'victim_descent', 'victim_sex', 'premise_code'], axis = 1, inplace = True)

In [172]:
crime.head()

Unnamed: 0,dr_number,date_reported,date_occurred,time_occurred,area_id,area_name,reporting_district,crime_code,crime_code_description,victim_age,weapon_description,status_description,address,location_
0,110215293,08/08/2011,08/08/2011,1700,2,Rampart,256,930,CRIMINAL THREATS - NO WEAPON DISPLAYED,57,VERBAL THREAT,Adult Other,1800 WILSHIRE BL,"(34.0564, -118.2725)"
1,110215294,08/08/2011,07/30/2011,5,2,Rampart,231,956,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",20,unknown,Invest Cont,100 N OCCIDENTAL BL,"(34.0707, -118.2795)"
2,110215295,08/08/2011,08/02/2011,1430,2,Rampart,231,956,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",43,unknown,Adult Other,100 S DILLON ST,"(34.0723, -118.2831)"
3,110215296,08/08/2011,08/08/2011,1810,2,Rampart,257,901,VIOLATION OF RESTRAINING ORDER,22,unknown,Adult Other,1200 WILSHIRE BL,"(34.053, -118.2648)"
4,110215299,08/09/2011,08/03/2011,1800,2,Rampart,299,440,THEFT PLAIN - PETTY ($950 & UNDER),0,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",Invest Cont,PICO,"(34.0446, -118.2769)"


In [193]:
crime.drop(['dr_number', 'date_reported', 'area_id', 'reporting_district', 'crime_code', 'address'], axis = 1, inplace = True)

In [197]:
crime.drop(['status_description', 'time_occurred', 'victim_age'], axis = 1, inplace = True)

In [200]:
crime.head()

Unnamed: 0,date_occurred,area_name,crime_code_description,weapon_description,location_
0,08/08/2011,Rampart,CRIMINAL THREATS - NO WEAPON DISPLAYED,VERBAL THREAT,"(34.0564, -118.2725)"
1,07/30/2011,Rampart,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",unknown,"(34.0707, -118.2795)"
2,08/02/2011,Rampart,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",unknown,"(34.0723, -118.2831)"
3,08/08/2011,Rampart,VIOLATION OF RESTRAINING ORDER,unknown,"(34.053, -118.2648)"
4,08/03/2011,Rampart,THEFT PLAIN - PETTY ($950 & UNDER),"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)","(34.0446, -118.2769)"


In [204]:
pwd

'/Users/adamburpee/dsi/project_4'

In [205]:
cd adam_burpee_la7/

/Users/adamburpee/dsi/project_4/adam_burpee_la7


In [207]:
cd code/

/Users/adamburpee/dsi/project_4/adam_burpee_la7/code


In [None]:
df_weed.to_csv('../data/df_weed.csv')

In [None]:
crime.to_csv('../data/crime.csv')

In [208]:
with open('../assets/crime_df.pkl', 'wb+') as f:
    pickle.dump(crime, f)

In [209]:
with open('../assets/df_weed.pkl', 'wb+') as f:
    pickle.dump(df_weed, f)

# Conclusion and Next Steps

Now that we've queried the yelp api and uploaded our crime dataset, we can move on to our next notebook and look into crime rates and tendencies, and how they are affected by marijuana legalization. We will be examining crime rates in general, and then focusing on crime within a certain radius of dispensaries. 