# Data Comprehensiveness

Problem: We need a comprehensive set of queries for the Google Maps scraper to extract an exhaustive list of activities from Berkeley

Get data scraper running locally:
https://github.com/gosom/google-maps-scraper

Cities | Categories
--- | ---
cities.csv | categories.csv

Desired categories need to be specific (i.e. no atms, religious institutions) and exhaustive (i.e. all restaurant subtypes).
Queries will be generated through cross-matching '[category] near [city], CA'.

In [17]:
# Imports
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
from itertools import product

InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_rows', 4000)

In [18]:
categories = pd.read_table('categories.csv', header=None)
categories = categories.rename(columns={0:'Category'})
categories.insert(1, "Primary", 0, True)
categories.insert(2, "Secondary", 0, True)
categories.insert(3, "Remove", 0, True)
categories.insert(4, "Unclassified", 1, True)
categories

Unnamed: 0,Category,Primary,Secondary,Remove,Unclassified
0,3d printing service,0,0,0,1
1,abarth dealer,0,0,0,1
2,abbey,0,0,0,1
3,aboriginal and torres strait islander organisa...,0,0,0,1
4,aboriginal art gallery,0,0,0,1
...,...,...,...,...,...
5497,youth organization,0,0,0,1
5498,youth social services organization,0,0,0,1
5499,yucatan restaurant,0,0,0,1
5500,zhejiang restaurant,0,0,0,1


In [19]:
samples_count = {'Restaurant': len(categories[categories["Category"].str.contains("restaurant")]),
                'Stop': len(categories[categories["Category"].str.contains("stop")]),
                'Company': len(categories[categories["Category"].str.contains("company")]),
                'Store': len(categories[categories["Category"].str.contains("store")]),
                'Shop':len(categories[categories["Category"].str.contains("store")]),
                'Station': len(categories[categories["Category"].str.contains("station")]),
                'Service': len(categories[categories["Category"].str.contains("service")]),
                'Agency': len(categories[categories["Category"].str.contains("agency")]),}
samples_count

{'Restaurant': 367,
 'Stop': 8,
 'Company': 99,
 'Store': 379,
 'Shop': 379,
 'Station': 45,
 'Service': 470,
 'Agency': 90}

In [23]:
# Basics: remove all obvious stupid options: parkING (issue), therapist
# Note: car rental agencies removed
# School a bit of a grey area (i.e. ski school) - removing for now
# SERVICE, CENTER

remove_terms = ["company", "parking", "car ", "agency", "church", "firm", "dealer", "auction", "manufactur", 
                "school", "bank", "atm", "facility", "hostel", "clinic", "supplier", "religious", "cleaning",
               "group", "ist", "academy", "tant", "department", "building", "information", "bureau", "office",
               "police", "station", "air", "service", "ag", "broker", "factory", "centre", "public"]
secondary_terms = ["shop", "store", "boutique", "ceramics", "charcuterie", "butcher", "clothing", "cosmetics", "couture",
                  "delicatessen", "fashion", "flea market", "market", "nail salon", "patisserie", "textiles"]
primary_terms = ["restaurant", "club", "museum", "class", " bar", " park", "adventure", "studio", "rental",
                "tour", "theater", "hall", "arena", "studio", "venue", "bakery", " field", "badminton",
                "basketball", "baseball", "squash", "tennis", "cafe", "coffee shop", "cinema", "diner", "race"]
terms_group = {"Remove": remove_terms, "Secondary": secondary_terms, "Primary": primary_terms}

for col, content in terms_group.items():
    for term in content: # get each word
        categories.loc[categories["Category"].str.contains(term), col] = 1 # set to 1 in proper category
        categories.loc[categories["Category"].str.contains(term), "Unclassified"] = 0 # set to 0 in unclassified
        for other_col in terms_group.keys():
            if other_col != col:
                categories.loc[categories["Category"].str.contains(term), other_col] = 0 # 0 for rest

manual_terms_raw = """spa, spa and health club, wellness hotel, rock climbing, sauna, gay sauna, ski, water ski, stadium, swimming pool,
                     aquarium, boxing ring, boxing gym, boxing club, botanical garden, campground, castle, culture, dancing, exhibit, facial spa, 
                     football pitch, fraternal organization, gambling house, garden, golf course, golf driving range, horse riding, ice skating rink, 
                     irish pub, mountain bike, nightlife, pier, planetarium, pub, sauna, ski, ski jumping hill, surf spot, town, wellness, winery,
                     wine cellar, zoo, bar, park, escape room center, amusement center, garden center, historic city center, indoor snowcenter, 
                     laser tag center, meditation center, recreation center, skydiving center, wilderness center, art center, aerial sports center, 
                     amphitheater, magician, massage, shooting range, rodeo, shooting range, ballroom, bbq area, beach volleyball court, beer garden, 
                     biking trail, bmx track, bocce ball court, casino, festival"""

manual_terms = manual_terms_raw.replace("\n                     ", "").split(", ")
for term in manual_terms: # get each word
    categories.loc[categories["Category"] == term, "Primary"] = 1 # set to 1 in proper category
    categories.loc[categories["Category"] == term, "Unclassified"] = 0 # set to 0 in unclassified
    for other_col in terms_group.keys():
        if other_col != "Primary":
            categories.loc[categories["Category"] == term, other_col] = 0 # 0 for rest

manual_remove_terms = ["abarth dealer", "aerospace company", "agriculture", "agriculture cooperative", "agriculture machines supplier", 
                        "aircraft rental service", "airport car park", "airport parking lot", "animation studio", "apartment rental agency",
                        "appliance rental service", "aquaculture farm", "army barracks", "bar restaurant furniture shop", 
                        "bar restaurant furniture store", "bar stool supplier", "bar tabac", "barbecue shop", "barber school", "barber supply shop",
                        "barber supply store", "bariatric surgeon", "barn", "barrel supplier", "barrister", "bartending school", "baseball goods shop",
                        "baseball goods store", "basketball court contractor", "bicycle club", "body shaping class", "book publisher", "bouncy castle hire",
                        "bus tour agency", "cabin rental agency", "cinema equipment supplier", "city department of public safety", "city hall", 
                        "city or town hall", "classified ads newspaper publisher", "co-working space", "college of agriculture", "computer club", 
                        "computer rental agency", "condominium rental agency", "construction machine rental service", "copier repair service",
                        "cottage rental", "coworking space", "crane rental agency", "department of public safety", "desktop publishing service",
                        "dress and tuxedo rental service", "evening dress rental service", "exhibition planner", "forklift rental service", 
                        "full dress rental service", "furniture rental service", "gardener", "garden allotment", "golf course builder", 
                        "heavy machinery rental service", "holiday apartment rental", "home theater store", "hospitality and tourism school",
                        "hyperbaric medicine physician", "jehovah’s witness kingdom hall", "lava field", "lawn equipment rental service", 
                       "mailbox rental service", "masonic hall", "military barracks", "military town", "multimedia and electronic book publisher",
                       "newspaper publisher", "office equipment rental company", "oil field equipment supplier", "party equipment rental service", 
                       "printed music publisher", "property rental", "property rental agency", "public amenity house", "bath", "public bath",
                       "public bathroom", "public baths", "public educational institution", "public female bathroom", "public mailbox",
                       "public housing", "postbox rental service", "public parking lot", "public parking space", "race car dealer", "real estate rental",
                       "real estate rental agency", "restaurant supply store", "retail space rental agency", "rsl club", "scaffolding rental service",
                       "short term apartment rental agency", "stained glass studio", "student halls", "studio", "studio apartment", "surf lifesaving club", 
                       "table tennis supply shop", "table tennis supply store", "tennis court construction company", "tennis shop", "tennis store", 
                       "theater supply store", "tourist information center", "tourist information centre", "trailer rental service", 
                       "vacation home rental agency", "valet parking service", "vauxhall/opel dealer", "village hall", "virtual office rental", 
                       "warehouse club", "wedding dress rental service", "wedding venue", "wheelchair rental service"]

for term in manual_remove_terms: # get each word
    categories.loc[categories["Category"] == term, "Remove"] = 1 # set to 1 in proper category
    categories.loc[categories["Category"] == term, "Primary"] = 0 # set to 0 in unclassified
    for other_col in terms_group.keys():
        if other_col != "Remove":
            categories.loc[categories["Category"] == term, other_col] = 0 # 0 for rest

In [24]:
CHECK_WORD = ""
categories[categories["Category"].str.contains(CHECK_WORD)]
len(categories[categories["Category"].str.contains(CHECK_WORD)])

Unnamed: 0,Category,Primary,Secondary,Remove,Unclassified
0,3d printing service,0,0,1,0
1,abarth dealer,0,0,1,0
2,abbey,0,0,0,1
3,aboriginal and torres strait islander organisa...,0,0,0,1
4,aboriginal art gallery,0,0,0,1
...,...,...,...,...,...
5497,youth organization,0,0,0,1
5498,youth social services organization,0,0,1,0
5499,yucatan restaurant,1,0,0,0
5500,zhejiang restaurant,1,0,0,0


5502

In [25]:
len(categories.loc[categories["Primary"] == 1])
select_categories = categories.loc[categories["Primary"] == 1]
select_categories

782

Unnamed: 0,Category,Primary,Secondary,Remove,Unclassified
9,acaraje restaurant,1,0,0,0
34,adult entertainment club,1,0,0,0
37,adventure sports,1,0,0,0
38,adventure sports center,1,0,0,0
39,adventure sports centre,1,0,0,0
45,aerial sports center,1,0,0,0
46,aero dance class,1,0,0,0
48,aeroclub,1,0,0,0
53,afghani restaurant,1,0,0,0
56,african restaurant,1,0,0,0


In [32]:
cities = pd.read_table('cities.csv', header=None)
cities = cities.rename(columns={0:'City'})

combinations = product(select_categories['Category'], cities.loc[cities['City']=="Berkeley", "City"])
df_combinations = pd.DataFrame([f'{activity} in {city.lower()}, ca' for activity, city in combinations], columns=['Activity in City'])
df_combinations

Unnamed: 0,Activity in City
0,"acaraje restaurant in berkeley, ca"
1,"adult entertainment club in berkeley, ca"
2,"adventure sports in berkeley, ca"
3,"adventure sports center in berkeley, ca"
4,"adventure sports centre in berkeley, ca"
5,"aerial sports center in berkeley, ca"
6,"aero dance class in berkeley, ca"
7,"aeroclub in berkeley, ca"
8,"afghani restaurant in berkeley, ca"
9,"african restaurant in berkeley, ca"


Side note: this might be stupid ah - testing on first 20 prompts

In [33]:
df_combinations.to_csv('clean_activity_berk.csv', index=False)

In [46]:
print(df_combinations.head(20).to_string(index=False).replace("  ", ""))

Activity in City
acaraje restaurant in berkeley, ca
adult entertainment club in berkeley, ca
adventure sports in berkeley, ca
 adventure sports center in berkeley, ca
 adventure sports centre in berkeley, ca
aerial sports center in berkeley, ca
aero dance class in berkeley, ca
aeroclub in berkeley, ca
afghani restaurant in berkeley, ca
african restaurant in berkeley, ca
 aikido club in berkeley, ca
 alsace restaurant in berkeley, ca
 amateur theater in berkeley, ca
 american football field in berkeley, ca
 american restaurant in berkeley, ca
amphitheater in berkeley, ca
amusement center in berkeley, ca
amusement park in berkeley, ca
 amusement park ride in berkeley, ca
anago restaurant in berkeley, ca


Scraper speed

Inactivity - 0.4m
depth: 1, prompt (general): 1 (restaurants), inactivity: 3m -- 4:08 / 17 results
depth: 1, prompt (general): 1 (restaurants), inactivity: 0.5m -- 1:08 / 17 results
depth: 1, prompt (general): 1 (restaurants), inactivity: 0.2m -- 0:02 / 0 results - failed
depth: 1, prompt (general): 1 (restaurants), inactivity: 0.4m -- 0:56 / 17 results
depth: 1, prompt (general): 2 (restaurants & tennis), inactivity: 0.3m -- 1:12 / 37 results

