# Python Web Scraper For WIC Grocery Stores

## Part 1:Scraping store names and adresses from WIC Grocery Stores website

In [1]:
# Import modules
import requests  # used to get url
import json  # used to read json files on the webpage
import pandas as pd  # used to integrate data

The starting url was the request url when you turn the page. It could be retrived from Developer Tools -> Network -> Headers only after turning the page. 

The data on this website are 85 pages of JSON files. Each page has 10 stores and their addresses. The total number of pages may be updated. Please update the variable 'page_count' to reflect the current total number of pages.

In [2]:
# There are 84 pages on the WIC website
page_count = 84

# Begin with page 1
page_number = 1

# start_url begins with page 1 as well
start_url = 'https://wicgrocerystores.web.health.state.mn.us/search?criteria=&size=10&page=' + str(page_number)

In [3]:
# Get response from start_url
resp = requests.get(start_url)

# Load the json file in start_url (only page 1 is there)
data = json.loads(resp.text)
print(data)

{'totalElements': 838, 'data': [{'id': '8245', 'name': '1st Quality Market', 'address': {'street': '2655 Nicollet Ave', 'city': 'MINNEAPOLIS', 'zipCode': 55408, 'county': 'HENNEPIN'}}, {'id': '2150', 'name': '33RD MEAT & GROCERY', 'address': {'street': '710 33RD AVE N', 'city': 'SAINT CLOUD', 'zipCode': 56303, 'county': 'STEARNS'}}, {'id': '1548', 'name': '52 MARKET  AND TRADING', 'address': {'street': '990 ARCADE ST', 'city': 'SAINT PAUL', 'zipCode': 55106, 'county': 'RAMSEY'}}, {'id': '8782', 'name': '75 Market and Deli', 'address': {'street': '1187 Minnehaha Ave E ', 'city': 'SAINT PAUL', 'zipCode': 55106, 'county': 'RAMSEY'}}, {'id': '8685', 'name': '7th Grocery', 'address': {'street': '43 7th St W', 'city': 'SAINT PAUL', 'zipCode': 55102, 'county': 'RAMSEY'}}, {'id': '8875', 'name': 'A A Market', 'address': {'street': '191 Western Ave N', 'city': 'SAINT PAUL', 'zipCode': 55102, 'county': 'RAMSEY'}}, {'id': '8552', 'name': 'Aaran Halal Market', 'address': {'street': '8904 Old Cedar

In [4]:
# Normalize the json file into the pandas data frame called "df".
df = pd.io.json.json_normalize(data['data'])
df

Unnamed: 0,id,name,address.street,address.city,address.zipCode,address.county
0,8245,1st Quality Market,2655 Nicollet Ave,MINNEAPOLIS,55408,HENNEPIN
1,2150,33RD MEAT & GROCERY,710 33RD AVE N,SAINT CLOUD,56303,STEARNS
2,1548,52 MARKET AND TRADING,990 ARCADE ST,SAINT PAUL,55106,RAMSEY
3,8782,75 Market and Deli,1187 Minnehaha Ave E,SAINT PAUL,55106,RAMSEY
4,8685,7th Grocery,43 7th St W,SAINT PAUL,55102,RAMSEY
5,8875,A A Market,191 Western Ave N,SAINT PAUL,55102,RAMSEY
6,8552,Aaran Halal Market,8904 Old Cedar Ave S,BLOOMINGTON,55425,HENNEPIN
7,8379,Africa International Market,7617 Welcome Ave N,BROOKLYN PARK,55443,HENNEPIN
8,8783,African Halal & Deli,405 E Lake St,MINNEAPOLIS,55408,HENNEPIN
9,8624,African Plaza,555 Snelling Ave N,SAINT PAUL,55104,RAMSEY


Note that "df" only have the information of the first 10 stores. Now We still have to load all 840 stores. 

In [5]:
# Store 840 grocery stores into a pandas data frame called "df2".
df2 = pd.DataFrame()

# Here I create a for loop to get through all pages
for page_number in range(page_count):  
    page_number = page_number + 1
    start_url = 'https://wicgrocerystores.web.health.state.mn.us/search?criteria=&size=10&page=' + str(page_number)
    resp_next = requests.get(start_url)
    data = json.loads(resp_next.text)
    df = pd.io.json.json_normalize(data['data'])
    df2 = df2.append(df)

In [6]:
# Check if the loop runs to the end
print(page_number)
print(start_url)

84
https://wicgrocerystores.web.health.state.mn.us/search?criteria=&size=10&page=84


In [7]:
# Get the total number of store records
len(df2)

838

In [8]:
# Check the first 20 data in df2
#Note that in this data frame, the indices loop every ten rows
df2.head(20)

Unnamed: 0,id,name,address.street,address.city,address.zipCode,address.county
0,8245,1st Quality Market,2655 Nicollet Ave,MINNEAPOLIS,55408,HENNEPIN
1,2150,33RD MEAT & GROCERY,710 33RD AVE N,SAINT CLOUD,56303,STEARNS
2,1548,52 MARKET AND TRADING,990 ARCADE ST,SAINT PAUL,55106,RAMSEY
3,8782,75 Market and Deli,1187 Minnehaha Ave E,SAINT PAUL,55106,RAMSEY
4,8685,7th Grocery,43 7th St W,SAINT PAUL,55102,RAMSEY
5,8875,A A Market,191 Western Ave N,SAINT PAUL,55102,RAMSEY
6,8552,Aaran Halal Market,8904 Old Cedar Ave S,BLOOMINGTON,55425,HENNEPIN
7,8379,Africa International Market,7617 Welcome Ave N,BROOKLYN PARK,55443,HENNEPIN
8,8783,African Halal & Deli,405 E Lake St,MINNEAPOLIS,55408,HENNEPIN
9,8624,African Plaza,555 Snelling Ave N,SAINT PAUL,55104,RAMSEY


## Part 2: Get started with Google Place API

Before using Google Place API. You must enable it from your Google account. To do this, please refer to https://developers.google.com/places/web-service/intro. If you're a new user, before you can start using the Google Maps Platform APIs and SDKs, you must sign up and create a billing account. To learn more, see https://developers.google.com/maps/gmp-get-started. 

After you enabled you API, you can get your API key for the next steps.

In [9]:
# Import modules 
import googlemaps
import urllib
import urllib.request

my_api_key = "AIzaSyAH9IdkVp0CigN83FGFtZt0nVq9QMRRm8Y" # replace with your API key. 

In [10]:
# Get my Google Maps Platform API key
gmaps_key = googlemaps.Client(key = my_api_key) 

In [11]:
# Here are the two basic url forms of Google Maps

# A Text Search request is an HTTP URL of the following form:
gmaps_text_url = "https://maps.googleapis.com/maps/api/place/textsearch/json?"

# A Place Details request is an HTTP URL of the following form:
gmaps_details_url = "https://maps.googleapis.com/maps/api/place/details/json?"

Next, we will use the second record - "33RD MEAT & GROCERY", to create our first query in Google Maps using Place API. The reason we use the second record is because the address of the first record, "1st Quality Market", doesn't match on the map.

The following code blocks has two goals to achieve. First, we will search the grocery store "33RD MEAT & GROCERY" on Google Maps using the Text Search function to get its place id. Second, we will use its place id to access to the place details to get the opening hours. 

In [None]:
# Try the first query
first_ten_data = df2[0:10]
first_query = first_ten_data.loc[1, 'address.street']+", "+first_ten_data.loc[1, 'address.city']+", "+str(first_ten_data.loc[1, 'address.zipCode']) + ", " + first_ten_data.loc[1, 'name'] # [0, 1, 2, ..., 9]
first_query

In [None]:
# Get the response of the first query place 
start_url2 = gmaps_text_url + 'query=' + first_query + '&key=' + my_api_key
resp2 = requests.get(start_url2)

In [None]:
# Display the information of the first query place
gmaps_data_1 = resp2.json()
gmaps_data_1

In this case, we use Google Maps text search function to search the store. But text search does not always turn up the right address. Hence, to control the data quality, we set have to set two quality check fields to record the matching address and matching name. If both the matching address and matching name are identical with the address and name on the WIC website, that piece of data is valid. Otherwise, it is invalid. 

In [None]:
# Get the matching address and the matching name for this query
mat_addr_1 = gmaps_data_1['results'][0]['formatted_address']
mat_name_1 = gmaps_data_1['results'][0]['name']
print(mat_addr_1)
print(mat_name_1)

In [None]:
# Get the place id for the first query place
gmaps_place_id_1 = gmaps_data_1['results'][0]['place_id']
gmaps_place_id_1

In [None]:
# Get the response of the place details search
start_url3 = gmaps_details_url+ 'place_id=' + gmaps_place_id_1 + "&key=" + my_api_key
resp3 = requests.get(start_url3)
gmaps_data_details_1 = resp3.json()
gmaps_data_details_1

In [None]:
gmaps_data_details_1['result']['opening_hours']['periods']

In [None]:
gmaps_data_details_1['result']['opening_hours']['weekday_text']

Compare the two code blocks above, we can see that, in the  data structure of the Google Maps opening hours, day_0, day_1 ..., day_6 represents Sunday, Monday, ..., Saturday, respectively. 

In [None]:
# Get the 'time' value of the second grocery store and store them in a list
vals_1 = []
for i in range(7):
    data_1_close = gmaps_data_details_1['result']['opening_hours']['periods'][i]['close']['time']
    data_1_open = gmaps_data_details_1['result']['opening_hours']['periods'][i]['open']['time']
    vals_1.append(data_1_open)
    vals_1.append(data_1_close)
print(vals_1)

## Part 3: Traverse the data list to get the a week's opening hours of each store

We successfully get the opening hours of "33RD MEAT & GROCERY" in Part 2. This part we will create a nested "for" loop to access the opening hours of every grocery store.

In [22]:
# A page contains 10 records
x = 0
page_count = 2 # which supposed to be 84!!
record_count = 10 # which supposed to be 10!!
index = 0
new_vals = []
mat_addr_list = []
mat_name_list = []

for page in range(page_count):
    data_series = df2[x:x+10]
    for index in range(record_count):
        query = data_series.loc[index, 'address.street']+", "+data_series.loc[index, 'address.city']+", "+str(data_series.loc[index, 'address.zipCode']) + ", " + data_series.loc[index, 'name']
        gmaps_search_url = gmaps_text_url + 'query=' + query + '&key=' + my_api_key
        gmaps_search_resp = requests.get(gmaps_search_url)
        gmaps_place_data = gmaps_search_resp.json()
        if gmaps_place_data['status'] == 'ZERO_RESULTS':
            new_vals.extend(['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN'])
            mat_addr_list.append('NaN')
            mat_name_list.append('NaN')
            continue
        else:
            gmaps_place_id = gmaps_place_data['results'][0]['place_id']
            matching_addr = gmaps_place_data['results'][0]['formatted_address']
            mat_addr_list.append(matching_addr)
            matching_name = gmaps_place_data['results'][0]['name']
            mat_name_list.append(matching_name) 
            gmaps_placedetails_url = gmaps_details_url+ 'place_id=' + gmaps_place_id + "&key=" + my_api_key
            gmaps_placedetails_resp = requests.get(gmaps_placedetails_url)
            gmaps_placedetails = gmaps_placedetails_resp.json()
            for i in range(7):
                try:
                    data_open = gmaps_placedetails['result']['opening_hours']['periods'][i]['open']['time']
                    data_close = gmaps_placedetails['result']['opening_hours']['periods'][i]['close']['time']
                    new_vals.append(data_open)
                    new_vals.append(data_close)
                except:
                    new_vals.extend(['NaN', 'NaN'])
    x = x+10
print(new_vals)
print(mat_addr_list)
print(mat_name_list)

['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', '0600', '2100', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0830', '1800', '0830', '1900', '0830', '1900', '0830', '1900', '0830', '1900', '0830', '1900', '0830', '1900', '0700', '2200', '0700', '2200', '0700', '2200', '0700', '2200', '0700', '2200', '0700', '2200', '0700', '2200', '0800', '2300', '0800', '2300', '0800', '2300', '0800', '2300', '0800', '2300', '0800', '2300', '0800', '2300', '0600', '2200', '0600', '2200', '0600', '2200', '0600', '2200', '0600', '2200', '0600', '2200', '0600', '2200', '0900', '2030', '0900', '2030', '0900', '2030', '0900', '2030', '0900', '2030', '0900', '2030', '0900', '2030', '1000', '1600', '0900', '1900', '0900', '1900', '0900', '1900', '0900', '1900', '0900', '1900', '0900', '1800', '0930

In [23]:
len(new_vals)

280

In [24]:
# Slice the data list at every 14th time record
chunks = [new_vals[x:x+14] for x in range(0, len(new_vals), 14)]
print(chunks)

[['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN'], ['0600', '2100', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000'], ['0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000', '0800', '2000'], ['0830', '1800', '0830', '1900', '0830', '1900', '0830', '1900', '0830', '1900', '0830', '1900', '0830', '1900'], ['0700', '2200', '0700', '2200', '0700', '2200', '0700', '2200', '0700', '2200', '0700', '2200', '0700', '2200'], ['0800', '2300', '0800', '2300', '0800', '2300', '0800', '2300', '0800', '2300', '0800', '2300', '0800', '2300'], ['0600', '2200', '0600', '2200', '0600', '2200', '0600', '2200', '0600', '2200', '0600', '2200', '0600', '2200'], ['0900', '2030', '0900', '2030', '0900', '2030', '0900', '2030', '0900', '2030', '0900', '2030', '0900', '2030'], ['1000', '1600', '0900', '1900', '0900', '1900', '0900', '1900', '0900', '1900', '0900', '1900', '090

In [25]:
# Copy the dataframe "df2" into dataframe "df3" to reset its row index
# This step is for concatenate two dataframes along the column 
df3 = df2.reset_index(drop=True)

In [26]:
# Create 14 new columns to store open and close time from Sunday to Saturday
new_cols = ['DAY0_OPEN', 'DAY0_CLOSE', 
            'DAY1_OPEN', 'DAY1_CLOSE', 
            'DAY2_OPEN', 'DAY2_CLOSE',
            'DAY3_OPEN', 'DAY3_CLOSE',
            'DAY4_OPEN', 'DAY4_CLOSE',
            'DAY5_OPEN', 'DAY5_CLOSE',
            'DAY6_OPEN', 'DAY6_CLOSE']

In [27]:
# Construct a dataframe using the columns created before
# Store every 14 records into a row
df5 = pd.DataFrame()
for i in range(len(chunks)):
    df4 = pd.DataFrame([chunks[i]], columns=new_cols) 
    df5 = df5.append(df4)
print(df5)

  DAY0_OPEN DAY0_CLOSE DAY1_OPEN DAY1_CLOSE DAY2_OPEN DAY2_CLOSE DAY3_OPEN  \
0       NaN        NaN       NaN        NaN       NaN        NaN       NaN   
0      0600       2100      0800       2000      0800       2000      0800   
0      0800       2000      0800       2000      0800       2000      0800   
0      0830       1800      0830       1900      0830       1900      0830   
0      0700       2200      0700       2200      0700       2200      0700   
0      0800       2300      0800       2300      0800       2300      0800   
0      0600       2200      0600       2200      0600       2200      0600   
0      0900       2030      0900       2030      0900       2030      0900   
0      1000       1600      0900       1900      0900       1900      0900   
0      0930       1900      0930       1900      0930       1900      0930   
0       NaN        NaN       NaN        NaN       NaN        NaN       NaN   
0       NaN        NaN       NaN        NaN       NaN        NaN

In [28]:
# Copy the dataframe "df5" into dataframe "df6" to reset its row index
# This step is for concatenate two dataframes along the column, too
df6 = df5.reset_index(drop=True)

In [29]:
# Create 2 new columns to store matching address
match_cols = ['match_address', 'match_name']

In [30]:
# Create a dataframe "df7" to
mat_data_dict = {'match_address': mat_addr_list, 'match_name': mat_name_list}
df7 = pd.DataFrame(mat_data_dict, columns=match_cols)
df7

Unnamed: 0,match_address,match_name
0,,
1,"710 33rd Ave N, St Cloud, MN 56303, United States",Thirtythird Meat & Grocery
2,"990 Arcade St, St Paul, MN 55106, United States",52 Market & Trading
3,"1187 Minnehaha Ave E, St Paul, MN 55106, Unite...",75 Market
4,"43 7th St W, St Paul, MN 55102, United States",7th Grocery
5,"191 Western Ave N # 1, St Paul, MN 55102, Unit...",AA Market
6,"8904 Old Cedar Ave S, Bloomington, MN 55425, U...",Aaran Halal Market
7,"7617 Welcome Ave N, Brooklyn Park, MN 55443, U...",AFRICA INTERNATIONAL MARKET
8,"405 E Lake St, Minneapolis, MN 55408, United S...",African Safari Meat Market
9,"555 Snelling Ave N, St Paul, MN 55104, United ...",Star Food Market


Now we have three dataframes "df3", "df6", and "df7" in the same row indices. The first stores grocery stores and their addresses, the second stores their opening hours, and the third stores the matching information. 

In [31]:
# Concatenate the two dataframes along the column
df8 = pd.concat([df3, df6, df7], axis=1)
df8.head(20)

Unnamed: 0,id,name,address.street,address.city,address.zipCode,address.county,DAY0_OPEN,DAY0_CLOSE,DAY1_OPEN,DAY1_CLOSE,...,DAY3_OPEN,DAY3_CLOSE,DAY4_OPEN,DAY4_CLOSE,DAY5_OPEN,DAY5_CLOSE,DAY6_OPEN,DAY6_CLOSE,match_address,match_name
0,8245,1st Quality Market,2655 Nicollet Ave,MINNEAPOLIS,55408,HENNEPIN,,,,,...,,,,,,,,,,
1,2150,33RD MEAT & GROCERY,710 33RD AVE N,SAINT CLOUD,56303,STEARNS,600.0,2100.0,800.0,2000.0,...,800.0,2000.0,800.0,2000.0,800.0,2000.0,800.0,2000.0,"710 33rd Ave N, St Cloud, MN 56303, United States",Thirtythird Meat & Grocery
2,1548,52 MARKET AND TRADING,990 ARCADE ST,SAINT PAUL,55106,RAMSEY,800.0,2000.0,800.0,2000.0,...,800.0,2000.0,800.0,2000.0,800.0,2000.0,800.0,2000.0,"990 Arcade St, St Paul, MN 55106, United States",52 Market & Trading
3,8782,75 Market and Deli,1187 Minnehaha Ave E,SAINT PAUL,55106,RAMSEY,830.0,1800.0,830.0,1900.0,...,830.0,1900.0,830.0,1900.0,830.0,1900.0,830.0,1900.0,"1187 Minnehaha Ave E, St Paul, MN 55106, Unite...",75 Market
4,8685,7th Grocery,43 7th St W,SAINT PAUL,55102,RAMSEY,700.0,2200.0,700.0,2200.0,...,700.0,2200.0,700.0,2200.0,700.0,2200.0,700.0,2200.0,"43 7th St W, St Paul, MN 55102, United States",7th Grocery
5,8875,A A Market,191 Western Ave N,SAINT PAUL,55102,RAMSEY,800.0,2300.0,800.0,2300.0,...,800.0,2300.0,800.0,2300.0,800.0,2300.0,800.0,2300.0,"191 Western Ave N # 1, St Paul, MN 55102, Unit...",AA Market
6,8552,Aaran Halal Market,8904 Old Cedar Ave S,BLOOMINGTON,55425,HENNEPIN,600.0,2200.0,600.0,2200.0,...,600.0,2200.0,600.0,2200.0,600.0,2200.0,600.0,2200.0,"8904 Old Cedar Ave S, Bloomington, MN 55425, U...",Aaran Halal Market
7,8379,Africa International Market,7617 Welcome Ave N,BROOKLYN PARK,55443,HENNEPIN,900.0,2030.0,900.0,2030.0,...,900.0,2030.0,900.0,2030.0,900.0,2030.0,900.0,2030.0,"7617 Welcome Ave N, Brooklyn Park, MN 55443, U...",AFRICA INTERNATIONAL MARKET
8,8783,African Halal & Deli,405 E Lake St,MINNEAPOLIS,55408,HENNEPIN,1000.0,1600.0,900.0,1900.0,...,900.0,1900.0,900.0,1900.0,900.0,1900.0,900.0,1800.0,"405 E Lake St, Minneapolis, MN 55408, United S...",African Safari Meat Market
9,8624,African Plaza,555 Snelling Ave N,SAINT PAUL,55104,RAMSEY,930.0,1900.0,930.0,1900.0,...,930.0,1900.0,930.0,1900.0,930.0,1900.0,930.0,1900.0,"555 Snelling Ave N, St Paul, MN 55104, United ...",Star Food Market


In [32]:
df8.head(20).to_csv("data_samples2.csv")