# Web Scraping Prototype
## Chris Kimber
## Insight Project

### Background

To create a table or potentially database of grocery products to match with ingredients, I am going to attempt to scape the Metro online grocery webshop. The webshop is written using Javascript so I will scrape using a web driver in Selenium

In [2]:
from bs4 import BeautifulSoup
from selenium import webdriver
import numpy as np
import pandas as pd

In [3]:
webshop_domain = 'https://metro.ca/en/'
path = 'online-grocery/search'
url = (webshop_domain + path)

In [5]:
driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source
soup = BeautifulSoup(page, features = 'html.parser')
driver.close()
print(soup.prettify())

<html class="vitrine page--error--bypass svg video videoloop videopreload hashchange csscalc cssgradients opacity pointerevents svgasimg cssanimations flexbox bgsizecover csstransforms csstransitions backgroundblendmode desktop mac landscape os macos10 macos10_10 32bit chrome chrome83 chrome83_0 webkit en-us no-videoautoplay desktop-size anonymous dishide-overlay-is_active dishide-instance-popover-wrapper-ipdetection-is_active" lang="en-CA" xml:lang="en-CA" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#">
 <head>
  <!-- 626F6E6A6F757261746F7573 -->
  <script async="" src="https://us-sonar.sociomantic.com/js/2010-07-01/adpan/metro-ca" type="text/javascript">
  </script>
  <script async="" id="www-widgetapi-script" src="https://s.ytimg.com/yts/jsbin/www-widgetapi-vflIVmiP2/www-widgetapi.js" type="text/javascript">
  </script>
  <script async="" src="https://www.google-analytics.com/plugins/ua/ec.js" type="text/javascript">
  </script>
  <script async="" src="https://www

The class structure for grocery items on this page is pretty ugly, with multiple classes and horrific nested names. Here I'll pull out one class and test to see if it can be pulled out of the soup.

In [10]:
item_list_test = soup.find_all('div', class_ = "products-tile-list__tile products-tile-list__tile--third")

In [11]:
item_list_test

[<div class="products-tile-list__tile products-tile-list__tile--third">
 <div class="tile-product item-addToCart tile-product--effective-date" data-category-url="/aisles/fruits-vegetables/fruits/berries-cherries" data-is="container-tabs" data-is-inactive="false" data-max-qty="6" data-min-qty="1" data-product-category="Fruits &amp; Vegetables" data-product-code="715756100019" data-product-deletable="true" data-product-name="Raspberries" data-substitution-permission="NO" data-unit-increment="1">
 <div class="tile-product__top-section">
 <div class="tile-product__top-section__visuals">
 <a class="product-details-link" href="/en/online-grocery/aisles/fruits-vegetables/fruits/berries-cherries/raspberries/p/715756100019">
 <picture class="tile-product__top-section__visuals__img-product defaultable-picture">
 <source media="(min-width: 730px)" srcset="https://product-images.metro.ca/images/hab/hf2/9335888052254.jpg, https://product-images.metro.ca/images/hd6/h7a/9335889002526.jpg 2x"/>
 <sour

In [13]:
len(item_list_test)

4

In addition to being extremely ugly in naming structure, when inspecting the format and copying the class names out there are problems with inconsitent trailing or internal spaces. These cause the find within the soup to fail because the spacing doesn't match that in the soup. The list of classes began as copied out of inspector; the loop below tests whether each class in the list finds products and I manually cleaned the ones with 0s by adjusting internal and trailing spaces. 

In [33]:
item_classes = ["products-tile-list__tile", "products-tile-list__tile products-tile-list__tile--second", "products-tile-list__tile products-tile-list__tile--third", "products-tile-list__tile products-tile-list__tile--fourth products-tile-list__tile--second", "products-tile-list__tile products-tile-list__tile--third products-tile-list__tile--second", "products-tile-list__tile products-tile-list__tile--fourth products-tile-list__tile--third products-tile-list__tile--second"]

In [34]:
item_number = []
for cl in item_classes:
    number = (soup.find_all('div', class_ = cl))
    item_number.append(len(number))

In [35]:
item_number

[24, 4, 4, 4, 2, 2]

An interesting observation of the above result is that 24 items are listed under the first class. A webshop page actually displays only 24 items. This suggests the other classes might be getting pulled in when finding the first class? The name of the first class is nested in the others. Perhaps I should have expected this if I knew more about classes. 

To verify that all items on the page are in the first class I will extract all the item titles and verify all items are present by comparing with the page.

In [46]:
item_list = soup.find_all('div', class_ = item_classes[0])

In [47]:
first_item = item_list[0]
name = first_item.find('div', class_ = 'pt-title').text
name

'Banana'

In [48]:
names = []
for item in item_list:
    names.append(item.find('div', class_ = 'pt-title').text)

In [49]:
names

['Banana',
 'English cucumber',
 'Raspberries',
 'Lean Ground Beef, Value Pack',
 'White mushrooms',
 'Boneless Trimmed Chicken Breasts, Value Pack',
 'Hothouse red pepper',
 'Large Eggs',
 'Organic banana',
 'Lean Ground Beef',
 'Salted Butter',
 'Large eggs',
 'Seedless mini cucumbers',
 'Sweet potato',
 'Large Omega-3 Eggs, Life Smart',
 'Iceberg Lettuce',
 'Baby-Cut Carrots',
 'Lime',
 'Extra Lean Ground Beef',
 'Red Seedless Grapes',
 'Italian tomato',
 'Broccoli',
 'Green Onions',
 'Red cluster tomatoes']

All items on the page are in the first class so going forward I can rewrite the item list pull for a page as follows if preferred:

In [None]:
item_list = soup.find_all('div', class_ = "products-tile-list__tile")

Now that the item names can be grabbed from a single page, the next step is to iterate over all pages in the shop. First step is to make it possible to dynamically determine how many pages are currently in the webshop so that it can be integrated into the scraping script.

In [51]:
counter_test = soup.find('span', class_ = 'ppn--short').text
counter_test

'1/683'

In [52]:
def get_max_pagenum (soup):
    pc = soup.find('span', class_ = 'ppn--short').text
    split_pc = pc.split("/")
    return int(split_pc[1])

In [54]:
get_max_pagenum(soup)

683

With the function to determine how many pages are used, the script below can be used to open the first page with a counter set to 0, write the names to a file, click the button to move to the next page, and repeat the process. The page counter will cause the script to end after the last page is scraped. A 1 second delay is used because otherwise the scraper crashed after some small number of pages (stochastic, probably load time was variable and the scraper sometimes tried to click too fast). **NOTE there is a popup that appears on pg 1, currently you have to quickly manually click to remove it in order for the scraper to start.**

In [71]:
import time
driver = webdriver.Chrome()
driver.get(url)

names = []

pageCounter = 0

soup = BeautifulSoup(driver.page_source, features = 'html.parser')
maxPageCount = get_max_pagenum(soup)

while(pageCounter < maxPageCount):
    time.sleep(1)
    soup = BeautifulSoup(driver.page_source, features = 'html.parser')
    item_containers = soup.find_all('div', class_ = "products-tile-list__tile")
    for item in item_containers:
        names.append(item.find('div', class_ = 'pt-title').text)
    driver.find_element_by_css_selector("a[aria-label='Next']").click()
    pageCounter +=1
    print(pageCounter)
    
driver.close

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


<bound method WebDriver.close of <selenium.webdriver.chrome.webdriver.WebDriver (session="df0a1365cb57a8f1d9bc63baac3a0922")>>

Save out the names as a pickle and a csv and check overall length because the Chrome window stopped on pg 667 but the page counter kept going to 682, the end at the time of scraping.

In [72]:
names_df = pd.DataFrame(names)
names_df.to_csv('/Users/chrki23/Documents/Insight_Project/data/cleaned/grocery_names.csv', index = False)

In [73]:
names_df.tail(n=10)

Unnamed: 0,0
16351,"Maui mango scented mist refill, Beach Escapes"
16352,Rejuvenate water bottle
16353,"Cool antiperspirant and deodorant spray, Active"
16354,Strawberry jam
16355,Aloe body wash
16356,Smartfoam™ Effervescent Mint Whitening Toothpa...
16357,Ground Espelette pepper
16358,Gluten free organic chewy candies
16359,Horseradish mustard
16360,"Soya and lavender scented candle, Loft"


In [76]:
import pickle
filehandler = open('/Users/chrki23/Documents/Insight_Project/data/cleaned/grocery_names.data', 'wb')
pickle.dump(names, filehandler)

In [75]:
names[-1]

'Soya and lavender scented candle, Loft'