## Tesco Store finder

The aim of this notebook is to collect the various pieces of information for all of the Tesco stores in the UK


Approach

If you use the Tesco store locator website ( https://www.tesco.com/store-locator/uk/ )you can find a list of local (you specify the locality) tesco stores. Each of which has its own web page and included in the web page is detailed information about the store. 

When you go to one of the store web pages you will notice that the URL is something like this:


https://www.tesco.com/store-locator/uk/?bid=4634


Internally Tesco are using four digit numbers to identify their stores

These are real stores  
https://www.tesco.com/store-locator/uk/?bid=4634

https://www.tesco.com/store-locator/uk/?bid=6367


This one isn't


https://www.tesco.com/store-locator/uk/?bid=9999


There are about ~2500 Tesco stores inthe UK so you can see that a lot of the number range  1000 - 9999 will not actually represent a real store.




### Using the Browser Inspector

All modern browsers allow you to access the underlying HTML Code which makes up a Web page

It is the job of the Browser to interpret the HTML and present the information it represents on the screen in a user friendlt manner.

In order to Web scrape, you do need to have some understanding of HTML but not a great deal. Like most coding languages it is easier to read than to write and we only need to be able to read it a little bit, e.g. recognise different components or tags and a bit about the syntax of tags. 

A more important requirement is to be able to match what we see on the screen with the underlying HTML. A thorough understanding of the HTML and CSS code will allow you to do this, but there is a far easier way.

This involves using the developer tools found in all modern browsers and in particular the 'element inspector'. This allows you to select an element on the web page; a table, part of a table, a link, almost anything and have the corresponding HTML code highlighted.

### Information that we might want to scrape


* Store Name
* Store Address
* Store Geo-location 
* store type
* Store Post Code


## The packages we need

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs      # pip install beautifulsoup4 but import from bs4
import time
import folium                            # !pip install folium - not included with the Anaconda python, so you may need to install
from folium.plugins import MarkerCluster

The 'get' methods from requests only needs to be given a string representing a url

Quite often if you need to provide multiple parameters you would build the url string up and then call  

## Quick example to show how Beautifulsoup works

In [3]:
r = requests.get('https://www.bbc.co.uk')
##print(r.text)


## We can make the output look a bit better

In [None]:
soup = bs(r.text)               
prettyHTML = soup.prettify() 
#print(prettyHTML)

## We can find all of the images within the Web page

In [None]:
for imagelink in soup.find_all('img'):
    url = imagelink.get('src')
    print(url)

In [None]:
## We can find all of the links within the Web page

In [None]:
urls = soup.findAll('a')
for url in urls :
    print(url.get('href'))

## `find` and `FindAll` allow you to search for tags

## `get` allows you to select parameter values or tag values

## Typically we will be finding tags and then extracting values from them.

## What we need to ensure when doing this is that we have selected the correct tags. In any given webpage some of the common tags will occur many times as we have just seen with the 'a' tag.

## we can do this by either using a chain of tags which is unique and ends in the tag we want or make use of the parameters and values within a tag and find a unique combination which will identify the specific tag we want.

## This is why we need to inspect the HTML in order to identify these unique combinations.

## In HTML tags are written in a specific way 



## Start with a single file

In [None]:
r = requests.get('https://www.tesco.com/store-locator/uk/?bid=6367')
soup = bs(r.text)  

## Get the Title

In [None]:
for h1 in soup.find_all('h1'):
    store_id = h1.get('title')
    print(h1.text)

## Get the Store Id

In [None]:
for h1 in soup.findAll('h1'):
    store_id = h1.get('title')
    print(store_id)

## Get the address

In [None]:
for h2 in soup.find_all('h2'):
    if h2.text == 'Address' :
        print(h2.findNext('span').text)

## or

In [None]:
# Store Address
address = soup.find('div', {'class': 'address'}).find('span', {'itemprop': 'streetAddress'} ).text
print(address)

## or

In [None]:
# if the class makes the element unique, I can use it.
address = soup.find('span', {'itemprop': 'streetAddress'} )
print(address.text)

## As long as you uniquely identify the tag you want, how you get there generally doesn't matter

## Get the Longitude and Latitude

## This is a bit more involved and includes some simple python code to extract the actual values

In [None]:
# Latitude and Longitude
imagelink = soup.find('img')
url = imagelink.get('src')

item_list = url.split('/')
lat = item_list[8].split(',')[0]
lng = item_list[8].split(',')[1]

print ('lat =', lat, 'lng =', lng)

## That is all of the bits of information we wanted to collect - that is the Web scraping done

## So now we will put it all together and add a few python bits to accumulate the data from many stores in a single save dataset.


## We need to :

1. Make use of the file naming convention to loop through a large number of possible stores, accepting that some won't exist.
2. Although wasteful of space we will save all of the 'requested' files seperately and then process them with Beautifulsoup
3. Create a CSV file of all of the data we extract from the files using Beautifulsoup.


In [None]:
# 1  --- DO NOT RUN

stem = 'https://www.tesco.com/store-locator/uk/?bid='                      
filename_prefix = './stores/'
filename_suffix = '.html'
for x in range(3000,4000) :
    r = requests.get(stem + str(x))
    filename = filename_prefix + str(x) + filename_suffix
    f = open(filename, "w")
    f.write(r.text)
    f.close()
    time.sleep(5)                          # explain why this is included  - 1000 store with 5 sec wait = 5000+ secs to run
                                           # the wait is added as a courtesy so as not to overload the server

## Given that we expect some of the values we used not to be actual stores we need to know how to identify a 'missing' store

### In fact all calls return an HTML file, so we check to see which include 'Error' in the title.
###  
### In other scenarios the resuests call coul return a status_code of 404 (file not found) which you could check for using the `status_code` value which is included as part of the response.

In [None]:
rm = requests.get('https://www.tesco.com/store-locator/uk/?bid=9999')
soup = bs(rm.text)               
#prettyHTML = soup.prettify() 
#print(prettyHTML)
title = soup.find('title')
print(title.text)
if title.text[0:5] == 'Error' :
    print("Ignore me")
else :
    print("process me")

## Now that we have all of the files and can identify the 'duds'

## We are ready to do the scraping

## You could combine this step with the previous and do the scraping as you request the files

In [None]:
# 2


# create the dataframe
store_info = pd.DataFrame(columns = ['Id', 'Name', 'Address', 'lat', 'lng'])

# read the data from a file
folder  = './stores/'
for i in range(3000,4000) :
    print(i)
    filename = folder + str(i) + '.html'    
    with open(filename, 'r') as file_handle: content = file_handle.read()
    soup = bs(content) 
    title = soup.find('title')
    if title.text[0:5] != 'Error' :
        h1 = soup.find('h1')
        store_id = h1.get('title')
        store_name = h1.text
        address = soup.find('div', {'class': 'address'}).find('span', {'itemprop': 'streetAddress'} ).text
        imagelink = soup.find('img')
        url = imagelink.get('src')
        item_list = url.split('/')
        lat = item_list[8].split(',')[0]
        lng = item_list[8].split(',')[1]
        #print(store_id + " ++ " + store_name + " ++ " + address + " ++ " + lat + " ++ " + lng)
        store_info = store_info.append({'Id' : store_id, 'Name' : store_name, "Address" : address, "lat" : lat, "lng" : lng}, ignore_index=True)
print(store_info.shape)
print(store_info.head())

## The store type is part of the store name and the post code is included in the address, so extracting them is just Python code

In [None]:
# 3

### Adding the store_type and post_code columns

def last_element(s, split_on) :
    l = s.split(split_on)
    return l[len(l) - 1]

store_info['store_type'] = store_info.apply(lambda row: last_element(row.Name, ' '), axis=1)
store_info['post_code'] = store_info.apply(lambda row: last_element(row.Address, ','), axis=1)
store_info.head()
store_info.to_csv('xstore_info.csv', index = False)

## Now we have our file of data, lets put it on a map

In [None]:
data = pd.read_csv('store_info.csv')
uk = folium.Map(location=[53, -1], control_scale=True, zoom_start=7)


# adding the markers and pop ups
for i in range(0,len(data)):
    popup_data = data.iloc[i]['post_code'] +'\n' + data.iloc[i]['store_type']
    folium.Marker([ data.iloc[i]['lat'], data.iloc[i]['lng']], popup=popup_data).add_to(uk)
uk
#uk.save('Tesco_stores.html')