# Target Store Web Scraper

### Introduction

This code is designed to scrape all Target stores in the USA. If you go thorugh the the target store directory and click on a state to view the cites with target stores, you will notice that clicking on some cities will go directly to a store page and others will open up an overlay with a list of stores. For example, if you visit the directory for Minnesota, https://www.target.com/store-locator/store-directory/minnesota, clicking on Bemidji or Duluth will take you directly to the store page but clicking on Saint Cloud will bring up a sidebar that shows two stores in which you can then click on and go to their respective store pages. While you can use the requests library to get the urls for Bemidji and Duluth from the html, the sidebar for Saint Cloud is not a part of the base url, and therefore has to be dynamically loaded. For this we use webdriver from the selenium library.

Simply put, webdriver uses a web browser to navigate web pages and can be used input data such as usernames and passwords or simulate mouse inputs. For our case we just need to load webpages and call the html through webdriver, which we can then use our standard libraries to parse and scrape. 

For this program, I use google chrome and Chromedriver. Chromedriver can be downloaded from https://chromedriver.chromium.org/home and saved to a file directory of your choice. 
#### Note: Please make sure you have the correct version of Chromedriver for google chrome. Make a note of your file directory, you will need it for a later step. Also make sure you have all the required libraries included in the Imports section.

Included is commented out code that can be uncommented to save some of the data objects for ready use in future code. For example, the first commented out code uses the pickle library to save a list of links to each state store directory to a text file. You can then unpickle the file which gives you back your list of links.


#### Note: The code must run sequentially from imports down for everything to work.

#### Note: There are several parts in the code where a sleep funciton has been implemented. This is so that we don't get blacklisted form Target's website or put any unecessary strain on the site. As is this code will take a few hours to complete. Use your judgement and discresion to change the time funcitons to speed or slow down the code.

### Contents

* [Imports](#imports)

* [Support Functions](#sf)

* [State Directory](#std)

* [City Directory](#cd)

* [Store Directory](#sd)

* [Storing Object Data](#sod)

* [Scraping Each Store Page](#sesp)

* [Saving the Data](#stda)


#### Imports <a class="anchor" id="imports"></a>

In [None]:


#Required imports
import requests as re
from bs4 import BeautifulSoup as bs
import time
import csv
from selenium import webdriver
import pickle

In [None]:
#we will need the state abbreviations for later
state_abbrev = {'alabama':'AL','alaska':'AK','arizona':'AZ','arkansas':'AR','california':'CA',
               'colorado':'CO','connecticut':'CT','delaware':'DE','florida':'FL','georgia':'GA',
               'hawaii':'HI','idaho':'ID','illinois':'IL','indiana':'IA','iowa':'IA','kansas':'KS',
               'kentucky':'KY','louisiana':'LA','maine':'ME','maryland':'MD','massachusetts':'MA',
               'michigan':'MI','minnesota':'MN','mississippi':'MS','missouri':'MO','montana':'MT',
               'nebraska':'NE','nevada':'NV','new hampshire':'NH','new jersey':'NJ','new mexico':'NM',
                'new york':'NY','north carolina':'NC','north dakota':'ND','ohio':'OH','oklahoma':'OK',
                'oregon':'OR','pennsylvania':'PA','rhode island':'RI','south carolina':'SC','south dakota':'SD',
                'tennessee':'TN','texas':'TX','utah':'UT','vermont':'VT','virginia':'VA','washington':'WA',
                'west virginia':'WV','wisconsin':'WI','wyoming':'WY'}

#### Support Function <a class="anchor" id="sf"></a>

In [None]:
#Support functions
def states_urls(url):
    '''
    Gathers a list of urls for each state in target
    store directory. 
    Creates a dictionary where the keys are the 
    state names, e.g. new-york, and the values
    are the urls.
    '''
    
    data = re.get(url).text 
    soup = bs(data,"html.parser")
    # By examining the urls for particular states we need to
    # create a list of states in lowercase to append to original url
    # and replace spaces between states with two words with '-' so
    # the urls will work.
    state_dict=dict()
    state_divs = soup.find_all('div', class_='h-margin-v-tiny')
    for i in range(len(state_divs)):
        state = state_divs[i].a.text.lower()
        state_url = url + str('/') + state.replace(" ","-")
        state_dict[state]=state_url
        
    return state_dict


def city_scrape(url):
    '''
    Input: url to state in target store directory
    For a given url for a particular state, returns a list of names
    of cites in that state that has one or more target stores
    '''
    state = re.get(url).text
    soup = bs(state,'html.parser')
    
    cities = []
    store_divs = soup.find_all('div', class_="h-margin-v-tiny")
    for div in store_divs:
        if div.a==None:
            # Cities that have multiple stores don't link to another page
            # but instead, pull a dynamically loaded overlay with the list of stores
            # and therefore are not enclosed in an <a> tag
            cities.append(div.text) 
        else:
            cities.append(div.a.text) #Those cities that have an <a> tag with a hyperlink 
    return cities
    
def store_scrape(url):
    ''' 
    Given a url to a store page, this function gets the
    store name, address, and services returns a dictionary:
    {store_name: "Name", address: "address", services:"list of services"}
    - requires import requests as re and from bs4 import BeautifulSoup as bs
    '''
    html = re.get(url).text
    soup = bs(html,'html.parser')
    store_name = soup.find('h1', class_='Heading__StyledHeading-sc-1mp23s9-0 styles__StoreNameHeading-sc-pxu7eq-0 czSHDm kqGfqM h-margin-b-tiny')
    address = soup.find('a', rel='noopener').text
    service_tags = soup.find_all('div', class_='styles__CollapsibleContainer-sc-983hjk-2 jmXqrN')
    service_tags2 = soup.find_all('li', class_='styles__OtherServicesListItem-sc-7syzbd-1 hBVqPX')
    services = []
    for i in range(len(service_tags)):
        services.append(service_tags[i].h3.text)
    for i in range(len(service_tags2)):
        services.append(service_tags2[i].text)
    store_dict = {}
    store_dict['store_name']=store_name.text
    store_dict['address']=address
    store_dict['services']=services
    
    
    return store_dict  

def store_search(state,cities): #state= 2 letter abbreviation,e.g. FL, cities = list of cities in state
    '''
    This function uses the target store_finder tool to generate a list of 
    target store urls based on a state and a list of cities in the state.
    Because calling differnt cities in the target store finder may have overlaping target
    stores, we include a check that excludes duplicates.
    
    Note: This is the step that requiers an open webdriver. Please make sure
    you have a webdriver named driver open,
    e.g. 
    driver = webdriver.Chrome('YOUR FILE DIRECTORY FOR CHROMEDRIVER HERE')
    '''
    storeurls = []
    for i in range(len(cities)):
        url = 'https://www.target.com/store-locator/find-stores/'+cities[i]+','+state
        #search = re.get(url).text
        driver.get(url)
        time.sleep(5)
        html = driver.page_source
        soup=bs(html,'html5lib')
        divs=soup.find_all('div', class_='Card__CardButtons-sc-6da7hu-1 cjxhUz')
        for div in divs:
            storeurl = 'https://www.target.com' + div.a['href']
            #Check if storeurl is alread in storeurls list
            if storeurl not in storeurls:
                storeurls.append(storeurl)
                
    return storeurls
        

In [None]:
state_directory.keys()

#### State Directory <a class="anchor" id="std"></a>

Create a dictionary where state is the key and the value is the url linking to the webpage containing the list of cities containing target stores.


i.e state_directory = {'alabama':url, 'alaska':url, ... ,'wyoming':url}

In [None]:
#create directory of states and their urls
url = 'https://www.target.com/store-locator/store-directory' #url to initial target store directory containing list of states
state_directory = states_urls(url) #Gets dictionary of the form {'state':'url',...}

#### City Directory <a class="anchor" id="cd"></a>

Create a dictionary where the values are the abbreviations of the states and the keys are lists of city names with target stores.

i.e. state_cities = {'AL':['city1','city2',...], 'AK':['city1','city2',...], ... ,'WY':['city1','city2',...]}

We abbreviate the state because this will be necessary in a later web scraping step.

Since we are visiting a series of urls for this web scraping step we include a sleep function to wait a few seconds between each web scrape. This is so we don't unecessarily stress the website or get blacklisted. Feel free to change the sleep time.

In [None]:

#For each state page get list of city names
state_cities = dict()
for key in state_directory:
    time.sleep(2)
    state_cities[state_abbrev[key]]=city_scrape(state_directory[key])


#### Store Directory <a class="anchor" id="sd"></a>

Create a list of urls for all target stores in the USA. 

We use the list of cities for each state in the state_cites dictionary and pass their keys and values to the store_search function.

We include a check to prevent duplicates since the store_search function may pull stores from differnt states for one input.

Note: The store search function includes a sleep function of 5 seconds. This is quite long but is to prevent us from overloading Target's website or getting blacklisted since we are scraping thousands of pages. Feel free to change at your discretion. See support functions above.

Note: This is the step that requires the webdriver to be open. Make sure you have chromedriver installed and copy the file path into the quotations into the webdriver object below.

In [None]:
driver = webdriver.Chrome('YOUR_FILE_DIRECTORY/chromedriver')

In [None]:
store_hrefs = []#Store all the storepage links to all target stores
for key in state_cities:
    store_list = store_search(key,state_cities[key])
    for store in store_list:
        if store not in store_hrefs:
            store_hrefs.append(store)
    
print("store page link scrape finished")

In [None]:
# Close the driver. It is not needed for further steps
driver.close()

### Storing object data <a class="anchor" id="sod"></a>

You might want to save the store_hrefs list for future use or incase something goes wrong in a later step. Uncomment the below cell if you wish to save the store_hrefs to a file.

If you wish to recall the store_hrefs list, uncomment the second to next cell.

In [None]:
# uncomment the below code if you wish to store the store_href list into a file

#with open('target_store_hrefs.txt','wb') as file:
    #pickle.dump(store_hrefs,file)

In [None]:
#Uncomment the below code if you wish to recall the store_hrefs list from the pickle file

#pickle_off = open ("target_store_hrefs.txt", "rb")
#store_hrefs = pickle.load(pickle_off)

#### Checking to make sure our list of store links is unique

We can check and see how many stores links we have scraped and if they are all differnt stores.

The len fucntion tells you how many elements you have in your list and the set fucntion will only consider non duplicate elements.

In [None]:
print("Number of links in store_hrefs list: ", len(store_hrefs))
print("Number of unique links in store_hrefs list: ", len(set(store_hrefs)))

#### Scraping Each Store Page <a class="anchor" id="sesp"></a>

We use the store_scrape support function to generate a list of dictionaries, with each dictionary corresponding to a target store and its data.

Each store dictionary consist of {'store_name':'...', 'address':'...', 'services':'...'}

We again include a sleep function to wait between scrapes. Change the time at your discretion.

In [None]:
target_data = []
for link in store_hrefs:
    store_data=store_scrape(link)
    target_data.append(store_data)
    time.sleep(3)
print("STORE PAGE SCRAPE COMPLETED!!!")

#### Storing the dictionary list

Uncomment the below code if you wish to store the list of dictionaries as is to a file. 

The step is not necessarily needed as we will save the data to a csv file in the next step.

In [None]:
# Uncomment the below code if you wish to pickle data in file


#with open('target_store_data.txt','wb') as file:
#    pickle.dump(target_data,file)

#### Saving the Data!! <a class="anchor" id="stda"></a>

Final step! We save the data to a csv file.

In [None]:
#write data to csv file

header =list(target_data[0].keys())

with open('target_stores.csv','w',newline='') as csv_file:
    w = csv.DictWriter(csv_file, fieldnames=header)
    w.writeheader()
    w.writerows(target_data)


