# <span style="font-family:Georgia; font-size:1em;">Yelp Webscraping Project</span>


## <span style="font-family:Georgia; font-size:1em;text-align: justify;">Short Description: This project scrapes the business name, phone number, address, and the amount of reviews for each vape shop listed on Yelp in the United States and places them into a CSV file in my home directory. The vape shop search term is interchangable and can be switched out for any other key term for example resturants.</span>

### <span style="font-family:Georgia; font-size:1em;text-align: justify; text-justify: inter-word;">This project is the first version of many to come. It is definitely not as fast as it could be; it scraps around 10 zipcodes every ten minutes. On my Intel Core I5 processor running on 9.7 GB of Ram, I can use three terminals and scrape approximately 30 zipcodes per minute.</span>

### <span style="font-family:Georgia; font-size:1em;text-align: justify;text-justify: inter-word;">Essentially, there are many other options that I can use to optimize this program for example adding multithreading, or even allowing a user to enter a specific zipcode they would want scraped. The one main thing I need to work on is logging information. At this point in time, the program only prints which url is being scraped. However, I should be logging at key points within the program, for example when it opens the webpage, how many businesses were scrapped per zipcode, which functions are working appropriately, etc. For fun, I will be using Apache Spark to help analyze the data. in real time for the last %10 of the zipcodes in the United States.</span>

### <span style="font-family:Georgia; font-size:.9em;text-align: justify;text-justify: inter-word;">Things I plan to do in the near future:</span>

<span style="font-family:Georgia; font-size:.9em;text-align: justify;text-justify: inter-word;">

1. Geocode the addresses + begin cleaning the data, specifically removing duplicates 

2. Use Tableau to display all of the locations of the Vape Shops

3. Get the raw counts of vapor stores by state 

4. Adjust the counts according to population in these areas 

5. Evaluate the relationships (if there are any) between smoking death rates, tax standards for tobacco products, and newly imposed legislation patterns. 

6. In places with high concentrations of Vape Shops, number to be distinguished in the near future, I will also look at the socioeconomic tendencies around these high concentration areas</span>

#### <span style="font-family:Georgia; font-size:.85em;text-align: justify; text-justify: inter-word;">Disclosure: There are better scraping solutions out there, however, this was my first attempt at scarping. In general, scraping Yelp is looked down upon. Please take a look at their robots.txt. </span>

# Here is the code: 

In [1]:
__author__ = 'Zachary Diemer'
__date__ = 'April 19th, 2016'

from bs4 import BeautifulSoup
from selenium import webdriver
import csv
#from urlparse import urljoin

#Global Declarations
ZIP_URL = "zipcodes.txt"
path_to_chromedriver ='/home/zackymo/Desktop/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

#This funtion pulls the zipcodes from a text file, stores them into a variable.

def get_zips():
    f = open(ZIP_URL, 'r+')
    zips = [zcs.strip() for zcs in f.read().split('\n') if zcs.strip() ]
    f.close()
    return zips

#This function modifies the current url 

get_yelp_page = \
    lambda zipcode: \
        'http://www.yelp.com/search?find_desc=vape+shops&find_loc={0}'.format(zipcode)

#This synchronizes BeautifulSoup and Selenium
#The first command transfers the entire page into html
#and the second sets up bs
def make_soup(url):
    html = browser.page_source
    return BeautifulSoup(html, "lxml")


#Creating, and opening the new csvFile, creating the header names, and set up the file for data entry 
csvFile = open('vapeshops.csv', 'w')
fieldnames = ['bizname', 'addr','bizphone','numrevs']# 'cheapornah', 'rating', 'wifiavail']#,'cheapornah','rating', 'wifiavail']
writer = csv.DictWriter(csvFile, fieldnames=fieldnames, delimiter=',', lineterminator='\n')
writer.writeheader()


#This function scrapes the url for the specified information

def gather_info(url):
    soup = make_soup(url)
    vapeshops = soup.find_all("div", {"class": "search-result"})
    for v in vapeshops:
        desc = {}
        try:
            desc['bizname'] = v.contents[1].find_all("a", {"class": "biz-name"})[0].text.encode("utf-8")
            print bizname
        except: #Exception, e:
            pass#if verbose: print 'Business name extract fail', str(e)
        try:
            desc['addr'] = v.contents[1].find_all("address")[0].getText(separator=u', ').encode("utf-8") #works fully with commas
            print addr
        except: #Exception, e:
            pass#if verbose: print 'Business address extract fail', str(e)
        try:
            desc['bizphone'] = v.contents[1].find_all("span", {"class": "biz-phone"})[0].text.encode("utf-8") #works
            print bizphone
        except: #Exception, e:
            pass#if verbose: print 'Phone number extract fail', str(e)
        #print item.contents[1].find_all("div", {"class": "rating-large"}) #sort of works
        try:
            desc['numrevs'] = v.contents[1].find_all("span", {"class": "review-count"})[0].text.encode("utf-8") #works
            print numrevs
        except: #Exception, e:
            pass#if verbose: print 'Number of reviews extract fail', str(e)
        """try:
            desc['cheapornah'] = v.contents[1].find_all("span", {"class": "price-range"}).encode("utf-8")
            print cheapornah
        except: #Exception, e:
            pass#if verbose: print ' extract fail', str(e)
        try:
            desc['rating'] = v.contents[1].find_all("div", {"class": "rating-large"}).encode("utf-8")
            print rating
        except: #Exception, e:
            pass#if verbose: print ' extract fail', str(e)
        try:
            desc['wifiavail'] = v.find('dd', {'class':'attr-WiFi'}).getText().encode("utf-8")
            print wifiavail
        except: #Exception, e:
            pass#if verbose: print 'Wifi availability extract fail', str(e)"""
        writer.writerow(desc)

        

#This is essentially the main function. It executes the various functions and outputs the url of the zipcode
#that is being scraped to the terminal

def crawl():
    zipcodes = get_zips()
    for z in zipcodes:
            #Add the zipcode to the Base URL
            initial_url = get_yelp_page(z)
            #Log that info somehow
            print initial_url
            #Use Selenium to display the URL
            browser.get(initial_url)
            #Gather the specific information you are looking for
            gather_info(initial_url)
            #Attempt to go to the next page
            try:
                browser.find_element_by_css_selector('span[class=\"pagination-label u-align-middle responsive-hidden-small pagination-links_anchor\"]').click()
            except:
                pass
            #Create a new variable to update url over time
            prev_url = initial_url
            #Get the new page's url
            new_url = browser.current_url
            #counter = 0 - If I would like, I can output the number of businesses extracted in each zipcode
            #Continue this process until you go through 7 different pages or refresh 10 different times
            while prev_url != new_url: #and counter <= 10:
                prev_url = new_url
                gather_info(new_url)
                browser.get(new_url)
                try:
                    browser.find_element_by_css_selector('span[class=\"pagination-label u-align-middle responsive-hidden-small pagination-links_anchor\"]').click()
                except:
                    pass
                new_url = browser.current_url
                #counter += 1
            #Print counter or add to a text file with zipcode name here



ImportError: No module named bs4