# Web Scraping Tutorial

I've received interest from some people about web scraping, so I decided to make a tutorial to help you guys and gals get started!

Installation
1. XPath - Add On that helps you find absolute/relative paths of text on a webpage faster. I don't use Firefox so I am not sure how it works, but should work very similar.
    - Chrome https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl
    - Firefox https://addons.mozilla.org/en-US/firefox/addon/xpath-checker/
2. conda install bs4 or pip install bs4
3. conda install lxml or pip install lxml
4. pip install scrapy

bs4 is BeautifulSoup4, which is a package that helps deal with HTML tags

lxml processes XML and HTML in Python

Scrapy allows us to use XPATH Selector

If you find yourself having problems scraping websites that have a ton of injected JavaScript or JQuery, you'll need to download a web driver:
    - Chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads
    - Firefox: https://github.com/mozilla/geckodriver/releases
    - Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/
    - IE: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

You'll also need to install Selenium
pip install selenium

Web drivers and Selenium allows you to access all the injected JavaScript and JQuery.

### *** FAIR WARNING ***
### Web scraping is not entirely legal. If you are accessing/scraping hundreds of pages within seconds, they will find out and you will get yourself banned! Please use with caution!

# Let's start scraping!

In [7]:
# First let's import everything we just installed!

import pandas as pd
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import requests
import time
from bs4 import BeautifulSoup

# If you need it, after downloading the webdriver, you'll need to access it
from selenium import webdriver
chromedriver = "/Users/edwardlee/Desktop/chromedriver"

In [2]:
# Let's scrape some Craigslist
# BE SURE THAT WHEN YOU DO .GET REQUEST, YOU AREN'T DOING IT CONSTANTLY, EVERY .GET RESPONSE TALKS TO THE WEBSITE.
# IF YOU DO IT MULTIPLE TIMES, THEY'LL CATCH ON!
# USE THE TIME COMMAND TO HAVE A BIT MORE RANDOMNESS TO IT, BUT ITS NOT 100% PROOF

response = requests.get('https://sfbay.craigslist.org/search/sfc/cta')
time.sleep(5*np.random.rand())
HTML = response.text
HTML[0:500]

u'\ufeff<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>SF bay area cars &amp; trucks  - craigslist</title>\n\n    <meta name="description" content="SF bay area cars &amp; trucks  - craigslist">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge"/>\n    <link rel="canonical" href="https://sfbay.craigslist.org/search/sfc/cta">\n    <link rel="alternate" type="application/rss+xml" href="https://sfbay.craigslist.org/search/sfc/cta?format=rss" title="RSS feed for craigslist | SF bay area cars &amp;'

Above is the HTML for the cars and trucks in the sfbay. We want to find the titles and pricing! We will need to find the relative xpath for each tag. Using your xpath addon, highlight over the item you want and it should show you the absolute path. By using // you can narrow down the xpath to the item you want. An example below:

**Absolute Path**: ```/html[@class='js canvas draggable fileAPI geolocation hashChange matchMedia picture pushState placeholder no-touchCapable transitions localStorage']/body[@class='search has-map en desktop w1024 grid']/section[@id='page-top']/form[@id='searchform']/div[@id='sortable-results']/ul[@class='rows']/li[@class='result-row'][1]/p[@class='result-info']/span[@class='result-meta']/span[@class='result-price']```

**Relative Path**: ```//li[@class='result-row']//span[@class='result-price']```

The relative path gives me ALL the prices on the first page. We can save that into a variable.

In [12]:
prices = Selector(text=HTML).xpath("//li[@class='result-row']/p[@class='result-info']/span[@class='result-meta']/span[@class='result-price']/text()").extract()
prices

[u'$13500',
 u'$20500',
 u'$15995',
 u'$16400',
 u'$3500',
 u'$5400',
 u'$1800',
 u'$10500',
 u'$5200',
 u'$5000',
 u'$15800',
 u'$25000',
 u'$12950',
 u'$3800',
 u'$7498',
 u'$1700',
 u'$19500',
 u'$4700',
 u'$17900',
 u'$3850',
 u'$12500',
 u'$26750',
 u'$25750',
 u'$17500',
 u'$35995',
 u'$4499',
 u'$1099',
 u'$13500',
 u'$7995',
 u'$19995',
 u'$2047',
 u'$13995',
 u'$2999',
 u'$15995',
 u'$4500',
 u'$8950',
 u'$6000',
 u'$17995',
 u'$31995',
 u'$15600',
 u'$2900',
 u'$19995',
 u'$7500',
 u'$13900',
 u'$1000',
 u'$2600',
 u'$7800',
 u'$9800',
 u'$13800',
 u'$7000',
 u'$5500',
 u'$1543',
 u'$18995',
 u'$1534',
 u'$14500',
 u'$7900',
 u'$45000',
 u'$12000',
 u'$8000',
 u'$16495',
 u'$4000',
 u'$37497',
 u'$6999',
 u'$3000',
 u'$7500',
 u'$10000',
 u'$4999',
 u'$16499',
 u'$1823',
 u'$3200',
 u'$2500',
 u'$7000',
 u'$3700',
 u'$21000',
 u'$11250',
 u'$8900',
 u'$11477',
 u'$2795',
 u'$17988',
 u'$25000',
 u'$2500',
 u'$2500',
 u'$2500',
 u'$4300']

In [15]:
# Now let's do the same thing with titles
titles = Selector(text=HTML).xpath("//li[@class='result-row']/p[@class='result-info']/a[@class='result-title hdrlnk']/text()").extract()
titles

u'2014 *Lexus IS 250* 4dr Sport Sdn Auto RWD - Lexus Gray'

In [13]:
print len(prices)
print len(titles)

84
100


However, we noticed that not every post has a price listed, but every title is listed! We will need to loop it 1 by 1 to make sure that we get everything

In [29]:
title_xpath = "//li[@class='result-row'][%s]/p[@class='result-info']/a[@class='result-title hdrlnk']/text()"
price_xpath = "//li[@class='result-row'][%s]/p[@class='result-info']/span[@class='result-meta']/span[@class='result-price']/text()"

# The [%s] sign is important! It'll allow us to fill in the value we need.

titles_all = []
price_all = []
for i in range(1,101):
    title_placeholder = title_xpath % i
    price_placeholder = price_xpath % i
    
    titles_all.append(Selector(text=HTML).xpath(title_placeholder).extract())
    price_all.append(Selector(text=HTML).xpath(price_placeholder).extract())
        
df = pd.DataFrame({'titles': titles_all, 'price': price_all})

In [32]:
# Completed!
df.head(10)

Unnamed: 0,price,titles
0,[],[2014 *Lexus IS 250* 4dr Sport Sdn Auto RWD - ...
1,[$13500],[Hummer H3 2006]
2,[$20500],[2011 ML 350 4WD]
3,[$15995],[2010 *Chevrolet* *Camaro* 🚗 As Low As $800 D...
4,[$16400],[2013 Toyota Prius Plug-in / 25.9k miles]
5,[$3500],[2011 Chevy cruze]
6,[$5400],[08 With Current Smog Pontiac Grand Prix Low M...
7,[$1800],[1996 ::: BMW Z3 ::: 1.9 2 dr Convertible]
8,[$10500],[1973 VW Thing - all original]
9,[$5200],[2007 Ford F-150 SuperCab]


BeautifulSoup is a great package to help you find information! Below is an example, I'll use the same HTML link.

In [58]:
soup = BeautifulSoup(HTML, "lxml")

In [59]:
soup.html.title

<title>SF bay area cars &amp; trucks  - craigslist</title>

In [60]:
soup.html.title.text

u'SF bay area cars & trucks  - craigslist'

In [61]:
element = soup.find_all('span', {'class': 'result-price'})

In [62]:
# You can now loop through this and put it into a list to put into a dataframe!
element[0].text

u'$13500'

In [63]:
# Shows you an organized version of the entire HTML page, but I prefer to use XPATH
print soup.prettify()

<!DOCTYPE html>
<html class="no-js">
 <head>
  <title>
   SF bay area cars &amp; trucks  - craigslist
  </title>
  <meta content="SF bay area cars &amp; trucks  - craigslist" name="description"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <link href="https://sfbay.craigslist.org/search/sfc/cta" rel="canonical"/>
  <link href="https://sfbay.craigslist.org/search/sfc/cta?format=rss" rel="alternate" title="RSS feed for craigslist | SF bay area cars &amp; trucks  - craigslist " type="application/rss+xml"/>
  <link href="https://sfbay.craigslist.org/search/sfc/cta?s=100" rel="next"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <link href="//www.craigslist.org/styles/cl.css?v=253cd59916a1b826b5363d00ff7b898d" media="all" rel="stylesheet" type="text/css"/>
  <link href="//www.craigslist.org/styles/search.css?v=49b6a11a0fd611ea50b0009c03baeaaf" media="all" rel="stylesheet" type="text/css"/>
  <link href="//www.craigslist.org/styles/jquery-ui-clcusto

BeautifulSoup4 is very powerful to help you clean up HTML and find specific things you need. I won't go over everything, but their documentation is really good: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Here's a quick tutorial on how to use webdriver. The code below is my code for scraping a dropdown box injected with JQuery. I could not scrape it regularly, so I used selenium's webdriver to get it.

In [64]:
browser = webdriver.Chrome(executable_path = chromedriver)

# Get all years for 1996-2016
browser.get('http://stats.nba.com/league/player/')
HTML = browser.page_source
year_data = Selector(text=HTML).xpath("//select[@name='Season']/option/text()").extract()

# Make sure to quit the webdriver browser or you'll end up having a bunch of opened browsers
browser.quit()

In [65]:
year_data

[u'2016-17',
 u'2015-16',
 u'2014-15',
 u'2013-14',
 u'2012-13',
 u'2011-12',
 u'2010-11',
 u'2009-10',
 u'2008-09',
 u'2007-08',
 u'2006-07',
 u'2005-06',
 u'2004-05',
 u'2003-04',
 u'2002-03',
 u'2001-02',
 u'2000-01',
 u'1999-00',
 u'1998-99',
 u'1997-98',
 u'1996-97']