# Web Scraping in Python


### Some things you should consider before web scraping a website:
1.) You should check a site's terms and conditions before you scrape them.

2.) Space out your requests so you don't overload the site's server, doing this could get you blocked.

3.) Scrapers break after time - web pages change their layout all the time, you'll more than likely have to rewrite your code.

4.) Web pages are usually inconsistent, more than likely you'll have to clean up the data after scraping it.

5.) Every web page and situation is different, you'll have to spend time configuring your scraper.


There are three modules we'll need in addition to python are:
1.) BeautifulSoup, which you can download by typing: pip install beautifulsoup4 or conda install beautifulsoup4 (for the Anaconda distrbution of Python) in your command prompt.

2.) lxml , which you can download by typing: pip install lxml or conda install lxml (for the Anaconda distrbution of Python) in your command prompt.

3.) requests, which you can download by typing: pip install requests or conda install requests (for the Anaconda distrbution of Python) in your command prompt.

We'll start with our imports:

In [5]:
from bs4 import BeautifulSoup
import requests

import pandas as pd
from pandas import Series, DataFrame

In [10]:
# web scraping univ of california data
url = 'http://www.ucop.edu/operating-budget/budgets-and-reports/legislative-reports/2013-14-legislative-session.html'
webdata = requests.get(url)
data = webdata.content

In [11]:
# define the soup object for the webdata
soup = BeautifulSoup(data)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [28]:
# navigate to the section of interest
soup.contents

summary = soup.find("div", {'class':'list-land', 'id':'content'})
summary
# find the tables in the html
#tables = summary.find_all('table')
#tables

<div class="list-land" id="content">
<!-- Main hero unit for a primary marketing message or call to action -->
<div class="row">
<div class="span12">
<h1 class="page-header">Budget Analysis and Planning</h1>
</div>
<div class="span12">
<ul class="nav nav-tabs sub-nav tab4">
<li class=""><a class="" href="../../index.html">Overview</a></li>
<li class=""><a class="" href="../../staff/index.html">Staff</a></li>
<li class="active"><a class="" href="../index.html">Budgets &amp; Reports</a></li>
<li class=""><a class="" href="../../fees-and-enrollments/index.html">Fees &amp; Enrollments</a></li>
</ul>
</div>
</div>
<!-- Example row of columns -->
<div class="row">
<div class="span8 dotted-top" role="main">
<h2>Legislative reports</h2>
<h3 class="subhead">2013-2014</h3>
<table cellpadding="5" cellspacing="0" class="table-striped" id="report" summary="2009-10 Legislative Reports in a table with one level of column and row headers" width="100%">
<tbody>
<tr>
<th scope="col"></th><th scope="col"

Now we need to use Beautiful Soup to find the table entries. A 'td' tag defines a standard cell in an HTML table. The 'tr' tag defines a row in an HTML table.

We'll parse through our tables object and try to find each cell using the findALL('td') method.

In [41]:
# set rows as indexed object in tables with a row
rows = tables[0].findAll('tr')

# set up empty list
data = []

for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        text = td.find(text=True)
        print (text, data.append(text))

1 None
08/01/13 None
2013-14 (EDU 92495) Proposed Capital Outlay Projects (2013-14 only) (pdf) None
2 None
09/01/13 None
2014-15  (EDU 92495) Proposed Capital Outlay Projects (pdf) None
3 None
11/01/13 None
Utilization of Classroom and Teaching Laboratories (pdf) None
4 None
11/01/13 None
Instruction and Research Space Summary & Analysis (pdf) None
5 None
11/15/13 None
Statewide Energy Partnership Program (pdf) None
6 None
11/30/13 None
2013-23 Capital Financial Plan (pdf) None
7 None
11/30/13 None
Projects Savings Funded from Capital Outlay Bond Funds (pdf) None
8 None
12/01/13 None
Streamlined Capital Projects Funded from Capital (pdf) None
9 None
01/01/14 None
Annual General Obligation Bonds Accountability (pdf) None
10 None
01/01/14 None
Small Business Utilization (pdf) None
11 None
01/01/14 None
Institutional Financial Aid Programs - Preliminary report (pdf) None
12 None
01/10/14 None
Summer Enrollment (pdf) None
13 None
01/15/14 None
Contracting Out for Services at Newly Develope

In [42]:
data

['1',
 '08/01/13',
 '2013-14 (EDU 92495) Proposed Capital Outlay Projects (2013-14 only) (pdf)',
 '2',
 '09/01/13',
 '2014-15\xa0 (EDU 92495) Proposed Capital Outlay Projects (pdf)',
 '3',
 '11/01/13',
 'Utilization of Classroom and Teaching Laboratories (pdf)',
 '4',
 '11/01/13',
 'Instruction and Research Space Summary & Analysis (pdf)',
 '5',
 '11/15/13',
 'Statewide Energy Partnership Program (pdf)',
 '6',
 '11/30/13',
 '2013-23 Capital Financial Plan (pdf)',
 '7',
 '11/30/13',
 'Projects Savings Funded from Capital Outlay Bond Funds (pdf)',
 '8',
 '12/01/13',
 'Streamlined Capital Projects Funded from Capital (pdf)',
 '9',
 '01/01/14',
 'Annual General Obligation Bonds Accountability (pdf)',
 '10',
 '01/01/14',
 'Small Business Utilization (pdf)',
 '11',
 '01/01/14',
 'Institutional Financial Aid Programs - Preliminary report (pdf)',
 '12',
 '01/10/14',
 'Summer Enrollment (pdf)',
 '13',
 '01/15/14',
 'Contracting Out for Services at Newly Developed Facilities (pdf)',
 '14',
 '03/

In [78]:
# list only the rows with pdf files
reports = []
date = []
index = 0
for item in data:
    
    if 'pdf' in item:
        # remove the \xa0
        reports.append(item.replace(u'\xa0',u' '))
        date.append(data[index - 1])
    index += 1

#reports
date

['08/01/13',
 '09/01/13',
 '11/01/13',
 '11/01/13',
 '11/15/13',
 '11/30/13',
 '11/30/13',
 '12/01/13',
 '01/01/14',
 '01/01/14',
 '01/01/14',
 '01/10/14',
 '01/15/14',
 '03/01/14',
 '03/01/14',
 '03/31/14',
 '04/01/14',
 '04/01/14',
 '05/15/14',
 '07/01/14']

In [79]:
# set up dates and reports as Series
date = Series(date)
reports = Series(reports)

In [87]:
# concatenate into dataframe
legislative_df = pd.concat([date, reports], axis=1)

# setup the columns
legislative_df.columns = ['Date', 'Reports']

legislative_df.head()

Unnamed: 0,Date,Reports
0,08/01/13,2013-14 (EDU 92495) Proposed Capital Outlay Pr...
1,09/01/13,2014-15 (EDU 92495) Proposed Capital Outlay P...
2,11/01/13,Utilization of Classroom and Teaching Laborato...
3,11/01/13,Instruction and Research Space Summary & Analy...
4,11/15/13,Statewide Energy Partnership Program (pdf)
