In [1]:
# parsing the HTML and getting it into a format that will be more useful, and then exporting it to csv
import requests
from bs4 import BeautifulSoup 

## Download the HTML

Requests allows us to send HTTP requests. We'll use requests.get() to retrieve a response object from a server.

In [3]:
# scraping the field experiments library

url = 'http://www.fieldexperiments.com/papers/'
page = requests.get(url)

In [4]:
page

<Response [200]>

In [6]:
page.content # content isn't very useful as it is right now, although data we may want could be in there

b'\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <meta name="description" content="A browseable library of economics field experiment papers.">\n    <meta name="author" content="Joe Seidel">\n\t<meta name="google-site-verification" content="ebk_6cw-mE_GTogBbNfQCTU5S9wuh374hN36137ArRc" />\n    <link rel="icon" href="http://www.uchicago.edu/favicon.ico">\n\n    <title>Field Experiments</title>\n\n\n    <!-- Bootstrap core CSS -->\n    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css" integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7" crossorigin="anonymous">\n\n    <!-- Custom styles for this template -->\n    \n<link href="/static/library/css/sticky-footer-navbar.css" rel="stylesheet">\n\n\n    <!-- HTML5 shim and Respond.js IE8 support of H

## Parse the response content

In [8]:
soup = BeautifulSoup(page.content, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="A browseable library of economics field experiment papers." name="description"/>
  <meta content="Joe Seidel" name="author"/>
  <meta content="ebk_6cw-mE_GTogBbNfQCTU5S9wuh374hN36137ArRc" name="google-site-verification">
   <link href="http://www.uchicago.edu/favicon.ico" rel="icon"/>
   <title>
    Field Experiments
   </title>
   <!-- Bootstrap core CSS -->
   <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css" integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7" rel="stylesheet"/>
   <!-- Custom styles for this template -->
   <link href="/static/library/css/sticky-footer-navbar.css" rel="stylesheet"/>
   <!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media queries -->
  

### Navigating the HTML

In [10]:
soup.title

<title>Field Experiments</title>

In [11]:
soup.title.string

'Field Experiments'

In [12]:
soup.title.text

'Field Experiments'

We can create a list of certain types of HTML tags using find_all, (e.g., 'p', 'a', 'div')

In [13]:
soup.find_all('a')
soup.find_all('div')

[<div class="navbar navbar-default navbar-fixed-top" role="navigation">
 <div class="container">
 <div class="navbar-header">
 <button class="navbar-toggle collapsed" data-target=".navbar-collapse" data-toggle="collapse" type="button">
 <span class="sr-only">Toggle navigation</span>
 <span class="icon-bar"></span>
 <span class="icon-bar"></span>
 <span class="icon-bar"></span>
 </button>
 <a class="navbar-brand" href="/">Field Experiments</a>
 </div>
 <div class="navbar-collapse collapse">
 <ul class="nav navbar-nav">
 <li><a href="/about/">About</a></li>
 <li class="dropdown">
 <a class="dropdown-toggle" data-toggle="dropdown" href="#">Browse By <span class="caret"></span></a>
 <ul class="dropdown-menu" role="menu">
 <li><a href="/papers/">All</a></li>
 <li><a href="/authors/">Authors</a></li>
 <li><a href="/types/">Type</a></li>
 <li><a href="/search/">Search</a></li>
 </ul>
 </li>
 <li><a href="/faq/">Faq</a></li>
 <li><a href="/contact/">Contact</a></li>
 <li><form action="/search/

### Find element by 'id'

In [16]:
container = soup.find(id='accordion')
container

<div class="panel-group" id="accordion">
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">
<a href="/paper/2775/">2020: A Summary Of Artefactual Field Experiments On Fieldexperiments.Com: The Who's, What's, Where's, And When's<span class="glyphicon glyphicon-link"></span></a>
<meta content="2020: A Summary of Artefactual Field Experiments on fieldexperiments.com: The Who's, What's, Where's, and When's" name="citation_title"/>
<meta content="List John A" name="citation_author"/>
<meta content="2020" name="citation_publication_date"/>
<meta content="Working Papers" name="citation_journal_title"/>
<meta content="http://s3.amazonaws.com/fieldexperiments-papers2/papers/00721.pdf" name="citation_pdf_url"/>
</h3>
</div>
<div class="panel-body">
<div class="row">
<div class="col-xs-8">
							
								John A List
							
							</div>
<div class="col-xs-4">
								Cited by*:  Downloads*:  <a href="http://ideas.repec.org/p/feb/artefa/00721.html"><img src="/

Let's get a list of all the elements with CSS class name 'panel'.

In [18]:
paperList = container.find_all('div', class_= 'panel')
paperList

In [20]:
first = paperList[0]
first

<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">
<a href="/paper/2775/">2020: A Summary Of Artefactual Field Experiments On Fieldexperiments.Com: The Who's, What's, Where's, And When's<span class="glyphicon glyphicon-link"></span></a>
<meta content="2020: A Summary of Artefactual Field Experiments on fieldexperiments.com: The Who's, What's, Where's, and When's" name="citation_title"/>
<meta content="List John A" name="citation_author"/>
<meta content="2020" name="citation_publication_date"/>
<meta content="Working Papers" name="citation_journal_title"/>
<meta content="http://s3.amazonaws.com/fieldexperiments-papers2/papers/00721.pdf" name="citation_pdf_url"/>
</h3>
</div>
<div class="panel-body">
<div class="row">
<div class="col-xs-8">
							
								John A List
							
							</div>
<div class="col-xs-4">
								Cited by*:  Downloads*:  <a href="http://ideas.repec.org/p/feb/artefa/00721.html"><img src="/static/library/gif/pdficon_small.png"/></

In [21]:
title = first.find('a').text
title

"2020: A Summary Of Artefactual Field Experiments On Fieldexperiments.Com: The Who's, What's, Where's, And When's"

In [26]:
authorsList = first.find_all(attrs={'name': 'citation_author'})
authorsList[0]['content']

'List John A'

In [29]:
yearMeta = first.find(attrs={'name': 'citation_publication_date'})
year = yearMeta['content']
year

'2020'

In [30]:
d = []
for paper in paperList:
    title = paper.find('a').text
    
    authorsList = paper.find_all(attrs={'name': 'citation_author'})
    first_author = authorsList[0]['content']
    
    yearMeta = paper.find(attrs={'name': 'citation_publication_date'})
    year = yearMeta['content']
    
    tempDict = dict(
        title = title,
        first_author = first_author,
        year = year
    )
    
    d.append(tempDict)
    
d

[{'title': "2020: A Summary Of Artefactual Field Experiments On Fieldexperiments.Com: The Who's, What's, Where's, And When's",
  'first_author': 'List John A',
  'year': '2020'},
 {'title': "2020: A Summary Of Framed Field Experiments On Fieldexperiments.Com: The Who's, What's Where's, And When's",
  'first_author': 'List John A',
  'year': '2020'},
 {'title': '2020 Summary Data Of Natural Field Experiments Published On Fieldexperiments.Com',
  'first_author': 'List John A',
  'year': '2020'},
 {'title': '2021 Summary Data Of Artefactual Field Experiments Published On Fieldexperiments.Com',
  'first_author': 'List John A',
  'year': '2022'},
 {'title': '2021 Summary Data Of Natural Field Experiments Published On Fieldexperiments.Com',
  'first_author': 'List John A',
  'year': ''},
 {'title': 'Academic Economists Behaving Badly? A Survey On Three Areas Of Unethical Behavior',
  'first_author': 'Bailey Charles ',
  'year': '2001'},
 {'title': 'Achievement Awards For High School Matricul

## Export to CSV

In [32]:
import pandas as pd

df = pd.DataFrame(d)

In [34]:
import os

csvFilePath = os.path.join(os.getcwd(), 'fe_scrape.csv')
df.to_csv(csvFilePath, index=False)

In [36]:
for i in range(1,71):
    url = f'http://www.fieldexperiments.com/papers/?page={i}'
    print(url)
    
    # if you want more than the first page and do for every page

http://www.fieldexperiments.com/papers/?page=1
http://www.fieldexperiments.com/papers/?page=2
http://www.fieldexperiments.com/papers/?page=3
http://www.fieldexperiments.com/papers/?page=4
http://www.fieldexperiments.com/papers/?page=5
http://www.fieldexperiments.com/papers/?page=6
http://www.fieldexperiments.com/papers/?page=7
http://www.fieldexperiments.com/papers/?page=8
http://www.fieldexperiments.com/papers/?page=9
http://www.fieldexperiments.com/papers/?page=10
http://www.fieldexperiments.com/papers/?page=11
http://www.fieldexperiments.com/papers/?page=12
http://www.fieldexperiments.com/papers/?page=13
http://www.fieldexperiments.com/papers/?page=14
http://www.fieldexperiments.com/papers/?page=15
http://www.fieldexperiments.com/papers/?page=16
http://www.fieldexperiments.com/papers/?page=17
http://www.fieldexperiments.com/papers/?page=18
http://www.fieldexperiments.com/papers/?page=19
http://www.fieldexperiments.com/papers/?page=20
http://www.fieldexperiments.com/papers/?page=21
h

When you are web-scraping, you can send more than you can humanly do. You need to be careful not to crash websites or be overly abusive (i.e., put pauses in your code). 

Some terms of service on websites don't want you to web-scrape, and they will use bots to deter you from doing this (or maybe just include it in their terms of service). 

Some APIs give you the ability to get this data rather than have you web-scrape their site and congest their server.

### Dynamic Site

What if you wanted to scrape something like Zillow, a dynamic website? It returns a Javascript code which tells your browser how to display the webpage. If you use the code we use today, you are just going to see Javascript code that won't make any sense.

If you want to do this, look at Python packages like `selenium`. It's a lot more complicated than what we've done, but it's worth looking into if you are working with dynamic websites