In [1]:
import requests
from bs4 import BeautifulSoup

# Normal scraping

By now we all know how to scrape normal sites (kind of, mostly, somewhat).

In [None]:
# Grab the NYT's homepage
# http://docs.python-requests.org/en/master/
# vs urllib

 



In [7]:
response = requests.get("http://nytimes.com")
doc = BeautifulSoup(response.text)

# Snag all of the headlines (h3 tags with 'story-heading' class)
headlines = doc.find_all('h3', {'class': 'story-heading'})
#headolines = doc.search("h3.story-heading")


# Getting the headline text out using list comprehensions
# is a lot more fun but I guess you just learned those
# like a day ago, so we'll go ahead and use a for loop.
# But for the curious:
#   [headline.text.strip() for headline in headlines]

# Print the text of the headlines
for headline in headlines:
    print(headline.string)


        Iraqi Forces Enter Falluja, Encountering Little Fight From ISIS        

        An Expensive Law Degree, and No Place to Use It        

        Op-Ed Contributor: Why We Should Politicize the Orlando Massacre        

        Orlando Survivors Recall Night of Terror: ‘Then He Shoots Me Again’        

        The First Big Company to Say It’s Serving the Legal Marijuana Trade? Microsoft.        

        Christo’s Newest Project: Walking on Water        

        Spurred by Orlando Shooting, Senator Offers a Gun Control Compromise        

        Modern Love: A Path to Fatherhood, With (Shared) Morning Sickness        

        Review: In ‘Finding Dory,’ a Forgetful Fish and a Warm Celebration of Differences        

        Brooklyn’s Private Jewish Patrols Wield Power. Some Call Them Bullies.        

        Cavaliers 115, Warriors 101 | Series is tied, 3-3: LeBron James and Cavaliers Rout Warriors, Forcing Game 7        

        Is ‘Shrew’ Worth Taming? Female Director



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))


# But... forms!

So the issue is that sometimes you need to submit *forms* on a web site. Why? Well, let's look at an example.

This example is going to come from [Dan Nguyen](https://twitter.com/dancow)'s incredible [Search, Script, Scrape](https://github.com/compjour/search-script-scrape), 101 scraping exercises.

> The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient
>
> Related URL: [http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm](http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm)

When you visit that URL, you're going to type in "Fentanyl," and select "Disc (Discontinued Drug Products)." Then you'll hit **search**.

Hooray, results! Now **look at the URL.**

> http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm

Does anything about that URL say "Fentanyl" or "Discontinued Drug Products"? Nope! And if you [straight up visit it](http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm) (might need to open an Incognito window) you'll end up being redirected back to a different page. 

This means **`requests.get` just isn't going to cut it.** If you tell `requests` to download that page it's going to get a whooole lot of uselessness.

Be my guest if you want to try it!

In [None]:
Post and get

In [None]:
url = 'http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm')
paramenters ={'q': 'cats with guitaes'}
reponse = requests.post('http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm')

In [None]:
url = 'http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm')
paramenters ={'Generiic_Name': 'Fentanyl', 'table1': 'OB_Disc'}
reponse = requests.post(url, post)
#Inspect the form with Developter tools
#Spy on the request we sent to the server.
#DEv Tools, Network, run search.

In [None]:
doc = BeautifulSoup(response.text, 'html.parser')

links


In [None]:
# Just in case you didn't run it up there, I'll import again
import requests



In [None]:
# Using .select instead of .find is a little more
# readable to people from the web dev world, maybe?


It's magic, I swear!

# But then...

Sometimes `requests.get` just isn't enough. Why? It mostly has to do with JavaScript or complicated forms - when a site reacts and changes without loading a new page, you can't use `requests` for that (think "Load more" buttons on Instagram).

For those sites you need **Selenium!** Selenium = you put your browser on autopilot. As in, literally, it takes control over your browser. There are "headless" versions that use invisible browsers but if you don't like to install a bunch of stuff, the normal version is usually fine.

## Installing Selenium


Selenium isn't just a Python package, but you'll need to install **python bindings** in order to have Python talk to Selenium.

````
pip install selenium
````

You'll also need the [Firefox browser](https://www.mozilla.org/en-US/firefox), since that's the browser we're going to be controlling.

Selenium is built on **WebDrivers**, which are libraries that let you... drive a browser. I believe it comes with a Firefox WebDriver, whereas Safari/Chrome/etc take a little more effort to set up.

In [9]:
!pip install selenium

Collecting selenium
  Downloading selenium-2.53.5-py2.py3-none-any.whl (884kB)
[K    100% |████████████████████████████████| 890kB 815kB/s 
[?25hInstalling collected packages: selenium
[31mException:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/site-packages/pip/commands/install.py", line 317, in run
    prefix=options.prefix_path,
  File "/usr/local/lib/python3.5/site-packages/pip/req/req_set.py", line 742, in install
    **kwargs
  File "/usr/local/lib/python3.5/site-packages/pip/req/req_install.py", line 831, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/usr/local/lib/python3.5/site-packages/pip/req/req_install.py", line 1032, in move_wheel_files
    isolated=self.isolated,
  File "/usr/local/lib/python3.5/site-packages/pip/wheel.py", line 378, in move_wheel_files
    clobber(source, dest, Fals

## Using Selenium

In [18]:
# Imports, of course
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait


In [19]:
# Initialize a Firefox webdriver
driver = webdriver.Firefox()

#if not working, download this: http://www.seleniumhq.org/download/
#Selenium Standalone Server
#The Selenium Server is needed in order to run either Selenium RC style scripts or Remote Selenium WebDriver ones. The 2.x server is a drop-in replacement for the old Selenium RC server and is designed to be backwards compatible with your existing infrastructure.
#Download version 2.53.0
#To use the Selenium Server in a Grid configuration see the wiki page.

In [20]:
# Grab the web page
search_url = 'http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm'
driver.get(search_url)

In [21]:
# You'll use selenium.webdriver.support.ui.Select
# that we imported above to grab the Seelct element called 
# t_web_lookup__license_type_name, then select Acupuncturists
name_input = driver.find_element_by_name('Generic_Name')
name_input.send_keys('Fentanyl')
# We use .find_element_by_name here because we know the name

In [34]:
search_url = 'https://app.hpla.doh.dc.gov/Weblookup/'
driver.get(search_url)

In [35]:
name_input = driver.find_element_by_name('t_web_lookup__first_name')
name_input.send_keys('Katherine')

In [36]:
# We use .find_element_by_id here because we know the id

# Then we'll fake typing into it

In [37]:
# Now we can grab the search button and click it
license_type = driver.find_element_by_name("t_web_lookup__license_type_name")
Select(license_type).select_by_value("ACUPUNCTURIST")


In [38]:
search_button = driver.find_element_by_id("sch_button")
search_button.click()

In [None]:
# Instead of using requests.get, we just look at .page_source of the driver


In [41]:
# We can feed that into Beautiful Soup
doc = BeautifulSoup(driver.page_source, 'html.parser')

In [42]:
# It's a tricky table, but this grabs the linked names inside of the A
#rows = doc.select("#datagrid_results tr")
rows = doc.find('table', id='datagrid_results').find_all('tr', attrs={'class': None})

doctors = []
for row in rows:
   #print(row.attrs)
   # Find the ones that don't have 'style' as an attribute
   if 'style' in row.attrs:
       # Skip it! It's a header or footer row
       pass
   else:
       cells = row.find_all("td")
       doctor = {
           'name': cells[0].text,
           'number': cells[1].text,
           'profession': cells[2].text,
           'type': cells[3].text,
           'status': cells[4].text,
           'city': cells[5].text,
           'state': cells[6].text
       }
       doctors.append(doctor)


In [43]:
print(doctors)

[{'state': 'MD', 'name': 'KATHERINE A. SALE', 'status': 'Expired', 'type': 'ACUPUNCTURIST', 'profession': 'MEDICINE', 'city': 'ARNOLD', 'number': 'AC30086'}, {'state': 'NA', 'name': 'KATHERINE F SEARS', 'status': 'Active', 'type': 'ACUPUNCTURIST', 'profession': 'MEDICINE', 'city': 'Unknown', 'number': 'AC30023'}, {'state': 'NA', 'name': 'KATHERINE J KAPUSNIK', 'status': 'Expired', 'type': 'ACUPUNCTURIST', 'profession': 'MEDICINE', 'city': 'Unknown', 'number': 'AC500105'}, {'state': 'DC', 'name': 'KATHERINE S. YONKERS', 'status': 'Active', 'type': 'ACUPUNCTURIST', 'profession': 'MEDICINE', 'city': 'WASHINGTON', 'number': 'AC30057'}]


## Closing the webdriver

Once we have all the data we want, we can close our webdriver.

In [52]:
# Close the webdriver
driver.close()

# Saving our data

Now what are we going to do with our list of dictionaries? We *could* use a `csv.DictWriter` like in [this post](http://stackoverflow.com/questions/3086973/how-do-i-convert-this-list-of-dictionaries-to-a-csv-file-python), but it's actually quicker to do it with `pandas`.

### Step One: import pandas

In [44]:
import pandas as pd

### Step Two: Turn list into a DataFrame

In [46]:
doctor_df = pd.DataFrame(doctors)
doctor_df

Unnamed: 0,city,name,number,profession,state,status,type
0,ARNOLD,KATHERINE A. SALE,AC30086,MEDICINE,MD,Expired,ACUPUNCTURIST
1,Unknown,KATHERINE F SEARS,AC30023,MEDICINE,,Active,ACUPUNCTURIST
2,Unknown,KATHERINE J KAPUSNIK,AC500105,MEDICINE,,Expired,ACUPUNCTURIST
3,WASHINGTON,KATHERINE S. YONKERS,AC30057,MEDICINE,DC,Active,ACUPUNCTURIST


### Step Three: Save it to a CSV

While you're saving it, set `index=False` or else it will include `0`, `1`, `2`, etc from the further-left column (the index, of course).

In [49]:
doctor_df.to_csv('scraped-doctor.csv', index=False)

In [50]:
!cat scraped-doctor.csv

city,name,number,profession,state,status,type
ARNOLD,KATHERINE A. SALE,AC30086,MEDICINE,MD,Expired,ACUPUNCTURIST
Unknown,KATHERINE F SEARS,AC30023,MEDICINE,NA,Active,ACUPUNCTURIST
Unknown,KATHERINE J KAPUSNIK,AC500105,MEDICINE,NA,Expired,ACUPUNCTURIST
WASHINGTON,KATHERINE S. YONKERS,AC30057,MEDICINE,DC,Active,ACUPUNCTURIST


In [51]:
pd.DataFrame(doctors).to_csv('one-line-doctors.csv', index=False)

### Step Four: Party down

I don't have directions for this one

In [None]:
#Logging in:

In [None]:
#xkcd automation