# Scraping one page per row

Let's say we're interested in our members of Congress, because who isn't? Read in `congress.csv`.

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import numpy as np
import lxml
from bs4 import BeautifulSoup
import requests
import re

In [4]:
df = pd.read_csv("congress.csv")
df.head()

Unnamed: 0,name,slug
0,"Senator Abdnor, James",james-abdnor/A000009
1,"Representative Abercrombie, Neil",neil-abercrombie/A000014
2,"Senator Abourezk, James",james-abourezk/A000017
3,"Representative Abraham, Ralph Lee",ralph-abraham/A000374
4,"Senator Abraham, Spencer",spencer-abraham/A000355


# Let's scrape one

The `slug` is the part of the URL that's particular to that member of Congress. So `/james-abdnor/A000009` really means `https://www.congress.gov/member/james-abdnor/A000009`.

Scrape his name, birthdaye, party, whether he's currently in congress, and his bill count (don't worry if the bill count is dirty, you can clean it up later).

In [14]:
driver = webdriver.Chrome()

In [100]:
# for row in df.head(2).iterrows():
#     driver.get(f"https://www.congress.gov/member/{row[1]['slug']}")
#     print(driver.find_element_by_class_name('legDetail').text.strip())
#     print(driver.find_element_by_tag_name('tr').text.strip())
#     print(driver.find_element_by_class_name('results-number').text.strip())
    
    
    

# Build a function

Write a function called `scrape_page` that makes a URL out of the the `slug`, like we're going to use `.apply`.

In [99]:
# def scrape_page(row):
#     driver.get(f"https://www.congress.gov/member/james{row['slug']}")
#     data = {}
#     data['name'] = driver.find_element_by_tag_name("h1").text.strip()
#     data['party'] = driver.find_element_by_tag_name("tr").text.strip()
#     data['bills'] = driver.find_element_by_class_name("results-number").text.strip()
#     print(data)


In [44]:
def scrape_page(row):
    try:
        driver.get(f"https://www.congress.gov/member/james{row['slug']}")
        data = {}
        data['name'] = driver.find_element_by_tag_name("h1").text.strip()
        data['party'] = driver.find_element_by_tag_name("tr").text.strip()
        data['bills'] = driver.find_element_by_class_name("results-number").text.strip()
    except:
        pass
    return pd.Series(data)

In [46]:
scraped_df = df.apply(scrape_page, axis=1)

  # Remove the CWD from sys.path while we load stuff.


In [47]:
scraped_df.head()

Unnamed: 0,bills,name,party
0,"1-100 of 1,949",Senator James Abdnor (1923 - 2012)\nIn Congres...,Party Republican
1,"1-100 of 4,472",Representative Neil Abercrombie (1938 - )\nIn ...,Party Democratic
2,1-100 of 875,Senator James Abourezk (1931 - )\nIn Congress ...,Party Democratic
3,1-100 of 736,Representative Ralph Lee Abraham (1954 - )\nIn...,Website https://abraham.house.gov/
4,"1-100 of 1,227",Senator Spencer Abraham (1952 - )\nIn Congress...,Party Republican


In [56]:
scraped_df.bills = scraped_df.bills.str.replace(",", "")

In [59]:
scraped_df.bills = scraped_df.bills.str.extract("1-100 of (\d.+)", expand=False)

In [62]:
scraped_df['name_new'] = scraped_df.name.str.extract("(\w.*) \(")

In [64]:
scraped_df['birth_year'] = scraped_df.name.str.extract("\w.* \((\d.+?) ")

In [67]:
scraped_df['tenure'] = scraped_df.name.str.extract("(In Congress \w.*)")

In [73]:
scraped_df = scraped_df.dropna(subset=['tenure'])

In [83]:
scraped_df['currently_in_congress'] = scraped_df.name.str.extract("In Congress \d.+? - (Present)")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [86]:
scraped_df['currently_in_congress'] = scraped_df['currently_in_congress'].fillna("No")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [88]:
scraped_df['currently_in_congress'] = scraped_df['currently_in_congress'].str.replace("Present", "Yes")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [91]:
scraped_df['party'] = scraped_df['party'].str.extract("Party (\w.+)")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [97]:
df_new = df.merge(scraped_df, how='outer', left_index=True, right_index=True)

In [98]:
df_new.to_csv("congress-plus-scraped.csv", index=False)

# Do the scraping

Rewrite `scrape_page` to actually scrape the URL. You can use your scraping code from up above. Start by testing with just one row (I put a sample call below) and then expand to your whole dataframe.

Save the results as `scraped_df`.

* **Hint:** Be sure to use `return`!
* **Hint:** Make sure you return a `pd.Series`

In [None]:
# Test with this
scrape_page({'slug': 'neil-abercrombie/A000014'})

## Join with your original dataframe

Join your new data with your original data, adding the `_scraped` suffix on the new columns. You can use either `.join` or `.merge`, but be sure to read the docs to know the difference!

## Save it

Save your combined results to `congress-plus-scraped.csv`.