#### Scraping presidents

My objective is to create a dataframe with information about the presidents of the United States. To do this, I will go through this steps:

1. Scrape this [list of presidents of the United States](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States).

In [1]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd


# 2. find url and store it in a variable
url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"

# 3. download html with a get request
response = requests.get(url)
response.status_code # 200 status code means OK!

# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of presidents of the United States - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"7bcac551-2d57-45ef-96ec-0be7e6416a3e","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_presidents_of_the_United_States","wgTitle":"List of presidents of the United States","wgCurRevisionId":1056755129,"wgRevisionId":1056755129,"wgArticleId":19908980,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia semi-protected pages","Articles with short description","Short 

In [2]:
# this solution is not very elegant, but works. 
# The CSS selector we copied had an "nth-child" that we could iterate 
# to find presidents, but some elements were empty, so we concatenate 
# each new element with "+" instead of appending as usual:

presidents = []

for i in range(95):
    presidents = presidents + soup.select("tbody > tr:nth-child(" + str(i) + ") > td:nth-child(4) > b > a")

# check the output:
presidents

[<a href="/wiki/George_Washington" title="George Washington">George Washington</a>,
 <a href="/wiki/John_Adams" title="John Adams">John Adams</a>,
 <a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>,
 <a href="/wiki/James_Madison" title="James Madison">James Madison</a>,
 <a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>,
 <a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>,
 <a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>,
 <a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>,
 <a href="/wiki/William_Henry_Harrison" title="William Henry Harrison">William Henry Harrison</a>,
 <a href="/wiki/John_Tyler" title="John Tyler">John Tyler</a>,
 <a href="/wiki/James_K._Polk" title="James K. Polk">James K. Polk</a>,
 <a href="/wiki/Zachary_Taylor" title="Zachary Taylor">Zachary Taylor</a>,
 <a href="/wiki/Millard_Fillmore" title="Millard Fillmore">Millard Fillmore</a>,
 

2. Collect all the links to the Wikipedia page of each president.

In [3]:
# accessed the links searching for the attribute "href"
# in each element
presidents[0]["href"]

'/wiki/George_Washington'

In [4]:
url = "https://en.wikipedia.org/" + presidents[0]["href"]

In [5]:
# assemble a new request to the link
# send request
url = "https://en.wikipedia.org/" + presidents[0]["href"]
response = requests.get(url)
response.status_code

# parse & store html
soup = BeautifulSoup(response.content, "html.parser")
soup.find("table", {"class":"infobox vcard"})

<table class="infobox vcard"><tbody><tr><th class="infobox-above" colspan="2" style="font-size: 100%;"><div class="fn" style="font-size:125%;">George Washington</div></th></tr><tr><td class="infobox-image" colspan="2"><a class="image" href="/wiki/File:Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg"><img alt="Head and shoulders portrait of George Washington" data-file-height="5615" data-file-width="4626" decoding="async" height="267" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/220px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/330px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/440px-Gilbert_Stuart_Williamstown_Portrait_of_Geor

In this step it could be possible to store the whole wikipedia page for each president, or just the tiny, final pieces of information. Storing the boxes is a middle ground (don't have too much noise but retain the flexibility of deciding later which specific elements to extract).

When sending multiple requests, it is important to be respectful by spacing the requests a few seconds from each other. 

I also bring the success code to monitor that everything is going well:

In [6]:
# To make it more "human", we can randomize the waiting time:
from time import sleep
from random import randint

for i in range(5):
    print(i)
    wait_time = randint(1,4)
    print("I will sleep for " + str(wait_time) + " seconds.")
    sleep(wait_time)

0
I will sleep for 3 seconds.
1
I will sleep for 1 seconds.
2
I will sleep for 4 seconds.
3
I will sleep for 4 seconds.
4
I will sleep for 4 seconds.


In [7]:
# 2. find url and store it in a variable

presi_soups = []
    
for presi in presidents:
    # send request
    url = "https://en.wikipedia.org/" + presi["href"]
    response = requests.get(url)
    print(presi.get_text(), response.status_code)
    
    # parse & store html
    soup = BeautifulSoup(response.content, "html.parser")
    presi_soups.append(soup.find("table", {"class":"infobox vcard"}))
    
    # respectful nap:
    wait_time = randint(1,2)
    print("I will sleep for " + str(wait_time) + " second/s.")
    sleep(wait_time)

George Washington 200
I will sleep for 2 second/s.
John Adams 200
I will sleep for 1 second/s.
Thomas Jefferson 200
I will sleep for 2 second/s.
James Madison 200
I will sleep for 2 second/s.
James Monroe 200
I will sleep for 1 second/s.
John Quincy Adams 200
I will sleep for 2 second/s.
Andrew Jackson 200
I will sleep for 1 second/s.
Martin Van Buren 200
I will sleep for 1 second/s.
William Henry Harrison 200
I will sleep for 1 second/s.
John Tyler 200
I will sleep for 1 second/s.
James K. Polk 200
I will sleep for 1 second/s.
Zachary Taylor 200
I will sleep for 2 second/s.
Millard Fillmore 200
I will sleep for 2 second/s.
Franklin Pierce 200
I will sleep for 1 second/s.
James Buchanan 200
I will sleep for 2 second/s.
Abraham Lincoln 200
I will sleep for 1 second/s.
Andrew Johnson 200
I will sleep for 1 second/s.
Ulysses S. Grant 200
I will sleep for 2 second/s.
Rutherford B. Hayes 200
I will sleep for 1 second/s.
James A. Garfield 200
I will sleep for 2 second/s.
Chester A. Arthur 20

KeyboardInterrupt: 

4. Find and store information about each president.

Extracted the 'infoboxes': now it's time to exctract especific pieces of information from them. Let's test what can we get from single presidents and then assemble a loop for all of them - as usual.

Here, I  will use [the string argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) in the find function, since wikipedia tags and classes are not always helpfulto locate. The string argument allows me to locate elements by its actual content.

In [None]:
#Birthday
presi_soups[-1].find("span", {"class":"bday"}).get_text()

#Political party
presi_soups[-1].find("th", string="Political party").parent.find("a").get_text()

#Number of sons/daughters
len(presi_soups[-1].find("th", string="Children").parent.find_all("li"))