# Data Engineering with Beautiful Soup

(created along with Nelson Santos for cs109)

Data Engineering, the process of gathering and preparing data for analysis, is a very big part of Data Science.

Datasets might not be formatted in the way you need (e.g. you have categorical features but your algorithm requires numerical features); or you might need to cross-reference some dataset to another that has a different format; or you might be dealing with a dataset that contains missing or invalid data.

These are just a few examples of why data retrieval and cleaning are so important.

## Retrieving data from the web

### requests

You might need to retrieve some data from the Internet. Python has many built-in libraries that were developed over the years to do exactly that (e.g. urllib, urllib2, urllib3).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckily, as with most tasks in Python, someone has developed a library that simplifies these tasks. Get acquainted to `requests` as soon as possible, since you will probably need it in the future.

In [1]:
import requests

Now that the requests library was imported into our namespace, we can use the functions offered by it.

In this case we'll use the appropriately named `get` function to issue a *GET* request. This is equivalent to typing a URL into your browser and hitting enter.

In [2]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

Python is an Object Oriented language, and everything on it is an object. Even built-in functions such as `len` are just syntactic sugar for acting on object properties.

We will not dwell too long on OO concepts, but some of Python's idiosyncrasies will be easier to understand if we spend a few minutes on this subject.

When you evaluate an object itself, such as the `req` object we created above, Python will automatically call the `__str__()` or `__repr__()` method of that object. The default values for these methods are usually very simple and boring. The `req` object however has a custom implementation that shows the object type (i.e. `Response`) and the HTTP status number (200 means the request was successful).

In [3]:
req

<Response [200]>

Just to confirm, we will call the `type` function on the object to make sure it agrees with the value above.

In [4]:
type(req)

requests.models.Response

Right now `req` holds a reference to a *Request* object; but we are interested in the text associated with the web page, not the object itself.

So the next step is to assign the value of the `text` property of this `Request` object to a variable.

In [5]:
page = req.text
page[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Harvard University - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":799964943,"wgRevisionId":799964943,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from September 2014","All articles containing potentially dated statements","All articles with dead external links","Articles with dead external links from July 2017","Articles with permanently dead external links","CS1 maint: Extra text: editors list",

In [6]:
from IPython.display import IFrame, HTML
#IFrame(HTML(page), 1024, 768)
HTML(page)

0,1
,
Latin: Universitas Harvardiana,Latin: Universitas Harvardiana
Former names,Harvard College
Motto,Veritas[1]
Motto in English,Truth
Type,Private research
Established,1636[2]
Endowment,$34.541 billion (2016)[3]
President,Drew Gilpin Faust
Academic staff,"4,671[4]"

0,1
College/school,Year founded
Harvard College,1636
Medicine,1782
Divinity,1816
Law,1817
Dental Medicine,1867
Arts and Sciences,1872
Business,1908
Extension,1910
Design,1914

0,1
,This section needs expansion. You can help by adding to it. (September 2013)

0,1,2
University rankings National ARWU[111] 1 Forbes[112] 4 U.S. News & World Report[113] 2 Washington Monthly[114] 2 Global ARWU[115] 1 QS[116] 3 Times[117] 6 U.S. News & World Report[118] 1,National Program Rankings[119] Program Ranking Biological Sciences 1 Business 1 Chemistry 4 Clinical Psychology 16 Computer Science 18 Earth Sciences 8 Economics 1 Education 1 Engineering 23 English 8 History 4 Law 3 Mathematics 3 Medicine: Primary Care 16 Medicine: Research 1 Physics 2 Political Science 1 Psychology 3 Public Affairs 3 Public Health 2 Sociology 1 Statistics 4,Global Program Rankings[120] Program Ranking Agricultural Sciences 13 Arts & Humanities 2 Biology & Biochemistry 1 Chemistry 9 Clinical Medicine 1 Computer Science 6 Economics & Business 1 Engineering 34 Environment/Ecology 2 Geosciences 7 Immunology 1 Materials Science 5 Mathematics 10 Microbiology 1 Molecular Biology & Genetics 1 Neuroscience & Behavior 1 Pharmacology & Toxicology 1 Physics 3 Plant & Animal Science 5 Psychiatry/Psychology 1 Social Sciences & Public Health 1 Space Science 2

University rankings,University rankings
National,National.1
ARWU[111],1
Forbes[112],4
U.S. News & World Report[113],2
Washington Monthly[114],2
Global,Global
ARWU[115],1
QS[116],3
Times[117],6
U.S. News & World Report[118],1

National Program Rankings[119],National Program Rankings[119],National Program Rankings[119],National Program Rankings[119]
Program,Ranking,Unnamed: 2_level_1,Unnamed: 3_level_1
Biological Sciences,1,,
Business,1,,
Chemistry,4,,
Clinical Psychology,16,,
Computer Science,18,,
Earth Sciences,8,,
Economics,1,,
Education,1,,
Engineering,23,,
English,8,,

Global Program Rankings[120],Global Program Rankings[120],Global Program Rankings[120],Global Program Rankings[120]
Program,Ranking,Unnamed: 2_level_1,Unnamed: 3_level_1
Agricultural Sciences,13,,
Arts & Humanities,2,,
Biology & Biochemistry,1,,
Chemistry,9,,
Clinical Medicine,1,,
Computer Science,6,,
Economics & Business,1,,
Engineering,34,,
Environment/Ecology,2,,
Geosciences,7,,

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17%,11%,5%
Black/non-Hispanic,6%,4%,12%
Hispanics of any race,9%,5%,16%
White/non-Hispanic,46%,43%,64%
Mixed race/other,10%,8%,9%
International students,11%,27%,

v t e Harvard University,v t e Harvard University.1,v t e Harvard University.2
,History John Harvard statue President Drew Gilpin Faust Board of Overseers President and Fellows of Harvard College Provost Alan M. Garber The Harvard Library,
Arts and Sciences,Harvard Faculty of Arts and Sciences Dean Michael D. Smith College Harvard College Dean Rakesh Khurana Radcliffe College Freshman dormitories Upperclass houses Adams Cabot Currier Dudley Dunster Eliot Kirkland Leverett Lowell Mather Pforzheimer Quincy Winthrop Undergraduate organizations The Harvard Crimson The Harvard Lampoon The Harvard Advocate The Harvard Independent Athletics: Harvard Crimson Ivy League Harvard Stadium Yale football rivalry Lavietes Pavilion Bright Hockey Center Cornell hockey rivalry Beanpot Weld Boathouse Newall Boathouse Continuing Education Division of Continuing Education Dean Huntington D. Lambert Extension School Summer School History of Harvard Extension School Eng. & Appl. Sciences Harvard John A. Paulson School of Engineering and Applied Sciences Dean Francis J. Doyle III Lyman Laboratory of Physics Graduate School Graduate School of Arts and Sciences Dean Xiao-Li Meng Libraries Cabot Harvard-Yenching Houghton Harvard Review Lamont Pusey Widener Harry Widener Eleanor Elkins Widener Grossman Carpenter Center for the Visual Arts Center for Hellenic Studies Charles Warren Center for Studies in American History Collection of Historical Scientific Instruments Harvard–Smithsonian Center for Astrophysics Institute for Quantitative Social Science Nieman Foundation for Journalism Ukrainian Research Institute Villa I Tatti W. E. B. Du Bois Institute,
Business,Harvard Business School Dean Nitin Nohria Harvard Business Publishing Harvard Business Press Harvard Business Review,
Design,Harvard Graduate School of Design Dean Mohsen Mostafavi Harvard Design Magazine Joint Center for Housing Studies,
Divinity,Harvard Divinity School Dean David N. Hempton,
Education,Harvard Graduate School of Education,
Government,John F. Kennedy School of Government Dean Douglas Elmendorf Institute of Politics,
Law,Harvard Law School Dean Martha Minow Harvard Law Review Harvard Journal of Law & Technology Harvard Law Record Harvard International Law Journal Harvard Civil Rights-Civil Liberties Law Review Harvard Journal on Legislation Berkman Center,
Medicine,Harvard Medical School Dean Jeffrey Scott Flier Broad Institute Countway Library Center for the History of Medicine Warren Anatomical Museum Schepens Eye Research Institute Dentistry Harvard School of Dental Medicine Dean Bruce Donoff,
Public health,Harvard T.H. Chan School of Public Health Dean Michelle Ann Williams,

0,1
,Harvard Faculty of Arts and Sciences Dean Michael D. Smith
College,Harvard College Dean Rakesh Khurana Radcliffe College Freshman dormitories Upperclass houses Adams Cabot Currier Dudley Dunster Eliot Kirkland Leverett Lowell Mather Pforzheimer Quincy Winthrop Undergraduate organizations The Harvard Crimson The Harvard Lampoon The Harvard Advocate The Harvard Independent Athletics: Harvard Crimson Ivy League Harvard Stadium Yale football rivalry Lavietes Pavilion Bright Hockey Center Cornell hockey rivalry Beanpot Weld Boathouse Newall Boathouse
Continuing Education,Division of Continuing Education Dean Huntington D. Lambert Extension School Summer School History of Harvard Extension School
Eng. & Appl. Sciences,Harvard John A. Paulson School of Engineering and Applied Sciences Dean Francis J. Doyle III Lyman Laboratory of Physics
Graduate School,Graduate School of Arts and Sciences Dean Xiao-Li Meng
Libraries,Cabot Harvard-Yenching Houghton Harvard Review Lamont Pusey Widener Harry Widener Eleanor Elkins Widener Grossman
,Carpenter Center for the Visual Arts Center for Hellenic Studies Charles Warren Center for Studies in American History Collection of Historical Scientific Instruments Harvard–Smithsonian Center for Astrophysics Institute for Quantitative Social Science Nieman Foundation for Journalism Ukrainian Research Institute Villa I Tatti W. E. B. Du Bois Institute

0,1
,Harvard Medical School Dean Jeffrey Scott Flier Broad Institute Countway Library Center for the History of Medicine Warren Anatomical Museum Schepens Eye Research Institute
Dentistry,Harvard School of Dental Medicine Dean Bruce Donoff

Links to related articles,Links to related articles.1
v t e Ivy League Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Princeton Tigers Penn Quakers Yale Bulldogs v t e Colonial colleges Brown Columbia Dartmouth Harvard Penn Princeton Rutgers William & Mary Yale v t e Colleges and universities in metropolitan Boston Andover Newton Theological School Bay State College Benjamin Franklin Institute of Technology Bentley University Berklee College of Music Boston Architectural College Boston Baptist College Boston College Boston Conservatory Boston Graduate School of Psychoanalysis Boston University Brandeis University Bunker Hill Community College Cambridge College Curry College Eastern Nazarene College Emerson College Emmanuel College Episcopal Divinity School Fisher College Harvard University Hebrew College Hellenic College Hult International Business School Labouré College Lasell College Lesley University Longy School of Music Massachusetts College of Art and Design Massachusetts College of Pharmacy and Health Sciences Massachusetts Institute of Technology MGH Institute Mount Ida College Newbury College New England College of Optometry New England Conservatory New England Institute of Art New England School of Law Northeastern University Pine Manor College Quincy College Roxbury Community College St. John's Seminary School of the Museum of Fine Arts at Tufts Simmons College Suffolk University Tufts University University of Massachusetts Boston Urban College of Boston Wentworth Institute of Technology Wheelock College William James College v t e Association of American Universities Public Arizona California Berkeley Davis Irvine Los Angeles San Diego Santa Barbara Colorado Florida Georgia Tech Illinois Indiana Iowa Iowa State Kansas Maryland Michigan Michigan State Minnesota Missouri SUNY Buffalo Stony Brook North Carolina Ohio State Oregon Penn State Pittsburgh Purdue Rutgers Texas Texas A&M Virginia Washington Wisconsin Private Boston U Brandeis Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Duke Emory Harvard Johns Hopkins MIT Northwestern NYU UPenn Princeton Rice Rochester USC Stanford Tulane Vanderbilt Wash U Yale Canadian (public) McGill Toronto v t e Universities Research Association Public Alabama Arizona Arizona State California Berkeley Davis Irvine Los Angeles Riverside San Diego Santa Barbara Colorado Colorado State Florida Florida State Houston Illinois Chicago Urbana–Champaign Indiana Iowa Iowa State LSU Maryland Michigan Michigan State Minnesota Mississippi Nebraska New Mexico New Mexico State North Carolina North Texas Northern Illinois Ohio State Oklahoma Oregon Penn State Pittsburgh Purdue Rutgers South Carolina SUNY Buffalo Stony Brook Tennessee Texas Arlington Austin Dallas Texas A&M Texas Tech Virginia Virginia Tech Washington Wayne State William & Mary Wisconsin Private Boston U Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Duke Harvard Illinois Tech Johns Hopkins MIT Northeastern Northwestern Notre Dame Penn Princeton Rice Rochester Rockefeller SMU Stanford Syracuse Tufts Tulane Vanderbilt WUSTL Yale International  McGill  Toronto  Pisa  Waseda  Manchester  Liverpool  UCL v t e Association of Independent Colleges and Universities in Massachusetts (AICUM) Amherst Anna Maria Assumption Babson Bay Path Becker Bentley Berklee Boston Architectural Boston Baptist Boston College Boston U Brandeis Cambridge College Clark College of the Holy Cross Curry Dean Eastern Nazarene Elms Emerson Emmanuel Endicott Fisher Gordon Hampshire Harvard Hebrew Labouré Lasell Lesley Marian Court MCPHS MIT Merrimack MGH Institute Mount Holyoke Mount Ida NECO New England Conservatory Newbury Nichols Northeastern Olin Pine Manor Regis Simmons Smith Springfield Stonehill Suffolk Tufts Urban College of Boston Wellesley WIT Western New England Wheaton Wheelock Williams WPI v t e ECAC Hockey Teams Brown Bears men women Clarkson Golden Knights men women Colgate Raiders men women Cornell Big Red men women Dartmouth Big Green men women Harvard Crimson men women Princeton Tigers men women Quinnipiac Bobcats men women Rensselaer Engineers men women St. Lawrence Saints men women Union Dutchmen men women Yale Bulldogs men women Venues Meehan Auditorium (Brown) Cheel Arena (Clarkson) Starr Arena (Colgate) Lynah Rink (Cornell) Thompson Arena (Dartmouth) Bright Hockey Center (Harvard) Hobey Baker Memorial Rink (Princeton) TD Bank Sports Center (Quinnipiac) Houston Field House (Rensselaer) Appleton Arena (St. Lawrence) Achilles Rink (Union) Ingalls Rink (Yale) Herb Brooks Arena (Men's tournament) Championships and awards Men's champions Women's champions Player of the Year Rookie of the Year Coach of the Year Best Defensive Defenseman Best Defensive Forward Ken Dryden Award Student-Athlete of the Year Most Outstanding Player in Tournament Seasons 1961–62 1962–63 1963–64 1964–65 1965–66 1966–67 1967–68 1968–69 1969–70 1970–71 1971–72 1972–73 1973–74 1974–75 1975–76 1976–77 1977–78 1978–79 1979–80 1980–81 1981–82 1982–83 1983–84 1984–85 1985–86 1986–87 1987–88 1988–89 1989–90 1990–91 1991–92 1992–93 1993–94 1994–95 1995–96 1996–97 1997–98 1998–99 1999–00 2000–01 2001–02 2002–03 2003–04 2004–05 2005–06 2006–07 2007–08 2008–09 2009–10 2010–11 2011–12 2012–13 2013–14 2014–15 2015–16 2016–17 v t e Eastern Association of Rowing Colleges BU Terriers Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Georgetown Hoyas Harvard Crimson Holy Cross Crusaders MIT Engineers Navy Midshipmen Northeastern Huskies Penn Quakers Princeton Tigers Rutgers Scarlet Knights Syracuse Orange Wisconsin Badgers Yale Bulldogs v t e Eastern Intercollegiate Volleyball Association Current members Charleston Golden Eagles George Mason Patriots Harvard Crimson NJIT Highlanders Penn State Nittany Lions Princeton Tigers Sacred Heart Pioneers Saint Francis Red Flash Former members Concordia College East Stroudsburg University Juniata College New York University University of New Haven Queens College Rutgers–Newark Springfield College Vassar College SUNY New Paltz v t e  Sports teams based in Massachusetts Australian rules football USAFL Boston Demons Baseball MLB Boston Red Sox NYPL Lowell Spinners CCBL Bourne Braves Brewster Whitecaps Chatham Anglers Cotuit Kettleers Falmouth Commodores Harwich Mariners Hyannis Harbor Hawks Orleans Firebirds Wareham Gatemen Yarmouth–Dennis Red Sox FCBL Brockton Rox Martha's Vineyard Sharks North Shore Navigators Pittsfield Suns Wachusett Dirt Dawgs Worcester Bravehearts NECBL New Bedford Bay Sox North Adams SteepleCats Plymouth Pilgrims Valley Blue Sox Basketball NBA Boston Celtics Football NFL New England Patriots WFA Boston Renegades Hockey NHL Boston Bruins AHL Springfield Thunderbirds ECHL Worcester Railers NWHL Boston Pride CWHL Boston Blades Lacrosse MLL Boston Cannons UWLX Boston Storm Roller derby WFTDA Bay State Brawlers Roller Derby Boston Roller Derby MRDA Pioneer Valley Roller Derby Rugby league USARL Boston Thirteens Oneida FC Rugby union RSL Boston RFC NERFU Boston Irish Wolfhounds Mystic River South Shore Anchors Soccer MLS New England Revolution NWSL Boston Breakers PDL FC Boston Western Mass Pioneers NPSL Boston City FC Greater Lowell NPSL FC Champions Soccer League USA Greater Lowell United FC UWS New England Mutiny WPSL Boston Breakers Academy Boston Breakers Reserves Boston Breakers U23 FC Stars FC Stars U23 Ultimate Club Boston Ironsides College athletics (NCAA Division I) AIC Yellow Jackets (ice hockey) Bentley Falcons (ice hockey) Boston College Eagles Boston University Terriers Harvard Crimson Holy Cross Crusaders UMass Minutemen and Minutewomen UMass Lowell River Hawks Merrimack Warriors (ice hockey) Northeastern Huskies,v t e Ivy League Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Princeton Tigers Penn Quakers Yale Bulldogs v t e Colonial colleges Brown Columbia Dartmouth Harvard Penn Princeton Rutgers William & Mary Yale v t e Colleges and universities in metropolitan Boston Andover Newton Theological School Bay State College Benjamin Franklin Institute of Technology Bentley University Berklee College of Music Boston Architectural College Boston Baptist College Boston College Boston Conservatory Boston Graduate School of Psychoanalysis Boston University Brandeis University Bunker Hill Community College Cambridge College Curry College Eastern Nazarene College Emerson College Emmanuel College Episcopal Divinity School Fisher College Harvard University Hebrew College Hellenic College Hult International Business School Labouré College Lasell College Lesley University Longy School of Music Massachusetts College of Art and Design Massachusetts College of Pharmacy and Health Sciences Massachusetts Institute of Technology MGH Institute Mount Ida College Newbury College New England College of Optometry New England Conservatory New England Institute of Art New England School of Law Northeastern University Pine Manor College Quincy College Roxbury Community College St. John's Seminary School of the Museum of Fine Arts at Tufts Simmons College Suffolk University Tufts University University of Massachusetts Boston Urban College of Boston Wentworth Institute of Technology Wheelock College William James College v t e Association of American Universities Public Arizona California Berkeley Davis Irvine Los Angeles San Diego Santa Barbara Colorado Florida Georgia Tech Illinois Indiana Iowa Iowa State Kansas Maryland Michigan Michigan State Minnesota Missouri SUNY Buffalo Stony Brook North Carolina Ohio State Oregon Penn State Pittsburgh Purdue Rutgers Texas Texas A&M Virginia Washington Wisconsin Private Boston U Brandeis Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Duke Emory Harvard Johns Hopkins MIT Northwestern NYU UPenn Princeton Rice Rochester USC Stanford Tulane Vanderbilt Wash U Yale Canadian (public) McGill Toronto v t e Universities Research Association Public Alabama Arizona Arizona State California Berkeley Davis Irvine Los Angeles Riverside San Diego Santa Barbara Colorado Colorado State Florida Florida State Houston Illinois Chicago Urbana–Champaign Indiana Iowa Iowa State LSU Maryland Michigan Michigan State Minnesota Mississippi Nebraska New Mexico New Mexico State North Carolina North Texas Northern Illinois Ohio State Oklahoma Oregon Penn State Pittsburgh Purdue Rutgers South Carolina SUNY Buffalo Stony Brook Tennessee Texas Arlington Austin Dallas Texas A&M Texas Tech Virginia Virginia Tech Washington Wayne State William & Mary Wisconsin Private Boston U Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Duke Harvard Illinois Tech Johns Hopkins MIT Northeastern Northwestern Notre Dame Penn Princeton Rice Rochester Rockefeller SMU Stanford Syracuse Tufts Tulane Vanderbilt WUSTL Yale International  McGill  Toronto  Pisa  Waseda  Manchester  Liverpool  UCL v t e Association of Independent Colleges and Universities in Massachusetts (AICUM) Amherst Anna Maria Assumption Babson Bay Path Becker Bentley Berklee Boston Architectural Boston Baptist Boston College Boston U Brandeis Cambridge College Clark College of the Holy Cross Curry Dean Eastern Nazarene Elms Emerson Emmanuel Endicott Fisher Gordon Hampshire Harvard Hebrew Labouré Lasell Lesley Marian Court MCPHS MIT Merrimack MGH Institute Mount Holyoke Mount Ida NECO New England Conservatory Newbury Nichols Northeastern Olin Pine Manor Regis Simmons Smith Springfield Stonehill Suffolk Tufts Urban College of Boston Wellesley WIT Western New England Wheaton Wheelock Williams WPI v t e ECAC Hockey Teams Brown Bears men women Clarkson Golden Knights men women Colgate Raiders men women Cornell Big Red men women Dartmouth Big Green men women Harvard Crimson men women Princeton Tigers men women Quinnipiac Bobcats men women Rensselaer Engineers men women St. Lawrence Saints men women Union Dutchmen men women Yale Bulldogs men women Venues Meehan Auditorium (Brown) Cheel Arena (Clarkson) Starr Arena (Colgate) Lynah Rink (Cornell) Thompson Arena (Dartmouth) Bright Hockey Center (Harvard) Hobey Baker Memorial Rink (Princeton) TD Bank Sports Center (Quinnipiac) Houston Field House (Rensselaer) Appleton Arena (St. Lawrence) Achilles Rink (Union) Ingalls Rink (Yale) Herb Brooks Arena (Men's tournament) Championships and awards Men's champions Women's champions Player of the Year Rookie of the Year Coach of the Year Best Defensive Defenseman Best Defensive Forward Ken Dryden Award Student-Athlete of the Year Most Outstanding Player in Tournament Seasons 1961–62 1962–63 1963–64 1964–65 1965–66 1966–67 1967–68 1968–69 1969–70 1970–71 1971–72 1972–73 1973–74 1974–75 1975–76 1976–77 1977–78 1978–79 1979–80 1980–81 1981–82 1982–83 1983–84 1984–85 1985–86 1986–87 1987–88 1988–89 1989–90 1990–91 1991–92 1992–93 1993–94 1994–95 1995–96 1996–97 1997–98 1998–99 1999–00 2000–01 2001–02 2002–03 2003–04 2004–05 2005–06 2006–07 2007–08 2008–09 2009–10 2010–11 2011–12 2012–13 2013–14 2014–15 2015–16 2016–17 v t e Eastern Association of Rowing Colleges BU Terriers Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Georgetown Hoyas Harvard Crimson Holy Cross Crusaders MIT Engineers Navy Midshipmen Northeastern Huskies Penn Quakers Princeton Tigers Rutgers Scarlet Knights Syracuse Orange Wisconsin Badgers Yale Bulldogs v t e Eastern Intercollegiate Volleyball Association Current members Charleston Golden Eagles George Mason Patriots Harvard Crimson NJIT Highlanders Penn State Nittany Lions Princeton Tigers Sacred Heart Pioneers Saint Francis Red Flash Former members Concordia College East Stroudsburg University Juniata College New York University University of New Haven Queens College Rutgers–Newark Springfield College Vassar College SUNY New Paltz v t e  Sports teams based in Massachusetts Australian rules football USAFL Boston Demons Baseball MLB Boston Red Sox NYPL Lowell Spinners CCBL Bourne Braves Brewster Whitecaps Chatham Anglers Cotuit Kettleers Falmouth Commodores Harwich Mariners Hyannis Harbor Hawks Orleans Firebirds Wareham Gatemen Yarmouth–Dennis Red Sox FCBL Brockton Rox Martha's Vineyard Sharks North Shore Navigators Pittsfield Suns Wachusett Dirt Dawgs Worcester Bravehearts NECBL New Bedford Bay Sox North Adams SteepleCats Plymouth Pilgrims Valley Blue Sox Basketball NBA Boston Celtics Football NFL New England Patriots WFA Boston Renegades Hockey NHL Boston Bruins AHL Springfield Thunderbirds ECHL Worcester Railers NWHL Boston Pride CWHL Boston Blades Lacrosse MLL Boston Cannons UWLX Boston Storm Roller derby WFTDA Bay State Brawlers Roller Derby Boston Roller Derby MRDA Pioneer Valley Roller Derby Rugby league USARL Boston Thirteens Oneida FC Rugby union RSL Boston RFC NERFU Boston Irish Wolfhounds Mystic River South Shore Anchors Soccer MLS New England Revolution NWSL Boston Breakers PDL FC Boston Western Mass Pioneers NPSL Boston City FC Greater Lowell NPSL FC Champions Soccer League USA Greater Lowell United FC UWS New England Mutiny WPSL Boston Breakers Academy Boston Breakers Reserves Boston Breakers U23 FC Stars FC Stars U23 Ultimate Club Boston Ironsides College athletics (NCAA Division I) AIC Yellow Jackets (ice hockey) Bentley Falcons (ice hockey) Boston College Eagles Boston University Terriers Harvard Crimson Holy Cross Crusaders UMass Minutemen and Minutewomen UMass Lowell River Hawks Merrimack Warriors (ice hockey) Northeastern Huskies

v t e Ivy League,v t e Ivy League.1
Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Princeton Tigers Penn Quakers Yale Bulldogs,Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Harvard Crimson Princeton Tigers Penn Quakers Yale Bulldogs

v t e Colonial colleges,v t e Colonial colleges.1
Brown Columbia Dartmouth Harvard Penn Princeton Rutgers William & Mary Yale,Brown Columbia Dartmouth Harvard Penn Princeton Rutgers William & Mary Yale

v t e Colleges and universities in metropolitan Boston,v t e Colleges and universities in metropolitan Boston.1
Andover Newton Theological School Bay State College Benjamin Franklin Institute of Technology Bentley University Berklee College of Music Boston Architectural College Boston Baptist College Boston College Boston Conservatory Boston Graduate School of Psychoanalysis Boston University Brandeis University Bunker Hill Community College Cambridge College Curry College Eastern Nazarene College Emerson College Emmanuel College Episcopal Divinity School Fisher College Harvard University Hebrew College Hellenic College Hult International Business School Labouré College Lasell College Lesley University Longy School of Music Massachusetts College of Art and Design Massachusetts College of Pharmacy and Health Sciences Massachusetts Institute of Technology MGH Institute Mount Ida College Newbury College New England College of Optometry New England Conservatory New England Institute of Art New England School of Law Northeastern University Pine Manor College Quincy College Roxbury Community College St. John's Seminary School of the Museum of Fine Arts at Tufts Simmons College Suffolk University Tufts University University of Massachusetts Boston Urban College of Boston Wentworth Institute of Technology Wheelock College William James College,Andover Newton Theological School Bay State College Benjamin Franklin Institute of Technology Bentley University Berklee College of Music Boston Architectural College Boston Baptist College Boston College Boston Conservatory Boston Graduate School of Psychoanalysis Boston University Brandeis University Bunker Hill Community College Cambridge College Curry College Eastern Nazarene College Emerson College Emmanuel College Episcopal Divinity School Fisher College Harvard University Hebrew College Hellenic College Hult International Business School Labouré College Lasell College Lesley University Longy School of Music Massachusetts College of Art and Design Massachusetts College of Pharmacy and Health Sciences Massachusetts Institute of Technology MGH Institute Mount Ida College Newbury College New England College of Optometry New England Conservatory New England Institute of Art New England School of Law Northeastern University Pine Manor College Quincy College Roxbury Community College St. John's Seminary School of the Museum of Fine Arts at Tufts Simmons College Suffolk University Tufts University University of Massachusetts Boston Urban College of Boston Wentworth Institute of Technology Wheelock College William James College

v t e Association of American Universities,v t e Association of American Universities.1
Public,Arizona California Berkeley Davis Irvine Los Angeles San Diego Santa Barbara Colorado Florida Georgia Tech Illinois Indiana Iowa Iowa State Kansas Maryland Michigan Michigan State Minnesota Missouri SUNY Buffalo Stony Brook North Carolina Ohio State Oregon Penn State Pittsburgh Purdue Rutgers Texas Texas A&M Virginia Washington Wisconsin
Private,Boston U Brandeis Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Duke Emory Harvard Johns Hopkins MIT Northwestern NYU UPenn Princeton Rice Rochester USC Stanford Tulane Vanderbilt Wash U Yale
Canadian (public),McGill Toronto

v t e Universities Research Association,v t e Universities Research Association.1
Public,Alabama Arizona Arizona State California Berkeley Davis Irvine Los Angeles Riverside San Diego Santa Barbara Colorado Colorado State Florida Florida State Houston Illinois Chicago Urbana–Champaign Indiana Iowa Iowa State LSU Maryland Michigan Michigan State Minnesota Mississippi Nebraska New Mexico New Mexico State North Carolina North Texas Northern Illinois Ohio State Oklahoma Oregon Penn State Pittsburgh Purdue Rutgers South Carolina SUNY Buffalo Stony Brook Tennessee Texas Arlington Austin Dallas Texas A&M Texas Tech Virginia Virginia Tech Washington Wayne State William & Mary Wisconsin
Private,Boston U Brown Caltech Carnegie Mellon Case Western Reserve Chicago Columbia Cornell Duke Harvard Illinois Tech Johns Hopkins MIT Northeastern Northwestern Notre Dame Penn Princeton Rice Rochester Rockefeller SMU Stanford Syracuse Tufts Tulane Vanderbilt WUSTL Yale
International,McGill  Toronto  Pisa  Waseda  Manchester  Liverpool  UCL

v t e Association of Independent Colleges and Universities in Massachusetts (AICUM),v t e Association of Independent Colleges and Universities in Massachusetts (AICUM).1
Amherst Anna Maria Assumption Babson Bay Path Becker Bentley Berklee Boston Architectural Boston Baptist Boston College Boston U Brandeis Cambridge College Clark College of the Holy Cross Curry Dean Eastern Nazarene Elms Emerson Emmanuel Endicott Fisher Gordon Hampshire Harvard Hebrew Labouré Lasell Lesley Marian Court MCPHS MIT Merrimack MGH Institute Mount Holyoke Mount Ida NECO New England Conservatory Newbury Nichols Northeastern Olin Pine Manor Regis Simmons Smith Springfield Stonehill Suffolk Tufts Urban College of Boston Wellesley WIT Western New England Wheaton Wheelock Williams WPI,Amherst Anna Maria Assumption Babson Bay Path Becker Bentley Berklee Boston Architectural Boston Baptist Boston College Boston U Brandeis Cambridge College Clark College of the Holy Cross Curry Dean Eastern Nazarene Elms Emerson Emmanuel Endicott Fisher Gordon Hampshire Harvard Hebrew Labouré Lasell Lesley Marian Court MCPHS MIT Merrimack MGH Institute Mount Holyoke Mount Ida NECO New England Conservatory Newbury Nichols Northeastern Olin Pine Manor Regis Simmons Smith Springfield Stonehill Suffolk Tufts Urban College of Boston Wellesley WIT Western New England Wheaton Wheelock Williams WPI

v t e ECAC Hockey,v t e ECAC Hockey.1
Teams,Brown Bears men women Clarkson Golden Knights men women Colgate Raiders men women Cornell Big Red men women Dartmouth Big Green men women Harvard Crimson men women Princeton Tigers men women Quinnipiac Bobcats men women Rensselaer Engineers men women St. Lawrence Saints men women Union Dutchmen men women Yale Bulldogs men women
Venues,Meehan Auditorium (Brown) Cheel Arena (Clarkson) Starr Arena (Colgate) Lynah Rink (Cornell) Thompson Arena (Dartmouth) Bright Hockey Center (Harvard) Hobey Baker Memorial Rink (Princeton) TD Bank Sports Center (Quinnipiac) Houston Field House (Rensselaer) Appleton Arena (St. Lawrence) Achilles Rink (Union) Ingalls Rink (Yale) Herb Brooks Arena (Men's tournament)
Championships and awards,Men's champions Women's champions Player of the Year Rookie of the Year Coach of the Year Best Defensive Defenseman Best Defensive Forward Ken Dryden Award Student-Athlete of the Year Most Outstanding Player in Tournament
Seasons,1961–62 1962–63 1963–64 1964–65 1965–66 1966–67 1967–68 1968–69 1969–70 1970–71 1971–72 1972–73 1973–74 1974–75 1975–76 1976–77 1977–78 1978–79 1979–80 1980–81 1981–82 1982–83 1983–84 1984–85 1985–86 1986–87 1987–88 1988–89 1989–90 1990–91 1991–92 1992–93 1993–94 1994–95 1995–96 1996–97 1997–98 1998–99 1999–00 2000–01 2001–02 2002–03 2003–04 2004–05 2005–06 2006–07 2007–08 2008–09 2009–10 2010–11 2011–12 2012–13 2013–14 2014–15 2015–16 2016–17

v t e Eastern Association of Rowing Colleges,v t e Eastern Association of Rowing Colleges.1
BU Terriers Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Georgetown Hoyas Harvard Crimson Holy Cross Crusaders MIT Engineers Navy Midshipmen Northeastern Huskies Penn Quakers Princeton Tigers Rutgers Scarlet Knights Syracuse Orange Wisconsin Badgers Yale Bulldogs,BU Terriers Brown Bears Columbia Lions Cornell Big Red Dartmouth Big Green Georgetown Hoyas Harvard Crimson Holy Cross Crusaders MIT Engineers Navy Midshipmen Northeastern Huskies Penn Quakers Princeton Tigers Rutgers Scarlet Knights Syracuse Orange Wisconsin Badgers Yale Bulldogs

v t e Eastern Intercollegiate Volleyball Association,v t e Eastern Intercollegiate Volleyball Association.1
Current members,Charleston Golden Eagles George Mason Patriots Harvard Crimson NJIT Highlanders Penn State Nittany Lions Princeton Tigers Sacred Heart Pioneers Saint Francis Red Flash
Former members,Concordia College East Stroudsburg University Juniata College New York University University of New Haven Queens College Rutgers–Newark Springfield College Vassar College SUNY New Paltz

v t e  Sports teams based in Massachusetts,v t e  Sports teams based in Massachusetts.1
Australian rules football,USAFL Boston Demons
Baseball,MLB Boston Red Sox NYPL Lowell Spinners CCBL Bourne Braves Brewster Whitecaps Chatham Anglers Cotuit Kettleers Falmouth Commodores Harwich Mariners Hyannis Harbor Hawks Orleans Firebirds Wareham Gatemen Yarmouth–Dennis Red Sox FCBL Brockton Rox Martha's Vineyard Sharks North Shore Navigators Pittsfield Suns Wachusett Dirt Dawgs Worcester Bravehearts NECBL New Bedford Bay Sox North Adams SteepleCats Plymouth Pilgrims Valley Blue Sox
Basketball,NBA Boston Celtics
Football,NFL New England Patriots WFA Boston Renegades
Hockey,NHL Boston Bruins AHL Springfield Thunderbirds ECHL Worcester Railers NWHL Boston Pride CWHL Boston Blades
Lacrosse,MLL Boston Cannons UWLX Boston Storm
Roller derby,WFTDA Bay State Brawlers Roller Derby Boston Roller Derby MRDA Pioneer Valley Roller Derby
Rugby league,USARL Boston Thirteens Oneida FC
Rugby union,RSL Boston RFC NERFU Boston Irish Wolfhounds Mystic River South Shore Anchors
Soccer,MLS New England Revolution NWSL Boston Breakers PDL FC Boston Western Mass Pioneers NPSL Boston City FC Greater Lowell NPSL FC Champions Soccer League USA Greater Lowell United FC UWS New England Mutiny WPSL Boston Breakers Academy Boston Breakers Reserves Boston Breakers U23 FC Stars FC Stars U23

0,1
Authority control,WorldCat Identities VIAF: 128987800 LCCN: n78096930 ISNI: 0000 0001 2109 5844 GND: 2012974-9 SUDOC: 026453169 BNF: cb118698578 (data) ULAN: 500312819 NLA: 35176433 NKC: kn20020322375


Great! Now we have the text of the HU Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called BeautifulSoup.

### BeautifulSoup

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

You'll notice that the `import` statement bellow is different from what we used for `requests`. The _from library import thing_ pattern is useful when you don't want to reference a function byt its full name (like we did with `requests.get`), but you also don't want to import every single thing on that library into your namespace.

In [7]:
from bs4 import BeautifulSoup

BeautifulSoup can deal with HTML or XML data, so the next line parser the contents of the `page` variable using its HTML parser, and assigns the result of that to the `soup` variable.

In [8]:
soup = BeautifulSoup(page, 'html.parser')

In [9]:
type(soup)

bs4.BeautifulSoup

Doesn't look much different from the `page` object representation. Let's make sure the two are different types.

In [10]:
type(page)

str

Looks like they are indeed different.

`BeautifulSoup` objects have a cool little method that allows you to see the HTML content in a nice, indented way.

In [11]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Harvard University - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":799964943,"wgRevisionId":799964943,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from September 2014","All articles containing potentially dated statements","All articles with dead external links","Articles with dead external links from July 2017","Articles with permanently dead external links","CS1 maint: Extra te

Looks like it's our page!

We can now reference elements of the HTML document in different ways. One very convenient way is by using the dot notation, which allows us to access the elements as if they were properties of the object.

In [12]:
soup.title

<title>Harvard University - Wikipedia</title>

This is nice for HTML elements that only appear once per page, such the the `title` tag. But what about elements that can appear multiple times?

In [13]:
# Be careful with elements that show up multiple times.
soup.p

<p><b>Harvard University</b> is a private <a href="/wiki/Ivy_League" title="Ivy League">Ivy League</a> <a href="/wiki/Research_university" title="Research university">research university</a> in <a href="/wiki/Cambridge,_Massachusetts" title="Cambridge, Massachusetts">Cambridge, Massachusetts</a>, established in 1636, whose history, influence, and wealth have made it one of the world's most prestigious universities.<sup class="reference" id="cite_ref-7"><a href="#cite_note-7">[7]</a></sup></p>

Uh Oh. Turns out the attribute syntax in Beautiful soup is what is called syntactic sugar. That's why it is safer to use the explicit commands behind that syntactic sugar I mentioned. These are `BeautifulSoup.find` for getting single elements, and `BeautifulSoup.find_all` for retrieving multiple elements.

In [14]:
len(soup.find_all("p"))

75

---

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the HTML attributes that will be very useful to us is the "class" attribute.

Getting the class of a single element is easy...

In [15]:
soup.table["class"]

['infobox', 'vcard']

Next we will use a list comprehension to see all the tables that have a "class" attribute. 

In [16]:
#the classes of all tables that have a class sttribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]

[['infobox', 'vcard'],
 ['toccolours'],
 ['plainlinks', 'metadata', 'ambox', 'mbox-small-left', 'ambox-content'],
 ['multicol'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed'],
 ['wikitable'],
 ['nowraplinks', 'collapsible', 'collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'collapsible', 'collapsed', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'hlist', 'collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'collapsible', 'autocollapse', 'navbox-inner'],
 [

As mentioned, we will be using the Demographics table. To find this, we notice that it is the only table with just the class `wikitable` on it, whereas there are 3 tables with the class `wikitable`, with the other  two having multiple classes on them. This is why `find_all` below returns 3 results.

In [17]:
tables_wikitable = soup.find_all("table", "wikitable")

In [18]:
len(tables_wikitable)

3

Below we use a **matching** lambda function to find the table with just the class wikitable. Note that we have asked for a list with just `wikitable` in it. That ensures its the only class

In [19]:
dfinder = lambda tag: tag.name=='table' and tag.get('class') == ['wikitable']
table_demographics = soup.find_all(dfinder)

By contrast a simple find would give us just the first match. The below would be a great way to do things if we were guaranteed uniqueness. But since we are not, we use the full power of passing in a matching function.

In [20]:
soup.find("table", "wikitable")

<table class="wikitable sortable collapsible collapsed" style="float:right">
<tr>
<th colspan="4" style="background-color:#A51C30;color:white;-moz-box-shadow: inset 2px 2px 0 #1E1E1E, inset -2px -2px 0 #1E1E1E; -webkit-box-shadow: inset 2px 2px 0 #1E1E1E, inset -2px -2px 0 #1E1E1E; box-shadow: inset 2px 2px 0 #1E1E1E, inset -2px -2px 0 #1E1E1E;">National Program Rankings<sup class="reference" id="cite_ref-USNWR_Grad_School_Rankings_119-0"><a href="#cite_note-USNWR_Grad_School_Rankings-119">[119]</a></sup></th>
</tr>
<tr>
<th>Program</th>
<th>Ranking</th>
</tr>
<tr>
<td>Biological Sciences</td>
<td>1</td>
</tr>
<tr>
<td>Business</td>
<td>1</td>
</tr>
<tr>
<td>Chemistry</td>
<td>4</td>
</tr>
<tr>
<td>Clinical Psychology</td>
<td>16</td>
</tr>
<tr>
<td>Computer Science</td>
<td>18</td>
</tr>
<tr>
<td>Earth Sciences</td>
<td>8</td>
</tr>
<tr>
<td>Economics</td>
<td>1</td>
</tr>
<tr>
<td>Education</td>
<td>1</td>
</tr>
<tr>
<td>Engineering</td>
<td>23</td>
</tr>
<tr>
<td>English</td>
<td>8<

Since we used `find_all` we get back a list:

In [21]:
HTML(str(table_demographics[0]))

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17%,11%,5%
Black/non-Hispanic,6%,4%,12%
Hispanics of any race,9%,5%,16%
White/non-Hispanic,46%,43%,64%
Mixed race/other,10%,8%,9%
International students,11%,27%,


First we'll use a list comprehension to extract the rows (*tr*) elements.

In [22]:
rows = [row for row in table_demographics[0].find_all("tr")]
rows

[<tr>
 <th></th>
 <th>Undergraduate</th>
 <th>Graduate<br/>
 and professional</th>
 <th>U.S. census</th>
 </tr>, <tr>
 <th>Asian/Pacific Islander</th>
 <td>17%</td>
 <td>11%</td>
 <td>5%</td>
 </tr>, <tr>
 <th>Black/non-Hispanic</th>
 <td>6%</td>
 <td>4%</td>
 <td>12%</td>
 </tr>, <tr>
 <th>Hispanics of any race</th>
 <td>9%</td>
 <td>5%</td>
 <td>16%</td>
 </tr>, <tr>
 <th>White/non-Hispanic</th>
 <td>46%</td>
 <td>43%</td>
 <td>64%</td>
 </tr>, <tr>
 <th>Mixed race/other</th>
 <td>10%</td>
 <td>8%</td>
 <td>9%</td>
 </tr>, <tr>
 <th>International students</th>
 <td>11%</td>
 <td>27%</td>
 <td>N/A</td>
 </tr>]

In [23]:
header_row = rows[0]
header_row

<tr>
<th></th>
<th>Undergraduate</th>
<th>Graduate<br/>
and professional</th>
<th>U.S. census</th>
</tr>

### Splitting the data

Next we extract the text value of the columns. If you look at the table above, you'll see that we have three columns and six rows.

Here we're taking the first element (Python indexes start at zero), iterating over the *th* elements inside it, and taking the text value of those elements. We should end up with a list of column names.

But there is one little caveat: the first column of the table is actually an empty string (look at the cell right above the row names). We could add it to our list and then remove it afterwards; but instead we will use the `if` statement inside the list comprehension to filter that out.

Here the `get_text` will return an empty string for the first cell of the table, which means that the test will fail and the value will not be added to the list.

In [24]:
#the if col.get_text() takes care of no-text in the upper left
columns = [col.get_text() for col in header_row.find_all("th") if col.get_text()]
columns

['Undergraduate', 'Graduate\nand professional', 'U.S. census']

In [25]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

In [26]:
columns = [rem_nl(c) for c in columns]
columns

['Undergraduate', 'Graduate and professional', 'U.S. census']

Now let's do the same for the rows. Notice that since we have already parsed the header row, we will continue from the second row. The `[1:]` is a slice notation and in this case it means we want all values starting from the second position.

In [27]:
indexes = [row.find("th").get_text() for row in rows[1:]]
indexes

['Asian/Pacific Islander',
 'Black/non-Hispanic',
 'Hispanics of any race',
 'White/non-Hispanic',
 'Mixed race/other',
 'International students']

We need to transform the string on the "data" cells to integers. We start by checking if the last character of the string (Python allows for negative indexes) is a percent sign. If that is true, then we convert the characters before the sign to integers. Lastly, if one of the prior checks fails, we return a value of None.

In [28]:
def to_num(s):
    if s[-1] == "%":
        return int(s[:-1])
    else:
        return None

In [29]:
values = []
for row in rows[1:]:
    for value in row.find_all("td"):
        values.append(to_num(value.get_text()))
values

[17, 11, 5, 6, 4, 12, 9, 5, 16, 46, 43, 64, 10, 8, 9, 11, 27, None]

The problem with the list above is that the values lost their grouping.

The `zip` function is used to combine two sequences element wise. So `zip([1,2,3], [4,5,6])` would return `[(1, 4), (2, 5), (3, 6)]`.

Here we create 3 arrays corresponding to the 3 columns by putting every 3 values in each list

In [30]:
stacked_values_lists = [values[i::3] for i in range(len(columns))]
stacked_values_lists

[[17, 6, 9, 46, 10, 11], [11, 4, 5, 43, 8, 27], [5, 12, 16, 64, 9, None]]

We then use `zip`. Notice the use of the `*` in front: that converts the list of lists to a set of arguments to `zip`. 

In [31]:
def print_them(a, b, c):
    print("a", a, "b", b, "c", c)
print_them(1, 2, 3)

a 1 b 2 c 3


In [32]:
print_them(*[1, 2, 3])

a 1 b 2 c 3


In [33]:
stacked_values=zip(*stacked_values_lists)
list(stacked_values)

[(17, 11, 5), (6, 4, 12), (9, 5, 16), (46, 43, 64), (10, 8, 9), (11, 27, None)]

In [34]:
# Here's the original HTML table for visual understanding
HTML(str(table_demographics))

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17%,11%,5%
Black/non-Hispanic,6%,4%,12%
Hispanics of any race,9%,5%,16%
White/non-Hispanic,46%,43%,64%
Mixed race/other,10%,8%,9%
International students,11%,27%,


---

##  Putting things into Pandas

### Dataframes

To recap, we now have three data structures holding our column names, our row (index) names, and our values grouped by index.

We will now load this data into a Pandas Dataframe. The loading process is pretty straightforward, and all we need to do is tell Pandas which container goes where.


In [35]:
import pandas as pd

In [36]:
list(stacked_values)

[]

Wait! What happened?

Remember that `stacked_values` waz a zip object. We ran a `list(stacked_values)` to print it. But this had an unfortunate side effect. It **exhausted the iterator**, by iterating over the zip. Nothing was left. So we'll need to redefine the zip first. And we'll name it a bit better

In [37]:
stacked_values_iterator = zip(*stacked_values_lists)

Labeling variables like this follows the philosophy of [Hungarian Notation](https://en.wikipedia.org/wiki/Hungarian_notation). Use sparingly, when its critical to the understanding of your code, like here

In [38]:
df = pd.DataFrame(list(stacked_values_iterator), columns=columns, index=indexes)
df

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17,11,5.0
Black/non-Hispanic,6,4,12.0
Hispanics of any race,9,5,16.0
White/non-Hispanic,46,43,64.0
Mixed race/other,10,8,9.0
International students,11,27,


---

#### Other ways to create the Dataframe

That was one of many ways to construct a dataframe. Here is another that uses a list of dictionaries:

First we combine the list and dictionary comprehensions to get a list of dictionaries representing each row in the data.

In [40]:
stacked_values_iterator = zip(*stacked_values_lists)
data_dicts = [{col: val for col, val in zip(columns, col_values)} for col_values in stacked_values_iterator]
data_dicts

[{'Graduate and professional': 11, 'U.S. census': 5, 'Undergraduate': 17},
 {'Graduate and professional': 4, 'U.S. census': 12, 'Undergraduate': 6},
 {'Graduate and professional': 5, 'U.S. census': 16, 'Undergraduate': 9},
 {'Graduate and professional': 43, 'U.S. census': 64, 'Undergraduate': 46},
 {'Graduate and professional': 8, 'U.S. census': 9, 'Undergraduate': 10},
 {'Graduate and professional': 27, 'U.S. census': None, 'Undergraduate': 11}]

In [41]:
pd.DataFrame(data_dicts, index=indexes)

Unnamed: 0,Graduate and professional,U.S. census,Undergraduate
Asian/Pacific Islander,11,5.0,17
Black/non-Hispanic,4,12.0,6
Hispanics of any race,5,16.0,9
White/non-Hispanic,43,64.0,46
Mixed race/other,8,9.0,10
International students,27,,11


And yet another that uses a dictionary of lists:

To achieve this we group the values columnwise...

In [42]:
stacked_by_col = [values[i::3] for i in range(len(columns))]
stacked_by_col

[[17, 6, 9, 46, 10, 11], [11, 4, 5, 43, 8, 27], [5, 12, 16, 64, 9, None]]

and then revert the pattern we used to create a list of dictionaries.

In [43]:
data_lists = {col: val for col, val in zip(columns, stacked_by_col)}
data_lists

{'Graduate and professional': [11, 4, 5, 43, 8, 27],
 'U.S. census': [5, 12, 16, 64, 9, None],
 'Undergraduate': [17, 6, 9, 46, 10, 11]}

In [44]:
pd.DataFrame(data_lists, index=indexes)

Unnamed: 0,Graduate and professional,U.S. census,Undergraduate
Asian/Pacific Islander,11,5.0,17
Black/non-Hispanic,4,12.0,6
Hispanics of any race,5,16.0,9
White/non-Hispanic,43,64.0,46
Mixed race/other,8,9.0,10
International students,27,,11


---

### DataFrame cleanup

Our DataFrame looks nice; but does it have the right data types?

In [45]:
df.dtypes

Undergraduate                  int64
Graduate and professional      int64
U.S. census                  float64
dtype: object

The `U.S Census` looks a little strange. It should have been evaluated as an integer, but instead it came in as a float. It probably has something to do with the `NaN` value...

In fact, missing values can mess up a lot of our calculations, and some function don't work at all when `NaN` are present. So we should probably clean this up.

One way to do that is by dropping the rows that have missing values:

In [46]:
df.dropna()

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17,11,5.0
Black/non-Hispanic,6,4,12.0
Hispanics of any race,9,5,16.0
White/non-Hispanic,46,43,64.0
Mixed race/other,10,8,9.0


Or the columns that have missing values:

In [47]:
df.dropna(axis=1)

Unnamed: 0,Undergraduate,Graduate and professional
Asian/Pacific Islander,17,11
Black/non-Hispanic,6,4
Hispanics of any race,9,5
White/non-Hispanic,46,43
Mixed race/other,10,8
International students,11,27


But we will take a less radical approach and replace the missing value with a zero. In this case this solution makes sense, since 0% value meaningful in this context. We will also transform all the values to integers at the same time.

In [48]:
df_clean = df.fillna(0).astype(int)
df_clean

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
Asian/Pacific Islander,17,11,5
Black/non-Hispanic,6,4,12
Hispanics of any race,9,5,16
White/non-Hispanic,46,43,64
Mixed race/other,10,8,9
International students,11,27,0


In [49]:
df_clean.dtypes

Undergraduate                int64
Graduate and professional    int64
U.S. census                  int64
dtype: object

Now our table looks good!

Let's see some basic statistics about it.

In [50]:
df_clean.describe()

Unnamed: 0,Undergraduate,Graduate and professional,U.S. census
count,6.0,6.0,6.0
mean,16.5,16.333333,17.666667
std,14.896308,15.513435,23.36379
min,6.0,4.0,0.0
25%,9.25,5.75,6.0
50%,10.5,9.5,10.5
75%,15.5,23.0,15.0
max,46.0,43.0,64.0
