# Learning Objectives
When we're done today, you will approach messy real-world data with confidence that you can get it into a format that you can manipulate.

Specifically, our learning objectives are:

a. Understand the tree-like structure of an HTML document and use that structure to extract desired information

b. Use Python data structures such as lists, dictionaries, and Pandas DataFrames to store and manipulate information

c. Practice using Python packages such as BeautifulSoup and Pandas, including how to navigate their documentation to find functionality.

d. Identify some other (semi-)structured formats commonly used for storing and transferring data, such as JSON and CSV

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from bs4 import BeautifulSoup
import requests


import json
from IPython.display import HTML

In [2]:
# Setting up 'requests' to make HTTPS requests properly takes some extra steps... we'll skip them for now.
%matplotlib inline 

requests.packages.urllib3.disable_warnings()

import warnings
warnings.filterwarnings("ignore")

# Data Analysis Question

Is science becoming more collaborative over time? How about literature? Are there a few "geniuses" or lots of hard workers? One way we might answer those questions is by looking at Nobel Prizes. 

We could ask questions like:

1) Has anyone won a prize more than once?
2) How has the total number of recipients changed over time?
3) How has the number of recipients per award changed over time?
To answer these questions, we'll need data: who received what award and when.

What are the different approaches we could take to acquiring Nobel Prize data?

When possible: find a structured dataset (.csv, .json, .xls)
After a google search we stumble upon this dataset on github. 

It is also in the section folder named github-nobel-prize-winners.csv.

We use pandas to read it:

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/OpenRefine/OpenRefine/master/main/tests/data/nobel-prize-winners.csv")
df.head() #pandas is a very useful package

Unnamed: 0,year,discipline,winner,desc
0,1901,chemistry,Jacobus H. van 't Hoff,in recognition of the extraordinary services h...
1,1901,literature,Sully Prudhomme,in special recognition of his poetic compositi...
2,1901,medicine,Emil von Behring,"for his work on serum therapy, especially its ..."
3,1901,peace,Henry Dunant,
4,1901,peace,Fr&eacute;d&eacute;ric Passy,


In [5]:
type(df.winner)

pandas.core.series.Series

In [13]:
df.describe()

Unnamed: 0,year
count,853.0
mean,1961.791325
std,30.856644
min,1901.0
25%,1936.0
50%,1967.0
75%,1988.0
max,2007.0


# Question 1: Did anyone recieve the Nobel Prize more than once?
How would you check if anyone recieved more than one nobel prize?

In [25]:
# initialize the list storing all the names 
name_winners = []
i=-1

for name in df.winner:
    i=i+1 
    
    # Check if we already encountered this name: 
    if name in name_winners and len(name.split())<=2:
        
        # if so, print the name
        name_winners.append(name)
        print(df.iloc[i].year, "\t", df.iloc[i].discipline, "\t \t", name)
    else:
        # otherwise append the name to the list
        name_winners.append(name)

1911 	 chemistry 	 	 Marie Curie
1962 	 peace 	 	 Linus Pauling
1972 	 physics 	 	 John Bardeen
1980 	 chemistry 	 	 Frederick Sanger


# Part 2: WEB SCRAPING
The first step in web scraping is to look for structure in the html. 

A description of html tags and attributes is summarized here:
https://www.cs.princeton.edu/courses/archive/fall11/cos109/labs/html/tags.html

We must understand the structure of the webpage so that we can scrape it effectively.

Lets look at a real website:
The official Nobel website (https://www.nobelprize.org/prizes/lists/all-nobel-prizes/ ) has the data we want, but in 2018 and 2019 the physics prize was awarded to multiple groups so we will use an archived version of the web-page for an easier introduction to web scraping.

The Internet Archive periodically crawls most of the Internet and saves what it finds. (That's a lot of data!) So let's grab the data from the Archive's "Wayback Machine" (great name!). We've just given you the direct URL, but at the very end you'll see how we can get it out of a JSON response from the Wayback Machine API.

Let's take a look at the 2018 version of the Nobel website and to look at the underhood HTML: right-click and click on inspect . Try to find structure in the tree-structured HTML.
(http://web.archive.org/web/20180820111639/https://www.nobelprize.org/prizes/lists/all-nobel-prizes/)


In [131]:
#Nobel Prize Winners
url = "https://www.nobelprize.org/prizes/lists/all-nobel-prizes/"
response = requests.get(url) # you can use any URL that you wish

Status code tells us the status our request 
For more details: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status


In [41]:

response.status_code

200

In [28]:
response.text

'\t<!DOCTYPE html>\n\n\t<html lang="en-US" class="no-js">\n\n\t<head>\n\n\t\t<meta charset="UTF-8"><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"06ce85f426",applicationID:"933345079"};window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var i=t[n]={exports:{}};e[n][0].call(i.exports,function(t){var i=e[n][1][t];return r(i||t)},i,i.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(e,t,n){function r(){}function i(e,t,n){return function(){return o(e,[u.now()].concat(c(arguments)),t?null:this,n),t?void 0:this}}var o=e("handle"),a=e(6),c=e(7),f=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var d=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],p="api-",l=p+"ixn-";a(d,function(e,t){s[t]=i(p+t,!0,"api")}),s.addPageAction=i(p+"addPageA

In [29]:
soup = BeautifulSoup(response.content, "html.parser")
soup

 <!DOCTYPE html>

<html class="no-js" lang="en-US">
<head>
<meta charset="utf-8"/><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"06ce85f426",applicationID:"933345079"};window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var i=t[n]={exports:{}};e[n][0].call(i.exports,function(t){var i=e[n][1][t];return r(i||t)},i,i.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(e,t,n){function r(){}function i(e,t,n){return function(){return o(e,[u.now()].concat(c(arguments)),t?null:this,n),t?void 0:this}}var o=e("handle"),a=e(6),c=e(7),f=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var d=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],p="api-",l=p+"ixn-";a(d,function(e,t){s[t]=i(p+t,!0,"api")}),s.addPageAction=i(p+"addPageAction",!0),s.setC

In [30]:
soup.get_text()

' \n\n\n(window.NREUM||(NREUM={})).loader_config={licenseKey:"06ce85f426",applicationID:"933345079"};window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var i=t[n]={exports:{}};e[n][0].call(i.exports,function(t){var i=e[n][1][t];return r(i||t)},i,i.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(e,t,n){function r(){}function i(e,t,n){return function(){return o(e,[u.now()].concat(c(arguments)),t?null:this,n),t?void 0:this}}var o=e("handle"),a=e(6),c=e(7),f=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var d=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],p="api-",l=p+"ixn-";a(d,function(e,t){s[t]=i(p+t,!0,"api")}),s.addPageAction=i(p+"addPageAction",!0),s.setCurrentRouteName=i(p+"routeName",!0),t.exports=newrelic,s.interaction=function(){return(new r).get()};var 

In [31]:
soup.head # fetches the head tag, which ecompasses the title tag

<head>
<meta charset="utf-8"/><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"06ce85f426",applicationID:"933345079"};window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var i=t[n]={exports:{}};e[n][0].call(i.exports,function(t){var i=e[n][1][t];return r(i||t)},i,i.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(e,t,n){function r(){}function i(e,t,n){return function(){return o(e,[u.now()].concat(c(arguments)),t?null:this,n),t?void 0:this}}var o=e("handle"),a=e(6),c=e(7),f=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var d=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],p="api-",l=p+"ixn-";a(d,function(e,t){s[t]=i(p+t,!0,"api")}),s.addPageAction=i(p+"addPageAction",!0),s.setCurrentRouteName=i(p+"routeName",!0),t.exports=newrel

In [32]:
#Usually head tags are small and only contain the most important contents; however, here, there's some Javascript code. 
#The title tag resides within the head tag.

soup.title # we can specifically call for the title tag

<title>All Nobel Prizes</title>

In [33]:
soup.title.string

'All Nobel Prizes'

Always remember to “not to be evil” when scraping with requests! If downloading multiple pages (like you will be on HW1), always put a delay between requests (e.g, time.sleep(1), with the time library) so you don’t unwittingly hammer someone’s webserver and/or get blocked.

# Parse the Page
In HTML code, paragraphs are often denoated with a

tag. In addition to 'paragraph' (aka p) tags, link tags are also very common and are denoted by tags

In [34]:
soup.p

<p> </p>

In [35]:
soup.a

<a class="skip-link screen-reader-text" href="#content">
			Skip to content		</a>

In [36]:
soup.find_all('title')

[<title>All Nobel Prizes</title>,
 <title id="search-mobile-title">Close the search form</title>,
 <title id="logo-title">The Nobel Prize</title>,
 <title id="desktop-search-title">Close the search form</title>,
 <title id="submit-search-title-header">Submit a search term</title>,
 <title id="aside-facebook-title">Share on Facebook: All Nobel Prizes</title>,
 <title id="aside-twitter-title">Tweet: All Nobel Prizes</title>,
 <title id="aside-linkedin-title">Share on LinkedIn: All Nobel Prizes</title>,
 <title id="aside-email-title">Share via Email: All Nobel Prizes</title>,
 <title id="twitter-social-icon-title">Twitter Icon</title>,
 <title id="instagram-social-icon-title">Instagram Icon</title>,
 <title id="youtube-social-icon-title">Youtube Icon</title>,
 <title id="linkedin-social-icon-title">LinkedIn Icon</title>]

In [37]:
soup.find_all('a')

[<a class="skip-link screen-reader-text" href="#content">
 			Skip to content		</a>,
 <a href="#" id="search-mobile-trigger-js">
 <svg aria-labelledby="search-mobile-title search-mobile-description" height="20px" role="img" version="1.1" viewbox="0 0 20 20" width="20px" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
 <title id="search-mobile-title">Close the search form</title>
 <desc id="search-mobile-description">A button that allows you to close the search form if needed</desc>
 <g fill="none" fill-rule="evenodd" stroke="none" stroke-width="1">
 <g transform="translate(-1132.000000, -18.000000)">
 <g id="mobile-search-icon" transform="translate(1132.000000, 18.000000)">
 <rect fill="#2A2A2A" height="8" transform="translate(16.000000, 16.000000) rotate(-45.000000) translate(-16.000000, -16.000000) " width="2" x="15" y="12"></rect>
 <circle cx="8" cy="8" r="7" stroke="#2A2A2A" stroke-width="2"></circle>
 </g>
 </g>
 </g>
 </svg>
 </a>,
 <a class="toggle

In [38]:
for link in soup.find_all('a'): # we could optionally pass the href=True flag .find_all('a', href=True)
    print(link.get('href'))

#content
#
#
https://www.nobelprize.org
#main-navigation-js
/prizes/
/prizes/physics/
/prizes/chemistry/
/prizes/medicine/
/prizes/literature/
/prizes/peace/
/prizes/economics/
https://www.nobelprize.org/prizes/facts/nobel-prize-facts/
/nomination/
https://www.nobelprize.org/nomination/nomination-and-selection-of-physics-laureates/
https://www.nobelprize.org/nomination/nomination-and-selection-of-chemistry-laureates/
https://www.nobelprize.org/nomination/nomination-and-selection-of-medicine-laureates/
https://www.nobelprize.org/nomination/nomination-and-selection-of-literature-laureates/
https://www.nobelprize.org/nomination/nomination-and-selection-of-peace-prize-laureates/
https://www.nobelprize.org/nomination/nomination-and-selection-of-laureates-in-economic-sciences/
https://www.nobelprize.org/nomination/archive/
/alfred-nobel/
https://www.nobelprize.org/alfred-nobel/biographical-information/
https://www.nobelprize.org/alfred-nobel/alfred-nobels-will/
/news-and-insights/
https://ww

In [39]:
paragraphs = soup.find_all('p')
paragraphs

[<p> </p>,
 <p class="ingress">Between 1901 and 2020, the Nobel Prizes and the Prize in Economic Sciences were awarded 603 times to 962 people and organizations. With some receiving the Nobel Prize more than once, this makes a total of 930 individuals and 25 organizations. Below, you can view the full list of Nobel Prizes and Nobel Laureates.</p>,
 <p><a href="https://www.nobelprize.org/prizes/physics/2020/penrose/facts/">Roger Penrose</a> “for the discovery that black hole formation is a robust prediction of the general theory of relativity”</p>,
 <p><a href="https://www.nobelprize.org/prizes/physics/2020/genzel/facts/">Reinhard Genzel</a> and <a href="https://www.nobelprize.org/prizes/physics/2020/ghez/facts/">Andrea Ghez</a> “for the discovery of a supermassive compact object at the centre of our galaxy</p>,
 <p><a href="https://www.nobelprize.org/prizes/chemistry/2020/charpentier/facts/">Emmanuelle Charpentier</a> and <a href="https://www.nobelprize.org/prizes/chemistry/2020/doudna

# Regular expressions
You can find specific patterns or strings in text by using Regular Expressions: This is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). Some great resources that we recommend, if you are interested in them (could be very useful for a homework problem):

https://docs.python.org/3.3/library/re.html

https://regexone.com

https://docs.python.org/3/howto/regex.html.

Specify a specific sequence with the help of regex special characters. Some examples:

\S : Matches any character which is not a Unicode whitespace character

\d : Matches any Unicode decimal digit

* : Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
Let's find all the occurances of 'Marie' in our raw_html:

# BeautifulSoup 

Key functions we’ll be using in this section:

tag.prettify(): Returns cleaned-up version of raw HTML, useful for printing

tag.select(selector): Return a list of nodes matching a CSS selector

tag.select_one(selector): Return the first node matching a CSS selector

tag.text/soup.get_text(): Returns visible text of a node (e.g.,"<p>Some text</p>" -> "Some text")

tag.contents: A list of the immediate children of this node

You can also use these functions to find nodes.

tag.find_all(tag_name, attrs=attributes_dict): Returns a list of matching nodes

tag.find(tag_name, attrs=attributes_dict): Returns first matching node

BeautifulSoup is a very powerful library -- much more info here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [46]:
pretty_soup = soup.prettify()
print(pretty_soup[0:200]) #what about negative indices?

<!DOCTYPE html>
<html class="no-js" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   (window.NREUM||(NREUM={})).loader_config={licenseKey:"06ce85f426",applicationID


## Find the first “title” object

In [98]:
# Your code here
soup.select("h3 a")

[<a href="https://www.nobelprize.org/prizes/physics/2020/summary">The Nobel Prize in Physics 2020</a>,
 <a href="https://www.nobelprize.org/prizes/chemistry/2020/summary">The Nobel Prize in Chemistry 2020</a>,
 <a href="https://www.nobelprize.org/prizes/medicine/2020/summary">The Nobel Prize in Physiology or Medicine 2020</a>,
 <a href="https://www.nobelprize.org/prizes/literature/2020/summary">The Nobel Prize in Literature 2020</a>,
 <a href="https://www.nobelprize.org/prizes/peace/2020/summary">The Nobel Peace Prize 2020</a>,
 <a href="https://www.nobelprize.org/prizes/economic-sciences/2020/summary">The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel 2020</a>,
 <a href="https://www.nobelprize.org/prizes/physics/2019/summary">The Nobel Prize in Physics 2019</a>,
 <a href="https://www.nobelprize.org/prizes/chemistry/2019/summary">The Nobel Prize in Chemistry 2019</a>,
 <a href="https://www.nobelprize.org/prizes/medicine/2019/summary">The Nobel Prize in Physiolog

## Extract the text of first “title” object##

In [99]:
t=soup.select("h3 a")[0]
t.text

'The Nobel Prize in Physics 2020'

## Extracting award data


In [132]:
#award_nodes = soup.select('.by_year') #<div class ="by year"
award_nodes = soup.select('h3 a') #<div class ="by year"
len(award_nodes)

652

In [154]:
award_node = award_nodes[4]

In [155]:
HTML(award_node.prettify())

In [138]:
award_node.text

'The Nobel Prize in Physiology or Medicine 2020'

### Separating the award title from the year it was awarded

In [139]:
award_node.text[:-4]

'The Nobel Prize in Physiology or Medicine '

In [140]:
award_node.text[-4:].strip()

'2020'

### Putting these into functions so that it can be reused

In [141]:
def get_award_title(award_node):
    return award_node.text[:-4]

def get_award_year(award_node):
    return int(award_node.text[-4:])

In [143]:
list_awards = []
for award_node in award_nodes:
    list_awards.append(get_award_title(award_node))
list_awards

['The Nobel Prize in Physics ',
 'The Nobel Prize in Chemistry ',
 'The Nobel Prize in Physiology or Medicine ',
 'The Nobel Prize in Literature ',
 'The Nobel Peace Prize ',
 'The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel ',
 'The Nobel Prize in Physics ',
 'The Nobel Prize in Chemistry ',
 'The Nobel Prize in Physiology or Medicine ',
 'The Nobel Prize in Literature ',
 'The Nobel Peace Prize ',
 'The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel ',
 'The Nobel Prize in Physics ',
 'The Nobel Prize in Chemistry ',
 'The Nobel Prize in Physiology or Medicine ',
 'The Nobel Prize in Literature ',
 'The Nobel Peace Prize ',
 'The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel ',
 'The Nobel Prize in Physics ',
 'The Nobel Prize in Chemistry ',
 'The Nobel Prize in Physiology or Medicine ',
 'The Nobel Prize in Literature ',
 'The Nobel Peace Prize ',
 'The Sveriges Riksbank Prize in Economic Sciences in Memory

In [144]:
list_award_year = []
list_award_year=[get_award_year(award_node) for award_node in award_nodes ]

In [146]:
len(list_award_year)

652

In [150]:
soup.select("h6 a")

[]

In [None]:
bs_table = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')

rows = bs_table.find_all('tr')
for row in rows:
    cells = row.find_all(['td', 'th'])
    for cell in cells:
        print(cell.name, cell.attrs)
        
