# Web Scraping

This notebook will have problems that touch on the material covered in `Lectures/Data Collection/Web Scraping with BeautifulSoup`.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from time import sleep

##### 1. Asthma

Scrape the percent of each age group that has asthma according to these CDC statistics, <a href="https://www.cdc.gov/asthma/most_recent_national_asthma_data.htm">https://www.cdc.gov/asthma/most_recent_national_asthma_data.htm</a>. Turn the data into a `DataFrame`, which group has the highest rate, the lowest rate? (Answer this question about rates using `pandas`)

##### Sample Solution

In [2]:
url = "https://www.cdc.gov/asthma/most_recent_national_asthma_data.htm"
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')

In [3]:
table = soup.find('table').tbody

In [4]:
ages = [td.text for td in table.find_all('td', {'headers':"characteristic all_age"})]
percs_point = [float(td.text.split(" ")[0]) for td in table.find_all('td', {'headers':"percent all_age"})]
percs_se = [float(td.text.split(" ")[1].replace("(","").replace(")","")) for td in table.find_all('td', {'headers':"percent all_age"})]

In [5]:
asthma = pd.DataFrame({'age_group':ages,
                         'percs_point':percs_point,
                         'percs_se':percs_se})

asthma

Unnamed: 0,age_group,percs_point,percs_se
0,0–4,2.6,0.4
1,5–11,8.3,0.64
2,5–14,9.1,0.53
3,5-17 (School Age),8.6,0.44
4,12-14 (Young Teens),10.8,0.89
5,12-17,8.9,0.57
6,15-17 (Teenagers),7.0,0.66
7,15–19,7.4,0.69
8,11-21 (Adolescents),8.7,0.51
9,20–24,9.9,0.91


In [6]:
asthma.loc[asthma.percs_point==asthma.percs_point.max()]

Unnamed: 0,age_group,percs_point,percs_se
4,12-14 (Young Teens),10.8,0.89


##### 2. Let's go Cavs!

Write a script to get the scores from all of the Cleveland Cavaliers games from this site, <a href="https://www.basketball-reference.com/teams/CLE/2022_games.html">https://www.basketball-reference.com/teams/CLE/2022_games.html</a>, these are scored in the `Tm` and `Opp` columns.

In [7]:
url = "https://www.basketball-reference.com/teams/CLE/2022_games.html"
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')

In [8]:
table = soup.find('table', {'id':'games'}).tbody

In [9]:
pts_scored = [td.text for td in table.find_all('td', {'data-stat':"pts"})]
opp_pts = [td.text for td in table.find_all('td', {'data-stat':"opp_pts"})]

pd.DataFrame({'point_scored':pts_scored,
                 'opp_scored':opp_pts})

Unnamed: 0,point_scored,opp_scored
0,121,132
1,112,123
2,101,95
3,99,87
4,92,79
...,...,...
79,115,120
80,107,118
81,133,115
82,108,115


##### 3. More Scores

Repeat what you did in problem 2, but this time for all seasons from the 2000-01 to the 2021-22 seasons.

Also record the opponent team name, the season and the scores in a single `DataFrame`. In which season did the Cleveland Cavaliers score the highest average points per game?

In [10]:
from time import sleep
import numpy as np

In [11]:
years = range(2001, 2023)

base_url = "https://www.basketball-reference.com/teams/CLE/"
end_of_url = "_games.html"

opponents = []
seasons = []
pts_scored = []
opp_pts = []

for year in years:
    print("Working on", year)
    html = requests.get(base_url + str(year) + end_of_url)
    soup = BeautifulSoup(html.content, 'html.parser')
    
    table = soup.find('table', {'id':'games'}).tbody
    
    season_opp_names = [td.text for td in table.find_all('td', {'data-stat':"opp_name"})]
    season_pts_scored = [int(td.text) for td in table.find_all('td', {'data-stat':"pts"})]
    season_opp_pts = [int(td.text) for td in table.find_all('td', {'data-stat':"opp_pts"})]
    season_seasons = np.repeat(str(year-1) + "-" + str(year)[-2:], len(season_opp_names))

    opponents.extend(season_opp_names)
    seasons.extend(season_seasons)
    pts_scored.extend(season_pts_scored)
    opp_pts.extend(season_opp_pts)
    
    sleep(3)

Working on 2001
Working on 2002
Working on 2003
Working on 2004
Working on 2005
Working on 2006
Working on 2007
Working on 2008
Working on 2009
Working on 2010
Working on 2011
Working on 2012
Working on 2013
Working on 2014
Working on 2015
Working on 2016
Working on 2017
Working on 2018
Working on 2019
Working on 2020
Working on 2021
Working on 2022


In [12]:
cavs_scores = pd.DataFrame({'season':seasons,
                                'opponent':opponents,
                                'pts_scored':pts_scored,
                                'opp_pts':opp_pts})

In [13]:
cavs_scores.groupby('season').pts_scored.mean().sort_values(ascending=False)

season
2017-18    110.865854
2016-17    110.341463
2021-22    107.714286
2019-20    106.892308
2018-19    104.475610
2015-16    104.329268
2020-21    103.833333
2014-15    103.134146
2009-10    102.109756
2008-09    100.280488
2013-14     98.219512
2005-06     97.585366
2006-07     96.756098
2004-05     96.512195
2012-13     96.500000
2007-08     96.378049
2010-11     95.451220
2001-02     95.268293
2011-12     93.030303
2003-04     92.914634
2000-01     92.207317
2002-03     91.402439
Name: pts_scored, dtype: float64

##### 4. Scraping Scientific Articles

You are working on a research project that involves analyzing the posting of different journal article links. You have been tasked with scraping the title, authors and dois given a list of urls linking to different journal articles. There are three unique domains included in the data loaded for you below.

In [14]:
articles = pd.read_csv("journal_article_urls.csv")

In [15]:
articles.head()

Unnamed: 0,domain,url
0,www.science.org,https://www.science.org/doi/10.1126/sciimmunol...
1,www.science.org,https://www.science.org/doi/10.1126/scisignal....
2,www.science.org,https://www.science.org/doi/10.1126/sciimmunol...
3,www.science.org,https://www.science.org/doi/10.1126/scitranslm...
4,www.science.org,https://www.science.org/doi/10.1126/science.ab...


In [16]:
articles.domain.value_counts()

www.science.org      10
www.nature.com       10
www.thelancet.com     8
Name: domain, dtype: int64

##### Nature

First write a function that can scrape the urls from `www.nature.com`

##### Sample Solution

In [17]:
def nature(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.content)
    
    ## Title
    if soup.title:
        if soup.title.text:
            title = soup.title.text.split("|")[0].strip()
        else:
            title = "NA"
    else:
        title = "NA"

    ## Authors
    if soup.find_all('a', {'data-test':"author-name"}):
        authors = ", ".join([a.text for a in soup.find_all('a', {'data-test':"author-name"})])
    else:
        authors = "NA"
        
    ## doi
    if soup.find('li', {'class':"c-bibliographic-information__list-item c-bibliographic-information__list-item--doi"}):
        li = soup.find('li', {'class':"c-bibliographic-information__list-item c-bibliographic-information__list-item--doi"})
        if li.find('span', {'class':"c-bibliographic-information__value"}):
            doi = li.find('span', {'class':"c-bibliographic-information__value"}).text
        else:
            doi = "NA"
    else:
        doi = "NA"
        
    return title, authors, doi

In [18]:
for url in articles.loc[articles.domain=='www.nature.com'].url.values:
    print(url)
    title, authors, doi = nature(url)
    print(title)
    print(authors)
    print(doi)
    print()
    sleep(3)

https://www.nature.com/articles/s41586-022-04629-w
Projected environmental benefits of replacing beef with microbial protein
Florian Humpenöder, Benjamin Leon Bodirsky, Isabelle Weindl, Hermann Lotze-Campen, Tomas Linder, Alexander Popp
https://doi.org/10.1038/s41586-022-04629-w

https://www.nature.com/articles/s41586-022-04617-0
Protected areas have a mixed impact on waterbirds, but management helps
Hannah S. Wauchope, Julia P. G. Jones, Jonas Geldmann, Benno I. Simmons, Tatsuya Amano, Daniel E. Blanco, Richard A. Fuller, Alison Johnston, Tom Langendoen, Taej Mundkur, Szabolcs Nagy, William J. Sutherland
https://doi.org/10.1038/s41586-022-04617-0

https://www.nature.com/articles/s41586-022-04666-5
Nonlinear mechanics of human mitotic chromosomes
Anna E. C. Meijering, Kata Sarlós, Christian F. Nielsen, Hannes Witt, Janni Harju, Emma Kerklingh, Guus H. Haasnoot, Anna H. Bizard, Iddo Heller, Chase P. Broedersz, Ying Liu, Erwin J. G. Peterman, Ian D. Hickson, Gijs J. L. Wuite
https://doi.

##### The Lancet

Next write a function that can scrape the urls from `www.thelancet.com`

In [19]:
print(url)

https://www.nature.com/articles/s41586-022-04573-9


In [20]:
def lancet(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.content)
    
    ## title
    if soup.title.text:
        title = soup.title.text.split("- The Lancet")[0].strip()
    else:
        title = "NA"
        
    ## authors
    if soup.find_all('li', {'class':"loa__item author"}):
        authors = ", ".join([li.a.text.strip() for li in soup.find_all('li', {'class':"loa__item author"})])
    else:
        authors = "NA"
        
    ## doi
    if soup.find('a', {'class':"article-header__doi__value"}):
        doi = soup.find('a', {'class':"article-header__doi__value"}).text
    else:
        doi = "NA"
        
    return title, authors, doi

In [21]:
for url in articles.loc[articles.domain=='www.thelancet.com'].url.values:
    print(url)
    title,authors,doi = lancet(url)
    print(title)
    print(authors)
    print(doi)
    print()
    sleep(3)

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(22)00122-2/fulltext
Long-term secondary prevention of cardiovascular disease with a Mediterranean diet and a low-fat diet (CORDIOPREV): a randomised controlled trial
Javier Delgado-Lista, MD, Juan F Alcala-Diaz, MD, Jose D Torres-Peña, MD, Gracia M Quintana-Navarro, PhD, Francisco Fuentes, MD, Antonio Garcia-Rios, MD, Ana M Ortiz-Morales, MD, Ana I Gonzalez-Requero, MD, Ana I Perez-Caballero, MD, Elena M Yubero-Serrano, PhD, Oriol A Rangel-Zuñiga, PhD, Antonio Camargo, PhD, Fernando Rodriguez-Cantalejo, MD, Fernando Lopez-Segura, MD, Prof Lina Badimon, PhD, Prof Jose M Ordovas, PhD, Prof Francisco Perez-Jimenez, MD, Prof Pablo Perez-Martinez, MD, Prof Jose Lopez-Miranda, MD
https://doi.org/10.1016/S0140-6736(22)00122-2

https://www.thelancet.com/journals/lanmic/article/PIIS2666-5247(22)00033-7/fulltext
Accuracy and efficacy of pre-dengue vaccination screening for previous dengue infection with a new dengue rapid diagnostic

##### Science

Try to request the html code for the first url from the `www.science.org` domain.

What is the status response code?

##### Sample Solution

In [22]:
url = articles.loc[articles.domain=='www.science.org'].url.values[0]

requests.get(url)

<Response [503]>

When I was writing this notebook, I received a 503 response. This code implies an issue on the website's side and not an issue with your code. While there are different reasons why I got this response, I believe it is because `www.science.org`'s servers have been set up to prevent scraping like we are trying to accomplish.

Luckily, we have another way to get the data we want for these urls, which we will touch on in the `Python and APIs` `Practice Problems` notebook.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)