# STA 141B Lecture 10

February 14, 2023

### Announcements

* HW3 posted, due next Friday

### Topics

* Web Scraping
* Beautiful soup

### Datasets

* [wiki](https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area)
* [Worldometers](https://www.worldometers.info/coronavirus/)
* [Craigslist Apartments](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa)

### References

+ Web Scraping
    * [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
    * [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
    * [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
+ Natural Language Processing
    * [Natural Language Processing with Python][nlpp], chapters 1-3. Beware: the print version is for Python 2.
    * [Applied Text Analysis with Python][atap], chapters 1, 3.

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US

## Web Scraping

### Example 1: Getting tables from wikipedia

For data in a `table` element, we can use __Pandas__ instead of writing a scraper.

Wikipedia provides lots of useful information in tables. Let's get the Wikipedia list of [US cities by area][wiki].

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

In [2]:
import pandas as pd

In [None]:
tabs = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area")

In [None]:
def strip_footnote(x):
    """This function removes bracketed footnotes, such as '[1]'."""
    if pd.isna(x):
        return x
    
    return x.partition("[")[0]

In [None]:
# combine table headers into a row and remove footnote

In [None]:
# Apply a function to a Dataframe elementwise (.applymap)

tbl = tbl.applymap(strip_footnote) 

# do not work, why?

### Exercise: get table from https://en.wikipedia.org/wiki/List_of_cities_by_GDP

In [19]:
cities_GDP = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")

In [4]:
len(cities_GDP)

4

In [20]:
cities_GDP

[    Rank (PPP)  Rank (nominal)     Metropolitan area  Country/region  \
 0          1.0             1.0                 Tokyo           Japan   
 1          2.0             2.0              New York   United States   
 2          3.0             3.0           Los Angeles   United States   
 3          4.0             6.0                 Seoul     South Korea   
 4          5.0             5.0                 Paris          France   
 5          6.0             4.0                London  United Kingdom   
 6          7.0            10.0              Shanghai           China   
 7          8.0            16.0                Moscow          Russia   
 8          9.0            12.0               Beijing           China   
 9         10.0             8.0            Osaka–Kobe           Japan   
 10        11.0            50.0              Istanbul          Turkey   
 11        12.0            34.0               Jakarta       Indonesia   
 12        13.0             7.0               Chica

In [None]:
tbl_GDP = cities_GDP[1]

In [None]:
tbl_GDP.head()

### Example 2: Worldometers

[Beautiful soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a popular Python library for pulling data out of HTML and XML files.

In [None]:
# html

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
html_doc

Running the document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link3">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

In [None]:
soup.title

In [None]:
soup.title.name # gives the title name

In [None]:
soup.title.string

In [None]:
soup.title.parent.name

In [None]:
soup.p

In [None]:
soup.p['class']

In [None]:
soup.find_all('a')

In [None]:
soup.find(id="link2")

In [None]:
import requests

# Create an URL object
url = 'https://www.worldometers.info/coronavirus/'
# Create object page
page = requests.get(url)

In [None]:
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
soup

In [None]:
# Obtain information from tag <table>
table1 = soup.find('table', id='main_table_countries_today')
table1

In [None]:
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
    title = i.text
    headers.append(title)

In [None]:
headers

In [None]:
# Convert wrapped text in column 13 into one line text
headers[13] = 'Tests/1M pop'

In [None]:
# Create a dataframe
mydata = pd.DataFrame(columns = headers)

In [None]:
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata.loc[length] = row

In [None]:
mydata_final = mydata.loc[range(7, 238), :]

In [None]:
del mydata_final["#"]

In [None]:
mydata_final = mydata_final.reset_index(drop = True)

In [None]:
mydata_final.head()

In [None]:
# mydata_final.set_index(['Country,Other'])

### Example: Craigslist Apartments

[Craigslist](https://www.craigslist.org/) is a popular website where people can post advertisements for free. We can use data from Craigslist to analyze the local rental market for apartments.

Craigslist doesn't provide an API, so we have to scrape the data ourselves. Scraping Craigslist is the biggest challenge we've faced yet, since each ad is on a separate page.

We can start by scraping the front page of the [apartments section](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa) for links to individual ads.

In [None]:
import requests

In [None]:
url = 'https://sacramento.craigslist.org/search/apa'
response = session.get(url, headers=my_headers)

In [None]:
# craigslist_url = "https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa"
craigslist_url = "https://sacramento.craigslist.org/search/apa"

response = requests.get(craigslist_url, auth=('user', 'pass'))
response.raise_for_status()
html = lx.fromstring(response.text)
html.make_links_absolute(craigslist_url)

`make_links_absolute(base_href)`: makes all links in the document absolute, assuming that base_href is the URL of the document. So if you pass base_href="http://localhost/foo/bar.html" and there is a link to baz.html that will be rewritten as http://localhost/foo/baz.html.

More explanation: [here](https://linuxtut.com/en/e03431c718b94d6304ff/)

In [None]:
html.text_content()