<div align=right>
Winter 2025<br>
Nardin<br>
Lecture 2
</div>


<h1 align=center>Getting Data from the Web I</h1>

<font color='darkblue'> <h2>Learning Objectives</h2> </font>

* Define web scraping
* Recognize the differences between scraping with and without public APIs
* Use the library `requests` to interact with websites
* Use the library `BeautifulSoup` to parse and extract data from websites
* Read and understand the HTML language
* Select HTML tags for scraping

To achive these goals, we use two examples: Chicago wikipedia page (to scrape a table) and University of Arizona's website (to scrape faculty members' information and deal with missing info)

<font color='darkblue'> <h2>Table of Contents</h2> </font>

1. Definition of Web Scraping and Two Ways to Scrape
1. Scraping in Python: Overview of `requests` and `beautifulsoup`
1. Example 1: Chicago Wikipedia
    * Using `requests`
    * Website Structure
    * Using `beautifulsoup`
    * Exercise 1
1. Example 2: Extract Faculty Information
    * Putting it all togheter
    * Exercise 2


<font color='darkblue'> <h2>1. Definition of Web Scraping and Two Ways to Scrape</h2> </font>
 
 
Web scraping is <b>the process of gathering or "scraping" information from a website.</b>

If you have ever copied and pasted information from the Internet, you have performed the same task as any web scraper, just on a small scale. Web scraping allows automating this process to collect hundreds, thousands, or even millions of information (e.g., companies' names, emails, phones, newspaper articles, reviews, prices, etc.)

Broadly speaking <b>there are two main ways to get data from a website</b> and we will explore both, with a greater emphasis on the second option:

<font color='darkblue'> <h3> Option 1: Using the API made accessible by the website </h3> </font>

Several websites, especially those of large corporations or those managing extensive datasets, offer mechanisms that allow you to gather data by submitting queries and receiving CSV or JSON files in return. A website that offers this service is said to "provide an API." **An API (Application Programming Interface) is an interface provided by the website that helps users to collect data from that specific website.** 

How do you know if a website has an API? Most major websites (e.g. Facebook, Google, Amazon, The New York Times, etc.), usually have one or more than one (and want you to use it!). You can use Google to check if an API is available.

Example: Google Books API: https://developers.google.com/books/docs/v1/using#WorkingVolumes

**To gather data from a website using an API, you need to:**

* Learn how the API works (every API is different: sometimes using an API is smoooth, sometimes it is a frustrating experience) and often register an account and get a password
* You can interact with the the API directly or use Python wrapper written by someone to simplify interacting with the API; for examples see [here](https://github.com/realpython/list-of-python-api-wrappers) but use with caution is not recently updated 

<font color='darkblue'> <h3> Option 2: Directly accessing the website's HTML </h3> </font>

Many other websites do not provide an API. To gather data from these sites, you need to scrape them directly. Note you can always **attempt** to scrape a website directly, even if it has an API. Sometimes it is easier/faster than learning howto interact with the API, but in most cases, though, it is better to use the API, if one is available (examine pros and cons)

Example of a website without an API: https://sociology.arizona.edu/faculty

Example of a website with an API that is easy and OK to scrape directly: wikipedia

**To scrape a website directly, you need to:**

* Learn how to read and understand the website's raw code (every website is made up of a mix of HTML, CSS, and Javascript -- HTML is the most important for scraping)
* Use scraping libraries: we learn `BeautifulSoup` and `requests` (the latter is used for both types of websites, with and without an API)


<font color='darkblue'> <h2> 2. Scraping in Python: Overview of `requests` and `beautifulsoup` </h2> </font>

We are going to use `requests` and `beautiful soup`. If you haven't installed them on your machine, please do so before loading them (see Canvas for details; I assume you use Python with Anaconda in this course).

* <b>requests</b>: library to interact with web pages and get data from them. It sends HTTP requests to web servers and allows us (humans) to access the response. For more info see the official documentation: https://requests.readthedocs.io/en/master/

* <b>beautiful soup (bs4)</b>: library to parse and extract the data. It allows us to navigate and extract data (i.e. the desired tags) from the HTML and other markup languages. See https://www.crummy.com/software/BeautifulSoup/ You need this library only for scraping directly a website.


In [None]:
import requests                       # to interact with websites and request/get data from them 
from bs4 import BeautifulSoup as bs   # to parse and extract data from websites 
import pandas as pd

**The library <code>requests</code> helps us making requests and getting data back ([documentation](https://requests.readthedocs.io/en/latest/)):**

To understand what `requests` does, and web scraping more generally, we need to start with our daily use of the internet: when we open a website in a browser (e.g. Google) our machine sends a request to the website's server, asking for the information on that page. The server responds by sending to us the requested data in raw format (mainly written in HTML), which is then nicely displayed in our browser.

Under the hood, the process looks something [like this](https://www.linkedin.com/pulse/what-happens-when-you-enter-url-browser-he-asked-victor-ohachor):

* Computers talk to each other on the web by sending and receiving (GET) <b>data requests</b> and (POST) <b>data responses</b>: some making requests, some receiving and answering them, some doing both. 

* Every computer has an address that other computers can use or refer to. When you click on a page, the <b>web browser</b> of your computer (e.g., Chrome, Safari, etc.) makes a data request to the <b>web server</b> of that page (think at a database where all the info about that page are stored) and gets back a response object. 

**For example:**

If you type https://macss.uchicago.edu/current-student-resources into your web browser and hit enter, these steps occurs under the hood:
* your web browser translates what you typed into a <b>HTTP request</b> to tell the macss web server that you would like to access the info stored at <code>/current-student-resources</code> using the <code>https</code> protocol
* the web server that hosts macss receives your request and sends back to your web browser a <b>HTTP response</b> code and response content (a bunch of files written in HTML)
* your browser receives and <b>transforms this response content into a nice visual display</b> that might include texts, graphics, hyperlinks, etc.

**For web scraping, we do not want to display the data, we want to collect them:** 

* So we use the library `requests` to send data requests, and get back a response; then we use the library `beautifulsoup` to parse extract these data 
* There are other Python libraries similar to `requests` (e.g. `urllib, urllib2, urllib3`); however, `requests` is the most widely used. The same applies to `beautifulsoup`: not the most powerful scraping library, but the most common and the first tool new scrapers use before moving on to more advanced libraries (e.g., `scrapy` or tools for dynamic scraping).

Let's see how to use these libraries with our first example!

<font color='darkblue'> <h2> 3. Example 1: Chicago Wikipedia </h2> </font>

Our first task is to scrape a table from this Chicago's Wikipedia page: https://en.wikipedia.org/wiki/Chicago
* We first use <code>requests</code> to interact with this URL and store the data response we get back.
* We then rely on <code>BeautifulSoup</code> to parse the response and extract the data. 


<font color='darkblue'> <h3> Requests </h3> </font>

Let's start from the library `requests`. To start the process, we send a <code>get()</code> request to the URL and store the response:

In [None]:
chicago_wiki = 'https://en.wikipedia.org/wiki/Chicago'
response = requests.get(chicago_wiki)
print(type(response))

That's the minimum necessary code to use the library, but we can do more: set up a User-Agent, check our response status code, and verify the encoding.

<h4> User-Agent </h4>

You can specify your User-Agent by passing it in a dictionary to the `headers` parameter of `request.get`. More info [here](https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers). 

This is not required in general, and not needed for PA1 (do not add it there), but it is helpful for your own scraping project if you want make your intentions clear.

For example, we can specify the goal of our scraper and provide a way to be contacted by the website:

In [None]:
header = { "User-Agent" : "demo scraper for teaching purposes yourname@uchicago.edu" } 
response = requests.get(chicago_wiki, headers = header)

print("Our response code is:", response.status_code)

Remember computers make and send requests (see above)? 

OK, a User-Agent is a text string that your computer's web browser sends every time you make a request to a website web server. It communicates info about your device type, operating system, etc. This info is useful for the web server that receives it. Sometimes, setting a custom User-Agent might prevent you from getting blocked. To know your User-Agent, type "what is my user agent" in your Google search bar. A Chrome User-Agent on Windows looks similar to this:

    <code>user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"</code>
    

<h4> Response Status Code </h4>

Another useful task when you use the `requests` library is checking the response status code: 2xx codes bring good news, 4xx or 5xx codes mean errors. There are [several status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), these are the most common: 
* "200 OK": standard response for successful HTTP requests
* "404 not found": the page is not available
* "403 forbidden": the server rejected the request
* "503 server unavailable": usually a temporary issue, e.g. server overloaded or under maintenance
* "504 gateway timeout": indicates a time issue (a response was not sent within the expected time frame)

In our case, it is likely that our request will be successful, so we should get a 200 code, which means that the webpage has accepted the request. If you get an error (4xx or 5xx codes), it might mean several things. One of the most common is that the webpage is 404: since this comes from the website, there is nothing you can do on your side to "fix it". You might try later or manage the response error so your code does not break when it runs into a response code error (more on this in the next lectures)

In [None]:
print("Our response code is:", response.status_code)

<h4> Encoding </h4>

The request we send out is in bytes: the process called encoding translates bytes into letters/characters that humans can read. When the library `requests` gets the data back from the webpage, it encodes them automatically by making an educated guess as to what encoding scheme applies to a given language.

The most common encoding is <code>UTF-8</code>, which is the default and works well for English and most languages, but not all. 

You may run into encoding issues with web scraping. This is because the representation of characters is not always consistent across the computers! In practice, this means you should check the default encoding, and you can leave unchanged, but if you notice issues along the way, go back and change it:

In [None]:
print("Response default encoding is:", response.encoding)

# change encoding
#response.encoding = 'latin-1'

<h4> Turn response into text </h4>

OK, at this point we got our response object from the Chicago wikipedia server and we stored it into a "response" variable. 

Our "response" holds a reference to a request object; but we want to be able to read it, thus we turn it into a string with `.text` (if you have non-text content like images, use `.content` instead):

In [None]:
response_txt = response.text
print(type(response_txt))

<font color='darkblue'> <h3>  Website Structure </h3> </font>

Now we have the HTML of the Chicago Wikipedia page stored as a text in our <code>response_txt</code> object. We are going to use BeautifulSoup (or "bs" for short) to parse it, but, remember, bs won't do the hard job of understanding a webpage structure and telling us what to extract!  

It can only help us parsing a webpage and extracting the data *once we know where to find it (e.g. which tag to grab)*. Therefore, to use bs for scraping task we need to learn the basic elements of a webpage.

A website is made of the following elements:

* <b>HTML</b> which means HyperText Markup Language and is the core element of a website. HTML uses a set of tags to organize the webpage (i.e., makes the text bold, creates body text, paragraphs, inserts hyperlinks, etc.), but when the page is displayed the markup language is hidden

* <b>CSS</b> which means Cascading Style Sheets, it adds styling to make the page looks nicer 

* <b>JS</b> Javascript code is used to add interactivity to the page, and you need "dynamic web scraping" techniques to interact with it

* <b>Other stuff</b> for example images (jpg and png allow webpages to show pictures), hyperlinks, videos or multimedia


#### Basic structure of HTML

HTML is the most important language we need to learn for web scraping; makes the "skeleton" or structure of a website

Messy to read, but it follows a [hierarchical-tree-like structure](https://www.researchgate.net/figure/HTML-source-code-represented-as-tree-structure_fig10_266611108) since it embeds tags within tags (everything marked with <> is a tag)

Standard HTML syntax (simple example): ``<tagname> contents </tagname>`` 

```html 
   <html>
     <head>
        <title>general info about the page</title>
     </head>
     <body>
       <p>a paragraph that holds some text about the page</p>
       <p>another paragraph which might contain <strong>additional</strong> markup</p>
       <p>...</p>
     </body>
   </html>
```

#### Tags

In web scraping, tags are fundamental because we collect information from webpages using them:

* tags are organized in a tree-like structure and are nested within each other
* tags go in pairs: one on each end of the content that they include, there is a "start tag" and an "end tag" (with a slash), for example `<p>hello</p>`
* tags can have attributes (id and class attributes are the most useful for scraping)

There are <b>several tags</b>, for example:

* section headings: `<h1>...</h1>` to `<h6>...</h6>`
represent six levels of section headings, `<h1>` is the highest section level and `<h6>` is the lowest


* body:  ``<body>...</body>``
contains the text and markup that is to be displayed, things like: text formatting (e.g. bold, italic, etc.), tables, paragraphs, lists, hyperlinks. Each will have its own tag inside the main body tag


* links: 
 - ``<a href="http://college.uchicago.edu">The College</a>``
 -  Note that `a` is the tag and `href` is the tag attribute and `"http://..."` is the attribute's value. Links are web page requests embedded in another web page and are fundamental to the whole "browsing" experience. The idea is that, as you are reading some page, you can click on a hyperlink and be taken to other pages


* images: ``<img style="height: 120px;" alt="" src="images/freelunch.png">``


* paragraph:
  -  ``<p> … some text … </p>``
  -  ``<p class="courseblocktitle”> course title… </p>``


* comment: ``<!-- ... -->``


* division or section:
  - ``<div> … </div>``
  - ``<div class="courseblock main">…stuff...</div>``


* table:
```html
     <table>
       <tr>
          <th>...</th>
             ...
          <th>...</th>
       </tr>
       <tr>
          <td>...</td>
             ...
          <td>...</td>
       </tr>
     </table>
```

A few more pieces of information to remember about tags:
    
1. Tags have commonly used names that depend on their positions in relations to other tags: 
    * <b>child</b>: the tag inside another tag (e.g. the `p`tag is usually a child of the `body` tag)
    * <b>parent</b>: the tag that contains another tag
    * <b>sibling</b>: two tags are siblings if they are nested inside the same parent


2. <b>Class</b> and <b>id</b> are special attributes that specify more information about a given tag, usually a certain CSS style:
* not all tags have a class or id attribute
* the same class can be shared between elements but each element can only have one id. 
* in web scraping, class and id attributes are important because they offer details about a tag and so they help us locating it

3. See [this list](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a complete list of tags and their meaning for your scraping projects!

<b>Back to our example to observe these things in practice:</b>

Safari:
* ensure the developer menu is enabled: open Safari > Settings > Advanced > Check the "Show Develop menu in menu bar" checkbox
* go to https://en.wikipedia.org/wiki/Chicago, right click on it and select "Inspect Element"
* on the search bar, there should be small target that you can use to select tags

Chrome: 
* go to https://en.wikipedia.org/wiki/Chicago, right click on it and select "Inspect" (can also click on the "View page source") 
* there should be a small box with an arrow icon that you can use to select tags 

Browse the website and notice the elements we learned (e.g., tree-like structure; tags are nested and go in pairs, etc.)

<b>Important: each webpage is different!</b> Meaning its HTML structure and tags cannot be determined in advance. This requires some general knowledge of HTML, and time and patience to identify which tags to use to scrape the data we want. In an ideal world, webpages are well made (in that they rely on well-designed and clear HTML structure) but in reality, this is often not the case!


<font color='darkblue'> <h3> Beautiful Soup </h3> </font>

<h4> Store and parse the HTML </h4>

Now we use Beautiful Soup to get this webpage into Python and start scraping!

In [None]:
# use bs to parse the content of our response_txt variable 

soup = bs(response_txt, 'html.parser')
print(soup.prettify())

We used the `html.parser` (other common parsers `html5lib` and `lxml`). See [here]( https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) for more.


<h4> Common Beautiful Soup Methods </h4>

Now that we have saved the entire HTML or "skeleton" of the webpage into an object (that we called "soup"), we want to extract information from our "soup". The most important BeautifulSoup methods to do so are `find()` and `find_all()`

In [None]:
# find() returns the FIRST instance of a tag (examples of tags: p, table, td, img, title, etc.)

print(soup.find("title"))

In [None]:
# combine find() with get_text() to extract text from a specific tag

print(soup.find('title').get_text())

In [None]:
# same thing using the attribute notation: to access tags as if they were attributes of the object
print(soup.title)
print(soup.title.text)

In [None]:
print(soup.p)

NB: the attribute or dot notaton is nice, but is more limited than the explicit notation. It is OK to use for elements that only appear ONCE per page, because it only gives back the first element! It is always safer to be explicit: use `.find()` for retrieving single elements and `.find_all()` for multiple elements. 

For instance, to find all paragraph tags, we can use `find_all()`:

In [None]:
# find_all() returns ALL instances of a tag on a given web page

len(soup.find_all("p"))
print(soup.find_all("p"))

# compare with soup.p

In [None]:
# to extract the text from each of these <p> tags at once, this won't work... why?
#soup.find_all('p').get_text()

In [None]:
# we need to loop over each tag

paragraphs = soup.find_all('p')

for par in paragraphs:
    print(par.get_text())

In [None]:
# you can save all <p> tags into a list 

paragraphs = soup.find_all('p')

par_list = []
for par in paragraphs:
    par_txt = par.get_text()
    par_list.append(par_txt)
print(par_list)

In [None]:
# or you save all <p> tags into a dictionary
# keys are the consecutive numbers of <p>, and values are the <p> text

paragraphs = soup.find_all('p')

par_dict = {}
for index, par in enumerate(paragraphs):
    par_txt = par.get_text()
    par_dict[index] = par_txt   
print(par_dict)

In [None]:
# access value using keys (e.g. access paragraph number 10)
par_dict[10]

# return all keys and all values each as list, or all pairs as list
#list(par_dict.keys())
#list(par_dict.values())
#list(par_dict.items())

Other than scraping all tags of one type (like all `<p>` tags) we can also scrape all instances of an EXACT MATCH (like a specific `p` tag). For example, if we had a tag that looks like this:

`<span class="mw-headline" id="Etymology_and_nicknames">Etymology and nicknames</span>`

We could write code like this to extract it:

In [None]:
# syntax 1
#print(soup.find_all("span", {"class": "mw-headline"}))

# syntax 2
#print(soup.find_all("span", class_= "mw-headline"))

<h4> Scraping Tables </h4>

The [Chicago wiki page](https://en.wikipedia.org/wiki/Chicago) has a few tables in it. Let's say that we want to scrape the "Major league professional teams in Chicago (ranked by attendance)" table in the page. And turn it into a DataFrame object. 

First, take a look at the page by "inspecting" it (see above for instructions). We want to look for the tag and its class attribute and put them into our code:

In [None]:
# this does not work although is the tag we see on the website by inspecting it
#soup.find_all("table", class_= "wikitable sortable jquery-tablesorter")

# this works
soup.find_all("table", class_= "wikitable sortable")

 The reason the first code doesn't work is in the `soup` variable. The `find_all()` method searches the content stored in that variable (vs. the webpage we see when we "inspect" elements). To confirm, try typing `print(soup)` or `print(soup.prettify())` and then search for the table tag within the output.

Issues like this are common, and you will often need to adjust your code. Here are some debugging tips:

* Check the content of your soup variable to ensure it includes the elements you are trying to scrape.
* Experiment with different parsers. For example, we used `html.parser` here, but other options are available (a list of common parsers is linked in earlier sections of this notebook).
* Check  if the website has JavaScript-rendered components (most common scenario): BeautifulSoup cannot handle JavaScript directly and may only partially scrape or not scrape at all such content 

We can grab it and print how it looks like in beautiful soup:

In [None]:
sports_table = soup.find("table", class_= "wikitable sortable")

We can also render this table in our Jupyter notebook, so that we can better see what we are working with:

In [None]:
from IPython.core.display import HTML
HTML(str(sports_table))

Notice, we have not turn the table into a dataframe yet, we are just displaying it inside Jupyter to help inspecting it and grabbing the tags we need (you can do it directly from the page).

It looks like each row of data is between `<tr>` tags, for "table row." We can pull out the raw text within each one of these rows using the `text` attribute:

In [None]:
# with a loop
rows = []
for i in (sports_table.find_all("tr")):
    rows.append(i.text)
print(rows)
len(rows)

In [None]:
# with a list comprehension = [expression for item in iterable if condition]
rows = [i.text for i in sports_table.find_all("tr")]
print(rows)

This still leaves us with a bunch of new line characters `\n` (some data cleaning is very common in scraping!). We can deal with them using Python's built-in string methods lie `strip()` and `split()`. 

Note that each column entry is delineated by two new line characters and each row starts with one new line character:

In [None]:
# strip new line characters from start and end, split on double new line
rows_clean = [i.text.strip('\n').split('\n\n') for i in sports_table.find_all("tr")]
rows_clean

A "list of lists" is something that we can work with and easily bring into a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):

In [None]:
pd.DataFrame(data = rows_clean[1:], columns = rows_clean[0])

There are still a few data cleaning tasks that we could complete if we wanted to analyze this data (i.e. change the Attendance column to an integer data type, we will talk more about cleaning tasks in the weeks ahead), but we have a scraped table as a df!

<font color='darkblue'> <h3> Exercise 1 </h3> </font>

**Your task: use the code above as example to scrape the "Racial Composition" table from [Chicago wiki page](https://en.wikipedia.org/wiki/Chicago). Then use regular expressions to turn clean the scraped data and turn it into a pandas df. Do not use AI.**

In [None]:
# your code to request data
chicago_wiki = 'https://en.wikipedia.org/wiki/Chicago'
response = requests.get(chicago_wiki)
response_txt = response.text

In [None]:
# your code to parse (e.g., "soup") and scrape raw data
soup = bs(response_txt, "html.parser") # or bs(response.text, "html.parser") if you skip line above
race_table = soup.find("table", class_= "wikitable sortable collapsible")
rows = [i.text for i in race_table.find_all("tr")]
rows

In [None]:
# you code to clean data using regular expressions 

import re

cleaned_rows = []
for row in rows:
    row = row.strip('\n')
    row = re.sub('\n\n', '\n', row)
    row = re.sub('\[\w+\]', '', row)
    row = row.split('\n')
    cleaned_rows.append(row)
cleaned_rows

In [None]:
# your code to turn the collected data into a pandas df
df = pd.DataFrame(data = cleaned_rows[1:], columns = cleaned_rows[0])
display(df)

<font color='darkblue'> <h2> 4. Example 2: Extract Faculty information </h2> </font>

We now use another website to practice what we learned so far and illustrate another common scraping task: <b>collecting contact information</b>. You can apply the same logic to get contact information from any website (companies, NGOs, congress members, other univerisities, etc.)

Our tasks:
1. Make a request to https://sociology.arizona.edu/faculty and save it into a response object
1. Turn the response object into text and parse it
1. Identify which HTML tags to use to scrape the following info from each faculty member: names, emails, titles
1. Put it all togheter

In [None]:
# TASK 1: make a request with the request library

url = "https://sociology.arizona.edu/faculty"
response = requests.get(url)
print("Our response code is:", response.status_code)

In [None]:
# TASK 2: turn the response into text and parse it with the bs library

soup = bs(response.text, "html.parser")

Tips to extract names, emails, etc.:

* Go to the url you want to scrape (https://sociology.arizona.edu/faculty)
* Use the "Inspect" option in the development tools to inspect the HTML (see above for more)
* You should discover that all data (name, email, phone, etc.) for each faculty are nested under a tag called <code>div class="az-person-row"</code>: <code>div</code> is the actual tag (a division or section tag), the rest is its specific class attribute
* Spend some time exploring all tags nested under this <code>div</code> tag to find the info you want

Below is one way to set up your scraper (e.g., using the higher-level tag and attribute `div class="az-person-row"`). However, there are other functional approaches to scrape data from this website. Select the approach that best fits your specific needs and the structure of the website's HTML.

**Beyond this specific example, your scraping code should always include a way to handle potential missing data for each element you scrape (see the example provided in TASK 4 below)**

In [None]:
# TASK 3.1: NAMES are stored under a <span> tag 

names = []

for row in soup.find_all('div',  attrs = {'class': 'az-person-row'}): 
    name = row.find('span', attrs = {'class': 'field field--name-title field--type-string field--label-hidden'}).text
    names.append(name)

print("\n", "Names of faculty members:", "\n", names)

In [None]:
# TASK 3.2: EMAILS are stored as text under <a href>, nested under <div> tag...

emails = []

for row in soup.find_all('div',  attrs = {'class': 'az-person-row'}): 
    div_tag = row.find('div', attrs = {'class': 'field field--name-field-az-email field--type-email field--label-hidden text-truncate d-block'})
    #print(div_tag)
    if div_tag is not None:
        email = div_tag.find("a", href = True).text  # only "a" also works here
    #print(email)
    emails.append(email) 

print(emails)

# note: see TASK 4 below for a better way of setting this code up!

In [None]:
# TASK 3.3: TITLES are stored as text under two nested div tags

titles = []
  
for row in soup.find_all('div',  attrs = {'class': 'az-person-row'}): 
    title = row.find('div', attrs = {'class': 'field field--name-field-az-titles field--type-string field--label-hidden field__items'}).text
    title = title.replace("\n", " ").strip()
    titles.append(title) 

print(titles)

In [None]:
# TASK 4: Combine everything in one piece of code
# Collect names, emails, and titles in one loop while checking for missing data

contacts = []

for row in soup.find_all('div',  attrs = {'class': 'az-person-row'}): 
    
    # scrape names
    name_tag = row.find('span', attrs = {'class': 'field field--name-title field--type-string field--label-hidden'})
    if name_tag:
        name = name_tag.text
    else:
        name = "NA"
    #name = name_tag.text if name_tag else "NA"
      
    # scrape emails
    email_tag = row.find('div', attrs = {'class': 'field field--name-field-az-email field--type-email ' \
                                               'field--label-hidden text-truncate d-block'})
    if email_tag:
        email = email_tag.text
    else:
        email = "NA"
    #email = email_tag.text.strip() if email_tag else "NA"
        
    # scrape titles
    title_tag = row.find('div', attrs = {'class': 'field field--name-field-az-titles field--type-string ' \
                                               'field--label-hidden field__items'})
    if title_tag: 
        title = title_tag.text.replace("\n", " ").strip()
    else: 
        title = "NA"
    #title = title_tag.text.replace("\n", " ").strip() if title_tag else "NA"
    
    contacts.append([name, email, title])  

# print results line by line and sort them
for row in sorted(contacts):
    print(row)

In [None]:
# save results in a pandas dataframe 

df = pd.DataFrame(contacts)
display(df) # same as print(df) but nicely formatted

In [None]:
# rename df and its columns

faculty = df.rename (columns = {0: 'name', 1: 'email', 2: 'title'})
display (faculty[0:3])

In [None]:
# export the collected data as csv using the "DataFrame.to_csv" function
# change the path to store the results on your machine

faculty.to_csv (r'\Users\Sabrina Nardin\Desktop\faculty.csv', encoding = 'utf-8', index = False)
print('Data exported!')

Note: passing the path as raw string with `r` works for Windows but not for Macs: `r` allows you to treat escape sequences as characters, but path specifications for Macs use a '/'

<font color='darkblue'> <h3> Exercise 2 </h3> </font>

**Your task: Scrape all faculty names and all faculty phone numbers from https://sociology.arizona.edu/faculty. Note that a few faculty members do not have a phone number, your scraper must account for the missing phone numbers. Write the same code using a list to store names and phones, and using a dictionary to do the same. Do not use AI.**

In [None]:
# code to scrape names and phone using a list


In [None]:
# code to scrape names and phones using a dictionary
