# Introduction to Web Scraping in Python

DS2F "Digital Scholarship & Data Science" Fellowship, University of Arizona Libraries

Instructor: Sabrina Nardin, Sociology Ph.D. Candidate, snardin@email.arizona.edu

## Getting Started

To run this workshop, we are using a Jupyter notebook and a virtual environment provided by mybinder.org. This means that to follow along and run the code, you do not need Python installed on your computer.

Jupyter notebooks allow mixing "markdown blocks," like the text you are reading now, and "code blocks," like the block below in grey. You can execute the code blocks of this tutorial by clicking Ctrl+Enter or by pressing the "Run" button in the command tab at the top of the window.

You can modify, delete, or add new cells in any Jupyter notebook, including this notebook:
* To modify a code block, just click on it and change the code
* To insert a new cell below this one, click on the "Insert" tab > "Insert Cell Below" 
* To delete a cell, click on it, then go to "Edit" > "Delete Cells"
* To clear any pre-existing output, go to "Cell" > "All Output" > "Clear"

In [None]:
# print out a welcome message: to execute it press Ctrl+Enter or "Run" at the top of the screen

print("Welcome to Introduction to Web Scraping in Python!")

<b>Important:</b> 
* Please note that when you load this Jupyter notebook in the virtual enviroment (the link provided in the README file), the changes you perform on the notebook will NOT be saved once you close the virtual environment. To keep the changes, download the notebook ("File" > "Download as" > select your preferred format).
* As we go through the workshop, it might happen that you get disconnected from the virtual environment ("Connection failed" message). To reconnect, open again the link provided in the README. Once the environment is rebuilt, make sure to re-run the code you had before the disconnection.

## Learning Objectives

By the end of this workshop, you will be able to...
* Read an HTML page and evaluate it (by identifying and deciding which tags to use)
* Use the library “requests” to interact with websites
* Use the library “BeautifulSoup” to parse and get data from websites
* Understand some of the key tasks in static web scraping: missing data, request errors, turn pages
* Conceptualize web scraping as a process that goes from the website to the cleaned data

In the first part of the workshop, we will learn the most important commands of the libraries requests and BeautifulSoup. Then, we will apply these tools by scraping two websites: University of Arizona website, and IMDb movie review website.

## What is Web Scraping?

* Web scraping is the process of gathering or "scraping" information from a website. If you have ever copied and pasted information from the Internet, you have performed the same task as any web scraper, just on a small scale. Web scraping allows automating this process to collect hundreds, thousands, or even millions of information
* It pertains to the data collection phase of a project and usually targets structured data (e.g. companies' names, emails, phones, newspaper articles, reviews, prices, etc.)
* There are mainly two ways to get data from a website: (1) Using the API provided by the website. Facebook, Twitter, YouTube, etc. all have their own APIs; (2) Directly accessing the HTML of the website, which is what we focus on in this workshop

<img src="attachment:webscraping.png" width="500">

## Ethics of Scraping...

The ethic of web scraping often time gets glossed over in tutorials, but it is an important topic. Let's take a couple of minutes and run these lines of code together:

In [None]:
ethical_scraper = "Just because I can, does not mean I should."
check_website_rules = "I swear on my honor that I will check the website rules before scraping it :)"

print(ethical_scraper, check_website_rules)

Joking aside, I would like to bring to your attention that there are guidelines for performing web scraping ethically:

1. Private data (not OK!) VS. Public data (OK!):
   If there is a password or other barriers put in place by the host site, it is likely private data. For example, it is not OK to scrape data from an online community where only logged-in users can access posts. The exception is if you use the API provided by the website and follow its rules.


2. Check the "robots.txt" file before you scrape a website by adding <code>/robots.txt</code> at the end of your URL. Here you can find an example of Robot File from the NYT: https://www.nytimes.com/robots.txt (the star after the "User-agent" means "The following is valid for all robots"; the list of things you are not allowed to scrape are after "Disallow"). Further info: https://www.robotstxt.org/robotstxt.html
    
                                                         
3. Read the website’s "Terms of Service" (ToS): these are the legal rules you agree to observe in order to use a service. Some people follow it, others do not. Violating ToS exposes you to the risk of violating CFAA or "Computer Fraud & Abuse Act", which is a federal crime.


4. If the website has an Application Programming Interface (API), use it:
    - Example on how to use the NYT API to scrape news: https://martinheinz.dev/blog/31
    - Further info on news media APIs: https://en.wikipedia.org/wiki/List_of_news_media_APIs
                                                                   

5. Take a look at the "hiQ Labs v. LinkedIn" lawsuit case:
https://www.abajournal.com/lawscribbler/article/scraping-a-public-website-isnt-a-crime


6. Some practical things to keep in mind while scraping to avoid being blocked:
    - be slow: web scraping consumes server resources from the website you are scraping, make sure to use a conservative rate when making requests to a server (e.g. one request every 5 seconds, or set a random number)
    - save and store the content of what you scrape, so to avoid scraping it again if you need it in the future 
    - identify your scraper with a "user-agent" string that allows the website you are scraping to identify your client browser information. For example, you can add something like this in your code: 
    - <code>user = { "User-Agent" : "News extractor for research project, email: youremail@gmail.com" } 
    response = requests.get ("put here the URL you want to scrape", user)


## Load Packages

Run this block of code to import the libraries we need for the workshop. The main two are:

* <b>Requests</b> to interact with web pages and get data from them. It sends HTTP requests to web servers and allows us (humans) to access the response. See also https://requests.readthedocs.io/en/master/

* <b>Beautiful Soup (bs4)</b> to parse and extract the data. It allows us to navigate and extract data (i.e. the desired tags) from the HTML and other markup languages. See https://www.crummy.com/software/BeautifulSoup/ 

In [None]:
# web scraping packages
import requests                # to interact with websites and get data from them 
from bs4 import BeautifulSoup  # to navigate and extract data from websites
import requests.exceptions     # to handle requests errors

# standard packages you might use for other projects as well
#from requests.exceptions import HTTPError
import time
import random
import re
import pandas as pd
import numpy as np

print("All packages available and imported!")

## Requests and BeautifulSoup

We first use "requests" to interact with an URL, here https://linguistics.arizona.edu/peo-faculty, and store the response we get back. We then rely on "beautiful soup" to parse the response and extract the data. For more information on this process, check the end of this page under "Additional information."

In [None]:
# make a request to this webpage and store the data into a "response" variable

response = requests.get("https://linguistics.arizona.edu/peo-faculty")
print(type(response))

You can specify your "User-Agent" as you send your request.
Setting a custom User-Agent is not usually seen in tutorials, but, as ethical scrapers, we use it to make our intentions clear. Here I specified the goal of our web scraper and provided a way (e-mail) to be contacted by the website. Setting a custom User-Agent is also a good practice to avoid getting blocked.

In [None]:
# set up a User-Agent with some info (you might want to replace my email with yours)

headers = { "User-Agent" : "web scraper for teaching purpose (type your e-mail address here)" } 
response = requests.get("https://linguistics.arizona.edu/peo-faculty", params = headers)

print("User Agent set up and request set")

Another good practice is to check the response status code: 2xx codes bring good news, 4xx or 5xx codes mean errors. There are a bunch of status codes, see https://en.wikipedia.org/wiki/List_of_HTTP_status_codes. Here, it is likely that our request will be successful, so we should get a 200 code, which means that the webpage has accepted the request. If you get an error, it might mean several things. One of the most common is that the webpage is not available: you might try later or skip that page. See the section toward in the Example 2 of this tutorial for more info on handling HTTP response errors.

In [None]:
# check our response status code

print("Our response code is:", response.status_code)

The request we send out is in bytes: encoding translates bytes into "words" we can read.
When the library "requests" gets the data, it encodes them for you by making an educated guess. The most common encoding is UTF-8, which is the default and works well for English and most languages, but not all. You should always check the default encoding, and eventually change it. Encoding issues are common problems you might encounter in web scraping.

In [None]:
# check our response default encoding

print("Response default encoding is:", response.encoding)

# you can also change the encoding
#response.encoding = 'latin-1'

OK, at this point we got our response object from the Arizona server, and we stored it into a "response" variable. However, to access this data we first need to convert it into a text (string):

In [None]:
# turn our response object into text

response_txt = response.text
print(type(response_txt))

Now, we are ready to parse the string with BeautifulSoup:

In [None]:
# use BeautifulSoup to create a variable (that I called "soup") and parse it

soup = BeautifulSoup(response_txt, "html.parser")
print(type(soup))

Notice that these steps are always executed in order: find one (or more) website, send a request to it, save the response you get back, create a beautiful soup object to work with. It is a sequence of steps: "requests" comes first, because we first need our web browser to send a request to the web server of the page we want to collect data from... it is like knocking at the door of someone you want something from: requests knocks at the door of the web page, asks for its data, and basically the web page can say yes or no. If it says yes, then it comes “BeautifulSoup” which helps navigating and extracting data from the jungle of tags that our request brings back to our pc. And this is our next step... 

Let's check what happens if we print our "soup" variable:

In [None]:
# use BeautifulSoup prettify() method to print the data (uncomment the line below to print)

#print(soup.prettify())

What we just printed is the HTML of our webpage, kind of its "skeleton." The HTML language is quite messy. To navigate it, I'd suggest going back to the webpage and inspecting it using the web browser development tools. This way, we can see the exact same skeleton from a much easier to read display:

Step 1. Go to the website: https://linguistics.arizona.edu/peo-faculty. Then right click on it and select "inspect"

<img src="attachment:step1.png" width="800">


Step 2. If you are using Chrome as web browser, you will find a small box with an arrow icon (circled in blue in the image below). Use the arrow to navigate and select the tags. In Safari, there is a button on the search bar which looks like a small target.
![step2_safari.jpg](attachment:step2_safari.jpg)

Take some time to browse the website. As you do it, please notice that: (1) tags follow a tree-like structure and are nested within each other; (2) tags go in pairs: one on each end of the content that they include, there is a start tag (e.g. "title") and an end tag with a slash (e.g. "/title").

See the "Additional Information" section at the end of the page for more info on tags, classes, and ids.


## Common BeautifulSoup Methods

In the previous two steps, we have saved the entire HTML structure or "skeleton" of the webpage into a variable that we called "soup." In this section, we learn some of the key BeautifulSoup commands to extract information from our "soup"

In [None]:
# get_text() extracts all text (remember "soup" is a variable that we created)

print(soup.get_text())

In [None]:
# find() returns the FIRST instance of a tag (examples of tags: p, table, td, img, etc.)
# here we ask for the first paragraph <p> tag

print(soup.find('p'))

In [None]:
# get_text() extracts the text from a tag

first_p = soup.find('p').get_text() 
print(first_p)	

In [None]:
# note that using .text does the same thing: .text is a property calling the function get_text()

print(soup.find('p').text) 

In [None]:
# find_all() returns ALL instances of a tag on a given web page

print(soup.find_all('p'))

In [None]:
# find_all() by default gives us ALL tags of the kind we ask for
# but we can use it to get only the FIRST tag, or the SECOND tag, etc.

paragraphs = soup.find_all('p')[0].get_text() # in python 0 means the 1st element
print(paragraphs)

In [None]:
# to extract the text from each of these <p> tags at once, we need to loop over each tag

paragraphs = soup.find_all('p')

for paragraph in paragraphs:
    print(paragraph.get_text(), "\n")

In [None]:
# we can also save all <p> tags into a list

paragraphs = soup.find_all('p')

paragraphs_txt = []

for paragraph in paragraphs:
    paragraph_txt = paragraph.get_text()
    paragraph_txt = paragraph_txt.replace('\xa0',' ')
    paragraphs_txt.append(paragraph_txt)
    
print(paragraphs_txt)

In [None]:
# it is common to use find_all() to ask for all instances of an EXACT MATCH

# here we find all instances of this tag from our website: <td class="views-field views-field-realname">
# the <td> tag refers to a cell of a table, the rest are its class attributes

print ( soup.find_all('td', {'class': "views-field views-field-realname"} ))

In [None]:
# another syntax style that does the same thing

print (soup.find_all('td', class_="views-field views-field-realname"))

## Example 1: Extract Contact Information From One Page

We are ready to tackle our first web scraping task! 

For this first example, we use the same website we have been using so far: https://linguistics.arizona.edu/peo-faculty. 
We take this website as a "toy" example to illustrate a common web scraping task: collecting contact information. You can apply the same logic to get contact information from any website (companies, NGOs, congress members, etc.)

We have already made a request with "requests" and parsed it with "BeautifulSoup," so we are ready to collect data. We begin by collecting the names of the faculty. Go to the webpage and use the development tools and the "inspect" option to check the tag that holds the names. You should discover that each faculty name is under an <code>a href</code> tag. This tag is nested under another tag, called <code>td class="views-field views-field-realname"</code>, which contains all contact information about each faculty such as name, title, email, telephone, office, etc. Notice that <code>td</code> is the actual tag (td indicates a cell of a table), whereas the other info is its specific class attributes. We want to include this information in our code to make sure we get that specific tag (one for each faculty member) and leave out other <code>td</code> tags.

The red box in the screenshot below shows the parent tag <code>td class="views-field views-field-realname"</code> and its child  <code>a href</code> tag for the first two faculty members:

<img src="attachment:names_tag.png" width="800">

In the code below, "soup" is the variable in which we have previously saved the entire "skeleton" of this webpage. To find the names, we need to examine the content nested under <code>td class="views-field views-field-realname"</code>

In [None]:
# this for loop prints all the given <td> tags and the tags nested under them

for row in soup.find_all('td',  attrs = {'class': 'views-field views-field-realname'}): 
    print(row.prettify())

Scroll down what we just printed until the end. You will notice that our loop printed everything from the start of the <code>td</code> tag to its end <code>/td</code> and it did so for each faculty member (there is a blank line that separates each faculty membes block of information). Notice also that, since tags are nested, this <code>td</code> tag contains other tags like <code>a</code>, <code>br</code>, <code>string</code> etc.

To get the <b>NAMES</b>, we need to recognize two things:
1. Names are stored under an <code>a</code> tag, followed by a bunch of specific information which, in this case, varies from person to person. For instance, the <code>a</code> tag for the first person is <code>a href="/user/diana-archangeli"</code>, for the second person is <code>a href="/user/andy-barss"</code> etc. 
2. For each faculty, there is more than one <code>a</code> tag under each <code>td</code> tag. The first <code>a</code> tag stores the names, the second stores the emails. To get the names, we want to take only the first <code>a</code> tag for each faculty and extract only the textual information (not the tag itself).

In [None]:
# get all names which are stored under the first <a> tag 

# set up an empty list
names = []

# loop over to get each name and append each of them to the list 
for row in soup.find_all('td',  attrs = {'class': 'views-field views-field-realname'}): 
    name = row.find_all('a')[0].text
    names.append(name)
    
# check results
print("\n", "List of all names of linguistics faculty members:", "\n", names)

<b>Exercise</b>: can you collect all the <b>EMAILS?</b> (hint: emails are stored under the 2nd 'a' tag)

In [None]:
# copy and past here the code for the names, and modify it accordingly



In [None]:
# exercise solution: 

emails = []

for row in soup.find_all('td', {'class': 'views-field views-field-realname'}):         
    email = row.find_all('a')[1].text  # emails are under the 2nd <a> tag, so we change 0 to 1
    print(email)
    emails.append(email) 

print("\n", "List of all emails of linguistics faculty members:", "\n", emails)

Let's extract the <b>PHONE NUMBERS:</b> 

Start by inspecting the webpage to find where the phone numbers are stored. Compared to extracting names and emails, this task has two new challenges: 
1. The phone sometimes is missing ("NA"), that is, not all faculty members on the website have a phone number. We could deal with missing values in many ways, here I use a *try/except* statement, in the second example (below) we will be using *if/else*.
2. The other thing to notice is that the tag is <code>&lt;strong&gt;TEL:&gt;/strong&gt;</code> but the actual phone number is outside the tag. This is an example of not so good HTML formatting. In theory, this "should not" happen, but in practice, you will find many examples of "bad" HTML. It just requires a bit of extra effort to find a way to extract the data. We deal with this using a new command <code>.next_sibling</code> which gets the next tag nested under the same parent tag.

In [None]:
# get all phone numbers

phones = []

for row in soup.find_all('td', attrs = {'class': 'views-field views-field-realname'}):  
    try:
         phone = row.find('strong', text='TEL:').next_sibling  # new command .next_sibling 
    except:
        phone = "NA"
    phones.append(phone)
    
print(phones)

Let's put it all togheter and collect names, emails, and phones at once!

In [None]:
# collect names, emails, phones at once

contacts = []

for row in soup.find_all('td', {'class': 'views-field views-field-realname'}):
    
    # collect name
    name = row.find_all('a')[0].text
    
    # collect email
    email = row.find_all('a')[1].text
    
    # collect phone
    try:
        phone = row.find('strong', text='TEL:').next_sibling  
        phone = re.sub('^ *', '', phone)  
    except:
        phone = "NA" 
    
    contacts.append([name, email, phone])  

# print results line by line and alphabetically
for row in sorted(contacts):
    print(row)

To process the data and get them ready to be exported, we use the library "pandas" which turns the data into a dataframe and makes it easier to manipulate it. 

In [None]:
# save the results in a pandas dataframe 

df = pd.DataFrame(contacts)
display (df) # same as print(df) but nicely formatted

In [None]:
# rename our df (here I called it "ling_faculty") and rename its columns

ling_faculty = df.rename (columns = {0: 'name', 1: 'email', 2: 'phone'})
display (ling_faculty[0:3])

In [None]:
# export the collected data as csv ("movies.csv") using the "DataFrame.to_csv" function
# change the path to store the results on your computer

ling_faculty.to_csv (r'\Users\Sabrina\Desktop\ling_faculty.csv', encoding = 'utf-8', index = False)
print('Data exported!')

For further practice:
+ try to implement this code to collect other information available on the webiste (e.g. title, office, etc.)
+ another Python example for extracting contact information with requests and beautiful soup from congress.gov https://medium.com/@lobodemonte/congress-gov-web-scraping-with-beautifulsoup-37af19f2e1f4

## Example 2: Extract Movies Information

For this second example, we scrape movie reviews from this IMDb website: https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv (robots.txt page: https://www.imdb.com/robots.txt). This example is divided in two parts:
* PART A: we reveiw what we have learned so far, by collecting movie data from one single page
* PART B: we learn how to collect movie data from multiple pages and how to handle HTTP errors 

This example has been adapted from https://medium.com/better-programming/how-to-scrape-multiple-pages-of-a-website-using-a-python-web-scraper-4e2c641cff8


## Part A: Extract Movies Information From One Page

##### STEP 1: make a request to the website

In [None]:
# set up the url and headers
movie_url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
headers = { "User-Agent" : "web scraper for classroom purposes (type your e-mail address here)" } 

# make a request
results = requests.get (movie_url, headers = headers) 
print(type(results))

##### STEP 2: "soup" our results

Now we use BeautifulSoup to store the web pages' "skeleton" in our "soup" variable. We then parse the data with an HMTL parser (other parsers are available, HTML is the default) and we print it. Remember to use <code>.text</code> to access the "soup" content (you can also transform the response into a text directly in the previous step: it does not matter as long as you do it).

In [None]:
# soup the data

soup = BeautifulSoup(results.text, "html")
#print(soup.prettify())

##### STEP 3: extract data from one page

There are a total of 50 movies for each page. For each movie, we extract the following five pieces of information: title, year, IMDb’s rating of the movie, director, genre. 

The first step is inspecting the webpage with the dev tools, as we did for Example 1:
* All information we want to extract for each movie is displayed under a specific section marked by the following "div" tag (see red box in the image below): <code>div class="lister-item-content"</code>. A "div" tag is a generic container and defines a section in a page. This specific "div" tag has class attribute: <code>"lister-item-content"</code>. We want to use this information to set up our loop! 
* Note that we could also take the above tag (shown in the first line of code in the image below): <code>div class = "lister-item mode-advanced"</code>. There is not a strict rule on this, but generally, I take the smallest possible section that still includes all the information I want to extract.

<img src="attachment:movie_inspect.png" width="900">

In [None]:
# this is the code to collect the data: the tags are different but the logic is the same as Example 1

# initialize empty lists, one for each piece of information we want to collect
titles = []
years = []
imdb_ratings = []
directors = []
genres = []

# use find_all() method to loop over the specific tag in our "soup", and grab the data 
for element in soup.find_all('div', class_='lister-item-content'):
   
        # title is under <a> tag which is nested within <h3> tag
        title = element.h3.find('a').text
        titles.append(title)
        
        # year is under <span> tag which is nested within <h3> tag
        year = element.h3.find('span', class_='lister-item-year').text
        years.append(year)

        #IMDb rating is under <strong> tag which is the only one
        imdb = element.find('strong').text  
        imdb_ratings.append(imdb)
        
        # director is under the first <a> tag which is nested within the third <p> tag
        # we first grab the third <p> tag, and then the first <a> tag
        third_ptag = element.find_all('p')[2]
        director = third_ptag.find('a').text    
        directors.append(director)
        
        # genre is under <span> tag which is nested within <p> tag
        # we add a condition: if there is a genre grab it, otherwise add 'NA' 
        genre = element.p.find('span', class_='genre').text \
                if element.p.find('span', class_='genre').text \
                else 'N/A'
        genres.append(genre)

print("Data collected!")

# uncomment these lines to print the results
#print(titles)
#print(years)
#print(imdb_ratings)
#print(directors)
#print(genres)

Accounting for potential <b>missing data</b> is good practice. The above code showed how to add a condition using an <em>if/else</em> statement (in the first example, we did the same using <em>try/except</em>). Here we did it only for the genre variable, as an illustrative example. However, it is good practice to account for missing data for every element, especially when it is impossible to visually inspect the webpage.

<b>Exercise</b>: expand the above code to collect more information from the website (e.g. runtime, score, gross income, etc.) and to account for missing data for every piece of information.

In [None]:
# copy/past here the code at step 3 and modify it accordingly





##### STEP 4: put data into a dataframe and clean them using the library "pandas"

In [None]:
# create a pandas df and rename the columns

movies = pd.DataFrame({
'movie': titles,
'year': years,
'imdb': imdb_ratings,
'director': directors,
'genre': genres,
})

display(movies[0:10])

Clean our data from left over tags and unwanted stuff:

In [None]:
# remove \n (new line)
movies = movies.replace("\n","", regex=True)

# remove the parenthesis from the year column
movies['year'] = movies['year'].replace("\(", "", regex=True)
movies['year'] = movies['year'].replace("\)", "", regex=True)

# print
display(movies[0:10])

## Part B: Extract Movie Information from Multiple Pages and Handle Request Errors

In this second part of the example, we learn how to get data for multiple pages iteratively, which is another common task in web scraping. This task adds some complexity in the preparation phase, in the "requests" phase (adding "time", "sleep", dealing with HTTP errors), and in the data extraction phase. 

##### STEP 1: generate a list of URLS to scrape

Since we want to collect data from multiple pages, rather than starting right away with requesting data, our first step becomes generating a list containing all THE web pages we intend to scrape. To do so, the first step is to explore how this specific webpage is structured (every webpage is a little different and we need to understand its structure to properly set up our scraper):

* Go to our home page: https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv
* Click on 'next page' to go to page 2: https://www.imdb.com/search/title/?groups=top_1000&start=51&ref_=adv_nxt
* Click on 'next page' to go to page 3: https://www.imdb.com/search/title/?groups=top_1000&start=101&ref_=adv_nxt
* Click on 'next page' to go to page 4: https://www.imdb.com/search/title/?groups=top_1000&start=151&ref_=adv_nxt

There are 50 movies on each page: page 1 goes from 1-50, page 2 from 51-100, page 3 from 101-150, etc. We use this info to generate a list of web pages to scrape: we first generate a list of numbers, and we then plug them into the address of each web page. 

To generate the numbers in steps of 50, we use the NumPy "np.arange" function with three arguments: start, stop, step <code>pages = np.arange(1, 1001, 50)</code> which says: start at 1, stop at 1001, take steps of size 50. Note that there are many other ways to generate numbers, for example using the "range" function. Why 1001? We have 1000 movies to scrape. The last page is at number 951 and displays movies from 951 to 1000. If we use 951 as a stop number, we exclude this last page.

In [None]:
# generate an array (using np.arange) of page numbers and convert it to list

pages_numbers = np.arange(1, 1001, 50).tolist()

print(pages_numbers)
print(type(pages_numbers))    

We now use the numbers that we have stored in the variable <code>pages_numbers</code> to generate the full URLs. We take the part of each URL that remains the same for all URLs (<code>base_url</code>), and use a loop to add to it a different number each time:

In [None]:
# initialize an empty list to store the urls
full_pages = []

# generate a list of urls
for page_number in pages_numbers: 
    base_url = "https://www.imdb.com/search/title/?groups=top_1000&start="
    page = (base_url + str(page_number) + "&ref_=adv_nxt") 
    print(page)
    full_pages.append(page)

In [None]:
# we take only the first five websites for this workshop (scraping takes time...)

for element in full_pages[0:5]:
    print(element)

##### STEP 2 (basic option): make a request to each URL

This part shows how to generalize the code to send multiple requests:

* The code takes each page from the list of pages generated in step 1 (<code>full_pages</code>) and loops over it to request data for each individual page. Here we only do the first five pages, but if you remove the <code>[0:5]</code> at the beginning of the loop, the code will run for all pages 
* For each page we print the <code>status_code</code> (remember 200 is good, others not so much) and the default <code>encoding</code> (usually utf-8)
* The <code>.raise_for_status()</code> line returns an HTTPError object if an error occurs during the process
* Since we send multiple requests, it is good practice to add a <code>time.sleep()</code> line which spaces out each request by suspending the execution for the given number of seconds (here we use a random number from 5 to 8 seconds)

In [None]:
# set up headers 
headers = { "User-Agent" : "web scraper for classroom purpose (type your e-mail address here)" } 

# set up an empty list to store all responses
requested_pages = []

# send a request to each page generated in step 1 ("full_pages")
for index, full_page in enumerate(full_pages[0:5]):            
    requested_page = requests.get(full_page, headers)          # send a request
    print("Request", index, full_page)                         # check the requested page
    print("Status:", requested_page.status_code)               # check its status code
    print("Encoding:", requested_page.encoding)                # check its default encoding
    print("Errors:", requested_page.raise_for_status(), "\n")  # check for errors, "None" means no errors
    requested_page_txt = requested_page.text                   # convert to a text
    time.sleep(random.randint(5, 8))                           # space out each request
    requested_pages.append(requested_page_txt)                 

# check the total pages stored in our list
print("Total pages requested:", len(requested_pages))

# note: this code will take at least 30 seconds to execute

##### STEP 2 (advanced option): make a request to each URL + catch request errors

This step includes more information to deal with request errors. This step is optional: you can skip it and go directly at STEP 3.

We now re-do STEP 2 by adding some more stuff to catch request errors. Let's imagine that one page runs an error among the five pages we just requested. The <code>.raise_for_status()</code> line from the previous code will catch that error but it will also break the code. For example, if we want to make five requests to five different pages and the error is in the third request, then the fourth and fifth will not be executed. This code implements the code above to catch some of the most common errors and handle them. We use a *try/except* statement to catch different types of errors.

Let's create a fictitious list of pages, one of them containing an error:

In [None]:
# add a wrong url to our list of full_pages

error_page = ['https://www.something/that/does/not/exist.com']
full_pages_error = full_pages[0:2] + error_page + full_pages[2:5]

print("Total number of pages in this new dataset containing a fake url:", len(full_pages_error))
for element in full_pages_error:
    print(element)

Most common errors: 
* "404 not found": the page is not available
* "403 forbidden": the server rejected the request
* "500 internal server error": generic error that catches all unexpected conditions that prevent the requests to go through
* "503 server unavailable": usually a temporary issue, e.g. server overloaded or under maintenance
* "504 gateway timeout": indicates a time issue (a response was not sent within the expected time frame)

The code below illustrates some of these errors. It is good practice to go from specific to general and leave the most generic error at the end. If the response is successful, no exception is raised and only the *try* statement will be executed. 

It is up to you to decide how many specific exceptions to catch. For example, you could only have one *try/except* block to catch HTTPError, and put everything else together in another category. Alternatively, you might want to raise even more specific exceptions than those illustrated in the code below! Here is the full list of "requests.exceptions" https://requests.readthedocs.io/en/master/_modules/requests/exceptions/ 

In [None]:
# set up headers 
headers = { "User-Agent" : "web scraper for classroom purpose (type your e-mail address here)" } 

# set up an empty list to store all responses
requested_pages = []

# send a request to each page of which the third contains an error ("full_pages_error")
for index,full_page in enumerate(full_pages_error):             
    
    # try/except to catch requests errors that might occur
    try:
        requested_page = requests.get(full_page, headers = headers)
        print("Request", index, full_page)  
        print("Status:", requested_page.status_code)
        print("Encoding:", requested_page.encoding,  "\n")
        requested_page_txt = requested_page.text  
        time.sleep(random.randint(5, 8)) 
      
    # invalid HTTP response
    except requests.exceptions.HTTPError as http_err:
        print("HTTP error:", http_err, "\n")
        requested_page_txt = str(http_err)  # str(http_err.args) default format of .args is tuple

    # failed connection
    except requests.exceptions.ConnectionError as connection_err:
        print ("Connection Error:", connection_err, "\n")
        requested_page_txt = str(connection_err)  
        
    # request timed out
    except requests.exceptions.Timeout as timeout_err:
        print ("Timeout Error:", timeout_err, "\n")
        requested_page_txt = str(timeout_err)  
       
    # generic exception for other errors
    except requests.exceptions.RequestException as other_err:
        print ("Unknown Error:", other_err, "\n")
        requested_page_txt = str(other_err)  
        
    # append our results
    requested_pages.append(requested_page_txt) 
    
# check the total pages stored in our list
print("Total pages requested:", len(requested_pages))

# note: this code will take at least 30 seconds to execute

##### STEP 3: "soup" all results

Now, we use BeautifulSoup to store each web page's "skeleton" as a list of "souped_pages." We then parse the data as we did for PART A of this example. Note that the code below does not have <code>.text</code> to access each "requested_page" content. This is because we have already converted the data to a text while we sent out the requests (see above, line <code>requested_page_txt = requested_page.text</code>).

In [None]:
# after having requested the html pages, use BeautifulSoup to parse them

souped_pages = [] 

for requested_page in requested_pages[0:5]:    
    souped_page = BeautifulSoup(requested_page, 'html.parser') # no need of .text here bcs we have already done it in STEP 2    
    souped_pages.append(souped_page)
    
print("All pages souped!")

##### STEP 4: extract data from all URLs

Finally, we extract our data from all pages. The code below is the same code for Part A of this example. The only difference is that there is an additional loop because we want to run it for all pages instead of just for one page. 

In [None]:
# initialize empty lists, one for each piece of information we want to collect
titles = []
years = []
imdb_ratings = []
directors = []
genres = []

# new loop to loop over all the souped pages (here only the first 5)
for souped_page in souped_pages[0:5]:
    
    # same loop as part A of this example to loop over each <div> tag for each page
    for element in soup.find_all('div', class_='lister-item-content'):
   
        # title is under <a> tag which is nested within <h3> tag
        title = element.h3.find('a').text
        titles.append(title)
        
        # year is under <span> tag which is nested within <h3> tag
        year = element.h3.find('span', class_='lister-item-year').text
        years.append(year)

        #IMDb rating is under <strong> tag which is the only one
        imdb = element.find('strong').text  
        imdb_ratings.append(imdb)
        
        # director is under first <a> tag which is nested within the third <p> tag
        # we first grab the third <p> tag, then  we grab the first <a> tag
        third_ptag = element.find_all('p')[2]
        director = third_ptag.find('a').text    
        directors.append(director)
        
        # genre is under <span> tag which is nested within <p> tag
        # we add a condition: if there is a genre grab it, otherwise add 'NA' 
        genre = element.p.find('span', class_='genre').text \
                if element.p.find('span', class_='genre').text \
                else 'N/A'
        genres.append(genre)
            
print("Done collecting data!")

# uncomment these lines to print the results
#print(titles)
#print(years)
#print(imdb_ratings)
#print(directors)
#print(genres)

##### STEP 5: put data into a dataframe, clean, and export them

In [None]:
# create a pandas df and rename the columns
movies = pd.DataFrame({
'movie': titles,
'year': years,
'imdb': imdb_ratings,
'director': directors,
'genre': genres,
})

# note that now we got 250 movies instead than 50 because we scraped 5 pages instead than one!
display(movies)

In [None]:
# clean data

# remove \n (new line)
movies = movies.replace("\n","", regex=True)

# remove the parenthesis from the year column
movies['year'] = movies['year'].replace("\(", "", regex=True)
movies['year'] = movies['year'].replace("\)", "", regex=True)

# print
display(movies[0:10])

In [None]:
# export the collected data as csv ("movies.csv")
# change the path to store them on your computer

movies.to_csv (r'\Users\Sabrina\Desktop\movies.csv', encoding='utf-8', index = False)

print("Data saved in the given directory: well done!")

## To sum things up

In this workshop we learned to...

* inspect the HTML structure of a website with your brownser's developer tools
* read the HTML language to select the tags to extract the data 
* use the library Request to make a request to the server and get the page
* use the library BeautifulSoup to parse the downloaded html page and to extract information
* turn pages, account for missing data, handle requesterrors
* put the collected data into a pandas dataframe, clean them, and export them

## Common Web Scraping Challanges (and Solutions)

* <b>Variety</b>: every website is different, even if there are general recurrent structures, pretty much every website requires a new project; also, and unfortunately, not every website has been built with logical formatting, which makes it more challenging to scrape
* <b>Change</b>: the same website might change over time, so you might find that your script of a few months ago does not work anymore. The good news is that, usually, it takes only a few changes to run it again!
* <b>Limits</b>: some websites set a max amount of data you can scrape at once, for example 50 pages or 2500 articles max, and the solution is to break your requests into "chunks"
* <b>Messy</b>: the scraped data are usually a bit messy, and they need to be cleaned
* <b>Dynamic Scraping</b>: this not really a challenge but something to keep in mind: many websites incorporate javascript dynamic parts. BeautifulSoup is not that good for dynamic scraping, but Scrapy and/or Selenium can help

## Additional information

Here is some additional information about the topics we covered during the workshop

#### Making requests to web servers

* Computers talk to each other by making and receiving <b>data requests</b>: when you click on a page, your web browser makes a request (what we did in our code with the "requests.get") to the web server of that page and gets back a response object. For example, if you type https://sociology.arizona.edu/news into your <b>web browser</b>, you are telling the arizona <b>web server</b> that you would like to visualize the information stored at /news. The arizona web server receives your request and sends back to your web browser a response, i.e. a bunch of files that your browser transforms into a nice visual display that might include texts, graphics, hyperlinks, etc.


* A <b>User-Agent</b> is a text string that your web browser sends every time you make a request to a website and communicates info about your device type, operating system, and browser. This info is useful for the server as it prepares its response accordingly. This is important because you can have your scraper being blocked and because many websites do not let you view the content if the User-Agent is not set (for more info see https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01). To know what is your User-Agent, type "what is my user agent" in the Google search bar. A Chrome User-Agent on Windows looks similar to this:

    <code>user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"</code>
    


#### The HTML "skeleton" of a webpage

A website is made of the following elements:

* <b>HTML</b> the core element of a website, basically one or more text files written in HTML or Hypertext Markup Language, which is not a programming language, like Python, but is a so-called "markup language": it uses a set of rules or tags to organize the webpage (i.e. makes the text bold, creates body text, paragraphs, inserts hyperlinks, etc.), but when the page is displayed the markup language is hidden. This is basically the only thing that matters for "static web scraping" tasks, and for this tutorial
* <b>CSS</b> which means Cascading Style Sheets, it adds styling to make the page looks nicer 
* <b>JS</b> Javascript code is used to add interactivity to the page, and you need "dynamic web scraping" techniques to interact with it
* <b>Images</b> for example jpg and png allow webpages to show pictures, other common stuff are videos or multimedia

#### Tags
In web scraping, tags are fundamental because we collect information from webpages using them. Some webpages use "direct formatting", others use "logical formatting" which is easier to scrape. Unfortunately (for us), not all webpages are "made well", in fact, most of them are not. For example, they do not follow a logical structure or miss descriptive elements etc. What this means to us is that often time we have to get creative to scrape data from them. 

* tags follow a tree-like structure and are nested within each other
* tags go in pairs: one on each end of the content that they include, there is a start tag (e.g. "title") and an end tag with a slash (e.g. "/title")  
* there are several tags, here is a list you could use as reference for your scraping projects: https://developer.mozilla.org/en-US/docs/Web/HTML/Element

<img src="attachment:html_tree.png" width="700">

Tags have commonly used names that depend on their positions in relations to other tags:
* child: the tag inside another tag (e.g. the p tag is usually a child of the body tag)
* parent: the tag that contains another tag
* sibling: two tags are siblings if they are nested inside the same parent

#### Class and id attributes

Class and id are special attributes that specify more information about a given tag, usually a certain style. They are optional in that they can be used to all tags, but not all tags have them. The same class can be shared between elements but each element can only have one id.

In web scraping, class and id attributes are important because we can extract elements using them. Since they offer quite detailed information, they help to find the specific element we want to scrape. For instance, in our first example we have seen that all information about each faculty was stored under the specific <code>td</code> tag that has class <code>"views-field views-field-realname"</code>. 

This screenshot, taken from the linguistics website we used as an example, shows a bunch of tags with their specific class information.

<img src="attachment:class.png" width="700">

## Credits

What is webscraping and ethics:
* webscraping image: https://blog.apify.com/what-is-web-scraping-1b548f8d6ac1
* ethics: https://www.empiricaldata.org/dataladyblog/a-guide-to-ethical-web-scraping

Requests and Beautiful Soup (suggested tutorials):
* Real Python BeautifulSoup tutorial: https://realpython.com/beautiful-soup-web-scraper-python/
* Real Python Requests tutorial: https://realpython.com/python-requests/
* DATAQUEST tutorial: https://www.dataquest.io/blog/web-scraping-tutorial-python/
* opensource.com tutorial: https://opensource.com/article/20/5/web-scraping-python
* Most common requests errors: https://justaskthales.com/us/5-common-http-error-codes-explained/

HTML language:
* Mike Hammond's lectures LING 508 Computational Linguistic Fall 2019 
* https://www.dataquest.io/blog/web-scraping-tutorial-python/
* https://en.wikipedia.org/wiki/Web_page#:~:text=The%20core%20element%20of%20a,CSS)%20code%20for%20presentation%20semantics.
* html tree structure image: https://www.researchgate.net/figure/HTML-source-code-represented-as-tree-structure_fig10_266611108

More on tags and CSS selectors (cheet sheets for webscraping tasks): 
* https://developer.mozilla.org/en-US/docs/Web/HTML/Element
* https://www.w3schools.com/html/html_elements.asp
* https://www.w3schools.com/css/default.asp

Examples:
* https://linguistics.arizona.edu/peo-faculty
* https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv
* https://medium.com/better-programming/how-to-scrape-multiple-pages-of-a-website-using-a-python-web-scraper-4e2c641cff8

This workshop is licensed under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) by Sabrina Nardin