# Intro to Web Scraping with Python

Welcome to Intro to Web Scraping with Python! This Tech Lab will cover the basics of web scraping, including getting the HTML for a page, parsing the HTML, and storing the results in a structured format. 

In this Tech Lab, you will scrape the locations of all of CRA's offices from its website. CRA's office locations can be found at [https://www.crai.com/locations/](https://www.crai.com/locations/).

To start, we'll import three packages. Those are:
 - requests, which is used for making HTTP requests and pulling the HTML from a website
 - Beautiful Soup (bs4), which is used to parse HTML and make it searchable and navigable
 - pandas, for storing the results of the web scrape in a tabular format

In [None]:
import requests
import bs4
import pandas as pd

### The Requests Package
The [Requests](https://docs.python-requests.org/en/latest/) package in Python provides Python users with a way to make HTTP requests. Using the [`get`](https://docs.python-requests.org/en/latest/api/#requests.get) function, users can make HTTP GET requests. The function returns a [`Response`](https://docs.python-requests.org/en/latest/api/#requests.Response) object, which contains the HTTP response.

For more information, see the [quickstart](https://docs.python-requests.org/en/latest/user/quickstart/) in the Requests 
documentation.

In [None]:
url = "https://www.crai.com/locations/"
r = requests.get(url)

In the above code, a request is made to the `"www.crai.com/locations"` URL, with the response stored in the variable `r`.

`Response` objects have a `.text` attribute that can be used to access the text of a response. In this case, since a website was requested, the text is presumably the HTML of the website that was requested. See below for a snippet of `r.text`.

In [None]:
print(r.text[:100])

Text returned as part of an HTTP response doesn't necisarily need to be HTML. For example, Web APIs return JSON (another text-based data type) via HTTP. However, we can verify that the text is HTML based on the `<!doctype html>` statement at the top of the text.

### The Beautiful Soup Package

With the HTML pulled into Python, that HTML can now be scraped! The Beautiful Soup package facilitates the scraping of HTML by parsing the text and making it searchable and navigable from within Python.

To start using Beautiful Soup, simply covert the HTML into "soup" by using `bs4.BeautifulSoup`.

*Note: The Beautiful Soup package is officially named bs4 (Beautiful Soup 4). The associated pip install command is therefore `pip install bs4`.*

In [None]:
soup = bs4.BeautifulSoup(r.text)

We now have soup! 

Based on a brief analysis of CRA's "Locations" webpage, it looks like the name, address, country, and phone number of each CRA office is stored within a tag with class "LocationCardListing__content". We can use the [`find_all`](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all) function in Beautiful Soup to grab all such cards on the page. The `find_all` function looks through a tag's descendants (nested tags) and retrieves all descendants that match the given filters. 

Below we use the `find_all` function to search for tags with the class value "LocationCardListing__content". To do that, we pass the string "LocationCardListing__content" to the `class_` parameter of the `find_all` function. 

In [None]:
cards = soup.find_all(class_="LocationCardListing__content")

Let's take a look at what `find_all` returns.

In [None]:
cards

`find_all` returns a list of all of the HTML tags that have class "LocationCardListing__content". While the result above just looks like HTML, we can inspect the returned list to get a little more information on the returned values.

In [None]:
print("There are {} location cards on the page.".format(len(cards)))
print()
print("Each value is {}.".format(type(cards[0])))

The `find_all` function found 22 tags with class"LocationCardListing__content. As we would expect from inspecting the website, each tag simply corresponds to a "location card" that contains the additional information on each the location.

*Note: Different webpages are structured differently, so it's not always the case that the HTML of interest will have a class that uniquely identifies it. In those situations, filtering by tag name, id, or other HTML attribute may be appropriate. For more information, see "Searching the Tree *

The descendants (nested tags) of the HTML of each "card" can be accessed by using the `find"` function. The `find` function can be called directly from a `Tag` element, and returns only the first instance of a descendant that matches the passed search criteria/filter. 

Below, the `find` function is used to pull the tag with class "LocationCardListing__title" from the first location card. Then, the `text` attribute of that tag is accessed to get just the associated text.

In [None]:
cards[0].find(class_="LocationCardListing__title").text

The result above matches the CRA website, where "Boston" is the first office listed on the Locations page.

We can now generalize the code above to loop through all location cards and loop through all of the four data points of interest - title, country, address, and phone number. Based on an inspection of the website, each of the four data points are uniquely identified by their associated class, so we can continue to use that to pull the information we want. 

Below, The text of each is pulled into a dictionary for each location, with the location-specific dictionaries then appended together into a list.

In [None]:
classes = ["LocationCardListing__country", "LocationCardListing__title", 
            "LocationCardListing__address", "LocationCardListing__phone"]
data = []
for card in cards: #Loop through all location cards
    card_data = {}
    for c in classes: #For each class of interest (defined above)
        tag = card.find(class_=c) #Find the tag with the associated class
        if tag is not None: #If such a tag exists
            cname = c.replace("LocationCardListing__", "").title() 
            card_data[cname] = tag.text #Store the value/text associated with the tag
    data.append(card_data) #Append all of the location card data to the 'data' variable
    

With the scraped data stored in a list of dictionaries, that data can be passed directly to pandas to be visualized as a DataFrame.

In [None]:
df = pd.DataFrame(data)

In [None]:
df

### Now you try! 
Use what you've learned above to try to scrape the names from CRA's "Our People" page (just for employees whose last name begins with the letter "A"). For each employee on the page, scrape their name, title, office, phone number, and e-mail.

As a reminder, a typical scraping structure would be:
 - Inspect the HTML of the webpage to identify how the webpage is structured.
 - Use requests to get the HTML of the page in Python
 - Parse the HTML using Beautiful Soup
 - Turn the data into tabular form using Pandas

In [None]:
people_url = "https://www.crai.com/our-people/?page=1&sort=role"

### (In your browser) inspect the HTML - No coding needed.


### Request the HTML


### Parse the data using Beautiful Soup





### Convert data into DataFrame
