# Webscraping without an API

## Getting Started

Encouraging story about not knowing how to do anything on the computer, to learning how to collect loads of Airbnb data; still relearning things all the time (had to learn this last week when I realized that I forgot how to scrape a table). Every website is different, so you have to figure out how to uniquely tailor your code to collect data from a website.

Before getting started, check the robots.txt file of the website you want to scrape and make sure there are no restrictions on the page you want to scrape (type the url and then add robots.txt). You can also check the terms and conditions. This gives you any restrictions as well as the crawl rate; if there is a crawl rate you can use the time.sleep() function in Python's "time" package.

## Inspecting and Identifying Code

Go to the website: https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=01&p=.htm.

There are actually a number of tables hidden in this page, so we want to make sure we identify the table that we want. Right click inspect element (on a mac, this is the command, click). We can roll over the code to find our table. We can see that our table is under "<table border="0" cellspacing="1" cellpadding="5"\>". You can also see that there are also row tags “tr” and cell tags “td” under this table.
    
However, we need to find where in the code this table is so we can tell Python where to find it. We search for “<table,” to see where it is located in the code. You can ignore green text as these are just notes made by the web designer. We then see that there are four table tags, before we see the table tag that we need. Because Python's indices starting at 0, this will actually be the 4th table in our code.

You can then use Beautiful Soup to search through tags to find content that you need. You can search by a range of tags with Beautiful Soup. See the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for more.

## Scraping Tables

These are the packages we will need to begin:

In [33]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

We want to scrape the table from this website of box office data, so to begin, we name the url to a variable called url.

In [34]:
url='https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=01&p=.htm'

The next set of commands tells Python to open the web page and store the information.

In [35]:
webpage = requests.get(url) ##opening the page and storing it to a variable called webpage
page_content=webpage.content ##storing the content of the page

If you like, you can store the raw data of the page onto your computer, with the below commands. If you are downloading a lot of content, this will probably take up a lot of space on your computer, but it can be useful if you think you will want to go back and retrieve other content later.


In [36]:
f=open('/Users/yotalao/Box Sync/2018-2019/CompSoc_Bootcamp/boxoffice.html', 'wb') ##opening a file to store the data , use 'wb' option for files that contain more than text
f.write(page_content) ##writing data to the saved html page
f.close() ##closing the file

Saving the raw webpage, however, is not necessary; we can retrieve and clean the data all in the steps below. First, we want to "beautify" the html code on the page, so it is easily searchable with our package BeautifulSoup. This reveals the inherent structure of the page.

In [37]:
soup = BeautifulSoup(page_content, "lxml") ##beautifying it

This is what the html looks like before beautifying:

This is what the html looks like after:

Next, using Chrome's Inspect tool by right clicking on the table we want, we can figure out how to tell Python to retrieve that content. With the inspect tool, we can see that our table is marked with a table tag. However, there are multiple table tags, so we need to refer to the right table. With BeautifulSoup, we can tell Python to find the fifth table in the html code.

In [38]:
table=soup.findAll('table')[4] ##telling Python to find the 5th table (4, since Python indexes starting at 0)

So we've downsized the content to the table we need, but we still need to clean it up. First, we'll extract all the row data. We know from looking at the table that there are 92 rows. So ignoring the column header (we'll deal with that later), we use Beautiful Soup to tell Python to extract all rows.

In [39]:
rows = table.findAll('tr')[1:93] ##extracting all the rows only, will deal with the column headers later

However, this is still messy. We now need to extract the data in each cell. We do this using for loop statements.

In [40]:
movie_data =[] ##creating a list first, where all the clean row data will be stored
for row in rows: ##for each row in the list of rows (the trs that we originally extracted)
    row_clean=[] ##creating a list called row_clean, where all the clean cells will be stored
    row_cells=row.findAll('td') ##finding all the cells in each row first
    for row_cell in row_cells: ##then for each cell
        row_cell_clean=row_cell.getText() ##get the text only from each cell
        row_clean.append(row_cell_clean) ##append the row to include the clean cell
    movie_data.append(row_clean) ##append the movie data to include the new clean row
print(movie_data) ## can see now that the data is a list of lists

[['1', '1', 'Jumanji: Welcome to the Jungle', 'Sony', '$47,763,243', '-46.6%', '3,801', '+36', '$12,566', '$256,135,909', '$90', '3'], ['2', 'N', 'Insidious: The Last Key', 'Uni.', '$36,241,140', '-', '3,116', '-', '$11,631', '$36,241,140', '$10', '1'], ['3', '2', 'Star Wars: The Last Jedi', 'BV', '$31,311,982', '-62.8%', '4,232', '-', '$7,399', '$580,274,584', '-', '4'], ['4', '3', 'The Greatest Showman', 'Fox', '$19,649,496', '-33.6%', '3,342', '+26', '$5,880', '$82,753,868', '$84', '3'], ['5', '4', 'Pitch Perfect 3', 'Uni.', '$13,242,135', '-54.8%', '3,458', '-10', '$3,829', '$88,999,225', '$45', '3'], ['6', '14', "Molly's Game", 'STX', '$9,612,537', '+129.5%', '1,608', '+1,337', '$5,978', '$16,829,097', '-', '2'], ['7', '5', 'Ferdinand', 'Fox', '$9,258,976', '-55.1%', '3,156', '-181', '$2,934', '$72,028,094', '$111', '4'], ['8', '8', 'Darkest Hour', 'Focus', '$9,179,445', '-2.2%', '1,733', '+790', '$5,297', '$31,215,552', '-', '7'], ['9', '6', 'Coco', 'BV', '$7,000,479', '-50.6%', 

This is a shorthand way to do the same thing:

In [41]:
movie_data=[[cells.getText() for cells in rows[row].findAll('td')]
            for row in range(len(rows))]

Now, we work on extracting the column names. We see that this is in the first row tag of the table, so we tell Python to retrieve that row with the column header.

In [42]:
##now extracting column names
cols=table.findAll('tr')[0] #column names are in the first row of the table
col_cells=cols.findAll('td') ##extract the cell data
cols_clean=[] ##creating a list to store clean column names
for col_cell in col_cells: ##for each column in the header
    col_cell_clean=col_cell.getText()
    cols_clean.append(col_cell_clean)
print(cols_clean) ##can now see column names

['TW', 'LW', 'Title (Click to view)', 'Studio', 'Weekly Gross', '% Change', 'Theater Count / Change', 'Average', 'Total Gross', 'Budget*', 'Week #']


This is a shorthand way to do the same thing:

In [43]:
cols = [cell.getText() for cell in cols.findAll('td')] ##getting text for each cell in cells extracted from the row
print(cols)

['TW', 'LW', 'Title (Click to view)', 'Studio', 'Weekly Gross', '% Change', 'Theater Count / Change', 'Average', 'Total Gross', 'Budget*', 'Week #']


Next, we want to put it into a data frame to analyze later on.

In [44]:
df = pd.DataFrame(movie_data, columns=cols) ##putting this into a pandas data frame, specifying what the column headers are
##but uh oh, there's an error
##this is because, if we look closely at the webpage, the column headers make up 11 cells whereas there are 12 cells in each row of data
##this is because of a stupid column 'Theater Count/Change', so we'll fix this manually



AssertionError: 11 columns passed, passed data had 12 columns

It doesn't work though. How do we fix?

In [45]:
##just copy from printed output above and change
cols= ['TW', 'LW', 'Title (Click to view)', 'Studio', 'Weekly Gross', '% Change', 'Theater Count', 'Change', 'Average', 'Total Gross', 'Budget*', 'Week #']
df = pd.DataFrame(movie_data, columns=cols) ##trying again, it works!
df

Unnamed: 0,TW,LW,Title (Click to view),Studio,Weekly Gross,% Change,Theater Count,Change,Average,Total Gross,Budget*,Week #
0,1,1,Jumanji: Welcome to the Jungle,Sony,"$47,763,243",-46.6%,3801,+36,"$12,566","$256,135,909",$90,3
1,2,N,Insidious: The Last Key,Uni.,"$36,241,140",-,3116,-,"$11,631","$36,241,140",$10,1
2,3,2,Star Wars: The Last Jedi,BV,"$31,311,982",-62.8%,4232,-,"$7,399","$580,274,584",-,4
3,4,3,The Greatest Showman,Fox,"$19,649,496",-33.6%,3342,+26,"$5,880","$82,753,868",$84,3
4,5,4,Pitch Perfect 3,Uni.,"$13,242,135",-54.8%,3458,-10,"$3,829","$88,999,225",$45,3
5,6,14,Molly's Game,STX,"$9,612,537",+129.5%,1608,+1337,"$5,978","$16,829,097",-,2
6,7,5,Ferdinand,Fox,"$9,258,976",-55.1%,3156,-181,"$2,934","$72,028,094",$111,4
7,8,8,Darkest Hour,Focus,"$9,179,445",-2.2%,1733,+790,"$5,297","$31,215,552",-,7
8,9,6,Coco,BV,"$7,000,479",-50.6%,1894,-210,"$3,696","$193,543,440",-,7
9,10,7,All the Money in the World,TriS,"$5,244,151",-44.5%,2123,+49,"$2,470","$21,826,060",-,2


## Scraping other page content


We're going to use the same packages as before:

In [46]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

We're going to  to scrape news story headings from this website's science section, repeating many of the same steps as before.

In [47]:
url='https://www.treehugger.com/science/'

In [48]:
webpage = requests.get(url) ##opening the page and storing it to a name
page_content=webpage.content ##keep in mind you would want to put a delay here if you were scraping more than one page 
##robots.txt says the crawl delay should be 20, so you would use a time.sleep(20) function

In [49]:
soup = BeautifulSoup(page_content, "lxml") ##beautifying it

Now we want to get the article content in the "latest stories" section. Using Inspect Element in Chrome, we see this content is contained under "section class="c-block c-block--cards"" tag. Each story is in the "article class="c-article c-article--card"" tag so we find this using beautiful soup.

In [50]:
latest=soup.findAll("article", class_="c-article c-article--card") ##getting all the latest stories content


Now we're going to clean this up into a dictionary that can easily be converted into a data frame.

In [51]:
latest_clean=[] ##this is going to become a list of dictionaries that can be easily converted into a pandas df
for i in latest:
    story={} ##creating a dictionary
    story['headline']=i.find(class_="c-article__headline").getText(strip=True) ##include strip=True because otherwise spaces 
    ##will be returned with \n
    story['category']=i.find(class_="c-article__category").getText(strip=True)
    story['author']=i.find(rel="author").getText(strip=True)
    story['pub_date']=i.find(class_="c-article__published").getText(strip=True)
    latest_clean.append(story)
print(latest_clean)

[{'headline': 'Wolf howls, Mother Nature claps (video)', 'category': 'Endangered Species', 'author': 'Melissa Breyer', 'pub_date': 'September 18, 2018'}, {'headline': '12 things to know about the September equinox', 'category': 'Natural Sciences', 'author': 'Melissa Breyer', 'pub_date': 'September 18, 2018'}, {'headline': 'High yield farming may be better for biodiversity', 'category': 'Sustainable Agriculture', 'author': 'Christine Lepisto', 'pub_date': 'September 18, 2018'}, {'headline': 'How Seabins use marinas to collect trash and clean the ocean', 'category': 'Plastic', 'author': 'Sami Grover', 'pub_date': 'September 18, 2018'}, {'headline': 'Photo: Pretty plover plops into her nest', 'category': "Reader's Photos", 'author': 'Melissa Breyer', 'pub_date': 'September 18, 2018'}, {'headline': 'From bottles to bike lanes: the first PlasticRoad opens in the Netherlands', 'category': 'Plastic', 'author': 'Lloyd Alter', 'pub_date': 'September 17, 2018'}, {'headline': 'If BPA is so terrib

We convert it then into a data frame:

In [52]:
df = pd.DataFrame(latest_clean)  ##converting to pandas dataframe
df

Unnamed: 0,author,category,headline,pub_date
0,Melissa Breyer,Endangered Species,"Wolf howls, Mother Nature claps (video)","September 18, 2018"
1,Melissa Breyer,Natural Sciences,12 things to know about the September equinox,"September 18, 2018"
2,Christine Lepisto,Sustainable Agriculture,High yield farming may be better for biodiversity,"September 18, 2018"
3,Sami Grover,Plastic,How Seabins use marinas to collect trash and c...,"September 18, 2018"
4,Melissa Breyer,Reader's Photos,Photo: Pretty plover plops into her nest,"September 18, 2018"
5,Lloyd Alter,Plastic,From bottles to bike lanes: the first PlasticR...,"September 17, 2018"
6,Lloyd Alter,Plastic,"If BPA is so terrible, why is everybody still ...","September 17, 2018"
7,Melissa Breyer,Reader's Photos,Photo: Tule elk in silhouette at sunset,"September 17, 2018"
8,Katherine Martinko,Plastic,Sea turtles can die from eating just one piece...,"September 14, 2018"
9,Melissa Breyer,Reader's Photos,Photo: Booted racket-tail hummingbird is the p...,"September 14, 2018"


## A Brief Overview of Regular Expressions

When all else fails and you're having difficulty using Beautiful Soup the content you need, regular expressions can be useful in finding data within text. Regular expressions are a language in their own right, so we don't have time to go over everything, but generally, they are expressions that find patterns in your text. 

In the above example, we could find the variables in the html code by using regular expressions. This method is generally less structured than finding the data with Beautiful Soup, so there is more room for error, but can be a useful strategy when used carefully.

So, for example, in searching for the titles of each article in the Treehugger html code, we see that they are contained in a tag like this:

We can write a regular expression that directs Python to retrieve anything that is in between the text "html">" and "</a\>."

First, we need to load the regular expression package.

In [53]:
import re

Then we use the re.findall function to find the titles. We will first test this the small subset of data above, to make sure the regular expression is working properly. So first, we treat the small subset as a string and name it to the variable "test."

In [54]:
test="""<a class="gtm-track-click" data-click-action="Promo Title Click" data-click-category="Streams (Index Page)" href="/plastic/if-bpa-so-terrible-why-everybody-still-drinking-beer-and-pop-out-bpa-lined-cans.html"> If BPA is so terrible, why is everybody still drinking beer and pop out of BPA lined cans?</a>"""
test
##we use three quotes at the beginning and end of the the string so Python knows it's a string

'<a class="gtm-track-click" data-click-action="Promo Title Click" data-click-category="Streams (Index Page)" href="/plastic/if-bpa-so-terrible-why-everybody-still-drinking-beer-and-pop-out-bpa-lined-cans.html"> If BPA is so terrible, why is everybody still drinking beer and pop out of BPA lined cans?</a>'

Now we are going to test out regular expressions in searching for text. ".*?" is a regular expression that is called a lazy quantifier. It matches the text in between what is before and after it the first time it encounters it and then stops (as opposed to other regular expressions that are "greedy" and match the expression as many times they appear). We can use parentheses to tell Python only to return the text that is in between the parentheses. "\s" indicates a single space; we do not Python to return the space, so we include it outside of the parentheses. 

In [55]:
titles = re.findall('html">\s(.*?)</a>', test) ##this tells it to return everything in between
##html"> and </a>
print(titles)

['If BPA is so terrible, why is everybody still drinking beer and pop out of BPA lined cans?']


It worked, so now we will use it on our larger portion of html code called "latest" to get all the titles on the html page.

In [56]:
soup_string=str(soup)
titles=re.findall('html">\s(.*)<\/a>', soup_string)
print(titles)

['Wolf howls, Mother Nature claps (video)', '12 things to know about the September equinox', 'High yield farming may be better for biodiversity', 'How Seabins use marinas to collect trash and clean the ocean', 'From bottles to bike lanes: the first PlasticRoad opens in the Netherlands', 'If BPA is so terrible, why is everybody still drinking beer and pop out of BPA lined cans?', 'Sea turtles can die from eating just one piece of plastic', 'Wayward narwhal adopted by a pod of belugas (video)', 'The view of Florence from space is a sobering thing ', 'Half-marathon in UK bans plastic water bottles', 'Toxic weed thriving on drought in Germany', 'The one thing missing from beach cleanups']


## Conclusion

What is the point if it takes so long to figure out how to scrape a page? You could probably do it by hand in the time it takes to figure out one page. However, the beauty of scraping is that once you figure out how to scrape one page, you can re-iterate that code for multiple pages with for loops, defining your own functions, etc. For example, on the Treehugger site,  you could loop through all 503 pages of science articles, and move onto other sections, scraping all the stories there as well.

## Exercise on your own

Try scraping at the UCLA Faculty Webpage: https://soc.ucla.edu/faculty. Scrape each faculty members name and at least three other variables (such as title, email, subfields, etc.) and clean the data into a Pandas data frame.