# Webscraping without an API

## Scraping Tables



These are the packages we will need to begin:

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

We want to scrape the table from this website of box office data, so to begin, we name the url to a variable called url.

In [None]:
url='https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=01&p=.htm'

The next set of commands tells Python to open the web page and store the information.

In [None]:
webpage = requests.get(url) ##opening the page and storing it to a variable called webpage
page_content=webpage.content ##storing the content of the page

If you like, you can store the raw data of the page onto your computer, with the below commands. If you are downloading a lot of content, this will probably take up a lot of space on your computer, but it can be useful if you think you will want to go back and retrieve other content later.


In [None]:
f=open('/Users/yotalao/Box Sync/2018-2019/CompSoc_Bootcamp/boxoffice.html', 'wb') ##opening a file to store the data , use 'wb' option for files that contain more than text
f.write(page_content) ##writing data to the saved html page
f.close() ##closing the file

Saving the raw webpage, however, is not necessary; we can retrieve and clean the data all in the steps below. First, we want to "beautify" the html code on the page, so it is easily searchable with our package BeautifulSoup. This reveals the inherent structure of the page.

In [None]:
soup = BeautifulSoup(page_content, "lxml") ##beautifying it

This is what the html looks like before beautifying:

In [None]:
print(page_content) ##ugly

This is what the html looks like after:

In [None]:
print(soup) ##pretty

Next, using Chrome's Inspect tool by right clicking on the table we want, we can figure out how to tell Python to retrieve that content. With the inspect tool, we can see that our table is marked with a table tag. However, there are multiple table tags, so we need to refer to the right table. 

In [None]:
table=soup.findAll('table')[4] 

So we've downsized the content to the table we need, but we still need to clean it up. First, we'll extract all the row data, ignoring the column header (we'll deal with that later).

In [None]:
rows = table.findAll('tr')[1:93] 

However, this is still messy. We now need to extract the data in each cell. We do this using for loop statements.

In [None]:
movie_data =[] 
for row in rows: 
    row_clean=[] 
    row_cells=row.findAll('td') 
    for row_cell in row_cells: 
        row_cell_clean=row_cell.getText() 
        row_clean.append(row_cell_clean) 
    movie_data.append(row_clean) 
print(movie_data) 

This is a shorthand way to do the same thing:

In [None]:
movie_data=[[cells.getText() for cells in rows[row].findAll('td')]
            for row in range(len(rows))] ##getting text for each cell in cells extracted from each of the 92 rows

Now, we work on extracting the column names. We see that this is in the first row tag of the table, so we tell Python to retrieve that row with the column header.

In [None]:
cols=table.findAll('tr')[0] 
col_cells=cols.findAll('td')
cols_clean=[] 
for col_cell in col_cells: 
    col_cell_clean=col_cell.getText()
    cols_clean.append(col_cell_clean)
print(cols_clean) 

This is a shorthand way to do the same thing:

In [None]:
cols = [cell.getText() for cell in cols.findAll('td')] ##getting text for each cell in cells extracted from the row
print(cols)

Next, we want to put it into a data frame to analyze later on.

In [None]:
df = pd.DataFrame(movie_data, columns=cols) ##putting this into a pandas data frame, specifying what the column headers are
##but uh oh, there's an error




It doesn't work though. How do we fix?

## Scraping other page content


We're going to use the same packages as before:

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

We're going to  to scrape news story headings from this website's science section, repeating many of the same steps as before.

In [None]:
url='https://www.treehugger.com/science/'

In [None]:
webpage = requests.get(url) 
page_content=webpage.content 

In [None]:
soup = BeautifulSoup(page_content, "lxml") ##beautifying it

Now we want to get the article content in the "latest stories" section. Using Inspect Element in Chrome, we see this content is contained under "section class="c-block c-block--cards"" tag. Each story is in the "article class="c-article c-article--card"" tag so we find this using beautiful soup.

In [None]:
latest=soup.findAll("article", class_="c-article c-article--card") 
print(latest)

Now we're going to clean this up into a dictionary that can easily be converted into a data frame.

In [None]:
latest_clean=[] ##this is going to become a list of dictionaries that can be easily converted into a pandas df
for i in latest:
    story={} ##creating a dictionary
    story['headline']=i.find(class_="c-article__headline").getText(strip=True) 
    story['category']=i.find(class_="c-article__category").getText(strip=True)
    story['author']=i.find(rel="author").getText(strip=True)
    story['pub_date']=i.find(class_="c-article__published").getText(strip=True)
    latest_clean.append(story)
print(latest_clean)

We convert it then into a data frame:

In [None]:
df = pd.DataFrame(latest_clean)  ##converting to pandas dataframe
df

## A Brief Overview of Regular Expressions

When all else fails and you're having difficulty using Beautiful Soup the content you need, regular expressions can be useful in finding data within text. Regular expressions are a language in their own right, so we don't have time to go over everything, but generally, they are expressions that find patterns in your text. 

In the above example, we could find the variables in the html code by using regular expressions. This method is generally less structured than finding the data with Beautiful Soup, so there is more room for error, but can be a useful strategy when used carefully.

So, for example, in searching for the titles of each article in the Treehugger html code, we see that they are contained in a tag like this:

We can write a regular expression that directs Python to retrieve anything that is in between the text "html">" and "</a\>."

First, we need to load the regular expression package.

In [None]:
import re

Then we use the re.findall function to find the titles. We will first test this the small subset of data above, to make sure the regular expression is working properly. So first, we treat the small subset as a string and name it to the variable "test." You can also test regular expressions in text editing programs like Text Wrangler, by using the find function and clicking "grep."

In [None]:
test="""<a class="gtm-track-click" data-click-action="Promo Title Click" data-click-category="Streams (Index Page)" href="/plastic/if-bpa-so-terrible-why-everybody-still-drinking-beer-and-pop-out-bpa-lined-cans.html"> If BPA is so terrible, why is everybody still drinking beer and pop out of BPA lined cans?</a>"""
test
##we use three quotes at the beginning and end of the the string so Python knows it's a string

Now we are going to test out regular expressions in searching for text. ".*?" is a regular expression that is called a lazy quantifier. It matches the text in between what is before and after it the first time it encounters it and then stops (as opposed to other regular expressions that are "greedy" and match the expression as many times they appear). We can use parentheses to tell Python only to return the text that is in between the parentheses. "\s" indicates a single space; we do not Python to return the space, so we include it outside of the parentheses. 

In [None]:
titles = re.findall('html">\s(.*?)</a>', test) ##this tells it to return everything in between
##html"> and </a>
print(titles)

It worked, so now we will use it on our larger portion of html code called "latest" to get all the titles on the html page.

In [None]:
soup_string=str(soup)
titles=re.findall('html">\s(.*)<\/a>', soup_string)
print(titles)

## Exercise on your own

Try scraping  the UCLA Faculty Webpage: https://soc.ucla.edu/faculty. Scrape each faculty members name and at least three other variables (such as title, email, subfields, etc.) and clean the data into a Pandas data frame.