## Text Scraping & Visualization Example

In this notebook we will go through an example of how to download data from a website and visualize it. As a toy example, we will use <i>the publications page from the Crockett Lab website</i>.

First, we import the libraries we'll need

In [None]:
# A module to open URLs
from urllib.request import urlopen

# A module to extract data from html files
from bs4 import BeautifulSoup

# A module to download files
import requests

# A module for plotting
import matplotlib.pyplot as plt
%matplotlib inline

# A module for generating word clouds from text
import wordcloud

Next, we retrive the html of the page and format it

In [None]:
# Get the html of the page
url = "http://www.crockettlab.org/publications"
html = urlopen(url)
type(html)

In [None]:
# Create a Beautiful Soup object from the html
soup = BeautifulSoup(html, 'lxml')
type(soup)

From this, we can get a number of attributes:

In [None]:
# Get the title
title = soup.title
print(title)

In [None]:
# Get the text
text = soup.get_text()
print(soup.text)

We'd like to get the title of every paper listed on the website. I noticed all paper titles are bolded, so let's find the elements in the 'soup' that are bolded (i.e. using the strong tag)

In [None]:
# The paper titles
all_titles = soup.find_all('strong')

print(all_titles[:5])

In [None]:
# These don't look like strings (which usually have quotes)...
# What format is it?

type(all_titles[0])

In [None]:
# Let's loop through and conver to string

titles = [] # Initialize an empty list
for title in soup.find_all('strong'): # For each title from our 'soup'...
    titles.append(str(title)) # convert to string and append to the empty list

# Did it work?
print(type(titles[0]))
print(titles[0])

In [None]:
print(titles[:10])

In [None]:
# The tags are left over! Let's remove them:

titles_clean = []
for title in titles:
    title = str(title).replace('<strong>', '')
    title = title.replace('</strong>', '')
    title = title.replace('<br/>', '')
    titles_clean.append(title)
    
print(titles_clean[:10])

Great! Now we have a list of titles of papers listed on the website. Let's visualize it using a word cloud!

In [None]:
# For the wordcloud funcion, we need to combine our list of titles into one long sting.

all_words = '' # Initialize empty string
for title in titles_clean: 
    tokens = title.split(' ') # split each title into its words
    tokens = " ".join(tokens) # join the list of words into one long string
    all_words += tokens+" " # append the words from this title to the string of all titles

print(all_words[:50])

In [None]:
# Create the word cloud from the string containing all titles
wcloud = wordcloud.WordCloud(width = 800, height = 800, 
                background_color ='white', 
                min_font_size = 10).generate(all_words) 

# plot the WordCloud image                        
plt.figure(figsize = (8, 8))
plt.imshow(wcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 

### Let's try to download these papers!

In [None]:
# We will use again the find_all() method of soup to extract useful html tags
# Some examples:
# < a > for hyperlinks
# < table > for tables
# < tr > for table rows
# < th > for table headers
# < td > for table cells

soup.find_all('a')

In [None]:
# Let's get the links only:

all_links = soup.find_all("a")
for link in all_links:
    print(link.get("href"))

We cannot download data from Dropbox without credentials, so we'll download solely the ones that are available on the webiste

In [None]:
i=0
for link in all_links:
    file = link.get("href") # Get file
    if str(file).startswith('/s/'):
        print(i, file)
        # Request the file
        r = requests.get('http://www.crockettlab.org'+file, allow_redirects=True)
        # Save it to a local directory
        open('crockettpubs/'+str(i)+'.pdf', 'wb').write(r.content)
        i+=1