# Getting data from the web:  Scraping

Our trusty friend Pandas can read data directly from a web link.

We read the dataset into a dataframe without actually having the file in our folder!

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/classics/classics.csv')

In [None]:
df.sample()

In [None]:
df.sort_values(by='metadata.downloads',ascending=True)[-20:].plot.barh(x='bibliography.title',y='metadata.downloads')

---

## requests

The above code is very handy, but what if we simply want to read content that is on the page rather than in a readily available file?

Go to the [Classics CSV File](https://corgis-edu.github.io/corgis/csv/classics/) webpage and use your Browser's Inspector to look at the HTML for the page.  This will show HTML we discussed very briefly during Week 1.

We are going to get the entire web page using "requests" ([documentation](https://docs.python-requests.org/en/latest/)).  
* "Requests is an elegant and simple HTTP library for Python, built for human beings."

In [None]:
import requests

In [None]:
response = requests.get('https://corgis-edu.github.io/corgis/csv/classics/')

In [None]:
response

"Responses" are numerical codes that indicate whether a specific HTTP request has been successfully completed (See [HTTP code list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status))

You may have run into a couple of these on other sites, or even while trying to login to this JupyterHub!

In [None]:
# Note that this won't actually get the csv file

response = requests.get('https://corgis-edu.github.io/corgis/csv/classics/classics.csv')

In [None]:
response

Try finding the above URL in your browser:  https://corgis-edu.github.io/corgis/csv/classics/classics.csv

In [None]:
response = requests.get('https://corgis-edu.github.io/corgis/csv/classics/')
print(response)

In [None]:
# The html of our desired corgis page:
response.text

In [None]:
print(response.text)

It is possible to search through the web tags to find what it may be that you want to search for:

In [None]:
# Save all the html in a string variable
html_string = response.text

# Use BeautifulSoup to create a new object that will allow you to search for HTML tags
document = BeautifulSoup(html_string, "html.parser")

# This "document" variable is an object that has a "find" method
document.find('a')

What is `<a href="...`?

-> This is an HTML tag.... so what are HTML "tags"?

HTML: Hyper-Text Markup Language

HTML uses "tags" to classify different elements, for example:
* `<h1>...</h1>`: a large header
* `<img src="...">`: an image
* `<a href="...">Deep Space Nine</a>`: a link

Let's look at a simpler website:
http://static.decontextualize.com/kittens.html

In [None]:
response = requests.get('http://static.decontextualize.com/kittens.html')

In [None]:
print(response.text)

Here the tag examples are:
* `<h1>Kittens and the TV Shows They Love</h1>`: a large header
* `<img src="http://placekitten.com/120/120">`: an image
* `<a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>`: a link

And you'll see additional tags:
* `<ul>`: unordered list
* `<li>`: list item
* `<head>` and `<body>`: like header information and the body of a document
* `<div>`: section of the document

There's a lot to learn about HTML, but this is mainly to show you examples of tags.  BeautifulSoup will let you parse HTML documents based on these tags.

---

Fun aside:  you can use the IPython library to visualize HTML right inside the Jupyter notebook.

In [None]:
from IPython.core.display import HTML
HTML('<img src="http://placekitten.com/110/110">')

In [None]:
%%HTML
<img src="http://placekitten.com/110/110">

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('Awf45u6zrP0')

---

Ok, enough cat silliness...

---

Back to our literary classics.

In [None]:
response = requests.get('https://corgis-edu.github.io/corgis/csv/classics/')
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [None]:
# We can look for the first link:

document.find('a')

In [None]:
# We can search for all the links on the page with:

document.find_all('a')

This allows us now to find the download link for the csv file.

# Two examples for practice

### Getting the script of Coco from IMSDB

In [None]:
import requests

In [None]:
response = requests.get('https://imsdb.com/scripts/Coco.html')

In [None]:
response

In [None]:
response.text

Woah, too much!

We break it down with BeautifulSoup

In [None]:
from bs4 import BeautifulSoup

In [None]:
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [None]:
document

`document` itself is still the original HTML

In [None]:
type(document)

In [None]:
document.find('pre')

There is still a lot of formatting to work through, but we have now found the screenplay text.

In [None]:
print(document.find('pre').text)

### Grabbing data from GitHub

Let's look at the repository lists of the Pandas-relevant organization

In [None]:
import requests

In [None]:
response = requests.get('https://github.com/orgs/pandas-dev/repositories')

In [None]:
response

In [None]:
html_string = response.text

In [None]:
html_string

In [None]:
from bs4 import BeautifulSoup

In [None]:
document = BeautifulSoup(html_string, "html.parser")

In [None]:
document

In [None]:
document.find("a")

In [None]:
document.find("a").attrs

In [None]:
document.find("a", attrs={'itemprop':'name codeRepository'})

In [None]:
document.find_all("a", attrs={'itemprop':'name codeRepository'})

In [None]:
for i in document.find_all("a", attrs={'itemprop':'name codeRepository'}):
    print(i.text.strip())

In [None]:
for i in document.find_all("a", attrs={'itemprop':'name codeRepository'}):
    print(i.attrs)

In [None]:
for i in document.find_all("a", attrs={'itemprop':'name codeRepository'}):
    print(i.text.strip() + ' : accessible at http://github.com' + i.attrs['href'])

### Ok, maybe three examples
The NYTimes has an organizational account on GitHub too:

In [None]:
response = requests.get('https://github.com/orgs/nytimes/repositories')
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")
for i in document.find_all("a", attrs={'itemprop':'name codeRepository'}):
    print(i.text.strip() + ' : accessible at http://github.com' + i.attrs['href'])

Let's say we get interested in looking at their covid-19-data repo.  Visit that repo page by clicking on the link above.

And we can try to directly import info from their us-states.csv file.

In [None]:
# This will fail:
# df = pd.read_csv('https://github.com/nytimes/covid-19-data/blob/master/rolling-averages/us-states.csv')

# must replace "blob" with "master" -> note the download link on the github page
df = pd.read_csv('https://github.com/nytimes/covid-19-data/raw/master/rolling-averages/us-states.csv')

In [None]:
df.head()

In [None]:
df.groupby('state')['cases'].sum()

In [None]:
df.groupby('state')['cases'].sum().sort_values().plot.barh(figsize=(8,10))