# Analyze and fetch - knowing what you want from data and asking for it 

On this page I will go through how to use Python for scraping the web with ``beautifulsoup`` and through API access. Some of my examples are strongly influenced by Joel Grus' work in the book "Data Science From Scratch" by O'Reilly.

I will try to make examples from sites returning both ``JSON`` and ``XML`` for the API-retrievals.

These walk-throughs do not aim at teaching you the theory behind the examples, if you need some more information in order to understand the code, I recommend that you seek out documentation and or literature. These examples expect a certain knowledge of programming and Python.

### Packages

If you do not already have ``beautifulsoup4``, ``requests`` and ``html5lib`` I strongly recommend that you start out by taking the time to install them. It very simple, from your terminal, for each package:
> ``pip install <packagename>``

Then we should be ready!

In [None]:
# Imports
from bs4 import BeautifulSoup as bs
import requests

In [None]:
# Getting data from webpage
url = '' # URL for the webpage you want content from
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')

### I have the content, now what?

I think the reason why you looked further into thses examples was because you needed to do something with some specific data from some webpage. Maybe you already know exactly what you need, but you are unsure how to do so? Here are some tips and tricks as to how to retrive information from ``Tag`` objects within your source html. Here are some examples:

In [None]:
# Get first paragraph (<p> element)
first_p = soup.find('p')

# To work with the content as text, use the text property
first_p_text = soup.p.text

In [None]:
# Get all paragraphs
all_p = soup.findall('p')

Often in HTML you have different tags nested inside one another, these can also be exracted with a little more work

In [None]:
# Tags within a specific class
important_p = soup('p', {'class' : 'important'})

In [None]:
# Spans inside divs
span_in_div = [span
              for div in soup('div')    # for div on page
              for span in soup('span')] # find span on page