##Getting started with BeautifulSoup and Web Scraping

This notebook demostrates scraping data from website using Python and BeautifulSoup.  This is for demonstration purposes, if you want to actually scrape data from a site you need to check the sites robots.txt file to make sure it is ok and you aren't breaking any rules.

BeautifulSoup makes it souper easy to scrape websites (smh).  You will probably need to install bs4 if you have not used this package before.  Install that by opening command line on Windows or the shell on non-Windows and type (don't type $)

>$ pip install bs4

Once that is installed and ready to go, start a new Python script and import the following packages.

In [1]:
from bs4 import BeautifulSoup, SoupStrainer
from urllib2 import Request, urlopen

In [2]:
def prep_soup(url):
    '''open url and prepare soup object for parsing.  To clean up original html work here'''
    data = urlopen(url).read()
    data = data.replace('&nbsp;', ' ') # Trying to clean up html output --- Testing
    trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 ) # dealing with utf 8 and ascii encoding
    data = data.translate(trans_table)
    unicode_data = data.decode('utf-8')
    soup = BeautifulSoup(unicode_data,  from_encoding="UTF-8")
    for elem in soup.findAll(['script', 'style']):
        elem.extract() #test using this to remove all JS and CSS to see if posts on new forum look better
    return(soup)

In [3]:
u = 'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season=2015&seasonType=REG&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'

In [4]:
print(BeautifulSoup.prettify(prep_soup(u)))

<!DOCTYPE html>
<!--[if lt IE 7 ]> <html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" class="ie ie6"> <![endif]-->
<!--[if IE 7 ]>    <html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" class="ie ie7"> <![endif]-->
<!--[if IE 8 ]>    <html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" class="ie ie8"> <![endif]-->
<!--[if IE 9 ]>    <html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" class="ie ie9"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!-->
<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <!--<![endif]-->
 <head>
  <!-- nfl_combo_enabled: true -->
  <title>
   NFL Stats: by Player Category
  </title>
  <!-- BEG