The web is a rich source of data from which you can extract various types of insights and findings. In this chapter, you will learn how to get data from the web, whether it be stored in files or in HTML. You'll also learn the basics of scraping and parsing web data.

# Importing flat files from the web


## You’re already great at importing!
- Flat files such as .txt and .csv
- Pickled files, Excel spreadsheets, and many
others!
- Data from relational databases
- You can do all these locally
- What if your data is online?

## You’ll learn how to…
- Import and locally save datasets from the web
- Load datasets into pandas DataFrames
- Make HTTP requests (GET requests)
- Scrape web data such as HTML
- Parse HTML into useful data (BeautifulSoup)
- Use the urllib and requests packages

## The urllib package
- Provides interface for fetching data across the web
- urlopen() - accepts URLs instead of file names

## How to automate file download in Python

In [2]:
from urllib.request import urlretrieve

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'

urlretrieve(url, 'winequality-white.csv')

('winequality-white.csv', <http.client.HTTPMessage at 0x7fa54d713c18>)

---
# Let’s practice!

# HTTP requests to import files from the web

## URL
-  Uniform/Universal Resource Locator
- References to web resources
- Focus: web addresses
- Ingredients:
- Protocol identifier - h!p:
- Resource name - datacamp.com
- These specify web addresses uniquely

## HTTP
- HyperText Transfer Protocol
- Foundation of data communication for the web
- HTTPS - more secure form of HTTP
- Going to a website = sending HTTP request
    - GET request
- `urlretrieve()` performs a GET request
- HTML - HyperText Markup Language

## GET requests using urllib

In [3]:
from urllib.request import urlopen, Request

url = "https://www.wikipedia.org/"

request = Request(url)

response = urlopen(request)

html = response.read()

response.close()

In [4]:
html

b'<!DOCTYPE html>\n<html lang="mul" class="no-js">\n<head>\n<meta charset="utf-8">\n<title>Wikipedia</title>\n<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">\n<![if gt IE 7]>\n<script>\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)no-js(\\s|$)/, "$1js-enabled$2" );\n</script>\n<![endif]>\n<!--[if lt IE 7]><meta http-equiv="imagetoolbar" content="no"><![endif]-->\n<meta name="viewport" content="initial-scale=1,user-scalable=yes">\n<link rel="apple-touch-icon" href="/static/apple-touch/wikipedia.png">\n<link rel="shortcut icon" href="/static/favicon/wikipedia.ico">\n<link rel="license" href="//creativecommons.org/licenses/by-sa/3.0/">\n<style>\n.sprite{background-image:url(portal/wikipedia.org/assets/img/sprite-6e35f464.png);background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-6e3

## GET requests using requests

Used by “her Majesty's Government, Amazon,
Google, Twilio, NPR, Obama for America,
Twi!er, Sony, and Federal U.S. Institutions that
prefer to be unnamed”

In [12]:
#!pip install request

In [10]:
import requests
url = "https://www.wikipedia.org/"
r = requests.get(url)
text = r.text

In [11]:
text

'<!DOCTYPE html>\n<html lang="mul" class="no-js">\n<head>\n<meta charset="utf-8">\n<title>Wikipedia</title>\n<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">\n<![if gt IE 7]>\n<script>\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)no-js(\\s|$)/, "$1js-enabled$2" );\n</script>\n<![endif]>\n<!--[if lt IE 7]><meta http-equiv="imagetoolbar" content="no"><![endif]-->\n<meta name="viewport" content="initial-scale=1,user-scalable=yes">\n<link rel="apple-touch-icon" href="/static/apple-touch/wikipedia.png">\n<link rel="shortcut icon" href="/static/favicon/wikipedia.ico">\n<link rel="license" href="//creativecommons.org/licenses/by-sa/3.0/">\n<style>\n.sprite{background-image:url(portal/wikipedia.org/assets/img/sprite-6e35f464.png);background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-6e35

---
# Let’s practice!

# Scraping the web in Python

## HTML
- Mix of unstructured and structured data
- Structured data:
- Has pre-defined data model, or
- Organized in a defined manner
- Unstructured data: neither of these properties

## BeautifulSoup
- Parse and extract structured data from HTML
- Make tag soup beautiful and extract information

In [13]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz
Building wheels for collected packages: bs4
  Running setup.py bdist_wheel for bs4 ... [?25ldone
[?25h  Stored in directory: /home/nbuser/.cache/pip/wheels/84/67/d4/9e09d9d5adede2ee1c7b7e8775ba3fbb04d07c4f946f0e4f11
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
[33mYou are using pip version 9.0.1, however version 9.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [14]:
from bs4 import BeautifulSoup
import requests
url = 'https://www.crummy.com/software/BeautifulSoup/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [15]:
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Beautiful Soup: We called him Tortoise because he taught us.
  </title>
  <link href="mailto:leonardr@segfault.org" rev="made"/>
  <link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
  <meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
  <meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
  <meta content="Leonard Richardson" name="author"/>
 </head>
 <body alink="red" bgcolor="white" link="blue" text="black" vlink="660066">
  <img align="right" src="10.1.jpg" width="250"/>
  <br/>
  <p>
   You didn't write that awful page. You're just trying to get some
data out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-tur

## Exploring BeautifulSoup
- Many methods such as:

In [16]:
print(soup.title)

<title>Beautiful Soup: We called him Tortoise because he taught us.</title>


In [17]:
print(soup.get_text())




Beautiful Soup: We called him Tortoise because he taught us.








You didn't write that awful page. You're just trying to get some
data out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround
screen scraping projects.

Beautiful Soup
"A tremendous boon." -- Python411 Podcast
[ Download | Documentation | Hall of Fame | Source | Discussion group  | Zine ]
If Beautiful Soup has saved you a lot of time and money, one way to pay me back is to read Tool Safety, a short zine I wrote about what I learned about software development from working on Beautiful Soup. Thanks! 
If you have questions, send them to the discussion
group. If you find a bug, file it.
Beautiful Soup is a Python library designed for quick turnaround
projects like screen-scraping. Three features make it powerful:


Beautiful Soup provides a few simple methods and Pythonic idioms
for navigating, searching, and modifying a parse tree: a toolkit for
dis

## Exploring BeautifulSoup
- `find_all()`

In [18]:
for link in soup.find_all('a'):
    print(link.get('href'))

bs4/download/
#Download
bs4/doc/
#HallOfFame
https://code.launchpad.net/beautifulsoup
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
zine/
zine/
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://bugs.launchpad.net/beautifulsoup/
http://lxml.de/
http://code.google.com/p/html5lib/
bs4/doc/
None
bs4/download/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
download/3.x/BeautifulSoup-3.2.1.tar.gz
None
http://www.nytimes.com/2007/10/25/arts/design/25vide.html
https://github.com/reddit/reddit/blob/85f9cff3e2ab9bb8f19b96acd8da4ebacc079f04/r2/r2/lib/media.py
http://www.harrowell.org.uk/viktormap.html
http://svn.python.org/view/tracker/importer/
http://www2.ljworld.com/
http://www.b-list.org/weblog/2010/nov/02/news-done-broke/
http://esrl.noaa.gov/gsd/fab/
http://laps.noaa.gov/topograbber/
http://groups.google.com/group/beautifulsoup/
https://launchpad.net/beautifulsoup
https://code.launchpad.net/beautifulsoup/
https://bugs.launchpad.

---
# Let’s practice!

01Introduction_and_flat_files.ipynb
02Importing_data_from_other_file_types.ipynb
03Working_with_relational_databases_in_Python.ipynb
04Importing_data_from_the_Internet.ipynb
05Interacting_with_APIs_to_import_data_from_the_web.ipynb
06Diving_deep_into_the_Twitter_API.ipynb
battledeath.xlsx
Chinook.sqlite
disarea.dta
huck_finn.txt
ja_data2.mat
latitude.xls
L-L1_LOSC_4_V1-1126259446-32.hdf5
mnist_kaggle_some_rows.csv
Northwind.sqlite
sales.sas7bdat
seaslug.txt
snakes.json
titanic_sub.csv
tweets3.txt
winequality-red.csv
winequality-white.csv
