# Webscraping

get data from internet: <br>
bulk download <br>
file sharing <br>
scraping <br>
APIs/feeds

scraping steps: <br>
(1) request HTML file from server - using request library <br>
(2) (store it in DB - using MongoDB) <br>
(3) parse file - using BeautifulSoup library <br>
(4) find data we need - using BeautifulSoup or regex

## request library

In [2]:
import requests
response =	requests.get('http://www.amazon.com')
# response object now contains the HTML file (response.text)

In [None]:
# NASA image request loooong way
images = requests.get('https://mars.nasa.gov/msl/multimedia/raw/?s=2032&camera=FHAZ_')

# NASA image request with params
images = requests.get('https://mars.nasa.gov/msl/multimedia/raw',
                      params = {'s': '2032', 'camera': 'FHAZ_'})

images.url
images.headers
images.text

more info: https://html.python-requests.org/

## BeautifulSoup

In [8]:
from bs4 import BeautifulSoup

In [None]:
bs_obj = BeautifulSoup(r.text, 'html.parser')
price=float(bs_obj.findAll('span',{'class':'price'})[0].get_text().strip('$'))

# find all elements of type “span” with class “price”, get the first one and get its text, which is the price

more info: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [11]:
htmlpage = images.text

# construct soup object
soup = BeautifulSoup(htmlpage, 'lxml') # specify parser

In [None]:
print(soup.prettify()) # using prettify method

In [None]:
soup.head # shows entire head - tag object tagged 'head'

In [13]:
soup.title # soup tag object tagged 'title'
soup.title.name

'title'

In [14]:
soup.title.contents
soup.title.text
soup.title.string # only if it's text only

'Raw Images - Mars Science Laboratory'

In [15]:
soup.img # soup tag object tagged 'img'
soup.img.attrs # attributes

{'align': 'left',
 'alt': 'Follow this link to skip to the main content',
 'border': '0',
 'height': '1',
 'hspace': '0',
 'src': 'https://mars-jpl-nasa-gov-images.s3.amazonaws.com/spacer.gif',
 'vspace': '0',
 'width': '1'}

In [16]:
soup.img['src']

'https://mars-jpl-nasa-gov-images.s3.amazonaws.com/spacer.gif'

In [17]:
soup.img['width']

'1'

In [None]:
# find all images on web page
soup.find_all('img')

In [19]:
# want only these:
# <img 
# alt="Image taken by Front Hazcam: Left B" 
# hspace="0" 
# src="/msl-raw-images/proj/msl/redops/ods/surface/sol/01460/opgs/edr/fcam/FLB_527107895EDR_F0572798FHAZ00337M_-thm.jpg" 
# style="border: 3px solid #FFFFFF;" 
# vspace="0" 
# width="160"/>,

imgs = [img['src'] for img in soup.find_all('img') if 'Image' in img['alt']]

## API

In [None]:
# A User agent header required for the Wikipedia API.
headers = {'user_agent': 'Web_Scraping/1.1 (darren.reger@galvanize.com; dsi example exercise)'}

# Experiment with fetching one or two pages and examining the result (fill in URL and payload)
url = 'https://en.wikipedia.org/w/api.php'

# parameters for the API request
payload = { 'action' : 'parse' , 'format' : 'json','page' : 'Kevin Bacon' }

# make the request
r = requests.post(url, data=payload, headers=headers)

# print out the result of the request as JSON
print r.json()['parse']

In [None]:
# API authentication
import requests
z = requests.get('http://galvanizesf.roomzilla.net', auth=('', 'gVIP543'))

In [None]:
# using Mongo to store data we scrape from web
from pymongo import MongoClient

# Define the MongoDB database and table
client = MongoClient()
db = client.uk_police
collection = db.all_crime
# or table = db['meta']

# create request - this grabs one date (year, month)
request = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=warwickshire&date=2013-09')
# insert it as JSON (plays well with Mongo)
collection.insert_many(other_request.json())

# Possible way to grab data for range of months and years
for year in range(2001, 2016):
    for month in range(1, 13):
        r = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=warwickshire&date={}-{}'.format(year, month))
        collection.insert_many(r.json())

# print them
import pprint as pp
for item in collection.find({ 'category' : 'public-order' }):
    pp.pprint(item)
    
# Remember to close the connection
client.close()