# Day 2 [NOTE WORK IN PROGRESS]

A vast amount of data exists on the web and is now publicly available. In this section, we give an overview of popular ways to retrieve data from the web, and walk through some important concerns and considerations. 

## Background
** 1) How does the web work? **  
** - a) Examining a http request through your browser (Chrome/Firefox) **  
** - b) Examining a http request through your console **  

** 2) Web terminology: some important distinctions **  
** - a) Web scraping vs APIs - what's the difference? **      
** - b) Web scrapers vs crawlers & spiders - what's the difference? **     

** 3) Building friendly bots: robots.txt and legality ** 


## Tutorial
** 1) Creating a friendly bot on Wikipedia **  
** 2) Spotify API **  
  


# Background:

## 1) How does the web work? 
An extremely simplified model of the web is as follows. The World Wide Web is said to follow a client-server architecture, where clients (etc. the web browser on your computer) send <b><i>requests</i></b> to servers, and servers respond with resources. When you enter a URL (or Uniform Resource Locator) into your browser, your browser sends a http request with information about the resource you are looking for to a remote server, which the server returns, if available. 

<img src="images/Client-server-model.svg.png">

A server can be understood as a computer that has various files (resources) stored in its system, and that returns those files if it receives requests in a format it understands. 

## 1a). Examining a request through your browser (Chrome/Firefox)

You can view the request sent by your browser by:

1) Opening a new tab in your browser   
2) Enabling developer tools (__View -> Developer -> Developer Tools in Chrome__ and __Tools -> Web Developer -> Toggle Tools in Firefox__)  
3) Loading or reloading a web page (etc. www.google.com)  
4) Navigating to the Network tab in the panel that appears at the bottom of the page.   

### Chrome Examine Request Example
<img src="images/chrome_request.png">

### Firefox Examine Request Example
<img src="images/firefox_request.png">

These requests you send follow the HTTP protocol (Hypertext Transfer Protocol), part of which defines the information (along with the format) the server needs to receive to return the right resources. Your HTTP request contains __headers__, which contains information that the server needs to know in order to return the right information to you. 

## 1b). Examining a http request through the console

Let's now try accessing the same server by using requests. Now, instead of sending the server a request through your browser, you are sending the server a request programmatically, through your console.  The server returns some output to you, which the requests module parses as a python object.  

In [45]:
import requests
import pprint 

response = requests.get("http://www.google.com")

This response object contains various information about the request you sent to the server, the resources returned, and information about the response the server returned to you, among other information. These are accessible through the <i>__request__</i> attribute, the <i>__content__</i> attribute and the <i>__headers__</i> attribute respectively, which we'll each examine below.
<hr style="border-color:gray;opacity:0.5">

### Examining the response object
The type() and dir() functions are useful for determining what kind of object you are dealing with, and the methods and attributes available to each object, esp. when first starting to work with different python modules. 

__type()__ returns the type of the object, which can be one of Python's default types such as int, or a list or dictionary, or a custom type defined by the module you are importing. Either way, the type gives you important clues on how to interact with the object, esp. if it includes familiar types in its name (such as list or dict). 

__dir()__ lists all the methods and attributes which the object has. A method is simply a callable attribute, and can be distinguished using the __callable__ function. 

These should be supplemented with the official documentation for the module (http://docs.python-requests.org/en/master/), as well as googling for key terms and specific error codes. 

In [54]:
print(type(response))
print("")
print(dir(response))

<class 'requests.models.Response'>

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


The type returned - "requests.models.Response" - is not too informative, but let's try examining the types of each of attribute we are interested in. 

In [55]:
print type(response.request)
print type(response.content)
print type(response.headers)

<class 'requests.models.PreparedRequest'>
<type 'str'>
<class 'requests.structures.CaseInsensitiveDict'>


Here, we can see that __request__ is an object with a custom type, __content__ is a str value and __headers__ is an object with "dict" in its name, suggesting we can interact with it like we would with a dictionary.

If we recall our simple model of the web, we sent a http request through our console to a remote server, which returned a response. Both the request and response contains information that first allows the server to determine the right resource to return, and then typically, our browser to interpret the returned object. 

The content is the actual resource returned to us - let's take a look at the content first before examining the request and response objects more carefully. (We select the first 1000 characters b/c of the display limits of Jupyter/python notebook.)

## # response.content

In [26]:
print(response.content[0:1000])

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/logos/doodles/2016/valentines-day-2016-5699846440747008-5129251808346112-ror.gif" itemprop="image"><meta content="Happy Valentine's Day! #GoogleDoodle" property="og:description"><meta content="http://www.google.com/logos/doodles/2016/valentines-day-2016-5699846440747008.3-thp.png" property="og:image"><meta content="518" property="og:image:width"><meta content="139" property="og:image:height"><title>Google</title><script>(function(){window.google={kEI:'2B_BVvfkE4LmjwPRw6vQBg',kEXPI:'1350255,3700263,4028790,4029815,4031109,4032677,4033307,4036509,4036527,4038012,4039268,4042785,4042793,4043492,

### HTML: language for computers

The content returned is written in __HTML (HyperText Markup Language)__, which is the default format in which web pages are returned. The content looks like gibberish at first, with little to no spacing. The reason for this is that this output is not designed for us to read, but for the browser to parse and present in a visual interface. 

The HTML raw document contains both the text in the web page, such as "Google Research" or "I'm Feeling Lucky", as well as tags and information about how the text is to be formatted and presented, including positioning, font size and the layout of the site. When we begin writing our web scraper for Wikipedia, we'll go into more detail how to navigate and parse the HTML structure to locate and extract the data you need.


If you save a web page as a ".html" file, and open the file in a text editor like Notepad++ or Sublime Text, this is the same format you'll see. Opening the file in a browser (i.e. by double-clicking it) gives you the Google home page you are familiar with. 


## # response.request

Next, let's take a look at the request attribute. Notice that the request attribute is attached to our response object returned from requests.get, i.e. the http request has already been sent and the request attribute is provided for convenience to see what request headers you sent, after-the-fact. 

As before, we use type() and dir() to learn more about the object stored in the requests attribute. 

In [58]:
print(type(response.request))
print(dir(response.request))

<class 'requests.models.PreparedRequest'>
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cookies', '_encode_files', '_encode_params', 'body', 'copy', 'deregister_hook', 'headers', 'hooks', 'method', 'path_url', 'prepare', 'prepare_auth', 'prepare_body', 'prepare_content_length', 'prepare_cookies', 'prepare_headers', 'prepare_hooks', 'prepare_method', 'prepare_url', 'register_hook', 'url']


Let's print out the headers associated with our request. The __url__ and __method__ attribute contains other key information associated with the request. We can see the __headers__, __url__ and __method__ attributes in the dir, you can also use the __getattr__ function or just check to see if a word is in the headers list (if the headers list is too long).

### Checking if the headers attribute is available

In [79]:
print(getattr(response.request, "headers"))
print("headers" in dir(response.request))

{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.7.0 CPython/2.7.10 Darwin/14.5.0'}
True


### Examining the request headers object
We can use __dir()__ and __type()__ again on the object stored in response.request.headers. We can see that req_headers is of the type CaseInsensitiveDict, which suggests we can interact with it like how we would with a typical Python dictionary, etc. it has a keys method containing all keys in the dictionary. 

In [86]:
req_headers = response.request.headers
print(type(req_headers))
print(dir(req_headers))
print("keys" in dir(req_headers))

<class 'requests.structures.CaseInsensitiveDict'>
['_MutableMapping__marker', '__abstractmethods__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__metaclass__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_store', 'clear', 'copy', 'get', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'lower_items', 'pop', 'popitem', 'setdefault', 'update', 'values']
True


### Printing information associated with request 

In [87]:
print("url: " + response.request.url)
print("method: " + response.request.method)

for i in response.request.headers.keys():
    print i + ": " + response.request.headers[i]


url: http://www.google.com/
method: GET
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.7.0 CPython/2.7.10 Darwin/14.5.0


The method associated with the request (GET here) is part of a number of other methods defined in the HTTP Protocol, including GET, POST, PUT, DELETE, etc. 

Of these, the most common are GET and POST, with the GET method typically used for data retrieval and the POST method used to make changes in the server's database. We shall return to GET again in our Wikipedia web scraping tutorial, which is usually the only method used for web scraping. 

We won't go too much into what some of these other header fields mean, which you should be able to find references for easily online (etc: https://en.wikipedia.org/wiki/List_of_HTTP_header_fields). 

Nonetheless, when troubleshooting your code for extracting data from the web, you'll often find yourself examining the header fields for both the request and response messages. 

## # response.headers

To round out this section, let's briefly examine the headers associated with the response (rather than the request) with the techniques we've learned, which are directly available in the main response object we have been working with. 




In [91]:
print(type(response.headers))
print(dir(response.headers))
print("")

for i in response.headers.keys():
    print i + ": " + response.headers[i]

<class 'requests.structures.CaseInsensitiveDict'>
['_MutableMapping__marker', '__abstractmethods__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__metaclass__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_store', 'clear', 'copy', 'get', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'lower_items', 'pop', 'popitem', 'setdefault', 'update', 'values']

content-length: 41117
x-xss-protection: 1; mode=block
content-encoding: gzip
set-cookie: NID=76=a6RJKuk3r3-JdP2Tp-3UcdubV5_mYD4WAn5MB1_RXyqUd52gu7DmFf3wSBBxu0tEBNncYk-SHOTGf7-mRHc-vQliWN1h5gPexuXiJdxnB8VDWFkkUHcDouHVW20SACiS5Z9oe-ftcyO3Aw; expires=Tue, 16-Aug-2016 23:00:53 GMT; path

### End Note: Browser vs. Console

From the server's perspective, the request it receives from your browser is not so different from the request received from your console (though some servers use a range of methods to determine if the request comes from a "valid" person using a browser, versus an automated program.) 

The server relies on the header request fields to determine what to return, and includes a number of header fields in its response, in addition to its content. 

The main difference is that in the browser, you interact with the server via a graphical user interface (GUI), so that much of the header specification, both in the request and response, remain invisible to you. In your console, you often have to specify or parse this content manually - while this involves more work, it also allows you a great deal more flexibility, and the ability to automate certain tasks. 



## 2) Web terminology: some important distinctions


## 2a) Web scraping vs APIs  - what's the difference?
Now that we've covered a simple model of how you might interact with the World Wide Web, let's go through the two main ways you may extract data from the web for research or analysis. 

As a quick recap - when you access data through your browser, you download a resource. Because our interaction is primarily visual, information returned to browsers is in HTML, a markup language, that delivers both content and rules about how the content is to be presented (fonts, text size, bold, arrangement). By contrast, APIs typically are built to only return data. For this reason, the data is typically returned in XML or JSON formats. We've already seen an example of a HTML file, and here is an example of a .json file from the Spotify API. 


The structure is similar to how we navigate nested Python objects (such as a list of lists), and we will see how you can navigate json objects using the python json library later in the tutorial. Notice the format of the data is highly structured, with no lines devoted to markup or how a page is to be displayed, like for HTML data to be displayed in the browser. 

In summary:

__Web scraping__ typically involves the scraping of pages meant for human consumption. Hence you are more likely to work with __.html__ files. 

__Web APIs__ is a broad category, but in the context of data extraction for research. Here you are likely to work in __XML__ or __JSON__ formats, or whatever format the company or agency chooses to make the data available. There are typically fewer steps between extracting the data and parsing it into a form ready for analysis, as APIs are built to directly return data.

### Note on Robots:
We also hear a lot about robots. A robot is a program designed to accomplish any kind of automated task. This, you can write an automated script to download data from an API, or to scrape pages. Sending requests manually on the console does not qualify as a bot - the key is that the task must be automated. For instance, a bot can click through every post on a forum, downloading pages that match a specific key word or phrase. 

In the next section, we will discuss a text file called "robots.txt" (which applies to scaping only) that is typically contained in the root folder, that contains instructions to bots on what can or cannot be scraped or crawled on the site. 

## 2b) Menagerie of tools: crawlers, spiders, scrapers - what's the difference? 
Web crawlers or spiders are used by search engines to index the web. The metaphor is that of an automated bot with long, spindly legs, traversing from hyperlink to hyperlink. Search engines use these crawlers to continually traverse the web and index new or changed content, so that our search queries reflect the most recent and up-to-date content. 

Web scraping is a little different. While many of the tools used may be identical or similar, web scraping "focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet." (https://en.wikipedia.org/wiki/Web_scraping) In other words, web scraping focuses on translating data into a form ready for storage and analysis (versus just indexing). 

In many cases, to the server, these processes look somewhat identical. Resources are sent in response to requests. Rather, it is what is done to those resources after they are sent, and the overall goal, that differentiates web crawling and scraping. 

Most websites want crawlers to find them so their pages appear on popular search engines, but see no clear-cut benefit when their content is parsed and converted into usable data. Beyond research, many companies also use web scraping (in a legal grey area or illegally) to repurpose content, etc, a real estate website scraping data from Craigslist to re-post as listings on their website. 

## 4) Considerate robots and legality 

__Typically, in starting a new web scraping project, you'll want to follow these steps:__  
1) Find the websites' robots.txt and do not access those pages through your bot  
2) Make sure your bot does not make too many requests in a specific period (etc. by using Python's sleep.wait function)   
3) Look up the website's term of use or terms of service. 

We'll discuss each of these briefly.

### What data owners care about

__Data owners are concerned with:__  
1) Keeping their website up  
2) Protecting the commercial value of their data   

Their policies and responses differ with respect to these two areas. You'll need to do some research to determine what is appropriate with regards to your research. 

#### 1) Keeping their website up
Most commercial websites have strategies to throttle or block IPs that make too many requests within a fixed amount of time. Because a bot can make a large number of requests in a small amount of time (etc. entering 100 different terms into Google in one second), servers are able to determine if traffic is coming from a bot or a person (among many other methods). For companies that rely on advertising, like Google or Twitter, these requests do not represent "human eyeballs" and need to be filtered out from their bill to advertisers. 

In order to keep their site up and running, companies may block your IP temporarily or permanently if they detect too many requests coming from your IP, or other signs that requests are being made by a bot instead of a person. If you systematically down a site (such as sending millions of requests to an official government site), there is the small chance your actions may be interpreted maliciously (and regarded as hacking), with risk of prosecution. 

#### 2) Protecting the commercial value of their data
Companies are also typically very protective of their data, especially data that ties directly into how they make money. A listings site (like Craigslist), for instance, would lose traffic if listings on its site were poached and transfered to a competitor, or if a rival company used scraping tools to derive lists of users to contact. For this reason, companies' term of use agreements are typically very restrictive of what you can do with their data. 

Different companies may have a range of responses to your scraping, depending on what you do with the data. Typically, repurposing the data for a rival application or business will trigger a strong response from the company (i.e. legal attention). Publishing any analysis or results, either in a formal academic journal or on a blog or webpage, may be of less concern, though legal attention is still possible. 

### Where APIs fit
Companies typically provide APIs to deal with 1) - to direct bots and scrapers away from their main site, as well as for commercial purposes (such as the Google Maps API, which is used by many companies on a pay-as-you-go basis). 

Because APIs usually require registration and set a fixed (though often very large) number of requests, they are easier to manage and don't require companies to figure out whether requests are being made by their primary customers, versus scrapers and crawlers.

#### __In general, using APIs vs. web scraping offers more protections because:__  
1) Because of the way APIs are designed, you are unlikely to affect the running of the main site and   
2) API data is data companies have explicitly chosen to make available (though terms of service still apply). By contrast, you may be scraping information companies want to protect if you do it through web scraping.

### Risks in brief
- In general, most often you'll simply find your IP being temporarily blocked if you are careless with the number of requests you make.   
- More serious consequences would include being put on a permanent blacklist or contacted for a cease-and-desist or legal action by the company (etc. if you create a new service using their data). Some of this falls in a legal grey area.  
- Finally, if you scale your requests and manage to send them in a sophisticated enough manner to crash the site, this may qualify as digital crime - similar to Distributed Denial-of-Service (DDOS) attacks. We probably don't have to worry about this at this stage.

### robots.txt: internet convention

The robots.txt file is typically located in the root folder of the site, with instructions to various services (User-agents) on what they are not allowed to scrape. 

Typically, the robots.txt file is more geared towards search engines (and their crawlers) more than anything else. 

However, companies and agencies typically will not want you to scrape any pages that they disallow search engines from accessing. Scraping these pages makes it more likely for your IP to be detected and blocked (along with other possible actions.) 

Below is an example of reddit's robots.txt file: 
https://www.reddit.com/robots.txt

User blahblahblah provides a concise description of how to read the robots.txt file:
https://www.reddit.com/r/learnprogramming/comments/3l1lcq/how_do_you_find_out_if_a_website_is_scrapable/

In general, your bot will fall into the * wildcard category of what the site generally do not want bots to access. You should make sure your scraper does not access any of those pages, etc. www.reddit.com/login etc. 


# Let's get started!



Now that we've gone through major concepts and tried out a few code snippets, let's hone our Python skills and build two basic bots, one on Wikipedia, and one using Spotify's API. 

## 6) Tutorial 1: Creating a friendly bot on Wikipedia

Our first use case involves scraping some basic information about technology companies from Wikipedia. Say you are the chief innovation officer of a small city in the San Francisco Bay Area. A number of large-scale local investments in office space have taken place, with space opening up over the next few years. You wish to be part of the trend of technology companies moving out of San Francisco and Silicon Valley. You have been networking and talking to companies at events and conferences, but would like a more systematic way of identifying companies to focus on. 

You notice a list of 141 technology companies based in the San Francisco area on Wikipedia:
https://en.wikipedia.org/wiki/Category:Technology_companies_based_in_the_San_Francisco_Bay_Area

Your goal is to scrape basic useful information about each company in a list, into which you can do some summary statistics to identify companies or even industries you are interested in focusing on. 

** In particular, you want to know: **  
1) what industry they are in  
2) where the company is currently headquartered  
3) the number of employees   
4) website address of the company  

This will allow you to know the current and budding tech hubs in the Bay area, get a better sense of your competition, and the number of jobs you can attract to your city. For convenience, you also collate the website addresses of the companies to pull into your list. 

## Step 1: Examining the webpage structure

The first step is to examine the webpage in your browser, using developer tools (Firefox/Chrome). First, identify the element that you want to pull data from. In this case, a series of links. Forum traversal. 

For this case, we concentrate on the box that appears at the side of each of the company's pages. 

While we've identified visually where we want to pull the element from, this may or may not translate into code. In our case, thankfully, the pages have similar enough structure. HTML has an optional category called "class", which, among other uses, allows the website to specifiy how the formatting of an element should look (using what is called css). For our purposes, we can use the "infobox vcard" class to tell the program which box we want to pull out and use.

[click inspect element] 




## Step 2: Interacting with the webpage through the console 

After examining the webpage structure through your browser, now it's time to interact with the underlying html code (what you see in the inspect element page) directly in your console. Both processes are useful to coming up with a strategy of how (and whether) data from the website can be scraped. 

First, import urllib2 and BeautifulSoup. Downloading a html copy of the site is as simple as: 

In [39]:
import requests
from bs4 import BeautifulSoup

response = requests.get('http://en.wikipedia.org/wiki/Category:Technology_companies_based_in_the_San_Francisco_Bay_Area')

html = response.content
print(html[0:300])
soup = BeautifulSoup(html)
#lists = soup.findAll("div", { "class" : "mw-category-group" })
company_section = soup.findAll("div", {"id": "mw-pages"})
print(type(company_section))


<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" />
<title>Category:Technology companies based in the San Francisco Bay Area - Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( 
<class 'bs4.element.ResultSet'>


In [44]:
print(len(company_section))
each_alphabet = company_section[0].find_all("div", {"class":"mw-category-group"})
print(len(each_alphabet))


1
26


In [49]:
alphabet_a = each_alphabet[0]
print(alphabet_a)
company_list = alphabet_a.find_all("li")

<div class="mw-category-group"><h3>3</h3>
<ul><li><a href="/wiki/HP_3PAR" title="HP 3PAR">HP 3PAR</a></li></ul></div>


In [56]:
link_list = []
print(company_list)
for i in company_list:
    new_list = [i.text, i.a['href']]
    link_list.append(new_list)

[<li><a href="/wiki/HP_3PAR" title="HP 3PAR">HP 3PAR</a></li>]



### Now let's write the loop over all sections

In [57]:
link_list = []
for each_section in company_section:
    company_list = each_section.find_all("li")
    for i in company_list:
        new_list = [i.text, i.a['href']]
        link_list.append(new_list)
print(len(link_list))

159


### Now using the list, let's load the first page and locate the text elements we want 


In [62]:
example_site = link_list[0]
print(example_site)

company_page = requests.get("http://wikipedia.org" + example_site[1])

[u'HP 3PAR', '/wiki/HP_3PAR']


In [69]:
print(company_page.content[0:200])

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" />
<title>HP 3PAR - Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className = do


In [74]:
soup = BeautifulSoup(company_page.content) 
info_box = soup.find("table", {"class": "infobox vcard"})
print(info_box)

<table class="infobox vcard" style="width:22em">
<caption class="fn org">HP 3PAR</caption>
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Types_of_business_entity" title="Types of business entity">Type</a></div>
</th>
<td class="category" style="line-height:1.35em;">Subsidiary</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Industry</th>
<td class="category" style="line-height:1.35em;"><a href="/wiki/Data_storage_device" title="Data storage device">Data Storage</a></td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Founded</th>
<td style="line-height:1.35em;">1999</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Founder</th>
<td class="agent" style="line-height:1.35em;"><a class="new" href="/w/index.php?title=Jeffrey_Price_(3PAR)&amp;action=edit&amp;redlink=1" title="Jeffrey Price (3PAR) (page does not exist)">Jeffrey Price</a><br/>
<a class="new" href="/w/index.php?title=Ashok_Singhal

### And now let's export to csv

### a) Pulling a list of links

In [None]:
# list_of_lists = 

### b) Manipulating the element containing data 

### c) Building a loop

### d) Exporting the data to a csv file 

### Parsing HTML

Web scraping is flexible, but particularly useful for semi-structured, repetitive data. You start by browsing the individual Wikipedia pages for each of the links. In particular, you notice that box that appears regularly at the side, which contains much of the information you need. HTML has an optional category called "class", which, among other uses, allows the website to specifiy how the formatting of an element should look (using what is called css). For our purposes, we can use the "infobox vcard" class to tell the program which box we want to pull out and use. 

[Click inspect element]

In [14]:
# Run the actual code

from bs4 import BeautifulSoup

In [2]:
# [pull code and output above elements, save to csv]

## Tutorial 2: Using Spotify's API 

Per our discussion with APIs, let's start interacting with some web services! We shall use Spotify's public API as an example. First, take a look at. 

Most APIs will employ such a format. Basically, you enter. So, from the console, all your program needs to do is to query that url with specific terms, and be able to process the data that returns, typically in JSON.

We shall try to find the top 5 most popular artistes, as ranked by Spotify's algorithm.

The first part of the url query is called the "endpoint", it can be viewed as the root. For instance, Facebook's root api is ..., and Twitter's is .... The next part of the query, after the question mark, is called the query string. This depends on how the service have designed their API, but a few elements are consistent throughout. Note that query strings are used throughout the web, and by no means specific to APIs, which as we've seen, have a quite general definition. 

## URL Encoding


This is called URL encoding. Some browsers will automatically convert, or if your query itself has special characters, such as Aphex Twin's minipops 67 [120.2][source field mix]. 

https://www.google.com/#q=minipops+67+%5B120.2%5D%5Bsource+field+mix%5D

One easy way to do automatic conversion is to simply type into Google, and then cut and paste the url from the browser. 

https://api.spotify.com/v1/search?q=justin&type=artist

In [8]:
import requests

response = requests.get("https://api.spotify.com/v1/search?q=justin&type=artist")
json_object = response.content()

print(len(json_object))


16914


As you progress on your API journey, 

# END OF TUTORIAL

# Rough notes

# robots.txt

The following is an example of Reddit's robots.txt file.

Disallows specific pages to be scrapable. The bot that calls itself Bender, and Fort, 

Web scraping falls into a legal grey area. Abuse usually means that your IP will be blocked. For various bots, websites can also determine if the website is being accessed programmatically or by a bot, and may block the latter. For instance, a website that depends on advertising. Many websites offer APIs partly as an attempt to avoid web scraping. Established websites will have a way of buffering or blocking excessive requests from a single IP or source. In general, you should at minimum space out your requests and follow the website's robots.txt. if you space out your requests, follow the robots.txt, and follow websites' terms of service 

Websites have two major concerns - one is protecting the copyright of the content on their site, the other is. Most cases that have been brought to court. For instance, Twitter. 

Terms of Service.

https://www.reddit.com/r/learnprogramming/comments/3l1lcq/how_do_you_find_out_if_a_website_is_scrapable/
    

http://datajournalismhandbook.org/1.0/en/getting_data_3.html

Difference between bots and accessing something through the console. A bit subtle.

We are going to scrap a specific bit of information from Wikipedia's site. On counties. 

## 5) Data Retrieval on the Web: Key concepts [may remove this section or embed content into other sections.]

We've already mentioned HTML, JSON. Here we elaborate on them more.

1) HTML
HTML documents imply a structure of nested HTML elements.

2) JSON

3) http 

req_headers = response.request.headers
print(type(req_headers))
print(dir(req_headers))
for i in response.request.headers:
    print i

print(type(response.headers))
print(len(response.content))


Similarly, saving a web page or going to Source in Developer Tools allows you to view the html code associated with each.


A web API. The following are examples of each: [examples here] 

What looks like gibberish. There is little to no spacing. Is not designed for us to read, but for the program (in this case the browser) to parse that content, and present it in a visual interface for us.


The robots.txt file is usually more geared towards search engines than anything else.
The bot that calls itself 008 (apparently from 80legs) isn't allowed to access anything
bender is not allowed to visit my_shiny_metal_ass (it's a Futurama joke, the page doesn't actually exist)
Gort isn't allowed to visit Earth (another joke, from The Day the Earth Stood Still)
Other scrapers should avoid checking the API methods or "compose message" or 'search" or the "over 18?" page (because those aren't something you really want showing up in Google), but they're allowed to visit anything else. 


In [None]:
a = lists[0]
dir(a)
print(a)
a.ul
print("")
for i in a.ul.children:
    print i
print("")
all_lists = a.find_all("li")
print("")
print(all_lists)

for i in all_lists:
    print(i)