# Day 3 [NOTE WORK IN PROGRESS]

A vast amount of data exists on the web and is now publicly available. In this section, we give an overview of popular ways to retrieve data from the web, and walk through some important considerations and concerns. 

## Background
** 1) How does the web work? **  
** 2) Web scraping vs APIs - what's the difference? **  
** 3) Menagerie of tools: crawlers, spiders, scrapers - what's the difference? **   
** 4) Building friendly bots: robots.txt and legality ** 


## Tutorial
** 1) Creating a friendly bot on Wikipedia **  
** 2) Spotify API **  
  


# Background:

## 1) How does the web work? 
An extremely simplified model of the web is as follows. Servers and clients. Clients send requests to servers, and servers respond with resources. The World Wide Web is "an open source information space where documents and other web resources are identified by URLs, interlinked by hypertext links, and can be accessed via the Internet." More concisely, it is a way of accessing information on the web, which relies on the HTTP protocol. 

<img src="images/Client-server-model.svg.png">

The codes that you sometimes see (such as 404 Not Found) are standard response codes defined in the HTTP protocol. The intricacies of how the web works are beyond the scope of this session. But two key concepts - the server-client model, and .., will return again. If you are curious, you can go to: 



## 2) Web scraping vs APIs  - what's the difference?
When you access data on the web, you typically download a resource. This can occur on a browser, or in your Python console. Because our interaction is primarily visual, information returned is in HTML, a markup language, that delivers both content and rules about how the content is to be presented (fonts, text size, bold, arrangement). By contrast, APIs typically are built to only return data. For this reason, the data is typically returned in XML or JSON formats. The following are examples of each: [examples here] 

Similarly, saving a web page or going to Source in Developer Tools allows you to view the html code associated with each. 


### Note on Robots:
We also hear a lot about robots. A robot is ... to accomplish any kind of automated task. However, if you... Bots can be built to extract data from APIs or web scraping. Note that accessing APIs through the console does not necessarily mean it is a bot. If you manually send requests on the console to download specific resources, that is not a bot. Requires that it be automated. In the next section, we will discuss a text file called "robots.txt" that is typically contained in the root folder, that contains instructions to bots 

For instance, a bot can click through every post on a forum, downloading files for each looking for a specific word or text. In our example, we show how to do this for Wikipedia. 

## 3) Menagerie of tools: crawlers, spiders, scrapers - what's the difference? 
Web crawlers or spiders comes from. An image of long, spindly legs, traversing from hyperlink to hyperlink. It is these automated crawlers that continually traverse the web and index new or changed content, that search engines used to update and present the most relevant results to your search requests. 

Web scraping is a little different. While many of the tools used may be identical or similar, web scraping "focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet." (https://en.wikipedia.org/wiki/Web_scraping) 

In many cases, to the server, these processes look somewhat identical. Resources are sent in response to requests. Rather, it is what is done to those resources after they are sent, and the overall goal, that differentiates web crawling and scraping. Most websites want crawlers to find them so their pages appear on popular search engines, but see no clear-cut benefit when their content is parsed and converted into usable data. Beyond research, many companies also use web scraping (in a legal grey area or illegally) to repurpose content, etc, a real estate website scraping data from Craigslist to re-post as listings on their website.

## 4) Considerate robots and legality 
The use of robots.txt file is a convention. 

identifies the links, and highlights specific crawlers. Let's take a look at reddit's [insert reddit's privacy policy]

Twitter's privacy policy. [insert Twitter privacy policy]  

Blogs, for instance, or many forum sites do not have APIs available. 

Most frequent is getting your IP blocked temporarily or permanently. In addition, if you plan to publish your results for research, contacting the agency is probably a good idea. 

In summary:   
1) Find the websites' robots.txt and do not access those pages through your bot  
2) Make sure your bot does not make too many requests in a specific period (etc. by using Python's sleep.wait function)   
3) Look up the website's term of use or terms of service.   


## 5) Data Retrieval on the Web: Key concepts [may remove this section or embed content into other sections.]

We've already mentioned HTML, JSON. Here we elaborate on them more.

1) HTML
HTML documents imply a structure of nested HTML elements.

2) JSON

3) http 



# Let's get started!



Now that we've gone through major concepts and tried out a few code snippets, let's hone our Python skills and build two basic bots, one on Wikipedia, and one using Spotify's API. 

## 6) Tutorial 1: Creating a friendly bot on Wikipedia

Our first use case involves scraping some basic information about technology companies from Wikipedia. Say you are the chief innovation officer of a small city in the San Francisco Bay Area. A number of large-scale local investments in office space have taken place, with space opening up over the next few years. You wish to be part of the trend of technology companies moving out of San Francisco and Silicon Valley. You have been networking and talking to companies at events and conferences, but would like a more systematic way of identifying companies to focus on. 

You notice a list of 141 technology companies based in the San Francisco area on Wikipedia:
https://en.wikipedia.org/wiki/Category:Technology_companies_based_in_the_San_Francisco_Bay_Area

Your goal is to scrape basic useful information about each company in a list, into which you can do some summary statistics to identify companies or even industries you are interested in focusing on. 

** In particular, you want to know: **  
1) what industry they are in  
2) where the company is currently headquartered  
3) the number of employees   
4) website address of the company  

This will allow you to know the current and budding tech hubs in the Bay area, get a better sense of your competition, and the number of jobs you can attract to your city. For convenience, you also collate the website addresses of the companies to pull into your list. 

## Step 1: Examining the webpage structure

The first step is to examine the webpage in your browser, using developer tools (Firefox/Chrome). First, identify the element that you want to pull data from. In this case, a series of links. Forum traversal. 

For this case, we concentrate on the box that appears at the side of each of the company's pages. 

While we've identified visually where we want to pull the element from, this may or may not translate into code. In our case, thankfully, the pages have similar enough structure. HTML has an optional category called "class", which, among other uses, allows the website to specifiy how the formatting of an element should look (using what is called css). For our purposes, we can use the "infobox vcard" class to tell the program which box we want to pull out and use.

[click inspect element] 




## Step 2: Interacting with the webpage through the console 

After examining the webpage structure through your browser, now it's time to interact with the underlying html code (what you see in the inspect element page) directly in your console. Both processes are useful to coming up with a strategy of how (and whether) data from the website can be scraped. 

First, import urllib2 and BeautifulSoup. Downloading a html copy of the site is as simple as: 

In [23]:
import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://en.wikipedia.org/wiki/Category:Technology_companies_based_in_the_San_Francisco_Bay_Area')

html = response.read()
print(html[0:300])

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" />
<title>Category:Technology companies based in the San Francisco Bay Area - Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( 


What looks like gibberish. There is little to no spacing. Is not designed for us to read, but for the program (in this case the browser) to parse that content, and present it in a visual interface for us.

### a) Pulling a list of links

In [None]:
# list_of_lists = 

### b) Manipulating the element containing data 

### c) Building a loop

### d) Exporting the data to a csv file 

### Parsing HTML

Web scraping is flexible, but particularly useful for semi-structured, repetitive data. You start by browsing the individual Wikipedia pages for each of the links. In particular, you notice that box that appears regularly at the side, which contains much of the information you need. HTML has an optional category called "class", which, among other uses, allows the website to specifiy how the formatting of an element should look (using what is called css). For our purposes, we can use the "infobox vcard" class to tell the program which box we want to pull out and use. 

[Click inspect element]

In [14]:
# Run the actual code

from bs4 import BeautifulSoup

In [2]:
# [pull code and output above elements, save to csv]

## Tutorial 2: Using Spotify's API 

Per our discussion with APIs, let's start interacting with some web services! We shall use Spotify's public API as an example. First, take a look at. 

Most APIs will employ such a format. Basically, you enter. So, from the console, all your program needs to do is to query that url with specific terms, and be able to process the data that returns, typically in JSON.

We shall try to find the top 5 most popular artistes, as ranked by Spotify's algorithm.

The first part of the url query is called the "endpoint", it can be viewed as the root. For instance, Facebook's root api is ..., and Twitter's is .... The next part of the query, after the question mark, is called the query string. This depends on how the service have designed their API, but a few elements are consistent throughout. Note that query strings are used throughout the web, and by no means specific to APIs, which as we've seen, have a quite general definition. 

## URL Encoding


This is called URL encoding. Some browsers will automatically convert, or if your query itself has special characters, such as Aphex Twin's minipops 67 [120.2][source field mix]. 

https://www.google.com/#q=minipops+67+%5B120.2%5D%5Bsource+field+mix%5D

One easy way to do automatic conversion is to simply type into Google, and then cut and paste the url from the browser. 

https://api.spotify.com/v1/search?q=justin&type=artist

In [8]:
import urllib2

response = urllib2.urlopen("https://api.spotify.com/v1/search?q=justin&type=artist")
json_object = response.read()

print(len(json_object))


16914


As you progress on your API journey, 

# END OF TUTORIAL

# Rough notes

# robots.txt

The following is an example of Reddit's robots.txt file.

Disallows specific pages to be scrapable. The bot that calls itself Bender, and Fort, 

Web scraping falls into a legal grey area. Abuse usually means that your IP will be blocked. For various bots, websites can also determine if the website is being accessed programmatically or by a bot, and may block the latter. For instance, a website that depends on advertising. Many websites offer APIs partly as an attempt to avoid web scraping. Established websites will have a way of buffering or blocking excessive requests from a single IP or source. In general, you should at minimum space out your requests and follow the website's robots.txt. if you space out your requests, follow the robots.txt, and follow websites' terms of service 

Websites have two major concerns - one is protecting the copyright of the content on their site, the other is. Most cases that have been brought to court. For instance, Twitter. 

Terms of Service.

https://www.reddit.com/r/learnprogramming/comments/3l1lcq/how_do_you_find_out_if_a_website_is_scrapable/
    

http://datajournalismhandbook.org/1.0/en/getting_data_3.html

Difference between bots and accessing something through the console. A bit subtle.

We are going to scrap a specific bit of information from Wikipedia's site. On counties. 