# Using APIs

- Like many programmers who have worked on large projects, I have my share of horror stories when it comes to working with other people's code. From namespacse issues to type issues to misunderstandings of function output, simply trying to get information from point A to method B can be a nightmare.

- This is where application programming interfaces come in handy: they provide nice, convenient interfaces between mulitple disparate applications. Is doesn't matter if the applications are written by different programmers, with different architectures, or even in different languages-APIs are designed to serve as a lingua franca between different pieces of software that need to share information with each other.

- Although various APIs exist for a variety of different software applications, in recent times "API" has been commonly understood as meaning "web application API." Typically, a programmer will make a request to an API via HTTP for some type of data, and the API will return this data in the form of XML or JSON. Although most APIs still support XML, JSON is quickly becoming the encoding protocol of choice.

- If taking advantage of a ready-to-use program to get information prepackaged in a useful format seems like a bit of a departure from the rest of this book, well, it is and it isn't. Although using APIs isn't generally considered web scraping by most people, both pratices use many of the same techniques (sending HTTP requests) and produce similar results (getting information); they often can be vert complementary to each other.

- For instance, you might want to combine information gleanced from a web scraper with information from a published API in order to make the information more useful to you. In an example later in this chapter, we will look at combining Wikipedia edit histories (which contain IP address) with an IP address resolver API in order to get the geographic location of Wikipedia edits around the world.

- In this chapter, we'll offer a general overview of APIs and how they work, look at a few popular APIs available today, and look at how you might use an API in your own web scrapers.

## How APIs work

- Although APIs are not nearly as ubiquitous as they should be (a large motivation for writing this book, because if you can't find an API, you can still get the data through scraping), you can find APIs for many types of information. Interested in music? There are a few different APIs that give you songs, artists, albums, and even information about musical styles and related artists. Need sports data? ESPN provides APIs for athlete information, game scores, and more. Google has dozens of APIs in its Developer section for language translations, analytics, geolocation, and more.

## Common Conventions 

- Unlike the subjects of most web scraping, APIs follow an extremely standardized set of rules to produce information, and they produce that information in an extremely standardized way as well. Because of this, it is easy to learn a few simple ground rules that will help you to quickly get up and running with any given API, as long as it's fairly well written.

## Methods

- There are four ways to request information from a web server via HTTP:

- GET
- POST
- PUT
- DELETE

- GET is what you use when you visit a website through the address bar in your browser. GET is the method you are using when you make a call to URL. You can think of GET as saying, "Hey, web server, please get me this information."

- POST is what you use when you fill out a form, or submit information, presumably to a backend script on the server. Every time you log into a website, you are making a POST request with your username and (hopefully) encrypted password. If you are making a POST request with an API, you are saying, "Please store this information in your database."

- PUT is less. commonly used when interacting with websites, but is used from time to time in APIs. A PUT request is used to update an object or information. An API might require a POST request to create a new user, for example, but it might need a PUT request if you want to update that user's email address.

- DELETE is straightforward; it is used to delete an object. For instance, if I send a DELETE request to http://myapi.com/user/23, it will delete the user with the ID 23. DELETE methods are not often encountered in public APIs, which are primarily created to disseminate information rather than allow random users to remove that information from their databases. However, like the PUT method, it's a good one to know about.

- Although a handful of other HTTP methods are defined under the specifications for HTTP, these four constitute the entirety of what is used in just about any API you will ever encounter.

## Authentication

- Although some APIs do not use any authentication to operate (meaning anyone can make an API call for free, without registering with the application first), many modern APIs require some type of authentication before they can be used.

- Some APIs require authentication in order to charge money per API call, or they might offer their service on some sort of a monthly subscription basis. Others authenticate in order to "rate limit" users (restrict them to a certain number of calls per second, hour, or day), or to restrict the access of certain kinds of information or types of API calls for some users. Other APIs might not place restrictions, but they might want to keep track of which users are making which calls for marketing purposes.

- All methods of API authentication generally revolve around the use of a token of some sort, which is passed to the web server with each API call made. This token is either provided to the user when the user registers and is a permanent fixture of the user's calls (generally in lower-security applications), or it can frequently change, and is retrieved from the server using a username and password combination. 

- For example, to make a call to Echo Nest API in order to retrieve a list of songs by the band Guns N' Roses we would use:

- This provides the server with an api_key value of what was provided to me on registration, allowing the server to identify the requester as Ryan Mitchell, and provide the requester with teh JSON data.

- In addition to passing tokens in the URL of the request itself, tokens might also be passed to the server via a cookie in the request header. 

```python
token = "<your api key>"
webRequest = urllib.request.Request("http://myapi.com", headers={"token":token})
html = urlopen(webRequest)
```

## Responses

- As you saw in the FreeGeoIP example at the beginning of the chapter, an important feature of APIs is that they have well-formatted responses. The most common types of response formatting are eXtensible Markup Language(XML) and JavaScript Object Notation(JSON).

- In recent years, JSON hav become vastly more popular than XML for a couple of major reasons. First, JSON file are generally smaller than well-designed XML files. Compare, for example, the XML data:

```xml
<user><firstname>Ryan</firstname><lastname>Mitchell</lastname><username>Kludgist </username></user>
```

which clocks in at 98 characters, and the same data in JSON:

```json
{"user":{"firstname":"Ryan","lastname":"Mitchell","username":"Kludgist"}}
```


which is only 73 characters, or a whipping 36% smaller than the equivalent XML.

Of course, one could argue that the XML could be formatted like this:

```XML
<user firstname="ryan" lastname="mitchell" username="Kludgist"></user>
```


but this is considered bad practice because it doesn't support deep nesting of data. Regardless, it still requires 71 characters, about the same length as the equivalent JSON.

Another reason JSON is quickly becoming more popular than XML is simply due to a shift in web technologies. In the past, it was more common for a server-side script such as PHP or .NET to be on the receiving end of an API. Nowadays, it is likely that a framework, such as Angular or Backbone, will be sending and receiving API calls. Server-side technologies are somewhat agnostic as to the form in which their data comes. But JavaScript libraries like Backbone find JSON easier to handle.

## Echo Nest

The Echo Nest is a fantastic example of a company that is built on web scrapers. Although some music-based companies, such as Pandora, depend on human intervention to categorize and annotate music, The Echo Nest relies on automated intelligence and information scraped from blogs and news articles in order to categorize musical artists, songs, and albums.

## A Few Examples

The Echo Nest API is built around several basic content types: artists, songs, tracks, and genres. Except for genres, these content types all have unique IDs, which are used to retrieve information about them in various forms, through API calls. For example, if I wanted to retrieve a list of songs performed by Monty Python, I would make the following call to retrieve their ID:

## Twitter

## Google API

## Parsing JSON

## Binging It All Back Home

Although the raison detre of many modern web applications is to take existing data and format it in a more appealing way, I would argue that this isn't very interesting thing to do in most instances. If you're using as API as your only data source, the best you can do is merely copy someone else's database that already exists, and which is, essentially, already published. What can be far more interesting is to take two or more data sources and combine them in a novel way, or use an API as a tool to look at scraped data from a new perspective.

If you've spent much time on Wikipedia, you've likely come across an article's revision history page, which displays a list of recent edits. 

The IP address outlined the history page is. By using the freegeoip.net API, as of this writing that IP address is from Quezon, Phillipines.

In [None]:
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html, "lxml")
    return bsObj.find("div", {"id":"bodyContent"}).find_all("a",
                                    href=re.compile("&(/wiki/)((?!:).)*$"))

def getHistoryIPs(pageUrl):
    # Format of revision history pages is:
    pageUrl = pageUrl.replace("/wiki/", "")
    historyUrl = "http://en.wikipedia.org/w/index.php?title="
                 +pageUrl+"&action=history"
    print("history url is : "+historyUrl)
    html = urlopen(historyUrl)
    bsObj = BeautifulSoup(html, "lxml")
    ipAddresses = bsObj.find_all("a", {"class":"mw-anouserlink"})
    addressList = set()
    for ipAddress in ipAddresses:
        addressList.append(ipAddress.get_text())
    return addressList

links = getLinks("/wiki/Python_(programming_language)")

while(len(links) > 0):
    for link in links:
        historyIPs = getHistoryIPs(link.attrs["href"])
        for historyIP in historyIPs:
            print(historyIP)
            
    newLink = links[random.randint(0, len(links)-1)].attrs["href"]
    links = getLinks(newLink)

Now that we have code that retrieves IP addresses as a string, we can combine this with the getCountry function from the previous section in order to resolve these IP addresses to countries. I modified getCountry slightly, in order to account for invalid or malformed IP addresses that will result in a "404 Not Found" error:

In [None]:
def getCountry(ipAddress):
    try:
        response = urlopen("http://freegeoip.net/json/"
                          _ipAddress).read().decode('utf-8')
    except HTTPError:
        return None
    responseJson = json.loads(response)
    return responseJson.get("country_code")

links = getLinks("/wiki/Python_(programming_language)")

while(len(links) > 0):
    for link in links:
        print("--------------")
        historyIPs = getHistoryIPs(link.attrs['href'])
        for historyIP in historyIPs:
            if country is not None:
                print(historyIP + " is from " + country)
        
newLink = links[random.randint(0, len(links)-1)].attrs['href']
links = getLinks(newLink)

## More About APIs

In this chapter, we've looked at a few ways that modern APIs are commonly used to access data onthe Web, in particular uses of APIs that you might find useful in we scraping.

- RESTful WEb APIs 
- Designing APIs for the Web

# Ch5 : Storing Data

- Although printing out to the terminal is a lot of fun, it's not incredibly useful when it comes to data aggregation and analysis. In order to make the majority of web scrapers remotely useful, you need to be able to save the information that they scrape.

- In this chapter, we will look at three main methods of data management that are suffiecient for almost any imaginable application. Do you need power the backend of a website or create your own API? You'lll probably want your scrapers to write to a database. Need a fast and easy way to collect some documents off the Internet and put them on your hand drive? You'll probably want to create file stream for that. Need occasional alerts, or aggregated data once a day? Send yourself an email!

- Above and beyond web scraping, the ability to store and interact with large amounts of data is incredibly important for just about any modern programming application. In fact, the information in this chapter is necessary for implementing many of the examples in later sections of the book. I highly recommend that you at least skim it if you're unfamiliar with automated data storage.

## Media Files

- There are two main ways to store media files: by reference, and by downloading the file itself. You can store a file by reference simply by storing the URL that the file is located at. This has several advantages:

- Scrapers run much faster, and require much less bandwidth, when they don't have to download files.
- You save space on your own machine by storing only the URLs.
- It is easier to write code that only stores URLs and doesn't need to deal with additional file downloads.
- You can lessen the load on the host sever by avoiding large file downloads.

> Disadvantages

- Embedding these URLs in your own website or application is known as hotlinking and doing it is a very quick way to get you in hot water on the Internet.
- You do not want to use someone else's server cycles to host media for your own applications.
- The file hosted at any particular URL is subject to change. This might lead to embarrassing effects if, say, you're embedding a hotlinked image on a public blog. If you're storing the URLs with the intent to store the file later, for further research, it might eventually go missing or be changed to something completely irrelevant at a later date.
- Real web browsers do not just request a page's HTML and move on-they download all of the assets required by the page as well. Downloading files can help make your scraper look like an actual human is browsing the site, which can be an advantage.

** urllib.request.rulretrieve ** to download files from any remote URLs

In [1]:
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html, 'html.parser')
imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"]
urlretrieve(imageLocation, "logo.jpg")

('logo.jpg', <http.client.HTTPMessage at 0x113d48908>)

The following will download all internal files, linked to by any tag's src attribute, from the home page of the site:

In [None]:
import os 
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"

def geAbsoluteURL(baseUrl, source):
    if source.startswith("http://www."):
        url = "http://"+source[11:]
    elif source.startswith("http://"):
        url = source
    elif source.startswith("www."):
        url = source[4:]
        url = "http://"+source
    else:
        url = baseUrl+"/"+source
    if baseUrl not in url:
        return None
    return url

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
    path = absoluteUrl.replace("www.", "")
    path = path.replace(baseUrl, "")
    path = downloadDirectory+path
    directory = os.path.dirname(path)
    
    if not os.path.exists(directory):
        os.makedirs(directory)
        
    return path

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html, "html.parser")
downloadList = bsObj.find_all(src=True)

for download in downloadList:
    fileUrl = getAbsoluteURL(baseUrl, download['src'])
    if fileUrl is not None:
        print(fileUrl)
        
urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

## Storing Data to CSV

CSV, or comma-separated values, is one of the most popular file formats in which to store spreadsheet data. It is supported by Microsoft Excel and many other application because of its simplicity. The following is an example of a perfectly valid CSV file:

As with Python, whitespace is important here: Each row is separated by a newline character, while columns within the row are separated by commas. Other forms of CSV files use tabs or other characters to separate rows, but these file formats are less common and less widely supported.

In [None]:
import csv 
csvFile = open(".../files/test.csv", "w+")
try:
    wrtier = csv.writer(csvFile)
    writer.writerow(('number', 'number plus 2', 'number times 2'))
    for i in range(10):
        writer.writerow((i, i+2, i*2))
finally:
    csvFile.close()
    

## MySQL

## Some Basic Commands

## Email

Just like web pages sent over HTTP, email is sent over SMTP(Simple Mail Transter Protocol). And, just like you use a web server client to handle sending out web pages over HTTP, servers use various email clients, such as Sendmail, Postfix, or Mailman, to send and receive email.

In [None]:
import smtplib
from eail.mime.text import MIMEText

msg = MIMEText('The body of the email is here')

msg['Subject'] = "An Email Alert"
msg['From'] = "kadensungbincho@gmail.com"
msg['To'] = "webmaster@pythonscraping.com"

s = smtplib.SMTP('localhost')
s.send_messgae(msg)
s.quit()

Python contains two important packages for sending emails: smtplib and email.


Python's email module contains useful formatting functions for creating email "packets" to send. The MIMEText object, used here, creates an empty email formatted for transfer with the low-level MIME (Multipurpose Internet Mail Extensions) protocol, across which the higher-level SMTP connections are made. 

The smtplib package contains information for handling the connection to the server. Just like a connection to a MySQL server, this connection must be torn down every time it is created, to avoid creating too many connections.

This basic email function can be extended and made more useful by enclosing it in a function:

In [None]:
import smtplib
from email.mime.text import MIMEText
from bs4 import BeautifulSoup
from urllib.request import urlopen
import time

def sendMail(subject, body):
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = 'christmas_alerts@pythonscraping.com'
    msg['To'] = 'kadensungbincho@gmail.com'
    
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()

bsObj = BeautifulSoup(urlopen("https://isitchristmas.com/"))
while(bsObj.find('a', {"id":"answer"}).attrs['title'] == "NO"):
    print("It is not Christmas yet.")
    time.sleep(3600)
bsObj = BeautifulSoup(urlopen("https://isitchristmas.com"))
sendMail("It's Christmas!",
        "According to")
    

# Ch6 : Reading Documents

## Document Encoding

A document's encoding tells applications-whether they are your computer's operating system or your own Python code-how to read it. This encoding can usually be deduced from its file extension, although this file extension is not mandated by its encoding. I could, for example, save myImage.jpg as myImage.txt with no problems-at least until my text editor tried to open it. Fortunately, this situation is rare, and a document's file extension is usually all you need to know in order to read it correctly.

On a fundamental level, all documents are encoded in 0s and 1s. On top of that, there are encoding algorithms that define things such as "How many bits per character" or "How many bits represent the color for each pixel". On top of that, you might have a layer of compression, or some space-reducing algorithm, as is the case with PNG files.

Although dealing with non-HTML files might seem intimidating at first, rest assured that with the right library, Python will be properly equipped to deal with any format of information you want to throw at it. The only difference between a text file, a video file, and an image file is how their 0s and 1s are interpreted. 

## Text

In [2]:
from urllib.request import urlopen
textPage = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1.txt")
print(textPage.read())

b'CHAPTER I\n\n"Well, Prince, so Genoa and Lucca are now just family estates of theBuonapartes. But I warn you, if you don\'t tell me that this means war,if you still try to defend the infamies and horrors perpetrated bythat Antichrist- I really believe he is Antichrist- I will havenothing more to do with you and you are no longer my friend, no longermy \'faithful slave,\' as you call yourself! But how do you do? I seeI have frightened you- sit down and tell me all the news."\n\nIt was in July, 1805, and the speaker was the well-known AnnaPavlovna Scherer, maid of honor and favorite of the Empress MaryaFedorovna. With these words she greeted Prince Vasili Kuragin, a manof high rank and importance, who was the first to arrive at herreception. Anna Pavlovna had had a cough for some days. She was, asshe said, suffering from la grippe; grippe being then a new word inSt. Petersburg, used only by the elite.\n\nAll her invitations without exception, written in French, anddelivered by a scarle

Normally, when we retrieve a page using urlopen, we turn it into a BeautifulSoup object in order to parse the HTML. In this case, we can read the page directly. Turning it into a BeautifulSoup object, while perfectly possible, would be counterproductive-there's no HTML to parse, so the library would be useless. Once the text file is read in as a string, you merely have to analyze it like you would any other string read into Python. The disadvantage here, of course , is that you don't have the ability to use HTML tags as context clues, pointing you in the direction of the text you actually need, versus the text you don't want. This can present a challenge when you're trying to extract certain information from text fils.

## Text Encoding and the Global Internet

In [4]:
from urllib.request import urlopen
textPage = urlopen(
        "http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt")
print(textPage.read())

b"\xd0\xa7\xd0\x90\xd0\xa1\xd0\xa2\xd0\xac \xd0\x9f\xd0\x95\xd0\xa0\xd0\x92\xd0\x90\xd0\xaf\n\nI\n\n\xe2\x80\x94 Eh bien, mon prince. G\xc3\xaanes et Lucques ne sont plus que des apanages, des \xd0\xbf\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x81\xd1\x82\xd1\x8c\xd1\x8f, de la famille Buonaparte. Non, je vous pr\xc3\xa9viens que si vous ne me dites pas que nous avons la guerre, si vous vous permettez encore de pallier toutes les infamies, toutes les atrocit\xc3\xa9s de cet Antichrist (ma parole, j'y crois) \xe2\x80\x94 je ne vous connais plus, vous n'\xc3\xaates plus mon ami, vous n'\xc3\xaates plus \xd0\xbc\xd0\xbe\xd0\xb9 \xd0\xb2\xd0\xb5\xd1\x80\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x80\xd0\xb0\xd0\xb1, comme vous dites. \xd0\x9d\xd1\x83, \xd0\xb7\xd0\xb4\xd1\x80\xd0\xb0\xd0\xb2\xd1\x81\xd1\x82\xd0\xb2\xd1\x83\xd0\xb9\xd1\x82\xd0\xb5, \xd0\xb7\xd0\xb4\xd1\x80\xd0\xb0\xd0\xb2\xd1\x81\xd1\x82\xd0\xb2\xd1\x83\xd0\xb9\xd1\x82\xd0\xb5. Je vois que je vous fais peur, \xd1\x81\xd0\xb0\xd0\xb4\xd0\xb8\xd1\x82\xd

In [8]:
from urllib.request import urlopen
textPage = urlopen(
        "http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt")
print(str(textPage.read(), 'utf-8'))

ЧАСТЬ ПЕРВАЯ

I

— Eh bien, mon prince. Gênes et Lucques ne sont plus que des apanages, des поместья, de la famille Buonaparte. Non, je vous préviens que si vous ne me dites pas que nous avons la guerre, si vous vous permettez encore de pallier toutes les infamies, toutes les atrocités de cet Antichrist (ma parole, j'y crois) — je ne vous connais plus, vous n'êtes plus mon ami, vous n'êtes plus мой верный раб, comme vous dites. Ну, здравствуйте, здравствуйте. Je vois que je vous fais peur, садитесь и рассказывайте.
Так говорила в июле 1805 года известная Анна Павловна Шерер, фрейлина и приближенная императрицы Марии Феодоровны, встречая важного и чиновного князя Василия, первого приехавшего на ее вечер. Анна Павловна кашляла несколько дней, у нее был грипп, как она говорила (грипп был тогда новое слово, употреблявшееся только редкими). В записочках, разосланных утром с красным лакеем, было написано без различия во всех:
«Si vous n'avez rien de mieux à faire, Monsieur le comte (или mon 

The problem is that Python is attempting to read the document as an ASCII document, whereas the browser is attempting to read it as an ISO-8859-1 encoded document. 

In [None]:
html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html, "html.parser")
content = bsObj.find("div", {"id":"mw-content-text"}).get_text()
content = bytes(content, "UTF-8")
content = content.decode("UTF-8")

## CSV

## PDF

In [None]:
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManger()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    
    process_pdf(rsrcmgr, device, pdfFile)
    device.close()
    
    content = retstr.getvalue()
    retstr.close()
    return content

pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
outputString = readPDF(pdfFile)
print(outputString)
pdfFiel.close()

## .docx

In [11]:
from zipfile import ZipFile
from urllib.request import urlopen 
from io import BytesIO
wordFile = urlopen("http://pythonscraping.com/pages/AWordDocument.docx").read() 
wordFile = BytesIO(wordFile)
document = ZipFile(wordFile)
xml_content = document.read('word/document.xml') 
print(xml_content.decode('utf-8'))

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/

In [25]:
from bs4 import BeautifulSoup
wordObj = BeautifulSoup(xml_content.decode('utf-8'), "xml") 
textStrings = wordObj.find_all("w:t")
for textElem in textStrings:
    print(textElem.text)

A Word Document on a Website
This is a Word document, full of content that you want very much. Unfortunately, it’s difficult to access because I’m putting it on my website as a .
docx
 file, rather than just publishing it as HTML
