## Anatomy of an API
**API**: Application programming interface
- Most 'big' sites (Twitter, Google, etc) have APIs to access their information
    - Allows access to information without using webpages
    - Can as the server to send only the specific information desired
    - Speeds up scraping as well as minimizes server demand
    - Typically includes throttling by limiting number of server requests per hour
- Access: request a key that program provides with each API call
    - API key (or token) uniquely identifies you
    - Lets the API provider monitor your usage
    - Security measure:keys can have different levels of authorization/access
    - Can be set to expire after a certain amount of time or number of uses
- Requests: program requests the data with a call to the API, including...
    - Method: type of query made using language defined by the API
    - Parameters: refine the query
- Response: data returned by API, typically in a common format like json

### Basics of API Queries: Wikipedia's API
- Wikipedia's API doesn't require an authorization key
    - When required, since scrapy can handle authorization so it can be used to access APIs
- Goal: use [Wikipedia's API](https://www.mediawiki.org/wiki/API:Main_page) to get what other entries on Wikipedia link to the Monty Python page
    - To do this by scraping, would have to scrape every single page on Wikipedia (very inefficient)
    - To accomplish this, can build a query using the [Wikipedia API Sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox)
    - Query is: `https://en.wikipedia.org/
    w/api.phpaction=query&format=xml&prop=linkshere&titles=Monty_Python&lhprop=title%7Credirect`
    - Broken down:
        - `w/api.php`: tells the server we are using an API rather than scraping raw pages
        - `action=query`: want information from the API (as opposed to changing information in the API)
        - `format=xml`: return format in xml, then we will parse with xpath
        - `prop=linkshere`: we are interested in which pages link to target page
        - `titles=Monty_Python`: setting target page using exact page name
        - `lhprop=title`: from those links, want the title of each page
        - `redirect`: also want to know if the link is a redirect
        
### Using Scrapy for API calls
- If query can be answered in one response, scrapy is overkill
    - Can just use requests library to make call and library like lxml to parse the return
- Wikipedia's API will only return ten items at a time in response to a query (to avoid overwhelming the server)
- Can use scrapy to iterate over query results (same way iterate over pages when scraping)

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess


class WikiSpider(scrapy.Spider):
    name = "WS"
    
    # Here is where we insert our API call.
    start_urls = [
        'https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=linkshere&titles=Monty_Python&lhprop=title%7Credirect'
        ]

    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        for item in response.xpath('//lh'):
            # The ns code identifies the type of page the link comes from.  '0' means it is a Wikipedia entry.
            # Other codes indicate links from 'Talk' pages, etc.  Since we are only interested in entries, we filter:
            if item.xpath('@ns').extract_first() == '0':
                yield {
                    'title': item.xpath('@title').extract_first() 
                    }
        # Getting the information needed to continue to the next ten entries.
        next_page = response.xpath('continue/@lhcontinue').extract_first()
        
        # Recursively calling the spider to process the next ten entries, if they exist.
        if next_page is not None:
            next_page = '{}&lhcontinue={}'.format(self.start_urls[0],next_page)
            yield scrapy.Request(next_page, callback=self.parse)
            
    
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'PythonLinks.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 100 links.    
    'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

# Starting the crawler with our spider.
process.crawl(WikiSpider)
process.start()
print('First 100 links extracted!')

First 100 links extracted!


In [2]:
import pandas as pd

# Checking whether we got data 

Monty=pd.read_json('PythonLinks.json', orient='records')
print(Monty.shape)
print(Monty.tail())

(94, 1)
                    title
89  Surrealist automatism
90        Raymond Queneau
91           Andr√© Breton
92      Tim Brooke-Taylor
93           Fifth Beatle
