## Anatomy of an API
**API**: Application programming interface
- Most 'big' sites (Twitter, Google, etc) have APIs to access their information
    - Allows access to information without using webpages
    - Can as the server to send only the specific information desired
    - Speeds up scraping as well as minimizes server demand
    - Typically includes throttling by limiting number of server requests per hour
- Access: request a key that program provides with each API call
    - API key (or token) uniquely identifies you
    - Lets the API provider monitor your usage
    - Security measure:keys can have different levels of authorization/access
    - Can be set to expire after a certain amount of time or number of uses
- Requests: program requests the data with a call to the API, including...
    - Method: type of query made using language defined by the API
    - Parameters: refine the query
- Response: data returned by API, typically in a common format like json

### Basics of API Queries: Wikipedia's API
- Wikipedia's API doesn't require an authorization key
    - When required, since scrapy can handle authorization so it can be used to access APIs
- Goal: use [Wikipedia's API](https://www.mediawiki.org/wiki/API:Main_page) to get what other entries on Wikipedia link to the Monty Python page
    - To do this by scraping, would have to scrape every single page on Wikipedia (very inefficient)
    - To accomplish this, can build a query using the [Wikipedia API Sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox)
    - Query is: `https://en.wikipedia.org/
    w/api.phpaction=query&format=xml&prop=linkshere&titles=Monty_Python&lhprop=title%7Credirect`
    - Broken down:
        - `w/api.php`: tells the server we are using an API rather than scraping raw pages
        - `action=query`: want information from the API (as opposed to changing information in the API)
        - `format=xml`: return format in xml, then we will parse with xpath
        - `prop=linkshere`: we are interested in which pages link to target page
        - `titles=Monty_Python`: setting target page using exact page name
        - `lhprop=title`: from those links, want the title of each page
        - `redirect`: also want to know if the link is a redirect
        
### Using Scrapy for API calls
- If query can be answered in one response, scrapy is overkill
    - Can just use requests library to make call and library like lxml to parse the return
- Wikipedia's API will only return ten items at a time in response to a query (to avoid overwhelming the server)
- Can use scrapy to iterate over query results (same way iterate over pages when scraping)

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess


class WikiSpider(scrapy.Spider):
    name = 'WS'
    
    # insert API call
    start_urls = [
        'https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=linkshere&titles=Monty_Python&lhprop=title%7Credirect'
        ]

    # identify the information wanted from the query response
    # and extract it with xpath
    def parse(self, response):
        for item in response.xpath('//lh'):
            # the ns code identifies the type of page the link comes from
            # '0' means it is a Wikipedia entry.
            # other codes indicate links from 'Talk' pages, etc
            # only interested in wikipedia entries, filter for only '0':
            if item.xpath('@ns').extract_first() == '0':
                yield {
                    'title': item.xpath('@title').extract_first() 
                    }
        # information necessary to get the next ten entries.
        next_page = response.xpath('continue/@lhcontinue').extract_first()
        
        # recursively call spider to process next ten entries, if they exist
        if next_page is not None:
            next_page = '{}&lhcontinue={}'.format(self.start_urls[0],next_page)
            yield scrapy.Request(next_page, callback=self.parse)
            
    
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'PythonLinks.json',
    # robots.txt file doesn't apply since using API queries
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # CLOSESPIDER_PAGECOUNT to limit scraper to the first 100 links    
    'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

# start the crawler with spider
process.crawl(WikiSpider)
process.start()
print('first 100 links extracted')

In [None]:
import pandas as pd

monty=pd.read_json('PythonLinks.json', orient='records')
print(monty.shape)
print(monty.tail())

- API call was successful, saved 94 out of 100 (6 aren't links from entry pages).
- Authorization keys
    - Often simply included in the query string as an argument
    - If need to enter key or login to form, scrapy has this functionality

### Challenge:
Pick a different website and write a scraper that will:
- Return specific pieces of information (rather than just downloading a whole page)
- Iterate over multiple pages/queries
- Save the data to your computer

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

class mapsSpider(scrapy.Spider):
    name = 'mapsSpider'
    # API call
    # maps api key AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0
    start_urls = ['https://maps.googleapis.com/maps/api/place/textsearch/xml?query=restaurants+near+wrigley+field&key=AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0']
    
    # extract first 10 restuarants near wrigley field
    def parse(self, response):
        for item in response.xpath('//result'):
            yield {
                'name': item.xpath('name/text()').extract_first(),
                'address': item.xpath('formatted_address/text()').extract_first(),
                    }

process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'mapsResults.json',
    # robots.txt file doesn't apply since using API queries
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # CLOSESPIDER_PAGECOUNT to limit scraper to the first 100 links    
    #'CLOSESPIDER_PAGECOUNT' : 10
})

process.crawl(mapsSpider)
process.start()
print('complete')

complete


In [2]:
import pandas as pd

results = pd.read_json('mapsResults.json', orient='records')
results

Unnamed: 0,address,name
0,"3456 N Sheffield Ave, Chicago, IL 60657, USA",Cozy Noodles n' Rice
1,"3463 N Clark St, Chicago, IL 60657, USA",Dimo's Pizza
2,"3664 N Clark St, Chicago, IL 60613, USA",Bernie's Tap & Grill (Across from Wrigley Field)
3,"3908 N Sheridan Rd, Chicago, IL 60613, USA",PR Italian Bistro
4,"3343 N Clark St, Chicago, IL 60657, USA",Lowcountry Lakeview
5,"3731 N Clark St, Chicago, IL 60613, USA",Azteca Grill
6,"3800 N Clark St, Chicago, IL 60613, USA",Uncommon Ground (Lakeview)
7,"1017 W Irving Park Rd, Chicago, IL 60613, USA",Byron's Hotdogs
8,"1011 W Irving Park Rd, Chicago, IL 60613, USA",Tac Quick
9,"936 W Addison St, Chicago, IL 60613, USA",El Burrito Mexicano


EVERYTHING ABOVE THIS CELL WORKS

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class mapsNearbySpider(scrapy.Spider):
    name = 'mapsNearbySpider'
    
    def __init__(self, queryStr, nearbyStr):
        super(mapsNearbySpider, self).__init__(queryStr, nearbyStr)
        self.start_urls = ['https://maps.googleapis.com/maps/api/place/textsearch/xml?{}&key=AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0'.format(queryStr + '+near+' + nearbyStr)]
        #self.queryTerm = queryStr
        #self.nearbyTerm = nearbyStr
        
        #url_base = 'https://maps.googleapis.com/maps/api/place/textsearch/xml?'
        #url_key = '&key=AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0'
        #url_args = queryStr + '+near+' + nearbyStr
        #url_complete = url_base + argsQuery + url_key
    
    # build api call
    # maps api key AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0
    #url_base = 'https://maps.googleapis.com/maps/api/place/textsearch/xml?'
    #url_key = '&key=AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0'
    #url_args = queryStr + '+near+' + nearbyStr
    #url_complete = url_base + argsQuery + url_key
    
    #self.
    #start_urls = ['https://maps.googleapis.com/maps/api/place/textsearch/xml?query=restaurants+near+wrigley+field&key=AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0']
    
    # extract first 10 restuarants near wrigley field
    def parse(self, response):
        for item in response.xpath('//result'):
            yield {
                'name': item.xpath('name/text()').extract_first(),
                'address': item.xpath('formatted_address/text()').extract_first(),
                    }

process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'mapsResults.json',
    # robots.txt file doesn't apply since using API queries
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # CLOSESPIDER_PAGECOUNT to limit scraper to the first 100 links    
    #'CLOSESPIDER_PAGECOUNT' : 10
})

wrigley_spider = mapsNearbySpider('restuarants', 'wrigley+field')

process.crawl(wrigley_spider)
process.start()
print('complete')

In [None]:
url_base = 'https://maps.googleapis.com/maps/api/place/textsearch/xml?'
argsQuery = 'query=restaurants+near+wrigley+field'
url_key = '&key=AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0'
url_complete = url_base + argsQuery + url_key
url_complete

In [None]:
queryTerm = 'restaurants'
nearbyTerm = 'wrigley' + '+' + 'field'
nearbyTerm

In [None]:
query_constructor = queryTerm + '+near+' + nearbyTerm
query_constructor

In [None]:
url_base = 'https://maps.googleapis.com/maps/api/place/textsearch/xml?'
argsQuery = 'query=' + query_constructor
url_key = '&key=AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0'
url_complete = str(url_base + argsQuery + url_key)
print('{}'.format(url_complete))

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class mapsNearbySpider(scrapy.Spider):
    name = 'mapsNearbySpider'

    def __init__(self, queryStr, nearbyStr):
        self.queryStr = queryStr
        self.nearbyStr = nearbyStr
    
    # build api call
    def build_url(self):
        url_base = 'https://maps.googleapis.com/maps/api/place/textsearch/xml?'
        url_key = '&key=AIzaSyDbSk0561WqHmDUagcZqTzDvzTHd6ol7i0'
        url_args = self.queryStr + '+near+' + self.nearbyStr
        url_complete = url_base + argsQuery + url_key
        return url_complete
    
    # make api call
    def api_call(self):
        self.build_url()
        start_urls = [url_complete]
    
    # extract name & address from response
    def parse(self, response):
        for item in response.xpath('//result'):
            yield {
                'name': item.xpath('name/text()').extract_first(),
                'address': item.xpath('formatted_address/text()').extract_first(),
                    }
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'mapsResults.json',
    # robots.txt file doesn't apply since using API queries
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # CLOSESPIDER_PAGECOUNT to limit scraper to the first 100 links    
    #'CLOSESPIDER_PAGECOUNT' : 10
})

process.crawl(mapsNearbySpider('restaurants', 'wrigley+field'))
process.start()
print('complete')

In [None]:
test_class = mapsTest('restaurants', 'wrigley+field')
test_class.build_url()

In [None]:
class mapsTest():

    def __init__(self, queryStr, nearbyStr):
        self.queryStr = queryStr
        self.nearbyStr = nearbyStr