# Collecting Data from the Web - APIs


An API is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. The kinds of APIs we will use here can be used to interact with, and get data from, different web platforms using Python. Some APIs are more public than others.

We're going to first explore the Google Books API to perform some searches for books and see what metadata we get in return. Although many APIs require a key in order to access the data, we can perform Google Books searches without one. 

Using APIs to request data is usually done through HTTP (see [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview) for an overview of basic terms). The guide to the Google Books API can be found [here](https://developers.google.com/books/).

If you are interested in Reddit and Twitter data check out notebook 3-2_praw-tweepy.ipynb. You can also download Tweets from the Internet by Google searching "download tweets". 

First we'll import the [`requests`](http://docs.python-requests.org/en/master/) library. The `requests` library is necessary for most interaction with the internet in Python. We'll use it to make a `get` request to the API endpoint.

In [1]:
# !pip install requests  

In [2]:
import requests
import pandas as pd
import time

To call an API it to just build a unique URL. We always need a base URL, or endpoint, to which we can add the parameters specific to our request. Let's assign the base Google Books endpoint to a variable, we'll call it `books_url`. We know this URL from the documentation linked above.

In [3]:
books_url = 'https://www.googleapis.com/books/v1/volumes?'

We can start off with a very simple search to see what the results look like. Then we'll move on to adding more filters and parameters. Let's assign our query to a variable `query`.

In [4]:
query = 'digital humanities'

To incorporate this into our query we can make a dictionary called `parameters`. Read more about what kinds of parameters the Google API allows for [here](https://developers.google.com/books/docs/v1/using#st_params).

We'll pass these parameters to the `get` method. Read more about HTTP GET requests [here](https://www.w3schools.com/tags/ref_httpmethods.asp). 

The `'q'` stands for 'query', and whatever value we assign to that is what we're searching for, just as if we typed it into the Google search bar.

In [5]:
parameters = {'q': query}

We'll pass two arguments to the `get` method of `requests` library: the URL and the parameters we want. It returns a response object.

In [6]:
r = requests.get(books_url, params = parameters)

Printing the `url` property, we can see that this function converted the URL into the proper format to include our search terms.

In [7]:
print(r.url)

https://www.googleapis.com/books/v1/volumes?q=digital+humanities


To see our results, we simply use the request object's `json` method, which we can then navigate in the same way as a Python dictionary. Take a minute or two to navigate through the results in order to see what's there. 

In [8]:
results = r.json()
results

{'kind': 'books#volumes',
 'totalItems': 578,
 'items': [{'kind': 'books#volume',
   'id': 'PGV2DwAAQBAJ',
   'etag': 'c46lXoe4Zkg',
   'selfLink': 'https://www.googleapis.com/books/v1/volumes/PGV2DwAAQBAJ',
   'volumeInfo': {'title': 'Research Methods for the Digital Humanities',
    'authors': ['lewis levenberg', 'Tai Neilson', 'David Rheams'],
    'publisher': 'Springer',
    'publishedDate': '2018-11-04',
    'description': 'This volume introduces the reader to the wide range of methods that digital humanities employ, and offers a practical guide to the study, interpretation, and presentation of cultural material and practices. In this instance, the editors consider digital humanities to include both the use of computing to understand cultural material in new ways, and the application of theories and methods from the humanities to interpret new technologies. Each chapter provides a step-by-step guide to cutting-edge methodologies so that students can make informed decisions about t

You probably figured out that the books are found under the `items` key, and the most important information for each one is under the `volumeInfo` key. Let's take a look at the first result.

In [9]:
results['items'][0]['volumeInfo']

{'title': 'Research Methods for the Digital Humanities',
 'authors': ['lewis levenberg', 'Tai Neilson', 'David Rheams'],
 'publisher': 'Springer',
 'publishedDate': '2018-11-04',
 'description': 'This volume introduces the reader to the wide range of methods that digital humanities employ, and offers a practical guide to the study, interpretation, and presentation of cultural material and practices. In this instance, the editors consider digital humanities to include both the use of computing to understand cultural material in new ways, and the application of theories and methods from the humanities to interpret new technologies. Each chapter provides a step-by-step guide to cutting-edge methodologies so that students can make informed decisions about the methods they use, consider ethical practices, follow practical procedures, and present their work effectively. Readers will develop practical and reflexive understandings of the software and digital devices that they study and use for

There's a lot of information in the results, but we probably don't want all of it. Suppose that for each volume in the results, we only want to extract 1) the title, 2) the author(s), 3) the publication date, and 4) the description. Below is a function named `parse_results` that takes the `results` variable as an argument and returns a list of dictionaries. Each dictionary within the list corresponds to a book, and has an `author` key, a `title` key, a `publication_date` key, and a `description` key.

In [10]:
def parse_results(results):

    results_list = []

    for book in results['items']:

        title = book['volumeInfo']['title']

        # some books don't have authors, dates, or a description
        try:
            authors = ','.join(book['volumeInfo']['authors'])
        except:
            authors = 'NA'
        
        try:
            published_date = book['volumeInfo']['publishedDate']
        except:
            published_date = 'NA'

        try:
            description = book['volumeInfo']['description']
        except:
            description = "NA"

        results_dict = {'title': title,
                        'authors': authors,
                        'description': description,
                        'published_date': published_date}
        
        results_list.append(results_dict)
        
    return(results_list)

In [11]:
# Use our function to parse the results
data = parse_results(results)
data

[{'title': 'Research Methods for the Digital Humanities',
  'authors': 'lewis levenberg,Tai Neilson,David Rheams',
  'description': 'This volume introduces the reader to the wide range of methods that digital humanities employ, and offers a practical guide to the study, interpretation, and presentation of cultural material and practices. In this instance, the editors consider digital humanities to include both the use of computing to understand cultural material in new ways, and the application of theories and methods from the humanities to interpret new technologies. Each chapter provides a step-by-step guide to cutting-edge methodologies so that students can make informed decisions about the methods they use, consider ethical practices, follow practical procedures, and present their work effectively. Readers will develop practical and reflexive understandings of the software and digital devices that they study and use for research, and the book will help new researchers collaborate a

In [12]:
# Convert the results into a dataframe
results_df = pd.DataFrame(data)
results_df

Unnamed: 0,title,authors,description,published_date
0,Research Methods for the Digital Humanities,"lewis levenberg,Tai Neilson,David Rheams",This volume introduces the reader to the wide ...,2018-11-04
1,A New Companion to Digital Humanities,"Susan Schreibman,Ray Siemens,John Unsworth",This highly-anticipated volume has been extens...,2016-01-26
2,Defining Digital Humanities,Dr Edward Vanhoutte,This reader brings together the essential read...,2013-12-23
3,Digital Humanities in Practice,"Claire Warwick,Melissa Terras,Julianne Nyhan",This cutting-edge and comprehensive introducti...,2012-10-09
4,Digital Humanities,"David M. Berry,Anders Fagerjord","As the twenty-first century unfolds, computers...",2017-05-30
5,The Digital Humanities and the Digital Modern,James Smithies,This book provides new critical and methodolog...,2017-08-28
6,Research Methods for Reading Digital Data in t...,Gabriele Griffin,The first volume to introduce the techniques a...,2016-02-15
7,Doing Digital Humanities,"Constance Crompton,Richard J Lane,Ray Siemens",Digital Humanities is rapidly evolving as a si...,2016-09-13
8,The Emergence of the Digital Humanities,Steven E. Jones,The past decade has seen a profound shift in o...,2013-08-15
9,Understanding Digital Humanities,D. Berry,Confronting the digital revolution in academia...,2012-02-07


Now let's explore the API using more parameters. You may have noticed that our query only gave us 10 books, but there are probably more than 10 books written about digital humanities. To adjust our search, we need to add in the `maxResults` parameter and the `startIndex` parameter. We can do that by adding these as keys to the `parameters` dictionary, and then run our request again. To read about these parameters, see the [documentation](https://developers.google.com/books/docs/v1/using#api_params).

In [13]:
%pwd

'/Users/tomvannuenen/Documents/GitHub/DIGHUM101-2022/Notebooks/Week3'

In [14]:
parameters = {'q': query,
              'startIndex': 0,
              'maxResults': 10} # max 40 per request, as per documentation

r = requests.get(books_url, params = parameters)

print(r.url)

results = r.json()

print()

parse_results(results)

https://www.googleapis.com/books/v1/volumes?q=digital+humanities&startIndex=0&maxResults=10



[{'title': 'Research Methods for the Digital Humanities',
  'authors': 'lewis levenberg,Tai Neilson,David Rheams',
  'description': 'This volume introduces the reader to the wide range of methods that digital humanities employ, and offers a practical guide to the study, interpretation, and presentation of cultural material and practices. In this instance, the editors consider digital humanities to include both the use of computing to understand cultural material in new ways, and the application of theories and methods from the humanities to interpret new technologies. Each chapter provides a step-by-step guide to cutting-edge methodologies so that students can make informed decisions about the methods they use, consider ethical practices, follow practical procedures, and present their work effectively. Readers will develop practical and reflexive understandings of the software and digital devices that they study and use for research, and the book will help new researchers collaborate a

Now, we can write a for-loop to collect the first 100 results into `all_results`. But make sure you use `time.sleep` at the end of each loop! Python is so fast that if you write a for loop without pausing between calls you can overload someone's server, or get yourself (temporarily) banned:

In [15]:
# See how .sleep works

for x in range(5):
    print(x)
    time.sleep(1) # alter this number to change sleep length

0
1
2
3
4


In [16]:
parameters = {'q': query,
          'maxResults': 20,
          'startIndex': 0}

all_results = []
for i in range(5):
    print("collecting page " + str(i + 1))
    
    r = requests.get(books_url, params=parameters)
    results = r.json()
    parsed = parse_results(results)
    all_results.extend(parsed)
    
    time.sleep(1) # very important to not overload API!!!
    parameters['startIndex'] += parameters['maxResults'] # note the addition assignment 

collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5


In [17]:
print(len(all_results))

100


Now we can write this data to a CSV.

In [18]:
import csv

keys = all_results[0].keys()

with open('books_search.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_results)