# COMP30760 Assignment 1 - 21334466 - Task 1 - Data Collection

For this task, I will be collecting data from Gutendex - a digital library of books available from Project Gutenberg. Gutendex offers an accessible and simple API for extracting information on books.

Here is the API endpoint from the list of APIs - https://gutendex.com/.

I decided to extract the data from the first 1000 URLs on the API endpoint, resulting in an extensive dataset of over 31,000 book entries.

The large volume of data will be very useful for the data cleaning and analysis process. By opting for a broad and inclusive dataset, I hope to explore accurate patterns, trends, and insight that emerge across genres, authors, and other useful characteristics.

In [16]:
#Importing libraries
import json, urllib.request
from pathlib import Path

In [17]:
#Setting API URL
gutendexUrl = "https://gutendex.com/books"

#List of pages of data
gutendex_metadata = []

In [18]:
#Creating directory for raw data storage, if it doesn't exist
dir_raw = Path("raw")
dir_raw.mkdir(parents=True, exist_ok=True)

## Data Collection and Parsing

To collect data from the Gutendex API, I will define a custom function specifically for fetching information from the API endpoint. 

The API endpoint is organized into distinct URL pages, with each page presenting a subset of the available data.

For example, the first page of the API can be accessed via the following URL: "https://gutendex.com/books/?page=1". 

Each page retrieved from the API is formatted as a dictionary or JSON object, containing the following key attributes:

- "count": Indicates the total number of book entries on the current page.
- "next": Provides the URL of the next page, allowing for sequential data retrieval.
- "previous": Specifies the URL of the previous page, facilitating backward navigation if necessary.
- "results": Represents the book entries as a list of dictionaries, containing detailed information about individual books.

The data collection process involves iterating over the available page numbers. For each page, the function will read the data, extract and parse relevant information, and save it for subsequent analysis. 

In [19]:
#General function to retrieve and parse data, takes URL and page number as input
def fetch_and_parse(endpoint, pageNumber=1): #default to first page
    #construct url
    url = endpoint
    
    if (not endpoint.startswith("/")):
         url += "/"
    if (pageNumber > 0):
        url += "?page=%d" % pageNumber
        print("Fetching %s" % url)
    else:
        print("Invalid page number %d. Must be a positive number." % pageNumber)
        return
    
    try:
        response = urllib.request.urlopen(url)
        jdata = response.read().decode()
        return json.loads(jdata)
    except:
        print("Failed to fetch %s" % url)

In [20]:
#Fetch + parse gutendex json data
#I have taken the first 1000 pages, here is a demonstration of the first 5:
for i in range(1, 6):
    gutendex = fetch_and_parse(gutendexUrl, i)
    gutendex_metadata.append(gutendex)
    if (gutendex["next"] == ""):
        #if next url is empty, stop fetching data
        break
    

Fetching https://gutendex.com/books/?page=1
Fetching https://gutendex.com/books/?page=2
Fetching https://gutendex.com/books/?page=3
Fetching https://gutendex.com/books/?page=4
Fetching https://gutendex.com/books/?page=5


## Data Access

Each book entry in the "results" field follows this structured format:

- id: Unique book identifier.
- title: Book title.
- authors: List of dictionaries of authors, where each dictionary contains an author's name, birth year, and death year.
- translators: List of dictionaries of translators.
- subjects: Genres and subjects.
- bookshelves: Categories the book belongs to.
- languages: Available languages.
- copyright: Indicates copyright status.
- media_type: Type of media (e.g."Text").
- formats: List of dictionaries of different file formats with corresponding URLs.
- download_count: Number of downloads.

Example of accessing an entry:

1) Accessing the results of the first page [0]

2) Looking at the result of entries ["results"]

3) Choosing the first book entry [0]

4) Retrieving its title ["title"]

In [21]:
#Example of accessing a title of a book entry
gutendex_metadata[0]["results"][0]["title"]

'Romeo and Juliet'

## Data Saving

Next, I will create individual files to store the extraced data. 

Each file is named after its corresponding page URL, and contains the data collected from that page. This organisation ensures that raw dataset remains structured and easily accessible for further analysis.

In total, this will create 1000 distinct files of the Gutendex dataset.

In [8]:
#Function to write page of data into raw dataset directory
def write_to_dataset(index, offset):
    #ensure index is a number + a valid number
    try:
        gutendex_metadata[int(index)]["results"][0]
    except:
        print("Invalid Page Number Index: %d\nIndex must be an integer and less than %d (number of pages retrieved)" % (index, len(gutendex_metadata)))
    #create filename and write to dataset
    fname="gutendex_page_%s.json" % (offset + 1)
    out_path = dir_raw / fname
    print("Writing data to %s" % out_path)
    fout = open(out_path, "w")
    json.dump(gutendex_metadata[index]["results"], fout, indent=4)
    fout.close()

To be able to collect more data, I added a variable "offset". At first, I gathered data from 10 pages to test and understand the dataset. Later on, the offset was adjusted to collect data from 30, then 500, and ultimately 1000 pages.

The offset allowed me to accumulate more files seamlessly without disrupting the existing dataset.

In [9]:
offset = 500
for page in range(len(gutendex_metadata)):
    write_to_dataset(page, offset)
    offset += 1

Writing data to raw\gutendex_page_501.json
Writing data to raw\gutendex_page_502.json
Writing data to raw\gutendex_page_503.json
Writing data to raw\gutendex_page_504.json
Writing data to raw\gutendex_page_505.json
Writing data to raw\gutendex_page_506.json
Writing data to raw\gutendex_page_507.json
Writing data to raw\gutendex_page_508.json
Writing data to raw\gutendex_page_509.json
Writing data to raw\gutendex_page_510.json
Writing data to raw\gutendex_page_511.json
Writing data to raw\gutendex_page_512.json
Writing data to raw\gutendex_page_513.json
Writing data to raw\gutendex_page_514.json
Writing data to raw\gutendex_page_515.json
Writing data to raw\gutendex_page_516.json
Writing data to raw\gutendex_page_517.json
Writing data to raw\gutendex_page_518.json
Writing data to raw\gutendex_page_519.json
Writing data to raw\gutendex_page_520.json
Writing data to raw\gutendex_page_521.json
Writing data to raw\gutendex_page_522.json
Writing data to raw\gutendex_page_523.json
Writing dat

Writing data to raw\gutendex_page_747.json
Writing data to raw\gutendex_page_748.json
Writing data to raw\gutendex_page_749.json
Writing data to raw\gutendex_page_750.json
Writing data to raw\gutendex_page_751.json
Writing data to raw\gutendex_page_752.json
Writing data to raw\gutendex_page_753.json
Writing data to raw\gutendex_page_754.json
Writing data to raw\gutendex_page_755.json
Writing data to raw\gutendex_page_756.json
Writing data to raw\gutendex_page_757.json
Writing data to raw\gutendex_page_758.json
Writing data to raw\gutendex_page_759.json
Writing data to raw\gutendex_page_760.json
Writing data to raw\gutendex_page_761.json
Writing data to raw\gutendex_page_762.json
Writing data to raw\gutendex_page_763.json
Writing data to raw\gutendex_page_764.json
Writing data to raw\gutendex_page_765.json
Writing data to raw\gutendex_page_766.json
Writing data to raw\gutendex_page_767.json
Writing data to raw\gutendex_page_768.json
Writing data to raw\gutendex_page_769.json
Writing dat

Writing data to raw\gutendex_page_955.json
Writing data to raw\gutendex_page_956.json
Writing data to raw\gutendex_page_957.json
Writing data to raw\gutendex_page_958.json
Writing data to raw\gutendex_page_959.json
Writing data to raw\gutendex_page_960.json
Writing data to raw\gutendex_page_961.json
Writing data to raw\gutendex_page_962.json
Writing data to raw\gutendex_page_963.json
Writing data to raw\gutendex_page_964.json
Writing data to raw\gutendex_page_965.json
Writing data to raw\gutendex_page_966.json
Writing data to raw\gutendex_page_967.json
Writing data to raw\gutendex_page_968.json
Writing data to raw\gutendex_page_969.json
Writing data to raw\gutendex_page_970.json
Writing data to raw\gutendex_page_971.json
Writing data to raw\gutendex_page_972.json
Writing data to raw\gutendex_page_973.json
Writing data to raw\gutendex_page_974.json
Writing data to raw\gutendex_page_975.json
Writing data to raw\gutendex_page_976.json
Writing data to raw\gutendex_page_977.json
Writing dat

In summary, the data collection process has yielded an organised dataset of 1000 files. 

This can now be used for detailed exploration and analysis into literary trends and reader preferences.