# Image Collection Script for Google News Archive

This script was developed by the University of Toronto Scarborough's Digital Scholarship Unit to collect early Canadian newspapers from the depracted Google News Archive. 

## How to use it
Simply use the function 'get_newspapers()' which takes four arguments: The URL to the archive, the start date, end date, and name of output folder. For example, to run the script on the Halifax Gazette from January 1st 1752 to January 1st 1760 you would run:
```
get_newspapers("https://news.google.com/newspapers?nid=4p3FJGzxjgAC&dat=17520323&b_mode=2&hl=en",
                17520101,
                17600101,
                'HalifaxGazette')
```
The dates are in the form **YearMonthDay**. The output images would be in the folder called 'HalifaxGazette', which would have all of the pages seperated into folders, where one folder is one issue. These folders and images are named with the PID  given by the Google Archive.

The OCR data will be found in the folder HalifaxGazetre_ocr.

### How long does the script take to run?
It greatly depends on the range of years and newspaper, but in general an average issue takes around 20 minutes. As you can imagine, running the script on large collections can take several days. For this reason, if you stop the script and resume it, it will pick up where it left off so that you don't lose all of your progress. 

## How does it work?
The script has two main parts - getting the images and then processing them. Google News Archive hosts high definition images of newspapers; however, they are composed of several smaller images which cannot be easily webscraped. To get these smaller images, the page id's and issue id's are scraped and then used to create a link which the smaller images are located at. This is handled by the function get_template_img_urls().

The second part of the script uses Python Image Library to put the smaller images together into a larger HD image. This is handled by the function create_newspaper().

Finally, the Google Vision API is used to collect OCR data on the collected images. This is done with the get_ocr_data() function.

### Notes on future use/potential errors
Due to the nature of how Google has structured there images, there is a good chance that running this script on some collections will print out: **'Warning! Image cannot be proccessed, unknown pattern'**. This means the image has an unfamilair number of smaller images which composes it. To fix this, you can simply add a new ```elif``` which builds the image. **You can see this in the example case at the bottom of save_h_imgs()**. Any image which cannot be put together will be written to the incompleted_images.txt file. 

Furthermore, **there is a chance that images are put together improperly**. This can happen because some images in different collections are put together in different orders, depending on the shape of the image. 

For example, in one newspaper an image may be made of 54 smaller images, with the large image being 6 images wide and 9 images tall. However, in another newspaper, an image may also have 54 smaller images but be 9 wide and 6 tall. At the moment, there is no way to fix this aside from going into the script and rearraning the pattern to suite the current newspaper. Hopefully, a solution can someday be used to give the script information as to the exact pattern the image has.

In [1439]:
# Import statements 
import requests
import time
import ast
import random
import json
import os
import sys
import re
import io

from google.cloud import vision
from google.cloud.vision import types
from google.protobuf.json_format import MessageToDict
from bs4 import BeautifulSoup
import os.path
from os import path

import numpy as np
import PIL
from PIL import Image

client = vision.ImageAnnotatorClient()

In [1440]:
def slice_url(url: str) -> str:
    """ Returns a list composed of the first
    half of the given url, behind the date, and 
    the second half of the url following the 
    date. """
    
    # Get indexes of halves
    i = url.find('dat=')
    j = url.find('&', i) 
    
    # Slice string
    half1 = url[:i+4]
    half2 = url[j:]
    
    return [half1, half2]

In [1441]:
def remove_duplicates(str_list: list) -> list:
    """ Returns a list containing no duplicate
    elements, given str_list. """
    
    ret_list = []
    
    # Iterate through items in str_list
    for item in str_list:
        # Check if item has been already seen
        if item not in ret_list:
            # If not, add to new list.
            ret_list.append(item)
            
    # Return list with no duplicates
    return ret_list

In [1442]:
# Function takes in list of hrefs to pages and gets page IDs
def get_page_ids(href_list: list) -> list:
    """ Returns a list of lists, each with the PIDs
    of images located on a given news issue, given
    a list of links to the pages. """
    
    ret_list = []
    
    # Iterate through the list of links
    for page in href_list:
        
        # Create a list for the PIDs
        temp_list = []
        
        # Send request at random times
        n = random.randint(6, 10)
        time.sleep(n)
        result = requests.get(page)
        
        # Get content from beautiful soup
        if result.status_code == 200:
            soup = BeautifulSoup(result.content, "html.parser")

        # Get the relevant portion of the page source
        data = soup.find_all('script')
        script_text = str(data[-1])
        
        # Use regex to find all pids in source
        pid_list = [m.start() for m in re.finditer('"pid":"', script_text)]

        # Go through pids and edit them for link
        for pid in pid_list:
            j = script_text.find(",", pid)
            i = script_text.find('",', pid)
            pid_num = script_text[pid+7:j] + '%2C' + script_text[j+1:i]
            temp_list.append(pid_num)
        
        # Append to list to be returned
        ret_list.append(remove_duplicates(temp_list))
        
    return ret_list

In [1443]:
# Gets href list
def get_links(url: str) -> list:
    """ Returns nested list of newspaper links
    to hrefs and image links located at url to
    google news archive. """
    
    page = url
    
    # Get request from url
    n = random.randint(6, 10)
    time.sleep(n)
    result = requests.get(page)
    
    print('[attemping to reach source code at:' + url + ']')
    print(str(result))
    
    # Check for non-200 status code
    if result.status_code == 200:
        print("[source code found.]")
        soup = BeautifulSoup(result.content, "html.parser")
    else:
        print("[source code not found.]")
        return [[],[]]

    # Get needed parts of source code
    data = soup.find_all('script')
    script_text = str(data[2])

    # Get required portion of source code as list
    i = script_text.find('summary_data')
    j = script_text.rfind(']')
    script_str = script_text[i+14:j+1]
    news_list = ast.literal_eval(script_str) 

    # Initiate lists to collect links
    ret_list = []
    href_list = []
    img_list = []
    
    # Iterate through 
    for year in news_list:
        # Check if editions exist for this year
        if 'editions' in year:
            year_list = year['editions']
            # Go through editions and get image and href link
            for edition in year_list:
                img_list.append(edition['img_url'])
                href_list.append(edition['href_url'])
    
    # Add these lists to returned list
    ret_list.append(href_list)
    ret_list.append(img_list)
    
    return ret_list

In [1444]:
def get_nested_link_list(start_date: int, end_date: int, url: str) -> list:
    """ Returns nested list of newspaper links
    to hrefs and image linkes from url to google
    news archive. from end_date to start_date"""
    
    # Slice and get the important parts of URL
    sliced_url = slice_url(url)
    
    url1 = sliced_url[0]
    url2 = sliced_url[1]
    
    href_dict = {}
    image_dict = {}

    # Go through months of archive
    i = start_date
    while(i < end_date):
        # Get links from url at particular date, both at yearly mark and
        # after sixth months
        list_of_links1 = get_links(url1 + str(i) + url2)
        list_of_links2 = get_links(url1 + str(i+600) + url2)
        
        # Add the links together
        list_of_hrefs = list_of_links1[0] + list_of_links2[0]
        list_of_imgs = list_of_links1[1] + list_of_links2[1]
        
        list_of_links = [list_of_hrefs, list_of_imgs]
        
        # Add the links to a dictionary so that there are no duplicates
        for href in list_of_links[0]:
            href_dict[href] = 1
        for img in list_of_links[1]:
            image_dict[img] = 1

        # Forward date by a year
        i+=10000
        
    # Convert keys to list and return
    final_href_list = list(href_dict.keys())
    final_image_list = list(image_dict.keys())

    return [final_href_list, final_image_list]

In [1445]:
def get_news_id(url: str) -> str:
    """ Return the id from the url to
    an image link for the newspaper."""

    # Isolate and return important part of link
    i = url.find("?id=")
    j = url.find("&pg=")
    return url[i+4:j+4]  

In [1446]:
def write_list_to_file(l: str, name: str) -> None:
    """ Writes each item in the list
    to a row in the text file. """
    
    with open(name, "a") as f:
        for item in l:
            f.write("%s\n" % item)   

In [1447]:
def get_template_img_urls(url: str, start_date: int, end_date: int) -> list:
    """ Returns array of template urls that point to 
    images which make up newspapers."""
        
    # Get list of links to pages where newspapers and images are located
    list_of_links = get_nested_link_list(start_date, end_date, url)
    # Get list of PIDs for each page in newspaper
    pid_list = get_page_ids(list_of_links[0])

    final_array = []

    # Iterate through the links to thumbnails
    for i in range(len(list_of_links[1])):
        link = list_of_links[1][i]
        # Loop through the pids for this particular image
        for j in range(len(pid_list[i])):
            url1 = "https://news.google.com/newspapers?id=" + get_news_id(link) + pid_list[i][j] + "&img=1&hl=en&zoom=1&tid="
            final_array.append(url1)

    return final_array

In [1448]:
def get_file(image_url: str) -> str:
    """ Returns name of image, downloaded 
    from image_url"""
    
    # Request image from URL
    n = random.randint(4, 9)
    time.sleep(n)
    img_data = requests.get(image_url).content
    
    # Get image as string
    data_str = str(img_data)
    
    # Check if image exists (it will send back a gif if not image)
    if data_str.startswith("b'GIF"):
        return ""
    
    else:
        # If image exists, download it to memory.
        with open("small_output/" + image_url[-5:] + "small_image.jpg", "wb") as handler:
            handler.write(img_data)
        return "small_output/" + image_url[-5:] + "small_image.jpg"

In [1449]:
def generate_images(url: str) -> list:
    """ Returns a list containing the names of 
    downloaded images making up image found at url.
    """
    
    # Initiate list to be returned 
    all_imgs = []
    
    # Set flag and iterator variable
    found_end = False
    i = 0
    
    # Collect images from link while images exist
    while found_end == False and i < 150:
        # Iterate through the image links, downloading them
        send_url = url + str(i)
        ret = get_file(send_url)
        # If no file found set flag to True
        if ret == "":
            found_end = True
        # Otherwise add the name of the image to the list
        else:
            found_end = False
            all_imgs.append(ret)
        i+=1
        
    # Return list of names
    return all_imgs

In [1450]:
def save_v_img(n: int, num: int, url: str):
    """ Given integer n, saves files named in the format
    'h' + n + '.jpg' as a large image stacked together. Named
    with num. """
    
    # Loop through partial images
    saved_obj_list = []
    for i in range(1, n):
        im = Image.open("h" + str(i) + ".jpg")
        saved_obj_list.append(im)

    # Get name of file using url
    full_pid = get_news_id(url)
    j = full_pid.find('&')
    issue_num = full_pid[:j]
    
    k = url.find("%2C")
    l = url.rfind("&img=")
    page_num = url[k+3:l]
    
    if os.path.exists('image_output/' + issue_num) == False:
        os.mkdir('image_output/' + issue_num)
    
    name = 'image_output/' + issue_num + '/' + page_num + '.jpg'
    print("[Image saved as: " + name + "]")
    
    # Save with name
    get_concat_h_multi_blank(saved_obj_list, 'v').save(name)

In [1451]:
def get_concat_h_blank(im1, im2, color=(0, 0, 0)):
    """ Merges images im1 and im2 horizontally.
    Sourced from PIL documentation. """
    
    dst = Image.new('RGB', (im1.width + im2.width, max(im1.height, im2.height)), color)
    dst.paste(im1, (0, 0))
    dst.paste(im2, (im1.width, 0))
    
    # Returns image as object
    return dst

def get_concat_v_blank(im1, im2, color=(0, 0, 0)):
    """ Merges images im1 and im2 vertically.
    Sourced from PIL documentation. """
    
    dst = Image.new('RGB', (max(im1.width, im2.width), im1.height + im2.height), color)
    dst.paste(im1, (0, 0))
    dst.paste(im2, (0, im1.height))
    
    # Returns image as object
    return dst

def get_concat_h_multi_blank(im_list: list, dr: str):
    """ Merges multiple images contained in 
    im_list vertically. Sourced from PIL 
    documentation. """
    
    _im = im_list.pop(0)
    
    # Loop through images and concatenate
    for im in im_list:
        if dr == "h":
            _im = get_concat_h_blank(_im, im)
        else:
            _im = get_concat_v_blank(_im, im)
    return _im 

In [1452]:
def save_h_imgs(l: list, n: int, img_url: str) -> None:
    """ Merges images from l in pattern which
    creates large image for Google News Archive
    images."""
    
    list_len = len(l)
    
    if list_len == 35:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[11], l[12]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[13], l[14]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[15], l[16], l[17], l[24], l[25]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[18], l[19], l[20], l[26], l[27]], 'h').save('h5.jpg')
        get_concat_h_multi_blank([l[21], l[22], l[23], l[28], l[29]], 'h').save('h6.jpg')  
        get_concat_h_multi_blank([l[30], l[31], l[32], l[33], l[34]], 'h').save('h7.jpg')
        
        save_v_img(8, n, img_url)
        
    elif list_len == 48:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10], l[11], l[18], l[19]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[12], l[13], l[14], l[20], l[21]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[15], l[16], l[17], l[22], l[23]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[24], l[25], l[26], l[33], l[34], l[35], l[42], l[43]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[27], l[28], l[29], l[36], l[37], l[38], l[44], l[45]], 'h').save('h5.jpg')  
        get_concat_h_multi_blank([l[30], l[31], l[32], l[39], l[40], l[41], l[46], l[47]], 'h').save('h6.jpg') 
   
        save_v_img(7, n, img_url)
    
    elif list_len == 54:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10], l[11], l[18], l[19], l[20]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[12], l[13], l[14], l[21], l[22], l[23]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[15], l[16], l[17], l[24], l[25], l[26]],  'h').save('h3.jpg')
        get_concat_h_multi_blank([l[27], l[28], l[29], l[36], l[37], l[38], l[45], l[46], l[47]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[30], l[31], l[32], l[39], l[40], l[41], l[48], l[49], l[50]], 'h').save('h5.jpg')  
        get_concat_h_multi_blank([l[33], l[34], l[35], l[42], l[43], l[44], l[51], l[52], l[53]], 'h').save('h6.jpg') 
        
        save_v_img(7, n, img_url)
    
    elif list_len == 45:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[11], l[12]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[13], l[14]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[15], l[16], l[17], l[24], l[25]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[18], l[19], l[20], l[26], l[27]], 'h').save('h5.jpg')
        get_concat_h_multi_blank([l[21], l[22], l[23], l[28], l[29]], 'h').save('h6.jpg')  
        get_concat_h_multi_blank([l[30], l[31], l[32], l[39], l[40]], 'h').save('h7.jpg')
        get_concat_h_multi_blank([l[33], l[34], l[35], l[41], l[42]], 'h').save('h8.jpg')
        get_concat_h_multi_blank([l[36], l[37], l[38], l[43], l[44]], 'h').save('h9.jpg')
        
        save_v_img(10, n, img_url)
    
    elif list_len == 25:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[11], l[12]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[13], l[14]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[15], l[16], l[17], l[21], l[22]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[18], l[19], l[20], l[23], l[24]], 'h').save('h5.jpg')
        
        save_v_img(6, n, img_url)
    
    elif list_len == 56:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10], l[11], l[18]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[12], l[13], l[14], l[19]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[15], l[16], l[17], l[20]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[21], l[22], l[23], l[30], l[31], l[32], l[39]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[24], l[25], l[26], l[33], l[34], l[35], l[40]], 'h').save('h5.jpg')
        get_concat_h_multi_blank([l[27], l[28], l[29], l[36], l[37], l[38], l[41]], 'h').save('h6.jpg')
        get_concat_h_multi_blank([l[42], l[43], l[44], l[48], l[49], l[50], l[54]], 'h').save('h7.jpg')
        get_concat_h_multi_blank([l[45], l[46], l[47], l[51], l[52], l[53], l[55]], 'h').save('h8.jpg') 
        
        save_v_img(9, n, img_url)
    
    elif list_len == 63:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10], l[11], l[18], l[19], l[20]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[12], l[13], l[14], l[21], l[22], l[23]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[15], l[16], l[17], l[24], l[25], l[26]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[27], l[28], l[29], l[36], l[37], l[38], l[45], l[46], l[47]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[30], l[31], l[32], l[39], l[40], l[41], l[48], l[49], l[50]], 'h').save('h5.jpg')
        get_concat_h_multi_blank([l[33], l[34], l[35], l[42], l[43], l[44], l[51], l[52], l[53]], 'h').save('h6.jpg')
        get_concat_h_multi_blank([l[54], l[55], l[56], l[57], l[58], l[59], l[60], l[61], l[62]], 'h').save('h7.jpg')

        save_v_img(8, n, img_url)

    elif list_len == 42:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10], l[11], l[18]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[12], l[13], l[14], l[19]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[15], l[16], l[17], l[20]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[21], l[22], l[23], l[30], l[31], l[32], l[39]], 'h').save('h4.jpg')  
        get_concat_h_multi_blank([l[24], l[25], l[26], l[33], l[34], l[35], l[40]], 'h').save('h5.jpg') 
        get_concat_h_multi_blank([l[27], l[28], l[29], l[36], l[37], l[38], l[41]], 'h').save('h6.jpg')
        
        save_v_img(7, n, img_url)
    
    elif list_len == 36:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10], l[11]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[12], l[13], l[14]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[15], l[16], l[17]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[18], l[19], l[20], l[27], l[28], l[29]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[21], l[22], l[23], l[30], l[31], l[32]], 'h').save('h5.jpg')  
        get_concat_h_multi_blank([l[24], l[25], l[26], l[33], l[34], l[35]], 'h').save('h6.jpg')
        
        save_v_img(7, n, img_url)
    
    elif list_len == 49:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10], l[11], l[18]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[12], l[13], l[14], l[19]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[15], l[16], l[17], l[20]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[21], l[22], l[23], l[30], l[31], l[32], l[39]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[24], l[25], l[26], l[33], l[34], l[35], l[40]], 'h').save('h5.jpg')  
        get_concat_h_multi_blank([l[27], l[28], l[29], l[36], l[37], l[38], l[41]], 'h').save('h6.jpg') 
        get_concat_h_multi_blank([l[42], l[43], l[44], l[45], l[46], l[47], l[48]], 'h').save('h7.jpg')
        
        save_v_img(8, n, img_url)

    elif list_len == 24:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[10]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[11]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[12], l[13], l[14], l[21]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[15], l[16], l[17], l[22]], 'h').save('h5.jpg')
        get_concat_h_multi_blank([l[18], l[19], l[20], l[23]], 'h').save('h6.jpg')
        
        save_v_img(7, n, img_url)
    
    elif list_len == 20:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[10]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[11]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[12], l[13], l[14], l[18]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[15], l[16], l[17], l[19]], 'h').save('h5.jpg')
        
        save_v_img(6, n, img_url)

    elif list_len == 28:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[10]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[11]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[12], l[13], l[14], l[21]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[15], l[16], l[17], l[22]], 'h').save('h5.jpg')
        get_concat_h_multi_blank([l[18], l[19], l[20], l[23]], 'h').save('h6.jpg')
        get_concat_h_multi_blank([l[24], l[25], l[26], l[27]], 'h').save('h7.jpg')
        
        save_v_img(8, n, img_url)

    elif list_len == 15:
        get_concat_h_multi_blank([l[0], l[1], l[2]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[9], l[10], l[11]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[12], l[13], l[14]], 'h').save('h5.jpg')
        
        save_v_img(6, n, img_url)
        
    elif list_len == 12:
        get_concat_h_multi_blank([l[0], l[1], l[2]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[9], l[10], l[11]], 'h').save('h4.jpg')
        
        save_v_img(5, n, img_url)

    elif list_len == 40:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[11], l[12]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[13], l[14]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[15], l[16], l[17], l[24], l[25]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[18], l[19], l[20], l[26], l[27]], 'h').save('h5.jpg')
        get_concat_h_multi_blank([l[21], l[22], l[23], l[28], l[29]], 'h').save('h6.jpg')  
        get_concat_h_multi_blank([l[30], l[31], l[32], l[36], l[37]], 'h').save('h7.jpg') 
        get_concat_h_multi_blank([l[33], l[34], l[35], l[38], l[39]], 'h').save('h8.jpg')
        
        save_v_img(9, n, img_url)

    elif list_len == 30:
        get_concat_h_multi_blank([l[0], l[1], l[2], l[9], l[10]], 'h').save('h1.jpg')
        get_concat_h_multi_blank([l[3], l[4], l[5], l[11], l[12]], 'h').save('h2.jpg')
        get_concat_h_multi_blank([l[6], l[7], l[8], l[13], l[14]], 'h').save('h3.jpg')
        get_concat_h_multi_blank([l[15], l[16], l[17], l[24], l[25]], 'h').save('h4.jpg')
        get_concat_h_multi_blank([l[18], l[19], l[20], l[26], l[27]], 'h').save('h5.jpg')
        get_concat_h_multi_blank([l[21], l[22], l[23], l[28], l[29]], 'h').save('h6.jpg')
        
        save_v_img(7, n, img_url)
        
#     elif list_len == ex_num:
#         get_concat_h_multi_blank([l[], l[], l[], l[], l[] .... ], 'h').save('h1.jpg')
#         get_concat_h_multi_blank([l[], l[], l[], l[], l[] .... ], 'h').save('h2.jpg')
#         ....
#         ....
#         ....
#         get_concat_h_multi_blank([l[], l[], l[], l[], l[] .... ], 'h').save('hx.jpg')
        
#         save_v_img(x+1, n, img_url) Where x is the number of function calls above.
    
    else:
        print("Warning! Image is not of known size.")
        with open("incompleted_images.txt", "a") as g:
            g.write("%s\n" % img_url)
    

In [1453]:
def create_newspaper(n: int, img_url: str):
    """ Given img_url, generates a newspaper
    from images collected from that url. """
    
    # Generate the images composing larger image
    img_list = generate_images(img_url)
    obj_list = []
    print("Number of images: " + str(len(img_list)))
    
    # Use PIL and make each an image object
    for file in img_list:
        im = Image.open(file)
        obj_list.append(im)
    
    # Determine what pattern the images go together with
    list_len = len(img_list)
    
    # Put together each image appropriately
    save_h_imgs(obj_list, n, img_url)
    
    # Remove images from memory
    for item in img_list:
        os.remove(item)
    

In [1454]:
def get_ocr_data(url: str) -> None:
    
    full_pid = get_news_id(url)
    j = full_pid.find('&')
    issue_num = full_pid[:j]
    
    k = url.find("%2C")
    l = url.rfind("&img=")
    page_num = url[k+3:l]
    
    file_name = issue_num + '/' + page_num
    
    if path.exists('image_output/' + file_name + '.jpg'):
        with io.open('image_output/' + file_name + '.jpg', 'rb') as image_file:
            content = image_file.read()

        print("[Running OCR]")

        image = vision.types.Image(content = content)
        response = client.text_detection(image = image)
        texts = response.text_annotations

        # Convert the response to dictionary
        response = MessageToDict(response)

        # Convert to json
        j_file = 'ocr_output/' + issue_num + '-' + page_num + "_annotated_ocr.json" 
        out_file = open(j_file, "w")  
        json.dump(response, out_file)     
        out_file.close()
        print("[Saved as: " + j_file + "]")

        # Convert to .txt
        if len(texts) != 0:
            text = ('\n"{}"'.format(texts[0].description))
        else:
            text = ""
        o_file = 'ocr_output/' + issue_num + '-' + page_num + "_OCR.txt"
        file = open(o_file,"w") 
        file.write(text)
        file.close()
        print("[Saved as: " + o_file + "]\n")

In [1455]:
def get_newspapers(url: str, start_date: int, end_date: int, folder_name: str) -> None:
    """ Generates newspapers as jpgs located
    at the url on google's newspaper archives."""
    
    # Get list of links which point to images
    list_of_imgs = get_template_img_urls(url, start_date, end_date)
    print("Number of images to be processed: " + str(len(list_of_imgs)))

    start = sum(1 for line in open('completed_images.txt'))
    print("Starting at: " + str(start))
    # Loop through each template image link
    for i in range(start, len(list_of_imgs)):
        print(str(i) + " [Building images located at: " + list_of_imgs[i] + ']')

        # Create newspaper image from link
        create_newspaper(i, list_of_imgs[i])

        # Collect OCR data
        get_ocr_data(list_of_imgs[i])

        # Log completed image
        with open("completed_images.txt", "a") as g:
            g.write("%s\n" % list_of_imgs[i])
        
    # Label folder and create new folders
    os.rename('image_output', folder_name)
    os.rename('ocr_output', folder_name + "_ocr")
    
    os.mkdir('image_output')
    os.mkdir('ocr_output')
    
    open('completed_images.txt', 'w').close()
    

In [1456]:
# Run this first - St.johns Royal Gazette
# get_newspapers("https://news.google.com/newspapers?nid=QhmGFXoqNHAC&dat=17920629&b_mode=2&hl=en",
#                17840101,
#                17990901,
#                'RoyalStJohnGazette')

# Run this second - Quebec Gazette
# get_newspapers("https://news.google.com/newspapers?nid=F_tUKv7nyWgC&dat=17750518&b_mode=2&hl=en",
#                 17550101,
#                 17650201,
#                 'QuebecGazette')

[attemping to reach source code at:https://news.google.com/newspapers?nid=QhmGFXoqNHAC&dat=17840101&b_mode=2&hl=en]
<Response [200]>
[source code found.]
[attemping to reach source code at:https://news.google.com/newspapers?nid=QhmGFXoqNHAC&dat=17840701&b_mode=2&hl=en]
<Response [200]>
[source code found.]
[attemping to reach source code at:https://news.google.com/newspapers?nid=QhmGFXoqNHAC&dat=17850101&b_mode=2&hl=en]
<Response [200]>
[source code found.]
[attemping to reach source code at:https://news.google.com/newspapers?nid=QhmGFXoqNHAC&dat=17850701&b_mode=2&hl=en]
<Response [200]>
[source code found.]
[attemping to reach source code at:https://news.google.com/newspapers?nid=QhmGFXoqNHAC&dat=17860101&b_mode=2&hl=en]
<Response [200]>
[source code found.]
[attemping to reach source code at:https://news.google.com/newspapers?nid=QhmGFXoqNHAC&dat=17860701&b_mode=2&hl=en]
<Response [200]>
[source code found.]
[attemping to reach source code at:https://news.google.com/newspapers?nid=Qh

ServiceUnavailable: 503 failed to connect to all addresses