# GEOG 464 - Term Project - Cameron Brubacher

## An interactive map of Canadian literature

For this term project, I am looking to combine my interests in literature and geography to create an interactive map of Canadian literary works. I hope to combine the Python methods surrounding textual and geospatial data, as well as webmapping using HTML, JavaScript, and CSS. I also hope that the work completed for this project can preface further studies in human geography, literature, and/or digital humanities.

The main data source for this project is the Canadian Literary Centre's *Annotated Bibliography of Major Canadian Authors* [series](https://lib-ezproxy.concordia.ca/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=cjh&bquery=&cli0=SE5&clv0=Annotated+bibliography+of+Canada%27s+major+authors&type=1&searchMode=Standard&site=ehost-live&scope=site), published by UofT, which includes bibliographic information on the works of a couple dozen presumed 'major' Canadian authors. Due to the size of the dataset and the variability in data formatting, the range of entries included in this project had to be greatly refined since its proposal stage to only include fiction novels published by these authors. It should be noted as well that the last volume of this series was published in the mid-80s, which means that although its entries are not at all up to date, it nevertheless represented an extensive and reputable source upon which I could perform some geospatial programming functions.

### Importing modules:

Here, I import the third-party Python modules used to process the data throughout this project. Notable inclusions which were not explored in class include [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) (an HTML-parsing module), [Wikipedia](https://github.com/goldsmith/Wikipedia) (a module which uses the website's API to fetch data), and [Geocoder](https://github.com/DenisCarriere/geocoder) (a geocodoing library which allows access to a number of providers).

In [2]:
import pandas as pd
import bs4 as bs
import re
import wikipedia as wiki
import geocoder as geo
import geopandas as gpd
import folium
import mapclassify
import matplotlib as plt

### Importing data:

Here, I import the bibliographic data from the Canadian Literary Centre, accessed through the Concordia Library's EBSCO service. Two files are imported: a csv file which includes surface-level article data from the *Annotated Bibliography* series, and a text html file (after being parsed with BeautifulSoup) which includes the actual bibliographic entries from the series.

In [3]:
def import_data(articles_file, text_file):
    # Import csv with articles from library database:
    articles = pd.read_csv(articles_file)

    # Import text version of html file with full-text citations from library database:
    with open(text_file, 'r') as file:
        lines = file.readlines()
        record = [line.strip() for line in lines]
    return articles, record

### Cleaning article data:

Here, I clean the csv article data, which mainly involves the removal and renaming of columns, reformatting of some textual entries, and the addition of 'Work Type' and 'Work Genre' columns, based on content of the article titles.

In [4]:
def clean_art(dataset):
    # Remove unnecessary/empty columns:
    dataset.drop(columns={'ISSN', 'Publication Date', 'Issue', 'DOI', 'Doctype', 'Keywords', 'Abstract', 'First Page', 'Page Count'}, inplace=True)
    
    # Rename ambiguous columns:
    dataset.rename(columns={'Author': 'Volume Author', 'Journal Title': 'Volume Section', 'Publisher': 'Series Publisher', 'PLink': 'Permalink'}, inplace=True)
    
    # Reformat article titles so all begin with 'Part n:':
    for i, row in dataset.iterrows():
        if row[0][6] != ':':
            dataset.loc[i, 'Article Title'] = row[0][:6]+':'+row[0][6:]
    
    # Reformat ISBN:
    for i, row in dataset.iterrows():
        dataset.loc[i, 'ISBN'] = row[3][2:-1]
    
    # Create empty work type genre columns:
    dataset['Work Type'] = pd.Series(dtype='string')
    dataset['Work Genre'] = pd.Series(dtype='string')
    
    # Assign work types and genres based on content of article titles:
    for i, row in dataset.iterrows():
        if row[0].count(';') >= 2:
            dataset.loc[i, 'Work Type'] = row[0][((row[0].find(';'))+2):(row[0].rfind(';'))]
            dataset.loc[i, 'Work Genre'] = row[0][((row[0].rfind(';'))+2):]
        elif (row[0].count(';') == 1) and (row[0].count(':') > 1):
            dataset.loc[i, 'Work Type'] = row[0][((row[0].find(';'))+2):row[0].rfind(':')]
            dataset.loc[i, 'Work Genre'] = row[0][((row[0].rfind(':'))+2):]
        elif row[0].count(';') == 1:
            dataset.loc[i, 'Work Type'] = row[0][((row[0].rfind(';'))+2):]
        elif row[0].count(':') > 2:
            dataset.loc[i, 'Work Type'] = row[0][((row[0].find(':'))+2):(row[0].rfind(':'))]
            dataset.loc[i, 'Work Genre'] = row[0][((row[0].rfind(':'))+2):]
        elif (row[0].count(';') == 0) and (row[0].count(':') == 1):
            dataset.loc[i, 'Work Type'] = row[0][((row[0].rfind(':'))+2):]
    return dataset

### Cleaning record data:

Here, I clean the HTML record data which includes the full-text bibliographic entries. This basically involves the creation of a list for each line, and the removal of empty entries. 

In [5]:
def clean_record(dataset):
    # Remove entries based on length:
    delete = []
    for i, entry in enumerate(dataset):
        if len(entry) <= 3:
            delete.append(i)
    for n in sorted(delete, reverse=True):
        del dataset[n]
    return dataset

### Tabulating metadata:

Here, I tabulate the metadata from the HTML record data, which includes further contextual information from each article of the *Annotated Bibliography* series, as well as information on the author discussed.

In [6]:
def table_meta(dataset):
    # Create list of headers based on sub-record metadata:
    headers = []
    for entry in dataset:
        if entry.endswith(':'):
            headers.append(entry)
    
    # Create empty metadata dataframe with list of headers:
    meta = pd.DataFrame(data=[[None]*len(headers)], columns=headers)

    # Add metadata to dataframe:
    for i, entry in enumerate(dataset):
        for header in headers:
            try:
                if (header == entry) and (dataset[i+1].endswith(':') == False):
                    meta[header] = dataset[i+1]
            except IndexError:
                break
    return meta

### Cleaning metadata:

Here, I clean the metadata table, which again involves the removal and renaming of columns, and the reformatting of textual entries.

In [17]:
def clean_meta(dataset):
    # Remove unnecessary columns:
    dataset.drop(columns={'Other Title:', 'Links:', 'Author(s):', 'Book Source:'}, inplace=True)

    # Rename columns:
    dataset.rename(columns={'Title:': 'Part Title', 'Record Type:': 'Record Type', 'Series Title:': 'Series Title', 'Genre(s):': 'Series Genre', 'Subject(s):': 'Work Author', 'Accession Number:': 'Accession Number', 'Database:': 'Database'}, inplace=True)
    
    # Reformat work author and series title columns:
    for i, row in dataset.iterrows():
        try:
            author = dataset.loc[i, 'Work Author'][((dataset.loc[i, 'Work Author'].find(':'))+2):(dataset.loc[i, 'Work Author'].find(';'))]
            author = author[((author.find(','))+2):]+author[:(author.find(','))]
            dataset.loc[i, 'Work Author'] = author.title()
        except KeyError:
            pass
    return dataset

### Adding biographic and geographic information to metadata:

Here, I append biographic and geographic information to the metadata table. A short biography of the author is added using the Wikipedia module. After extended attempts to add birthplace information using the same method, I opted to manually add this information considering the small number of authors and the inconsistent formatting of birthplace information in each Wikipedia entry. Finally, latitude and longitude coordinates are added using the Geocoder model, based on the author's birthplace.

Note: for Canadian authors born outside of Canada, I opted to use their landing/settling location in Canada, so as to only display geographic information within the country.

In [8]:
def geo_bio(dataset):
    # Create column for author's birthplace:
    dataset['Author Biography'] = pd.Series(dtype='string')
    dataset['Author Birthplace'] = pd.Series(dtype='string')
    dataset['latitude'] = pd.Series(dtype='float')
    dataset['longitude'] = pd.Series(dtype='float')

    for i, row in dataset.iterrows():
        # Extract author's name:
        author = dataset.loc[i, 'Work Author']
        #print(author)

        # Add short biography to dataframe:
        try:
            dataset.loc[i, 'Author Biography'] = wiki.summary(author, sentences=1, auto_suggest=True, redirect=True)
        except:
            pass

        # Add birthplace to dataframe:
        if author == 'Alice Munro':
            dataset.loc[i, 'Author Birthplace'] = 'Wingham, Ontario'
        if author == 'Anne Hebert':
            dataset.loc[i, 'Author Birthplace'] = 'Sainte-Catherine-de-la-Jacques-Cartier, Quebec'
        if author == 'Earle Birney':
            dataset.loc[i, 'Author Birthplace'] = 'Calgary, Alberta'
        if author == 'Ernest Buckler':
            dataset.loc[i, 'Author Birthplace'] = 'West Dalhousie, Nova Scotia'
        if author == 'Ethel Wilson':
            dataset.loc[i, 'Author Birthplace'] = 'Vancouver, British Columbia'
        if author == 'Gabrielle Roy':
            dataset.loc[i, 'Author Birthplace'] = 'Saint Boniface, Manitoba'
        if (author == 'Hugh Hood') or (author == 'Marian Engel') or (author == 'Morley Callaghan'):
            dataset.loc[i, 'Author Birthplace'] = 'Toronto, Ontario'
        if author == 'Margaret Atwood':
            dataset.loc[i, 'Author Birthplace'] = 'Ottawa, Ontario'
        if author == 'Margaret Laurence':
            dataset.loc[i, 'Author Birthplace'] = 'Neepawa, Manitoba'
        if (author == 'Mavis Gallant') or (author == 'Michael Ondaatje') or (author == 'Mordecai Richler'):
            dataset.loc[i, 'Author Birthplace'] = 'Montreal, Quebec'
        if author == 'Patricia K. Page':
            dataset.loc[i, 'Author Birthplace'] = 'Red Deer, Alberta'
        if author == 'Robert Kroetsch':
            dataset.loc[i, 'Author Birthplace'] = 'Heisler, Alberta'
        if author == 'Robertson Davies':
            dataset.loc[i, 'Author Birthplace'] = 'Thamesville, Ontario'
        if author == 'Sinclair Ross':
            dataset.loc[i, 'Author Birthplace'] = 'Shellbrook, Saskatchewan'
        if author == 'Thomas Raddall':
            dataset.loc[i, 'Author Birthplace'] = 'Halifax, Nova Scotia'
        if author == 'W.O. Mitchell':
            dataset.loc[i, 'Author Birthplace'] = 'Weyburn, Saskatchewan'
        if author == 'Leonard Cohen':
            dataset.loc[i, 'Author Birthplace'] = 'Westmount, Quebec'
        
        # Extract author's birthplace:
        place = dataset.loc[i, 'Author Birthplace']

        # Geocode coordinates of birthplace:
        location = geo.osm(place)
        latlng = location.latlng

        # Add coordinates to dataframe:
        dataset.loc[i, 'latitude'] = latlng[0]
        dataset.loc[i, 'longitude'] = latlng[1]
    return dataset

### Tabulating citation data:

Here, I tabulate the citation data from the HTML record data, which includes the actual bibliographic entries of works by the authors included. This includes the locating of bibliographic entries using Python's built-in RegEx expressions, the splitting of these entries by a separator, and the appending of these split entries to a list.

In [9]:
def table_cit(dataset):
    # Create list of bibliographic entries:
    works = []
    for entry in dataset:
        if re.search('^A\d+', entry):
            works.append(entry)
    for i, entry in enumerate(works):
        works[i] = entry.lstrip('A0123456789 ')
    for i, entry in enumerate(works):
        #print(entry)
        if re.search('^[a-z]', entry):
            works[i] = 'A'+entry
    
    # Split entries by and append to works list:
    cits = []
    for i, entry in enumerate(works):
        row = entry.rsplit('.')
        row = row[:-1]
        for i, string in enumerate(row):
            row[i] = string.strip()
        cits.append(row)
    return cits

### Cleaning citation data:

Here, I clean the citation data table, which includes the formatting of a new dataframe, and the appending and formatting of the previously split bibliographic entries.

In [10]:
def clean_cit(dataset):
    headers = ['Work Title', 'Year', 'Pages', 'Work Publisher', 'Publisher City', 'Notes']
    cit = pd.DataFrame(index = range(len(dataset)), columns = headers)
    for i, entry in enumerate(dataset):   
        notes = ""
        for i2, string in enumerate(entry):
            if i2 == 0:
                cit.loc[i, 'Work Title'] = string
            elif (':' in string) and (',' in string):
                cit.loc[i, 'Publisher City'] = string[:(string.find(':'))] 
                cit.loc[i, 'Work Publisher'] = string[((string.find(':'))+2):(string.rfind(','))] 
                cit.loc[i, 'Year'] = string[((string.rfind(','))+2):]     
            elif 'pp' in string:
                cit.loc[i, 'Pages'] = string
            else:
                notes = notes+'. '+string
            cit.loc[i, 'Notes'] = notes
    for i, row in cit.iterrows():
        try:
            if len(cit.loc[i, 'Notes']) < 1:
                cit.loc[i, 'Notes'] = None
            else:
                cit.loc[i, 'Notes'] = cit.loc[i, 'Notes'][2:]
        except KeyError:
            pass
    return cit

### Calling functions:

Here, I call all the previously defined funcitons. The first (article and record importing and cleaning) are done in order one after the other, and the rest (metadata and citation data tabulating and cleaning) are called continuously in a loop which splits the HTML record data into sub-records, and the metadata and citation data are appended to respective tables. Finally, these two tables are joined, before being merged with the initial articles table to create a final master dataset.

In [18]:
# Import citation and full-text html data:
articles, record = import_data('data/articles.csv', 'data/EBSCOhost.txt')

# Clean article and HTML record data:
articles = clean_art(articles)
record = clean_record(record)

# Create list of split points (at 'Record: n') in HTML record data:
splits = []
for i, entry in enumerate(record):
    if 'Record:' in entry:
        splits.append(i)

# Create empty metadata, citation, and joined dataframes:
metadata = pd.DataFrame()
citations = pd.DataFrame()
metacit = pd.DataFrame()

# Iterate through the HTML record data:
for i, n in enumerate(splits):
    # Create sub-record for each group of entries in HTML record data:
    sub_rec = []
    try:
        sub_rec = record[n:(splits[i+1])]
    except IndexError:
        sub_rec = record[n:]
        
    # Tabulate and clean sub-record metadata:
    meta = table_meta(sub_rec)
    meta = clean_meta(meta)
    meta = geo_bio(meta)

    # Append sub-record metadata to metadata dataframe:
    metadata = pd.concat([metadata, meta], ignore_index=True)

    # Tabulate and clean sub-record citation data:
    citlist = table_cit(sub_rec)
    cit = clean_cit(citlist)

    # Append sub-record citation data to citation dataframe:
    citations = pd.concat([citations, cit], ignore_index=True)

    # Append sub-record citation data to sub-record metadata:
    metas = pd.concat([meta]*(len(cit)), ignore_index=True)
    submetacit = metas.join(cit)

    # Append sub-record metadata and citation data to joined dataframe:
    metacit = pd.concat([metacit, submetacit], ignore_index=True)

# Merge articles dataframe to join metadata-citation dataframe to creation master dataframe:
master = articles.merge(metacit, on='Accession Number')

# Reorder master dataframe columns:
master = master[['Database', 'Record Type', 'Series Title', 'Series Genre', 'Series Publisher', 'Volume', 'Volume Section', 'Volume Author', 'Part Title', 'Article Title', 'Subjects', 'ISBN', 'Accession Number', 'Permalink', 'Work Author', 'Author Biography', 'Author Birthplace', 'latitude', 'longitude', 'Work Type', 'Work Genre', 'Work Title', 'Work Publisher', 'Publisher City', 'Year', 'Pages', 'Notes']]

### Converting to geospatial data:

Here, I take the master dataframe and convert it to a geodataframe, reformat the column names so as to ease access in a web environment, set a coordinate system (WGS 1984, OSM's standard - the service I used during geocoding), and export the geodataframe to a geojson to be used to create an interactive webmap. Finally, I used geopandas explore() function to display an inital Leaflet-based map of the data.

In [19]:
# Convert master dataframe to geodataframe:
geometry = gpd.points_from_xy(master.longitude, master.latitude)
master_gdf = gpd.GeoDataFrame(master, geometry=geometry)

#Remove spaces from geodataframe column names:
master_gdf.columns = master_gdf.columns.str.replace(' ', '')

# Set coordinate system of geodataframe:
master_gdf.set_crs(epsg=4326, inplace=True)

# Export geodataframe to geojson:
master_gdf.to_file('data/canlit.geojson', driver='GeoJSON')

# Show preview of geodataframe:
master_gdf.explore()