## Introduction

The script aims to run the regular re-accession for CKAN portals. Compared with DCAT portals, CKAN updates less frequently. Thus, we often run the script every 3 months.


> Orignal created by Yijing Zhou (@YijingZhou33) and Ziying Cheng(@Ziiiiing)

> Updated January 15, 2021                           
> Updated by Ziying Cheng (@Ziiiiing)

> Updated July 05, 2021                           
> Updated by Ziying Cheng (@Ziiiiing)

## Set up directories

Verify that you have the following files and folders in the same directory as this Notebook:

- `CKANportals.csv` includes some basic information about each CKAN portal.
- `resource` folder collects existing resource names by portal for each re-accession. The new one will be compared with the latest one to get both the created and deleted datasets.
- `reports` folder stores the metadata CSV files for all **New** datasets which are named as `allNewItems_YYYYMMDD.csv`. **Deleted** datasets are also stored within CSV files called `allDeletedItems_YYYYMMDD.csv`.




## Import modules

In [1]:
import csv
import urllib.request
import json 
import time
import os
import pandas as pd
from html.parser import HTMLParser
import re
import ast
import decimal
import ssl
import sys
import numpy as np

In [2]:
# auto-generate the current time in 'YYYYMM' format
actionDate = time.strftime('%Y%m%d')

## Load portal information

Read from local `CKANportals.csv` and extract the `URL`, `Provider`, `Publisher`, `Spatial Coverage` and `Bounding box` for each `portalName`.

In [3]:
portalsInfo = {}

with open('CKANportals.csv', 'r') as csvfile:
    reader = csv.reader(csvfile)
    
    # jump over the fieldnames
    # loop over from the first content record
    csv_fields = next(reader)
    for row in reader:
        portalsInfo[row[0]] = [row[1], row[2], row[3], row[4], row[5]]


## Loop over portals

Loop over each portal, collect the up-to-date resources and compare with the latest resources list from the `resource` folder. Thus, we can get the created datasets and deleted datasets after comparison. For those newly created datasets, request and create their metadata. For those deleted, store the resource name along with its portal code in the CSV file.

In [4]:
# function to compare old and new resource list
# return created and deleted items separately

def returnNotMatches(old, new):
    oldResource = set(old)
    newResource = set(new)
    return [list(newResource - oldResource), list(oldResource - newResource)]

In [5]:
### function to removes html tags from text
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True        
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

def cleanData(value):
    fieldvalue = strip_tags(value)
    return fieldvalue

In [6]:
# function to format metadata for new items

def metadataNewItems(newdata):    
    metadata = []
    
    title = ''
    alternativeTitle = newdata['result']['title']
        
    description = cleanData(newdata['result']['notes'])
    ### Remove newline, whitespace, defalut description and replace singe quote, double quote 
    if description == '{{default.description}}':
        description = description.replace('{{default.description}}', '')
    else:
        description = re.sub(r'[\n]+|[\r\n]+',' ', description, flags=re.S)
        description = re.sub(r'\s{2,}' , ' ', description)
        description = description.replace(u'\u2019', "'").replace(u'\u201c', '\"').replace(u'\u201d', '\"').replace(u'\u00a0', '').replace(u'\u00b7', '').replace(u'\u2022', '').replace(u'\u2013','-').replace(u'\u200b', '')

    language = 'eng'  
    creator = ''
    index = 0
        
    publisher = portalPublisher       
    spatialCoverage = portalSpaCov   

    if 'extras' in newitem['result']:
        extras = newitem['result']['extras']    
        for dictionary in extras:
            if dictionary['key'] == 'dsOriginator':
                creator = dictionary['value']

                ## if Creator field contains keywork 'County', extract the county name to fill in Publisher and Spatial Coverage field
                ## otherwise, autofill both fileds with 'Minnesota'
                index = creator.find('County')
                if index != -1:
                    publisher = creator[: index + 6]
                    spatialCoverage = publisher + f', {portalSpaCov}|{portalSpaCov}'   
    
                            
    format_types = []
    resourceClass = ''
    formatElement = ''
    downloadURL =  ''
    resourceType = ''
    featureServer = ''
    webService = ''
    html = ''
    previewImg = ''
    
    distribution = newdata['result']['resources']
    for dictionary in distribution:
        try:
            ### if one of the distributions is a shapefile, change genre/format and get the downloadURL
            format_types.extend([dictionary['format']])
            if dictionary['format'] == 'SHP':
                resourceClass = 'Datasets'
                formatElement = 'Shapefile'
                downloadURL = dictionary['url']
                resourceType = 'Vector data'
                
                
            ### if one of the distributions is WMS, and it is taged as 'aerial photography'
            ### change genre, type, and format to relate to imagery
            if dictionary['format'] == 'WMS':
                tags = newdata['result']['tags']
                for tag in tags:
                    if tag['display_name'] == 'aerial photography':                        
                        resourceClass = 'Imagery'
                        formatElement = 'Imagery'
                        downloadURL = dictionary['url']
                        resourceType = 'Satellite imagery'
                        
            ### saves the url if the dataset has Webservice format         
            if dictionary['format'] == 'ags_mapserver':
                webService = dictionary['url']
                
            ### saves the metadata page
            if dictionary['format'] == 'HTML':
                html = dictionary['url']   
            
            ### saves the thumbnail iamge
            if dictionary['format'] == 'JPEG':
                previewImg = dictionary['url']    
                
        ### if the distribution section of the metadata is not structured in a typical way
        except:
            resourceClass = ''
            formatElement = ''
            downloadURL =  ''       
            continue
                                                
    
    ### extracts the bounding box 
    try:
        bbox = []
        spatial = ''
        extra_spatial = newdata['result']['extras']
        for dictionary in extra_spatial:
            if dictionary['key'] == 'spatial':
                spatialList = ast.literal_eval(dictionary['value'].split(':[')[1].split(']}')[0])
                coordmin = spatialList[0]
                coordmax = spatialList[2]
                coordmin.extend(coordmax)
                typeDmal = decimal.Decimal
                fix3 = typeDmal("0.001")
                for coord in coordmin:
                    coordFix = typeDmal(coord).quantize(fix3)
                    bbox.extend([str(coordFix)])
                    spatial = ','.join(bbox)            
    except:
        spatial = ''     
        
    try:
        theme = ''
        groups_theme = newdata['result']['groups']
        if len(groups_theme) != 0:
            theme = groups_theme[0]['display_name'].replace('+', 'and')
    except:
        theme = ''
    
    keyword_list = []
    keyword = newdata['result']['tags']
    for dictionary in keyword:
        keyword_list.extend([dictionary['display_name']])
    keyword_list = ','.join(keyword_list).replace(',', '|')
    
    dateIssued = newdata['result']['metadata_created']
    temporalCoverage = 'Continually updated resource'
    dateRange = ''
    
    information = landingurl + newdata['result']['name']
    ID = newdata['result']['id']
    
    featureServer = ''
    mapServer = ''
    imageServer = ''
    
    ### specifies the Webservice type by querying the webService string    
    try:
        if 'FeatureServer' in webService:
            featureServer = webService
        if 'MapServer' in webService:
            mapServer = webService
        if 'ImageServer' in webService:
            imageServer = webService
    except:
            print(ID)
    
    identifier = item
    provider = portalProvider  
    code = portal     
    memberOf = 'ba5cc745-21c5-4ae9-954b-72dd8db6815a'
    isPartOf = portal
    
    
    status = 'Active'
    accuralMethod = 'CKAN'
    dateAccessioned = time.strftime('%Y-%m-%d')
                
    rights = ''               
    accessRights = 'Public'
    suppressed = 'FALSE'
    childRecord = 'FALSE'
    
    metadataList = [title, alternativeTitle, description, language, creator, publisher,
                    resourceClass, theme, keyword_list, dateIssued, temporalCoverage,
                    dateRange, spatialCoverage, spatial, resourceType,
                    formatElement, information, downloadURL, mapServer, featureServer,
                    imageServer, html, previewImg, ID, identifier, provider, code, memberOf, isPartOf, status,
                    accuralMethod, dateAccessioned, rights, accessRights, suppressed, childRecord]
    
    ### check the resource class: if it is neither 'Datasets' nor 'Imagery', create a empty list
    for i in range(len(metadataList)):
        if metadataList[6] != '':
            metadata = metadataList
        else: 
            continue
    
    return metadata

In [7]:
AllNewMetadata = []
AllDeleltedItem = []

for portal in portalsInfo:     
    print()
    print(f'Harvesting portal {portal}')
    
    ### delete later
#     if portal == '05d-11':
#         print('>>> skip 05d-11')
#         continue

    portalURL = portalsInfo[portal][0]
    portalProvider = portalsInfo[portal][1]
    portalPublisher = portalsInfo[portal][2]
    portalSpaCov = portalsInfo[portal][3]

    packageURL = portalURL + 'api/3/action/package_list'
    landingurl = portalURL + 'dataset/'

    # request new resources list
    context = ssl._create_unverified_context()
    response = urllib.request.urlopen(packageURL, context=context).read()
    packageList = json.loads(response.decode('utf-8'))
    newList = packageList['result']

    # store new resources locally for next re-accession
    with open(f'resource/{portal}_{actionDate}.csv', 'w') as fw:
        writer = csv.writer(fw)
        field = ['result']
        rows = np.reshape(newList, (-1, 1))
        writer.writerow(field)
        writer.writerows(rows)

    # find the latest resources list
    dates = []
    filenames = os.listdir('resource')
    for filename in filenames:
        if filename.startswith(portal):
            dates.append(filename[-12:-4]) 

    if actionDate in dates:
        dates.remove(actionDate)


    # For portals already existed for last re-accession:
    ## compare the current and the latest resources
    ## and find new and deleted items
    if dates:
        oldDate = max(dates)
        oldResource = f'resource/{portal}_{oldDate}.csv'

        oldList = []
        with open(oldResource) as fr:
            reader = csv.reader(fr)
            field = next(reader)
            for row in reader:
                oldList.append(row[0])

        newItems = []
        deletedItems = []

        newItems = returnNotMatches(oldList, newList)[0]
        deletedItems = returnNotMatches(oldList, newList)[1]
        AllDeleltedItem += [[portal, x] for x in deletedItems]


    # For new portals:
    # all current resources are new and do not have deleted items
    else:
        newItems = newList


    # Create metadata for all new items for each portal
    withEmpty = []
    metadata = []
    count = 0
    total = len(newItems)

    for item in newItems:
        count += 1
        itemURL = portalURL + 'api/3/action/package_show?id=' + item
        print(f'>>> Collecting dataset({count}/{total}): {itemURL}')

        context = ssl._create_unverified_context()
        response = urllib.request.urlopen(itemURL, context=context).read()
        newitem = json.loads(response.decode('utf-8'))
        withEmpty.append(metadataNewItems(newitem))

    # check whether empty
    metadata = [x for x in withEmpty if x != []]
    AllNewMetadata += metadata 


Harvesting portal 02a-03
>>> Collecting dataset(1/14): https://data.illinois.gov/api/3/action/package_show?id=fy22-cfhr-awards-and-premiums-for-vocational-ag-fairs
>>> Collecting dataset(2/14): https://data.illinois.gov/api/3/action/package_show?id=grants-to-illinois-artists-and-arts-organizations-in-2022-3rd-quarterly-reports
>>> Collecting dataset(3/14): https://data.illinois.gov/api/3/action/package_show?id=fy22-cfhr-thoroughbred-awards-and-premiums
>>> Collecting dataset(4/14): https://data.illinois.gov/api/3/action/package_show?id=fy22-cfhr-payments-to-il-county-fairs
>>> Collecting dataset(5/14): https://data.illinois.gov/api/3/action/package_show?id=fy22-cfhr-purses-and-or-track-improvements
>>> Collecting dataset(6/14): https://data.illinois.gov/api/3/action/package_show?id=grant-information-collection-act-4th-quarter-2022-xlsx
>>> Collecting dataset(7/14): https://data.illinois.gov/api/3/action/package_show?id=fy22-cfhr-county-fairs-purse-money
>>> Collecting dataset(8/14): h

## Print Reports

In [8]:
def printReport(report, fields, datalist):
    with open(report, 'w', newline='', encoding='utf-8') as f:
        csvout = csv.writer(f)
        csvout.writerow(fields)
        csvout.writerows(datalist)

Write CSV file for all new datasets.

In [9]:
fieldnames_new = [
    'Title', 
    'Alternative Title', 
    'Description', 
    'Language', 
    'Creator', 
    'titleSource', 
    'Resource Class',
    'Theme', 
    'Keyword', 
    'Date Issued', 
    'Temporal Coverage', 
    'Date Range', 
    'Spatial Coverage',
    'Bounding Box', 
    'Resource Type', 
    'Format', 
    'Information', 
    'Download', 
    'MapServer', 
    'FeatureServer', 
    'ImageServer', 
    'HTML', 
    'Image', 
    'ID', 
    'Identifier', 
    'Provider', 
    'Code', 
    'Member Of', 
    'Is Part Of', 
    'Status', 
    'Accrual Method', 
    'Date Accessioned', 
    'Rights', 
    'Access Rights', 
    'Suppressed', 
    'Child Record'
]

filepath_new = f'reports/allNewItems_{actionDate}.csv'   
printReport(filepath_new, fieldnames_new, AllNewMetadata)

Write CSV file for all deleted datasets.

In [10]:
fieldnames_del = ['Portal', 'Resource']

filepath_del = f'reports/allDeletedItems_{actionDate}.csv'   
printReport(filepath_del, fieldnames_del, AllDeleltedItem)