# Analysis of Imaging Orders

This notebook looks at datasets related to information about orders placed for the digitisation of collection items, these datasets are: 

- [**abrs.csv:**](datasets/abrs.csv) ABRS data.
- [**abrs_reading_categories.csv:**](datasets/abrs_reading_categories.csv) ABRS reading categories.
- [**imps_item_details.csv:**](datasets/imps_item_details.csv) A table from the IMPS database.

The notebook attempts to prioritise candidates for digitisation based on:

1. Demand
2. Probable copyright status
3. Reading category (e.g. is access to the original restricted)

Each of these criteria are given a weighting and used to produce a sorted list of items of each material type.

We still need to establish whether or not there is any direct connection between the ABRS and IMPS datasets but at the moment it doesn't seem like it. However, as orders from the imaging studios probably would've had to go through the ABRS we're using the ABRS dataset as our main one and pulling in supplementry data from the IMPS dataset, where possible.

In [None]:
import plotly as py
import plotly.figure_factory as ff
from plotly.graph_objs import *
import pandas as pd
import numpy as np

py.offline.init_notebook_mode()

# imps_df = pd.read_csv('datasets/imps_item_details.csv', dtype=str)
abrs_df = pd.read_csv('datasets/abrs.csv', dtype=str, parse_dates=['Date Submitted'], 
                      date_parser=lambda x: pd.datetime.strptime(x, '%d/%m/%y %H:%M'))
abrs_rc_df = pd.read_csv('datasets/abrs_reading_categories.csv', dtype=str)

# Merge datasets
abrs_df = pd.merge(abrs_df, abrs_rc_df, on='Read Cat', how='outer')

# Drop duplicates
n_before = len(abrs_df)
abrs_df.drop_duplicates(inplace=True)
n_duplicates = n_before - len(abrs_df)

# Get reading category codes for each collection
coll_codes = abrs_rc_df.fillna('Unknown').groupby('Collection')['Read Cat'].apply(list)
coll_codes = pd.Series([abrs_rc_df['Read Cat'].tolist()], index=['All']).append(coll_codes)

# Fill NaN
abrs.fillna('Unknown')

## ABRS dataset contents

The ABRS dataset contains a large number of duplicate rows, supposedly created when items have been scanned more than once. For the following analysis all such rows have been dropped.

In [None]:
overview_df = pd.DataFrame({
    'Start Date': [abrs_df['Date Submitted'].min().strftime('%d-%m-%Y')],
    'End Date': [abrs_df['Date Submitted'].max().strftime('%d-%m-%Y')],
    'Duplicate Records': ["{:,}".format(n_duplicates)],
    'Unique Records': ["{:,}".format(len(abrs_df))]
}, columns=['Start Date', 'End Date', 'Duplicate Records', 'Unique Records'])
table = ff.create_table(overview_df)
py.offline.iplot(table)

## Reading categories

The reading categories dataset is used to sort items by material type and restricted access status. The relevant sections are reproduced for reference below. To edit or correct this table locally see [datasets/abrs_reading_categories.csv](../edit/datasets/abrs_reading_categories.csv).

In [None]:
table = ff.create_table(abrs_rc_df.fillna('Unknown')[['Collection', 'Read Cat', 'Access']])
py.offline.iplot(table)

## Overview

The following charts show some overall trends for demand, copyright and restricted status for each material type.

In [None]:
def get_copyright_status(row):
    ranges = {
        'Probably out of Copyright': (1, 1875), 
        'Possibly out of Copyright': (1876, 1925), 
        'Probably in Copyright': (1925, pd.datetime.now().year)
    }
    pubyear = pd.to_numeric(row['Pub Year'])
    for status, _range in ranges.items():
        if _range[0] <= pubyear <= _range[1]:
            return status
    return 'Unknown'
    
abrs_df['Copyright Status'] = abrs_df.apply(lambda x: get_copyright_status(x), axis=1)
data = []
statuses = sorted(abrs_df['Copyright Status'].unique().tolist())
for s in statuses:
    y = [len(abrs_df.loc[(abrs_df['Copyright Status'] == s) & 
                         (abrs_df['Read Cat'].isin(codes))]) 
         for (coll, codes) in coll_codes.iteritems()]
    x = coll_codes.keys()
    trace = Bar(x=x, y=y, name=s)
    data.append(trace)
    
layout = Layout(title='Estimated Copyright Status', barmode='stack')
fig = Figure(data=data, layout=layout)
py.offline.iplot(fig)

In [None]:
data = []
statuses = sorted(abrs_df['Access'].unique().tolist())
for s in statuses:
    y = [len(abrs_df.loc[(abrs_df['Access'] == s) & 
                         (abrs_df['Read Cat'].isin(codes))]) 
         for (coll, codes) in coll_codes.iteritems()]
    x = coll_codes.keys()
    trace = Bar(x=x, y=y, name=s)
    data.append(trace)
    
layout = Layout(title='Access Status', barmode='stack')
fig = Figure(data=data, layout=layout)
py.offline.iplot(fig)

In [143]:
abrs_df['Request Count'] = abrs_df.groupby(['NormalizedShelfmark'])['Request ID'].transform('count')
requests = abrs_df['Request Count'].tolist()
data = [Histogram(x=requests)]
layout = Layout(title='Requests per Item')
fig = Figure(data=data, layout=layout)
py.offline.iplot(fig)

In [None]:
data = []
for coll, codes in coll_codes.iteritems():
    hg_df = abrs_df.loc[abrs_df['Read Cat'].isin(codes)]
    tmp_df = hg_df.set_index('Date Submitted').resample('M').agg('count')
    trace = Scatter(x=tmp_df.index, y=tmp_df['Request ID'], name=coll, mode='line', 
                    visible=('legendonly' if len(data) > 2 else True))  # Show the first few traces
    data.append(trace)

xaxis_date_range = dict(
    title='Date range', 
    rangeselector=dict(
        buttons=list([
            dict(step='all'),
            dict(count=5,
                 label='5y',
                 step='year',
                 stepmode='backward'),
            dict(count=2,
                label='2y',
                step='year',
                stepmode='backward'),
            dict(count=1,
                label='1y',
                step='year',
                stepmode='backward')
        ])
    ), 
    rangeslider=dict(), 
    type='date'
)

layout = Layout(title='Requests vs. Request Date', xaxis=xaxis_date_range, yaxis=dict(title='Requests'))
fig = Figure(data=data, layout=layout)
py.offline.iplot(fig)

## Selecting candidates for digitisation

The following tables show the top 20 candidates for digitisation for each material type, based on demand, copyright and restricted status. 

In [None]:
for coll, codes in coll_codes.iteritems():
    cols = ['Shelfmark', 'Request Count', 'Access', 'Copyright Status']
    tmp_df = abrs_df.loc[(abrs_df['Read Cat'].isin(codes))][cols]
    table = ff.create_table(tmp_df.head(20))
    py.offline.iplot(table)