# Digital Library of India Kannada Deduplication


Author: Arjuna Rao Chavala,(arjunaraoc@gmail.com)

Date of inital draft: 2018-11-28



This study explores the work done to identify duplicate items of Kannada collection of  Digital Library of India.  Original file size of the item is identified as the key parameter to identify duplicates. The study is presented as Jupyter notebook with code in Python 3 so that the research becomes reproducible


For more detail on the methodology and problems with DlI, please read the [Deduplication study done for Telugu Collection](https://github.com/arjunaraoc/Deduplicate-DLI).




### Get the metadata from archive.


In [1]:
#some preliminaries Python modules used by the code in the notebook
import pandas as pd
import numpy as np
import csv
from IPython.display import Image
import requests
import re

In [2]:
#code for downloading catalog. commented as the live collection could change the output file is provided
#gets archive item fields and DLI description subfields
def getCollection2(resfile,numitems):
    try:
        fo=open(resfile,"w")
        error_log = open('arxerrlog.txt', 'w+')
        url = "https://archive.org/services/search/v1/scrape?"
        basic_params={ 'q':'(collection%3Adigitallibraryindia+AND+(language%3Akan++OR+language%3AKannada))',
                       'fields':'identifier,title,creator,date,description'}
        params=basic_params.copy()
        numline = 0
        fo.write( "id"+"\t"+"title"+"\t"+"creator"+"\t"+"pubd"+"\t"+"pages"+"\t"+"bc"+"\n")
        while True:
            try:

                params_str= "&".join("%s=%s" % (k, v) for k, v in params.items())
                print (params_str)
                resp = requests.get(url+params_str, headers={})
            except requests.exceptions.RequestException as e:  # This is the correct syntax
                error_log.write('Could not get search result' + url + params+' because of error: %s\n' % e)
                print ("There was an error; writing to log.")
                sys.exit(1)
            else:
                data= resp.json()
                #write results
                iadict=data["items"]
                for i in iadict:
                    iaid=i['identifier']

                    iatitle=""
                    if 'title' in i:
                        iatitle=i['title']
                    iacreator=""
                    if 'creator' in i:
                        iacreator= i['creator']
                    iadate=""
                    if 'date' in i:
                        iadate= i['date']
                    iadesc=""
                    iadesc_totpages=""
                    iadesc_barcode=""
                    if 'description' in i:
                        iadesc=i['description']
                        
                        totpagessearchstr = "dc.description.totalpages" + ": " + "([0-9]+)"
                        m = re.search(totpagessearchstr,iadesc)
                        if m:
                            iadesc_totpages = m.group(1)
                        
                        # barcode search
                        bcsearchstr = "dc.identifier.barcode" + ": " + "([0-9]+)"
                        m = re.search(bcsearchstr,iadesc)
                        if m:
                            iadesc_barcode = m.group(1)


                    fo.writelines("%s\t%s\t%s\t%s\t%s\t%s\n" % (iaid,iatitle,iacreator,iadate,
                                                                    iadesc_totpages,iadesc_barcode))
                    numline += 1
                    if (numitems != 0) and (numline > numitems):
                        break
                if (numitems != 0) and (numline > numitems):
                    break
                cursor = data.get('cursor', None)
                print(cursor)
                if cursor is None:
                    break
                else:
                    params = basic_params.copy()
                    params['cursor'] = cursor
        fo.close()
    except IOError:
        print ("Error: can\'t find file or read data")
#getCollection2("./data/arxkancat.tsv",0)

In [3]:
dlicat=pd.read_csv("./data/arxkancat.tsv",index_col=None, sep="\t",converters={i: str for i in range(1,100)})
dlicat.to_csv("./data/arxkanfin.csv")

## Final dataset

The data is organised as identifier(archive.org, title, creator, publication date(pubd), pages and DLI barcode(bc)
Barcode contains information as follows( Number of digits given after the field )
<center no:1 or 2>,<Vendor number:2><Scanning location:3><source library:3><item number:7>


If you look at the samples(below), even bc was not captured properly by DLI as evidenced by all 9s for some of the fields.

In [4]:
cat=pd.read_csv("./data/arxkanfin.csv",index_col=0,converters={i: str for i in range(1,100)})
cat.head()

Unnamed: 0,id,title,creator,pubd,pages,bc
0,dli.osmania.3040,ಮುಂದಿನ ದೇವರು,ಕೆ.ಕೆ.ಶೆಟ್ಟಿ,1936-01-01T00:00:00Z,,
1,dli.osmania.3041,ಕುರುಡು ಓದು,ಮಾನಪ್ಪ,1946-01-01T00:00:00Z,,
2,dli.osmania.3042,ಎಚ್ಚತ್ತ ಆಗ್ನೇಯ ಏಶಿಯ,ಎಂ. ಹರಿದಾಸ,1942-01-01T00:00:00Z,,
3,dli.osmania.3043,ಕರುಣಾಲಹರಿ,ಎಂ.ವಿ. ಸೀತಾರಾಮಯ್ಯ,1955-01-01T00:00:00Z,,
4,dli.osmania.3044,ಪುಟ್ಟರಸು,ಹೊಯಿಸಳ,1949-01-01T00:00:00Z,,


In [5]:
cat.tail()

Unnamed: 0,id,title,creator,pubd,pages,bc
5339,in.ernet.dli.2015.494839,೧೯೮೬_ಜುಲೈ_ಸಪ್ತಗಿರಿ__ಕನ್ನಡ,ಕೆ. ಸುಬ್ಬರಾವ್,1986-01-01T00:00:00Z,44,2040100074126
5340,in.ernet.dli.2015.494840,ಸಪ್ತಗಿರಿ ಜೂನ್ ೧೯೮೬ ಕನ್ನಡ,ಕೆ. ಸುಬ್ಬರಾವ್,1986-01-01T00:00:00Z,44,2040100074127
5341,in.ernet.dli.2015.494841,೧೯೮೬ ಮಾರ್ಚ್ ಸಪ್ತಗಿರಿ ಕನ್ನಡ,ಕೆ. ಸುಬ್ಬರಾವ್,1986-01-01T00:00:00Z,42,2040100074128
5342,in.ernet.dli.2015.494842,ಸಪ್ತಗಿರಿ ಕನ್ನಡ ಮೇ ೧೯೮೬,ಕೆ. ಸುಬ್ಬರಾವ್,1986-01-01T00:00:00Z,43,2040100074129
5343,in.ernet.dli.2015.494843,೧೯೮೬ ನವೆಂಬರ್ ಸಪ್ತಗಿರಿ ಕನ್ನಡ,ಕೆ. ಸುಬ್ಬರಾವ್,1986-01-01T00:00:00Z,44,2040100074130


In [6]:
#Type of digital library project
cat['id'].str.extract(r'(in\.ernet\.dli|dli\.osmania)')[0].value_counts()

in.ernet.dli    3125
dli.osmania     2219
Name: 0, dtype: int64

In [30]:
#Scanning location code and number of items, first row had blank denoting osmania library items
#blank-Osmania, 001-IIIT, Allahabad ,002-Osmania ,010-SVDL,999-CDAC 
cat['bc'].str[-10:-7].value_counts().sort_index()

       2219
001     478
002    2105
010     441
999     101
Name: bc, dtype: int64

### Summary info

In [8]:
cat.describe(include='all')

Unnamed: 0,id,title,creator,pubd,pages,bc
count,5344,5344,5344,5344,5344.0,5344.0
unique,5344,4299,2565,274,616.0,3126.0
top,in.ernet.dli.2015.363050,ಬಸವರಾಜದೇವರ ರಗಳೆ,ಕೆ. ಸುಬ್ಬರಾವ್,1955-01-01T00:00:00Z,,
freq,1,4,145,376,2219.0,2219.0


 Uniqueness based on various parameters is given below. (output cell shows the count)

In [9]:
dfg1=cat.groupby(['title','creator'])
dfg1.ngroups

5081

In [10]:
dfg2=cat.groupby(['title','creator','pubd'])
dfg2.ngroups

5120

In [11]:
dfg3=cat.groupby(['title','creator','pubd','pages'])
dfg3.ngroups

5307

In [12]:
csvfile = open('data/flagdupset.csv', 'w', newline="")
writer = csv.writer(csvfile, delimiter=",")
for name, group in dfg3:
    if len(group)>1:
        duplist=group['id'].tolist()
        writer.writerow(duplist)
csvfile.close()

In [13]:
with open('data/flagdupset.csv') as fi:
    dup=fi.readlines()
dup[0:5]

['in.ernet.dli.2015.287352,in.ernet.dli.2015.447771\n',
 'in.ernet.dli.2015.287354,in.ernet.dli.2015.447773\n',
 'in.ernet.dli.2015.381941,in.ernet.dli.2015.382264\n',
 'in.ernet.dli.2015.382196,in.ernet.dli.2015.382198\n',
 'dli.osmania.3245,dli.osmania.516\n']

In [14]:
# of duplicate lines
len(dup)

33

In [15]:
# count of all ids in by counting "," and adding number of lines
numdupids=[len(s.split(",")) for s in dup]
sum(numdupids)+len(dup)

103

### Size access

In [16]:
# takes long time if there are more items. As the collection could change,output file provided.
# for duplicates csv file, get size, output duplicates,sizes,comparison status using api call for speedup
def sizeCompareForDuplicates2(inpfile,outpfile, numlines):
    import subprocess
    import json
    url = "https://archive.org/metadata/"
    try:
        fo=open(outpfile,"w")
        error_log = open('arxerrlog.txt', 'w+')
        line=1
        result = []
        resultset=set()
        fi=open(inpfile,"r")
        for row in fi.readlines():
            row=row.strip("\n")
            idlist=row.split(sep=",")
            index=0
            result.clear()
            resultset.clear()
            for id in idlist:
                params_str = "%s/files" % id
                print(params_str)
                try:
                    resp = requests.get(url + params_str, headers={})
                except requests.exceptions.RequestException as e:  # This is the correct syntax
                    error_log.write('Could not get search result' + url + params + ' because of error: %s\n' % e)
                    print("There was an error; writing to log.")
                    sys.exit(1)
                else:
                    data = resp.json()['result']

                size='0'
                for obj in data:
                    if obj['name'].find(".pdf")!= -1:
                        size=obj['size']
                        break
                if(int(size)==0):
                    print("Error, Did not find pdf file for determining size")
                    exit(-1)
                result.append(size)
                index+=1
            #compare resulting sizes
            resultset=set(result)
            if len(resultset)==1:
                 compare="Success"
            else:
                compare="Fail"
            #write resultline
            index=0;
            for id in idlist:
                fo.write(id+","+result[index]+",")
                index+=1
            fo.write(compare+"\n")
            print(line,compare)
            line += 1
            if (numlines != 0) and (line > numlines):
                break

    except IOError:
        print("Error: can\'t find file or read data")
#sizeCompareForDuplicates2("./data/flagdupset.csv","./data/flagdupsetcompare.csv",0)

In [17]:
with open('data/flagdupsetcompare.csv') as fi:
    dupcompare=fi.readlines()
dupcompare[0:5]


['in.ernet.dli.2015.287352,7577942,in.ernet.dli.2015.447771,7577942,Success\n',
 'in.ernet.dli.2015.287354,9381409,in.ernet.dli.2015.447773,9381409,Success\n',
 'in.ernet.dli.2015.381941,6461914,in.ernet.dli.2015.382264,6461914,Success\n',
 'in.ernet.dli.2015.382196,89377000,in.ernet.dli.2015.382198,89377000,Success\n',
 'dli.osmania.3245,14195488,dli.osmania.516,16042444,Fail\n']

In [18]:
sum('Success' in s for s in dupcompare)

18

In [19]:
sum('Fail'  in s for s in dupcompare )

15

In [20]:
# Read from the duplicates size comparison output, when there is success, write the ids, when there is a fail,
# find subsets which have samesize, write their ids, and also write uniques if exist() using csv module
def splitdup_size(inpfile, outfile,numlines):
    import pandas as pd
    import numpy as np
    import csv

    line = 0
    fi = open(inpfile, 'r')
    csvfile = open(outfile, 'w', newline="")
    writer = csv.writer(csvfile, delimiter=",")
    for row in fi.readlines():
        line += 1
        if row.find("Fail") != -1:
            row = row.replace(",Fail\n", "")
            info = row.split(",")
            ids = []
            sizes = []
            for i, j in zip(info[0::2], info[1::2]):
                ids.append(i)
                sizes.append(j)
            isf = pd.DataFrame({'id': pd.Series(ids), 'size': sizes})
            isfg = isf.groupby(['size'], sort=False)

            for name, group in isfg:
                duplist = group['id'].tolist()
                writer.writerow(duplist)
        else:
            row = row.replace(",Success\n", "")
            info = row.split(",")
            ids = []
            sizes = []
            for i, j in zip(info[0::2], info[1::2]):
                ids.append(i)
                sizes.append(j)
            isf = pd.DataFrame({'id': pd.Series(ids), 'size': sizes})
            isfg = isf.groupby(['size'], sort=False)
            for name, group in isfg:
                duplist = group['id'].tolist()
                writer.writerow(duplist)
        if (numlines != 0) and (line > numlines):
            break

    csvfile.close()

splitdup_size('data/flagdupsetcompare.csv','data/flagdupsetrevised.csv',0)

In [21]:
fi=open("./data/flagdupsetrevised.csv",'r')
lines=[]
for row in fi.readlines():
    lines.append(row)

len(lines)

48

In [22]:
#show sample 
lines[0:5]

['in.ernet.dli.2015.287352,in.ernet.dli.2015.447771\n',
 'in.ernet.dli.2015.287354,in.ernet.dli.2015.447773\n',
 'in.ernet.dli.2015.381941,in.ernet.dli.2015.382264\n',
 'in.ernet.dli.2015.382196,in.ernet.dli.2015.382198\n',
 'dli.osmania.3245\n']

In [23]:
#Non duplicate lines  based on size,only id will be present on line, without comma
nondupl=[i for i in lines if i.find(",")==-1]
len(nondupl)

29

In [24]:
#duplicate lines  based on size,comma will be present, this number also ids which are to be flagged.
dupl=[i for i in lines if i.find(",")!=-1]
len(dupl)

19

In [25]:
# of ids to be deleted
numids=[len(s.split(",")) for s in dupl]
numdups=sum(numids)-len(dupl)
numdups

22

In [26]:
# Number of duplicates as percentage
numdups*100/cat.shape[0]

0.4116766467065868

## Shell Script generation

In [27]:
ffo=open("./data/dup_flag.sh","w")
fdo=open("./data/dup_delete.sh","w")
fi=open("./data/flagdupsetrevised.csv",'r')
line=0
import time
for row in fi.readlines():
    line+=1
    row=row.strip("\n")
    ids=row.split(",")
    for i in range(0,len(ids)):
        if i==0:
            if len(ids)>1:
                #write notes command
                curation_notes=row.replace(ids[0]+",","")
                notescommand=("ia metadata %s  --modify=\"notes:Exact duplicates (Archive identifiers) of " 
                              "this item, which are likely to be "
                              "hidden or deleted: %s\"\n") % (ids[0], curation_notes)  
                ffo.write(notescommand)
            
        else:
            fdo.write("ia delete "+ids[i]+" -H x-archive-keep-old-version:0\n")
    

ffo.close()
fdo.close()