## Draft: finding duplicates the MarketPlace Dataset
This notebook discovers duplicates (if any) in the dataset.

#### External libraries and function to download descriptions from the MarketPlace dataset using the API
The following two cells are used to import the external libraries used in this Notebook and to define a function; in the final release of this Notebook this function will be (possibly) optimized and provided as an external library.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
def getMPDescriptions (url, pages):
    mdx = pd.Series(range(1, pages+1))
    df_desc = pd.DataFrame()
    for var in mdx:
        turl = url+str(var)+"&perpage=20"
        df_desc_par=pd.read_json(turl, orient='columns')
        df_desc=df_desc.append(df_desc_par, ignore_index=True)
        
    return (df_desc)

### Find duplicates on Tools and Services
The MarketPlace API are used to download the descriptions of Tools and Services

In [3]:
df_tool_all = pd.DataFrame()
df_tool_all =getMPDescriptions ("https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/tools-services?page=", 81)
df_tool_all.index

RangeIndex(start=0, stop=1606, step=1)

#### A quick look at data
The table below shows information about few, randomly chosen, descriptions.
Only a subset of values are shown, in particular: *id, category, label, licenses, contributors, accessibleAt, sourceItemId*.

In [4]:
df_tool_flat = pd.json_normalize(df_tool_all['tools'])
df_tool_work=df_tool_flat[['id', 'category', 'label', 'licenses', 'contributors', 'accessibleAt', 'sourceItemId']]
df_tool_work.head()

Unnamed: 0,id,category,label,licenses,contributors,accessibleAt,sourceItemId
0,30509,tool-or-service,140kit,[],"[{'actor': {'id': 483, 'name': 'Ian Pearce, De...",[https://github.com/WebEcologyProject/140kit],937
1,28542,tool-or-service,3DF Zephyr - photogrammetry software - 3d mode...,[],[],[https://www.3dflow.net/3df-zephyr-pro-3d-mode...,WQFP6XPS
2,11508,tool-or-service,3DHOP,[],[],[http://vcg.isti.cnr.it/3dhop/],SG86ZG5J
3,11419,tool-or-service,3DHOP: 3D Heritage Online Presenter,[],[],[https://github.com/cnr-isti-vclab/3DHOP],R379NADX
4,11507,tool-or-service,3DReshaper \| 3DReshaper,[],[],[https://www.3dreshaper.com/en/],PMES8DJW


#### Individuate duplicates based on values of property *label*

In [6]:
df_tool_work_duplicates=df_tool_work[df_tool_work.duplicated('label', keep=False)].sort_values('label')
df_tool_work_duplicates[['id', 'label', 'accessibleAt']].sort_values('label').head(6)

Unnamed: 0,id,label,accessibleAt
203,27972,CloudCompare - Documentation,[http://www.danielgm.net/cc/documentation.html]
204,11448,CloudCompare - Documentation,[http://www.cloudcompare.org/doc/]
286,29438,Cytoscape,[http://cytoscape.org/]
287,30178,Cytoscape,[http://www.cytoscape.org/]
294,30140,Data Desk,[https://datadescription.com/]
295,29689,Data Desk,[]


In [7]:
df_tool_work_duplicates.sort_values('label').to_csv(path_or_buf='duplicatedtools_services.csv')
df_tool_work_duplicates_av=df_tool_work[df_tool_work.duplicated('label', keep="last")].sort_values('label')
#av=df_tool_work_duplicates_av.count()[0]
tv=df_tool_work_duplicates.count()[0]
print (f'\nThere are {tv} duplicated tool/service descriptions\n')


There are 51 duplicated tool/service descriptions



### Find duplicates on Publications
The MarketPlace API are used to download the descriptions of Publications

In [8]:
df_publication_all = pd.DataFrame()
df_publication_all =getMPDescriptions ("https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/publications?page=", 151)
df_publication_all.index

RangeIndex(start=0, stop=2986, step=1)

#### A quick look at data
The table below shows information about few, randomly chosen, descriptions.  
Only a subset of values are shown, in particular: *id, category, label, licenses, contributors, accessibleAt, sourceItemId*.  

In [15]:
df_publication_flat = pd.json_normalize(df_publication_all['publications'])
#df_publication_flat.info()
df_publication_work=df_publication_flat[['id', 'category', 'label', 'licenses', 'contributors', 'accessibleAt', 'sourceItemId']]
df_publication_work.sort_values('label').head()

Unnamed: 0,id,category,label,licenses,contributors,accessibleAt,sourceItemId
140,24039,publication,"""A Model for International Cooperation - Emble...",[],"[{'actor': {'id': 4354, 'name': 'Mara R. Wade'...",[http://dh2016.adho.org/abstracts/91],conf/dihu/WadeHS16
215,23420,publication,"""A Pale Reflection of the Violent Truth? Pract...",[],"[{'actor': {'id': 3445, 'name': 'Seth Kotch', ...",[https://dh2017.adho.org/abstracts/365/365.pdf],conf/dihu/KotchG17
280,23345,publication,"""A Trace of this Journey"" - Citations of Digit...",[],"[{'actor': {'id': 3098, 'name': 'Paul Matthew ...",[https://dh2017.adho.org/abstracts/070/070.pdf],conf/dihu/Gooding17
217,22373,publication,"""A picture is worth a thousand words""? - From ...",[],"[{'actor': {'id': 2540, 'name': 'Amelie Dorn',...",[http://ceur-ws.org/Vol-2717/paper06.pdf],conf/dhn/AbgazDKD20
214,25164,publication,"""Any more Bids?"" - Automatic Processing and Se...",[],"[{'actor': {'id': 5800, 'name': 'Kris West', '...",[http://dh2010.cch.kcl.ac.uk/academic-programm...,conf/dihu/WestLB10


#### Individuate duplicates based on values of property *label*

In [16]:
df_publication_work_duplicates=df_publication_work[df_publication_work.duplicated('label', keep=False)].sort_values('label')
df_publication_work_duplicates[['id', 'label', 'accessibleAt']].sort_values('label').head(6)

Unnamed: 0,id,label,accessibleAt
93,24079,A Digital Metaphor Map for English,[http://dharchive.org/paper/DH2014/Paper-822.xml]
94,24074,A Digital Metaphor Map for English,[http://dharchive.org/paper/DH2014/Paper-448.xml]
1620,24354,Marked E-Books and Kindle's popular highlight ...,[http://dharchive.org/paper/DH2014/Poster-836....
1621,24353,Marked E-Books and Kindle's popular highlight ...,[http://dharchive.org/paper/DH2014/Paper-836.xml]


In [18]:
df_publication_work_duplicates.sort_values('label').to_csv(path_or_buf='duplicatedpublications.csv')
df_publication_work_duplicates_av=df_publication_work[df_publication_work.duplicated('label', keep="last")].sort_values('label')
#av=df_tool_work_duplicates_av.count()[0]
pv=df_publication_work_duplicates.count()[0]
print (f'\nThere are {pv} duplicated publications descriptions\n')


There are 4 duplicated publications descriptions



### Find duplicates on Training Materials
The MarketPlace API are used to download the descriptions of Training Materials

In [19]:
df_tm_all = pd.DataFrame()
df_tm_all =getMPDescriptions ("https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/training-materials?page=", 8)
df_tm_all.index

RangeIndex(start=0, stop=140, step=1)

#### A quick look at data
The table below shows information about few, randomly chosen, descriptions.  
Only a subset of values are shown, in particular: *id, category, label, licenses, contributors, accessibleAt, sourceItemId*.  

In [20]:
df_tm_flat = pd.json_normalize(df_tm_all['trainingMaterials'])
#df_publication_flat.info()
df_tm_work=df_tm_flat[['id', 'category', 'label', 'licenses', 'contributors', 'accessibleAt', 'sourceItemId']]
df_tm_work.head()

Unnamed: 0,id,category,label,licenses,contributors,accessibleAt,sourceItemId
0,27999,training-material,2.1 Error rates and ground truth - Text Digiti...,[],[],[https://sites.google.com/site/textdigitisatio...,TNK9BG7F
1,11515,training-material,3DHOP - How To,[],[],[http://vcg.isti.cnr.it/3dhop/howto.php],7R4HUMWW
2,28434,training-material,3ds Max Tutorials: Introduction,[],[],[http://docs.autodesk.com/3DSMAX/16/ENU/3ds-Ma...,CNRNDTHT
3,28014,training-material,8 Transcriptions of Speech - The TEI Guidelines,[],[],[http://www.tei-c.org/release/doc/tei-p5-doc/f...,MGAJZAUQ
4,28515,training-material,"Agisoft PhotoScan. Tutorials, beginner level",[],[],[http://www.agisoft.com],F4ZNGB66


#### Individuate duplicates based on values of property *label*

In [22]:
df_tm_work_duplicates=df_tm_work[df_tm_work.duplicated('label', keep=False)].sort_values('label')
df_tm_work_duplicates[['id', 'label', 'accessibleAt']].sort_values('label').head(6)

Unnamed: 0,id,label,accessibleAt
25,28518,ContextCapture tutorials,[https://www.acute3d.com/tutorials/]
26,28521,ContextCapture tutorials,[https://www.acute3d.com/tutorials/]


In [23]:
df_tm_work_duplicates.sort_values('label').to_csv(path_or_buf='duplicatedtrainingmaterials.csv')
df_tm_work_duplicates_av=df_tm_work[df_tm_work.duplicated('label', keep="last")].sort_values('label')
tmv=df_tm_work_duplicates.count()[0]
print (f'\nThere are {tmv} duplicated training materials descriptions\n')


There are 2 duplicated training materials descriptions

