# FORDATIS - Repositorium für Forschungsdaten der FhG

<table table border="1" align="left">
 <tbody>
    <tr>
      <td>Autor: </td>
      <td style="text-align: left">Antje Schroeder (antje.schroeder@zv.fraunhofer.de)</td>
    </tr>
    <tr>
      <td>Stand: </td>
      <td style="text-align: left">12/05/2023</td>
    </tr>
    <tr>
      <td>Zustand: </td>
      <td style="text-align: left">lauffähig, fertig</td>
    </tr>
 </tbody>
</table>

<br><br><br><br><br>

Dieses Skript fragt alle items aus FORDATIS ab. Items sind die Datensätze in FORDATIS. Es werden eine Ausgabedateie erzeugt, die nur die UUIDs (unique identifier) der Datensätze enthält. Diese Datei wird als Eingabedatei von weiteren Skritpen verarbeitet. Die zwei Datei enthält ein paar rudimentäre Metadaten zu jedem Datensatz. <br>

* Eingabedatei: keine</br>
* Ausgabedatei: fordatis_item_uuids.csv</br>
* Ausgabedatei: fordatis_item_metadata.csv</br>
* Datenverzeichnis: ../data</br>
* FORDATIS BaseURL für Items: https://fordatis.fraunhofer.de/rest/items

Dieses Skript benutzt die DSpace 6 Restfull-API:
* https://fordatis.fraunhofer.de/apidocu.jsp
* https://wiki.lyrasis.org/display/DSDOC6x/REST+API

Informationen zur Abfrage über die FORDATIS-API: es werden i. d. R. 100 Datensätze zurückgeliefert (limit = 100). Da derzeit (05/12/2023) 110 Datensätze in FORDATIS abgelegt sind, muss einmal auf die zweite "Seite" iteriert werden, um die Datensätze 100-110 abzuholen. Sollten mehr als 199 Datensätze angelegt sein, ist es erforderlich, die letzten Code-Block anzupassen. 

In [1]:
# Import modules

import urllib.request
import json
import pandas as pd
import csv

In [2]:
#Define environment 

output_dir = "./data/"
output_uuids = output_dir + "fordatis_item_uuids.csv"
output_metadata = output_dir + "fordatis_item_metadata.csv"

base_url = "https://fordatis.fraunhofer.de/rest/"
items = "items?"

offset = 0
limit = 100

# Define column header for metadata output file
fields = ['uuid', 'handle', 'name', 'date last modified']

i = 0

In [3]:
# Open output files

with open(output_uuids, 'w', newline='', encoding="utf-8") as f: # prepare output file 
    write = csv.writer(f)
    
with open(output_metadata, 'w', newline='', encoding="utf-8") as f: # prepare output file 
    write = csv.writer(f)
    write.writerow(fields) # write header to outputfile

In [4]:
# Construct request url
get_item_urls = base_url + items + "offset=" + str(offset) + "&limit=" + str(limit)
# get_item_urls = base_url + items
print(get_item_urls)

https://fordatis.fraunhofer.de/rest/items?offset=0&limit=100


In [5]:
# Read data and create a pandas dataframe
dataset = urllib.request.urlopen(get_item_urls).read()
dataset = json.loads(dataset)
df = pd.DataFrame(dataset)

In [6]:
# Get no of records to iter
no_of_records = len(df)
print(no_of_records)

100


In [7]:
# Fetch data from data frame 

for i in range(no_of_records):
    uuid = dataset[i]["uuid"]
    name = dataset[i]["name"]
    handle = dataset[i]["handle"]
    link = dataset[i]["link"]
    last_modified = dataset[i]["lastModified"]
    mylist = [[uuid, handle, name, last_modified]]
    uuid_list = [[uuid]]
    
    with open(output_metadata, 'a', newline='', encoding="utf-8") as f:
                write = csv.writer(f)
                write.writerows(mylist)
                
    with open(output_uuids, 'a', newline='', encoding="utf-8") as f:
                write = csv.writer(f)
                write.writerows(uuid_list)

# Print out some information about records found
if no_of_records < 100:
    print("No of records is: ", no_of_records, "which is lower than 100. We do not have to iter through a next page.")
else:
    print("No of records is: ", no_of_records, "Caution: We do have to iter through a next page. Please change code!")

No of records is:  100 Caution: We do have to iter through a next page. Please change code!


In [8]:
# Iter through next page
offset = 100
get_item_urls = base_url + items + "offset=" + str(offset) + "&limit=" + str(limit)

# Read data and create a pandas dataframe
dataset = urllib.request.urlopen(get_item_urls).read()
dataset = json.loads(dataset)
df = pd.DataFrame(dataset)

# Get no of records to iter
no_of_records = len(df)

if no_of_records < 100:
    print("No of records is: ", no_of_records, "which is lower than 100. We do not have to iter through a next page.")
else:
    print("No of records is: ", no_of_records, "Caution: We do have to iter through a next page. Please change code!")

No of records is:  10 which is lower than 100. We do not have to iter through a next page.


In [9]:
# 1st Iteration for records above 100 up to 199. If 200 or more records are available please adjust offset parameter.
# Fetch data from data frame 

for i in range(no_of_records):
    uuid = dataset[i]["uuid"]
    name = dataset[i]["name"]
    handle = dataset[i]["handle"]
    link = dataset[i]["link"]
    last_modified = dataset[i]["lastModified"]
    mylist = [[uuid, handle, name, last_modified]]
    uuid_list = [[uuid]]
    
    with open(output_metadata, 'a', newline='', encoding="utf-8") as f:
                write = csv.writer(f)
                write.writerows(mylist)
                
    with open(output_uuids, 'a', newline='', encoding="utf-8") as f:
                write = csv.writer(f)
                write.writerows(uuid_list)
                
print("all things done.")

all things done.
