# Extracting Data from Download URLs using the ExtractionWebService
You can reach the service directly from the intranet (that is e.g.: VPN to DFKI network). Try something like \
\
curl -s -H 'Content-Type: application/json; charset=UTF-8;' 'http://pc-4301.kl.dfki.de:8080/ExtractionWebService/aloeExtractionHandler/getMetadataFromURIAndExtractContent?uri=https%3A%2F%2Fris.kaiserslautern.de/oparl/bodies/0001/downloadfiles/00094421.pdf' \


In [1]:
! curl -s -H 'Content-Type: application/json; charset=UTF-8;' 'http://pc-4301.kl.dfki.de:8080/ExtractionWebService/aloeExtractionHandler/getMetadataFromURIAndExtractContent?uri=https%3A%2F%2Fris.kaiserslautern.de/oparl/bodies/0001/downloadfiles/00094421.pdf'

{"m_title":"Microsoft Word - Antrag_Modellkommune","m_creator":"User","m_resourceContent":"\n \nHerrn Oberbürgermeister \nDr. Klaus Weichel \nWilly-Brandt-Platz 1 \n67657 Kaiserslautern \n \n \nSehr geehrter Herr Weichel, \n \nfür die Stadtratssitzung am 12.06.2023 bitten wir Sie um Aufnahme des Tagesordnungspunktes: \n \nSicherheit und Schutz der Konsumenten stärken - Modellkommune zur Abgabe von \nCannabis werden \n \nBeschlussvorschlag: \nKaiserslautern bewirbt sich als Modellkommune für die geplanten Modellregionen zur \nkontrollierten und lizenzierten Abgabe von Cannabis. Dazu soll das Gesundheitsreferat \nentsprechende Vorbereitungen treffen und Informations- und Hilfeangebote im Bereich des \nKonsums von Cannabis entwickeln, die den Bedarfen und Erwartungen der Bürger*innen \nentsprechen. Wichtig ist es, Konsument*innen und vor allem auch Jugendliche bestmöglich zu \nschützen.  \n \n \nBegründung: \nDer von der Bundesregierung eingeschlagene Weg zur Legalisierung von Cannabis is

As you see, you have to URL-encode the link to the pdf. Returned will be something like- 


**{"m_title":"Microsoft Word - Antrag_Modellkommune","m_creator":"User","m_resourceContent":"\n \nHerrn Oberbürgermeister...\n\n","m_creationDate":"2023-06-04T17:18:32Z"}** \
\
This means for you, that you should find the content in field m_resourceContent.

To replace the download URLs in the "resultsProtocol" and "auxiliaryFile" fields with their respective contents using the extraction web service, we need to make HTTP requests to the web service for each download URL and retrieve the extracted content. We can use the requests library in Python to make HTTP requests. \
This code will retrieve the content from the extraction web service for each download URL in the "resultsProtocol" and "auxiliaryFile" fields and store it in the new JSON file under the fields "results_protocol_content" and "auxiliaryfile_content," respectively. If the content extraction fails for any URL, an empty string will be stored under the respective content fields.

In [None]:
import json
import requests

# Load the JSON file containing the data
data_path = '/Users/ameerkhan/Downloads/Oparl_Files/finetuned_meeting_location_organisation_participant_files.json'
with open(data_path, 'r') as file:
    data = json.load(file)

# Function to get content from the extraction web service
def get_content_from_url(url):
    encoded_url = requests.utils.quote(url, safe='')
    api_url = f'http://pc-4301.kl.dfki.de:8080/ExtractionWebService/aloeExtractionHandler/getMetadataFromURIAndExtractContent?uri={encoded_url}'
    response = requests.get(api_url)
    if response.status_code == 200:
        content = response.json().get('m_resourceContent', '')
        return content
    else:
        print(f"Failed to retrieve content for URL: {url}")
        return ''

# Function to modify the "auxiliaryFile" and "resultsProtocol" fields
def modify_files(json_data):
    for item in json_data:
        # Modify "auxiliaryFile" field
        auxiliary_files = item.get('auxiliaryFile', [])
        item['auxiliaryFile'] = []
        for file in auxiliary_files:
            content = get_content_from_url(file['auxiliaryfile_downloadurl'])
            item['auxiliaryFile'].append({
                'auxiliaryfile_id': file['auxiliaryfile_id'],
                'auxiliaryfile_name': file['auxiliaryfile_name'],
                'auxiliaryfile_content': content
            })

        # Modify "resultsProtocol" field
        results_protocol = item.get('resultsProtocol', {})
        if results_protocol:
            content = get_content_from_url(results_protocol['results_protocol_downloadurl'])
            item['resultsProtocol'] = {
                'results_protocol_id': results_protocol['results_protocol_id'],
                'results_protocol_name': results_protocol['results_protocol_name'],
                'results_protocol_content': content
            }
    return json_data

# Call the function to modify the "auxiliaryFile" and "resultsProtocol" fields
modified_data = modify_files(data)

# Specify the path for the new JSON file
new_file_path = '/Users/ameerkhan/Downloads/Oparl_Files/finetuned_meeting_location_organisation_participant_files_downloaded.json'

# Save the modified JSON data to the new file
with open(new_file_path, 'w') as new_file:
    json.dump(modified_data, new_file, indent=2)

print("New JSON file with modified 'auxiliaryFile' and 'resultsProtocol' fields created.")
