# Pre-processing Steps for OParl Data

The main objective of the pre-processing is to extract relevant data from the OParl API endpoints and store it in Elasticsearch for visualization purposes. Below are the detailed steps to achieve this:

1. **Extracting Data from the API to JSON Files**:
   <font size="2">The first step is to interact with the OParl API endpoints and retrieve the data in its raw format, typically provided in JSON. This can be done using HTTP requests or appropriate client libraries for the programming language of choice. For each relevant endpoint, the data should be fetched and saved as individual JSON files.</font>

2. **Identifying and Merging Relevant Fields**:
   <font size="2">The retrieved JSON data may contain more information than needed for visualization. In this step, you'll need to identify the relevant fields required for your visualization project. These fields could include information about meetings, agenda items, participants, organizations, etc. Once identified, you can merge these fields into a single JSON file or data structure for easier processing.</font>

3. **Extracting Data from Download URLs using the ExtractionWebService**:
   <font size="2">OParl data may contain references to external resources, such as documents, images, or additional data files. These resources are often linked via download URLs. To incorporate this data into your Elasticsearch index, you'll need to follow the download URLs and retrieve the relevant data using the ExtractionWebService if available. This step ensures that all necessary data is collected and linked together.</font>

4. **Converting Data into _bulk API Format**:
   <font size="2">Elasticsearch offers the `_bulk` API, which allows you to perform bulk operations for efficiently storing data. In this step, you'll convert the collected and processed data into the appropriate _bulk API format. This format enables you to send multiple data entries in a single request, reducing the overhead of individual requests and speeding up the data indexing process.</font>

By following these steps, you'll be able to preprocess the OParl data effectively and store it in Elasticsearch for seamless visualization and exploration of the OParl dataset. Properly processed and indexed data will enable you to build powerful visualizations and derive valuable insights from the OParl API.


# 1. Extracting data from the Api to json file

#available bodies

person    "https://ris.kaiserslautern.de/oparl/bodies/0001/people"

meeting    "https://ris.kaiserslautern.de/oparl/bodies/0001/meetings"

paper    "https://ris.kaiserslautern.de/oparl/bodies/0001/papers"

membership    "https://ris.kaiserslautern.de/oparl/bodies/0001/memberships"

locationList    "https://ris.kaiserslautern.de/oparl/bodies/0001/locations"

agendaItem    "https://ris.kaiserslautern.de/oparl/bodies/0001/agendaitems"

Organisations    "https://ris.kaiserslautern.de/oparl/bodies/0001/organizations"

consultations    "https://ris.kaiserslautern.de/oparl/bodies/0001/consultations"

files    "https://ris.kaiserslautern.de/oparl/bodies/0001/files"


In [1]:
import requests
from pprint import pprint
import pandas as pd
import json
from elasticsearch import Elasticsearch

In [None]:
 #Fetch people data from multiple pages and storing them in a json file


people_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/people'
data = []
page_counter = 1

while people_url:
    response = requests.get(people_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        people_url = response_data['links']['next']
        page_counter += 1
    else:
        people_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/people_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'people_data.json'")

In [None]:

# Fetch meetings data from multiple pages and storing them in a json file


meetings_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/meetings'
data = []
page_counter = 1

while meetings_url:
    response = requests.get(meetings_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        meetings_url = response_data['links']['next']
        page_counter += 1
    else:
        meetings_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/meetings_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'meetings_data.json'")

In [None]:
 #Fetch papers data from multiple pages and storing them in a json file


papers_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/papers'
data = []
page_counter = 1

while papers_url:
    response = requests.get(papers_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        papers_url = response_data['links']['next']
        page_counter += 1
    else:
        papers_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/papers_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'papers_data.json'")

In [None]:
 #Fetch memberships data from multiple pages and storing them in a json file

import requests
memberships_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/memberships'
data = []
page_counter = 1

while memberships_url:
    response = requests.get(memberships_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        memberships_url = response_data['links']['next']
        page_counter += 1
    else:
        memberships_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/memberships_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'memberships_data.json'")

In [None]:
 #Fetch locations data from multiple pages and storing them in a json file

import requests
locations_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/locations'
data = []
page_counter = 1

while locations_url:
    response = requests.get(locations_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        locations_url = response_data['links']['next']
        page_counter += 1
    else:
        locations_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/locations_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'locations_data.json'")

In [None]:
 #Fetch agendaitems data from multiple pages and storing them in a json file


agendaitems_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/agendaitems'
data = []
page_counter = 1

while agendaitems_url:
    response = requests.get(agendaitems_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        agendaitems_url = response_data['links']['next']
        page_counter += 1
    else:
        agendaitems_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/agendaitems_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'agendaitems_data.json'")

In [None]:
 #Fetch organisations data from multiple pages and storing them in a json file


organizations_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/organizations'
data = []
page_counter = 1

while organizations_url:
    response = requests.get(organizations_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        organizations_url = response_data['links']['next']
        page_counter += 1
    else:
        organizations_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/organizations_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'organizations_data.json'")

In [None]:
 #Fetch files data from multiple pages and storing them in a json file


files_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/files'
data = []
page_counter = 1

while files_url:
    response = requests.get(files_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        files_url = response_data['links']['next']
        page_counter += 1
    else:
        files_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/files_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'files_data.json'")

In [None]:
 #Fetch consultaions data from multiple pages and storing them in a json file


consultations_url = 'https://ris.kaiserslautern.de/oparl/bodies/0001/consultations'
data = []
page_counter = 1

while consultations_url:
    response = requests.get(consultations_url)
    response_data = response.json()
    data.extend(response_data['data'])
    
    # Check if there is a next page
    if 'next' in response_data['links']:
        consultations_url = response_data['links']['next']
        page_counter += 1
    else:
        consultations_url = None

# Store the data in a JSON file
with open('/Users/ameerkhan/Downloads/Oparl_Files/consultations_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print(f"Data from {page_counter} pages is stored in 'consultations_data.json'")