# Open Legal Data

Open Legal Data is an open data project that aims to make legal data more available to the public. It tackles the fact that most of the information produced by courts in Germany and in many countries isn't accessible or displayed in a structured format.

The projects offers an [API](https://de.openlegaldata.io/) through which users can retrieve data on many court decisions in Germany. It's also worth checking the [GitHub](https://github.com/openlegaldata) page and the [project's website](http://openlegaldata.io/).


# Goal of This Notebook

In this notebook, we'll explore the possibilites of the Open Legal Data API. We'll  retrieve data from multiple courts and cases, clean it and prepare it for analysis. 

# Considerations On The Data

While the Open Legal Data project tries to bring accurate information into light, it also has to obey Data Protection rules. For this reason, instead of scraping data on cases and courts from the web, it relies highly on the cooperation from courts to gather information.

On one side this is useful, because it ensures the privacy and data protection right's of the parties involved in the cases (as the names are anonymised, for example). On the other side, the data loose to some extent the relation to reality, because the data sample is to some degree biased. 

Given this, we can't assure the accuracy of data obtained from the API.

## Importing Libraries

In [1]:
import json
import requests
import pandas as pd
import altair as alt
import googlemaps

In [2]:
# This cell reads the API key
# For many usages of the Open Legal Data API a key isn't necessary though.

with open(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\api_key.txt") as file:
    key = file.read()
    
headers = {"Authorization":key} # The headers argument will be passed to the get method of the requests library

## Function To Make Requests

To retrieve the information from the API, it's necessary to access it in someway. The function below accesses the API endpoints and retrieves the results as a python dictionary.

In [3]:
def request_and_read(endpoint,page_size,page):
    
    endpoint = str(endpoint) # Converts endpoint to string
    request = requests.get(endpoint,
                          headers = headers,
                          params = {"page_size":page_size,
                                   "page":page}) # Calls the get method and passes arguments
    content = json.loads(request.content) # reads content
    results = content["results"] # Accesses key containing the results
    return results

## Instatiate a Dictionary To Receive Requests

The API uses pagination. Even though it's possible to set the page size for a very large number, we'll make requests to multiple pages and every page will be stored in a different key of the dictionary.

In [4]:
# The API uses pagination.
case_pages = dict.fromkeys(range(1,51),[])

## Make Requests

With the read and request function, we can make the requests and store them to keys of the dictionary.

In [5]:
# This cell might take some time to run
page_number = 1
for key in case_pages:
    case_pages[key] = request_and_read(r"https://de.openlegaldata.io/api/cases/",1000,page_number)
    page_number+=1

# Organize Requests

At the moment, all the data is stored in a dictionary called "case_pages". Every one of the 50 keys in the dictionary has information on a 1000 cases. A simple way to understand this is that we took data from 50.000 cases through the API and organized them across 50 books.

In a more advanced project, it would be possible to use sampling techniques to ensure that the data is reliable, however this is out of the scope of this project. So we'll stick to the random 50.000 cases that the API gives back.

Every key in the dictionary(every book) has the following keys:

In [6]:
# Accessing the key 1 of the case_pages dictionary
# Then accessing the first element of the list of dictionaries contained in this key
for key in case_pages[1][0]:
    print(key)

id
slug
court
file_number
date
created_date
updated_date
type
ecli
content


By taking a look at the dictionary keys, we can see that most of the them are related to the case, but the "court" key is related to the court. 

For this reason, it's better to organize this in two different datasets, in order to analyse it properly.

# Case Data

Below, we create a function that extracts the case data.

In [7]:
def get_case_data(a_list):
    # Instantiantes a Dictionary
    case_info = {"id":[],"slug":[], "file_number":[],"date":[],"created_date":[],
            "updated_date":[],"type":[],"ecli":[],"content":[]}
    # Loops through the keys of the dictionary
    # Appends values to the main dictionary cases_info
    for element in a_list:
        case_info["id"].append(element["id"])
        case_info["slug"].append(element["slug"])
        case_info["file_number"].append(element["file_number"])
        case_info["date"].append(element["date"])
        case_info["created_date"].append(element["created_date"])
        case_info["updated_date"].append(element["updated_date"])
        case_info["type"].append(element["type"])
        case_info["ecli"].append(element["ecli"])
        case_info["content"].append(element["content"])
        
    return case_info #A dictionary containing, in each key, a list of the objects found in case_pages
        

A for loop to extract the files with the function above.

In [8]:
case_info = dict.fromkeys(range(1,51),[]) # Instatiates a Dictionary to get the info on the cases.

for key in case_info: # Loops through the dictionaries in case pages and unpacks the values from each key.
    case_info[key] = get_case_data(case_pages[key])
    

Now we have a dictionary containing only the information on the cases. Its keys can be seen below:

In [9]:
case_info[1].keys()

dict_keys(['id', 'slug', 'file_number', 'date', 'created_date', 'updated_date', 'type', 'ecli', 'content'])

We can no longer see the court key, which is good since we decided to split them in two dictionaries.

To manipulate the data more efficiently, we can transform it into a pandas DataFrame.

In [10]:
for key in case_info: # Loops through the keys of case_info and transforms every key in a data frame
    case_info[key] = pd.DataFrame(case_info[key])

Since every key contains the same data, we can simply concatenate the DataFrames.

In [11]:
cases = pd.concat(case_info) # Vertical concatenation of DataFrames

 Checking results:

In [12]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 50000 entries, (1, 0) to (50, 999)
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            50000 non-null  int64 
 1   slug          50000 non-null  object
 2   file_number   50000 non-null  object
 3   date          50000 non-null  object
 4   created_date  50000 non-null  object
 5   updated_date  50000 non-null  object
 6   type          50000 non-null  object
 7   ecli          50000 non-null  object
 8   content       50000 non-null  object
dtypes: int64(1), object(8)
memory usage: 3.6+ MB


Displaying DataFrame

In [13]:
cases.head() 

Unnamed: 0,Unnamed: 1,id,slug,file_number,date,created_date,updated_date,type,ecli,content
1,0,328393,bgh-2020-05-07-ix-zb-5619,IX ZB 56/19,2020-05-07,2020-05-29T10:00:15Z,2020-05-29T10:07:14Z,Beschluss,ECLI:DE:BGH:2020:070520BIXZB56.19.0,"<h2>Tenor</h2>\n\n<div>\n <dl class=""R..."
1,1,328192,bverwg-2020-04-22-2-b-5219,2 B 52/19,2020-04-22,2020-05-21T10:00:05Z,2020-05-21T10:06:18Z,Beschluss,ECLI:DE:BVerwG:2020:220420B2B52.19.0,"<h2>Tenor</h2>\n\n<div>\n <dl class=""R..."
1,2,328242,bgh-2020-04-21-ii-zr-5618,II ZR 56/18,2020-04-21,2020-05-23T10:00:15Z,2020-05-23T10:07:16Z,Urteil,ECLI:DE:BGH:2020:210420UIIZR56.18.0,"<h2>Tenor</h2>\n\n<div>\n <dl class=""R..."
1,3,327286,bverfg-2020-03-25-2-bvr-11320,2 BvR 113/20,2020-03-25,2020-04-17T10:00:22Z,2020-04-17T10:06:52Z,Nichtannahmebeschluss,ECLI:DE:BVerfG:2020:rk20200325.2bvr011320,"<h2>Tenor</h2>\n\n<div>\n <dl class=""R..."
1,4,327121,bverfg-2020-03-18-1-bvr-33720,1 BvR 337/20,2020-03-18,2020-04-09T10:00:18Z,2020-04-09T10:08:59Z,Nichtannahmebeschluss,ECLI:DE:BVerfG:2020:rk20200318.1bvr033720,"<h2>Tenor</h2>\n\n<div>\n <dl class=""R..."


# Court Data

Now it's time to retrieve the court data. This will be stored in a different dictionary. We'll repeat the same procedures of the extraction from case data.

In [14]:
def get_court_data(a_list): # Similar to get_case_Data, but accesses the court key of the case_pages dictionary
    
    court_info = {"id":[],"name":[],"slug":[],"city":[],"state":[],"jurisdiction":[],"level_of_appeal":[]}
    
    for element in a_list:
        court_info["id"].append(element["court"].get("id"))
        court_info["name"].append(element["court"].get("name"))
        court_info["slug"].append(element["court"].get("slug"))
        court_info["city"].append(element["court"].get("city"))
        court_info["state"].append(element["court"].get("state"))
        court_info["jurisdiction"].append(element["court"].get("jurisdiction"))
        court_info["level_of_appeal"].append(element["court"].get("level_of_appeal"))
        
    return court_info

In [15]:
court_info = dict.fromkeys(range(1,11),[])

for key in case_pages:
    court_info[key] = get_court_data(case_pages[key])
    

In [16]:
for key in court_info:
    court_info[key] = pd.DataFrame(court_info[key])

In [17]:
courts = pd.concat(court_info)

In [18]:
courts.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 50000 entries, (1, 0) to (50, 999)
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               50000 non-null  int64  
 1   name             50000 non-null  object 
 2   slug             50000 non-null  object 
 3   city             16688 non-null  float64
 4   state            50000 non-null  int64  
 5   jurisdiction     31938 non-null  object 
 6   level_of_appeal  26045 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 2.8+ MB


In [19]:
courts.head()

Unnamed: 0,Unnamed: 1,id,name,slug,city,state,jurisdiction,level_of_appeal
1,0,4,Bundesgerichtshof,bgh,,2,,Bundesgericht
1,1,5,Bundesverwaltungsgericht,bverwg,,2,Verwaltungsgerichtsbarkeit,Bundesgericht
1,2,4,Bundesgerichtshof,bgh,,2,,Bundesgericht
1,3,3,Bundesverfassungsgericht,bverfg,,2,Verfassungsgerichtsbarkeit,Bundesgericht
1,4,3,Bundesverfassungsgericht,bverfg,,2,Verfassungsgerichtsbarkeit,Bundesgericht


Finally, all the data is stored in two separate DataFrames called cases and courts. This will allow us to manipulate them better in order to conduct the data cleaning. 

However, before we do any further data manipulation, we should save the "raw" files. At this point it would also be commendable to save one of the columns to a different file. The column in question is called "content" and is located in the cases DataFrame. It contains the whole decision of the court. 

This data is very interesting for natural language processing projects, but since this is out of the scope of this project, we'll leave it aside for now. 

In [20]:
content = cases[["id","content"]] # We keep the id to enable merging the DataFrames together in the future
content.to_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\content.csv")

In [21]:
courts.to_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\courts.csv")

In [22]:
cases = cases.drop("content",axis = 1)
cases.to_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\cases.csv")

In [23]:
cases = pd.read_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\cases.csv")
cases = cases.drop(["Unnamed: 0","Unnamed: 1"],axis = "columns")
cases

Unnamed: 0,id,slug,file_number,date,created_date,updated_date,type,ecli
0,328393,bgh-2020-05-07-ix-zb-5619,IX ZB 56/19,2020-05-07,2020-05-29T10:00:15Z,2020-05-29T10:07:14Z,Beschluss,ECLI:DE:BGH:2020:070520BIXZB56.19.0
1,328192,bverwg-2020-04-22-2-b-5219,2 B 52/19,2020-04-22,2020-05-21T10:00:05Z,2020-05-21T10:06:18Z,Beschluss,ECLI:DE:BVerwG:2020:220420B2B52.19.0
2,328242,bgh-2020-04-21-ii-zr-5618,II ZR 56/18,2020-04-21,2020-05-23T10:00:15Z,2020-05-23T10:07:16Z,Urteil,ECLI:DE:BGH:2020:210420UIIZR56.18.0
3,327286,bverfg-2020-03-25-2-bvr-11320,2 BvR 113/20,2020-03-25,2020-04-17T10:00:22Z,2020-04-17T10:06:52Z,Nichtannahmebeschluss,ECLI:DE:BVerfG:2020:rk20200325.2bvr011320
4,327121,bverfg-2020-03-18-1-bvr-33720,1 BvR 337/20,2020-03-18,2020-04-09T10:00:18Z,2020-04-09T10:08:59Z,Nichtannahmebeschluss,ECLI:DE:BVerfG:2020:rk20200318.1bvr033720
...,...,...,...,...,...,...,...,...
49995,84716,bverwg-2014-12-04-4-cn-713,4 CN 7/13,2014-12-04,2018-11-11T14:30:04Z,2020-05-06T07:18:54Z,Urteil,ECLI:DE:BVerwG:2014:041214U4CN7.13.0
49996,84718,bverwg-2014-12-04-9-b-7514,9 B 75/14,2014-12-04,2018-11-11T14:30:04Z,2020-05-06T07:19:04Z,Beschluss,ECLI:DE:BVerwG:2014:041214B9B75.14.0
49997,84729,bverwg-2014-12-04-4-c-3313,4 C 33/13,2014-12-04,2018-11-11T14:30:05Z,2020-05-06T07:22:07Z,Urteil,ECLI:DE:BVerwG:2014:041214U4C33.13.0
49998,84733,bverwg-2014-12-04-8-b-6614,8 B 66/14,2014-12-04,2018-11-11T14:30:05Z,2020-05-06T07:22:24Z,Beschluss,ECLI:DE:BVerwG:2014:041214B8B66.14.0


In [24]:
courts = pd.read_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\courts.csv")
courts = courts.drop(["Unnamed: 0","Unnamed: 1"], axis = "columns")
courts

Unnamed: 0,id,name,slug,city,state,jurisdiction,level_of_appeal
0,4,Bundesgerichtshof,bgh,,2,,Bundesgericht
1,5,Bundesverwaltungsgericht,bverwg,,2,Verwaltungsgerichtsbarkeit,Bundesgericht
2,4,Bundesgerichtshof,bgh,,2,,Bundesgericht
3,3,Bundesverfassungsgericht,bverfg,,2,Verfassungsgerichtsbarkeit,Bundesgericht
4,3,Bundesverfassungsgericht,bverfg,,2,Verfassungsgerichtsbarkeit,Bundesgericht
...,...,...,...,...,...,...,...
49995,5,Bundesverwaltungsgericht,bverwg,,2,Verwaltungsgerichtsbarkeit,Bundesgericht
49996,5,Bundesverwaltungsgericht,bverwg,,2,Verwaltungsgerichtsbarkeit,Bundesgericht
49997,5,Bundesverwaltungsgericht,bverwg,,2,Verwaltungsgerichtsbarkeit,Bundesgericht
49998,5,Bundesverwaltungsgericht,bverwg,,2,Verwaltungsgerichtsbarkeit,Bundesgericht


# Cleaning Data

The cases DataFrame is complete. It contains no missing values and whatever Data Cleaning it requires depends on the analysis that will be done. On that account, for now, we'll focus the cleaning on the courts DataFrame.

## The States Column

The states colum contains data about the Bundesländer where the courts are located - actually, it's data about the geographical extent of their jurisdiction. This is the reason why the code 2, as we'll see below, stands for "Bundesrepublik Deutschland and is attributed to the Bundesverfassungsgericht and the Bundesgerichtshof, even though they're located in Baden-Wüttemberg (code 3).

In [25]:
courts.head()

Unnamed: 0,id,name,slug,city,state,jurisdiction,level_of_appeal
0,4,Bundesgerichtshof,bgh,,2,,Bundesgericht
1,5,Bundesverwaltungsgericht,bverwg,,2,Verwaltungsgerichtsbarkeit,Bundesgericht
2,4,Bundesgerichtshof,bgh,,2,,Bundesgericht
3,3,Bundesverfassungsgericht,bverfg,,2,Verfassungsgerichtsbarkeit,Bundesgericht
4,3,Bundesverfassungsgericht,bverfg,,2,Verfassungsgerichtsbarkeit,Bundesgericht


Luckily, the Open Legal Data Api has another endpoint to indicate what's the meaning behind each number in the State column.

In [26]:
test = requests.get("https://de.openlegaldata.io/api/states/")
test_content = json.loads(test.content)
for d in test_content["results"]:
    print(d["id"],d["name"])

3 Baden-Württemberg
4 Bayern
5 Berlin
6 Brandenburg
7 Bremen
2 Bundesrepublik Deutschland
19 Europäische Union
8 Hamburg
9 Hessen
10 Mecklenburg-Vorpommern
11 Niedersachsen
12 Nordrhein-Westfalen
13 Rheinland-Pfalz
14 Saarland
15 Sachsen
16 Sachsen-Anhalt
17 Schleswig-Holstein
18 Thüringen
1 Unknown state


We can compare those to the unique values in the state column.

In [27]:
courts["state"].unique()

array([ 2, 11,  3, 12, 10, 13,  4, 19, 17, 14,  8, 16, 15], dtype=int64)

We can now input more comprehensible data to the state column by using the replace() method from Pandas

In [28]:
courts["state"] = courts["state"].astype(str)

In [29]:
mapper = {"3":"Baden-Württemberg",
         "4":"Bayern",
         "5":"Berlin",
         "6":"Brandenburg",
         "7":"Bremen",
         "2":"Bundesrepublik Deutschland",
         "19":"Europäische Union",
         "8":"Hamburg",
         "9":"Hessen",
         "10":"Mecklenburg-Vorpommern",
         "11":"Niedersachsen",
         "12":"Nordrhein-Westfalen",
         "13": "Rheinland-Pfalz",
         "14":"Saarland",
         "15":"Sachsen",
         "16":"Sachsen-Anhalt",
         "17":"Schleswig-Holstein",
         "18":"Thüringen"}

courts["state"] = courts["state"].replace(mapper)

We can verify the changes to check if everything went as expected

In [30]:
courts.head()

Unnamed: 0,id,name,slug,city,state,jurisdiction,level_of_appeal
0,4,Bundesgerichtshof,bgh,,Bundesrepublik Deutschland,,Bundesgericht
1,5,Bundesverwaltungsgericht,bverwg,,Bundesrepublik Deutschland,Verwaltungsgerichtsbarkeit,Bundesgericht
2,4,Bundesgerichtshof,bgh,,Bundesrepublik Deutschland,,Bundesgericht
3,3,Bundesverfassungsgericht,bverfg,,Bundesrepublik Deutschland,Verfassungsgerichtsbarkeit,Bundesgericht
4,3,Bundesverfassungsgericht,bverfg,,Bundesrepublik Deutschland,Verfassungsgerichtsbarkeit,Bundesgericht


In [31]:
courts["state"].unique()

array(['Bundesrepublik Deutschland', 'Niedersachsen', 'Baden-Württemberg',
       'Nordrhein-Westfalen', 'Mecklenburg-Vorpommern', 'Rheinland-Pfalz',
       'Bayern', 'Europäische Union', 'Schleswig-Holstein', 'Saarland',
       'Hamburg', 'Sachsen-Anhalt', 'Sachsen'], dtype=object)

The column still has a little problem though. For the Bundesgerichte (courts who have jurisdiction over the whole german state) the state column does not indicate their location.

With some knowledege of the german legal system, we can know where the Bundesgerichte are located and update this information.

In [32]:
courts.loc[(courts["name"]=="Bundesverfassungsgericht")|(courts["name"]=="Bundesgerichtshof"),"state"] = "Baden-Württemberg"
courts.loc[courts["name"]=="Bundesarbeitsgericht","state"]="Thüringen"
courts.loc[courts["name"]=="Bundesverwaltungsgericht","state"] = "Sachsen"
courts.loc[courts["name"]=="Bundesfinanzhof","state"]="Bayern"
courts.loc[courts["name"]=="Bundessozialgericht","state"]="Hessen"

The DataFrame nows looks like this

In [33]:
courts.head(10)

Unnamed: 0,id,name,slug,city,state,jurisdiction,level_of_appeal
0,4,Bundesgerichtshof,bgh,,Baden-Württemberg,,Bundesgericht
1,5,Bundesverwaltungsgericht,bverwg,,Sachsen,Verwaltungsgerichtsbarkeit,Bundesgericht
2,4,Bundesgerichtshof,bgh,,Baden-Württemberg,,Bundesgericht
3,3,Bundesverfassungsgericht,bverfg,,Baden-Württemberg,Verfassungsgerichtsbarkeit,Bundesgericht
4,3,Bundesverfassungsgericht,bverfg,,Baden-Württemberg,Verfassungsgerichtsbarkeit,Bundesgericht
5,7,Bundesarbeitsgericht,bag,,Thüringen,Arbeitsgerichtsbarkeit,Bundesgericht
6,3,Bundesverfassungsgericht,bverfg,,Baden-Württemberg,Verfassungsgerichtsbarkeit,Bundesgericht
7,3,Bundesverfassungsgericht,bverfg,,Baden-Württemberg,Verfassungsgerichtsbarkeit,Bundesgericht
8,5,Bundesverwaltungsgericht,bverwg,,Sachsen,Verwaltungsgerichtsbarkeit,Bundesgericht
9,3,Bundesverfassungsgericht,bverfg,,Baden-Württemberg,Verfassungsgerichtsbarkeit,Bundesgericht


## The City Column

The city column also contains a code for each city. However, it also contains null values for courts of the state level.

In [34]:
courts["city"].unique()

array([ nan, 297., 375., 325., 446., 290.,  42., 413., 380., 109.,  84.,
        90., 408., 110.,  38., 188., 158., 117., 423., 620., 647., 168.,
       541., 184., 342., 608., 632., 471., 283., 127., 384., 430., 449.,
        95., 485., 509., 302., 407., 103., 467., 233., 616., 120., 465.,
       537., 115., 291., 551., 378., 625., 135., 121., 379., 479., 289.,
       376., 164., 622., 476., 524., 555., 355., 538., 394., 294., 393.,
       556., 531., 633., 142., 123., 116., 150., 606., 186., 155., 189.,
       151., 132., 145., 166., 176., 639., 631., 629.,  40.,  73., 602.,
        68., 601., 635., 623., 550., 364., 543., 111., 610.,  55., 514.,
       534., 286., 557., 619., 285., 535.,  98.,  39.,  66.,  37.,  44.,
        76.,  72., 377., 347., 554., 609., 561.,  34., 607., 293., 445.,
       442., 417., 466., 497., 463., 404., 498., 490., 390., 493., 494.,
       396., 481., 388., 409., 349., 462., 458., 328., 428., 432., 392.,
       448., 546., 436., 433., 422., 440.,   9., 46

The API from Open Legal data offers information to which code corresponds to which city in the "cities_read" endpoint, however there are other ways to fill in the missing values. One of them is to use the google maps API to retrieve the location of each court based on its name.

In [35]:
with open(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\maps_api_key.txt") as file: 
    key = file.read()
gmaps = googlemaps.Client(key=key)

Example of how the API works

In [36]:
gmaps.geocode("Bundesverfassungsgericht")
# The search result is a list containing a dictionary. Every dictionary contains multiple keys.
# From every key, the most important is the "formatted_address", which gives us information on the court's location

[{'access_points': [{'access_point_type': 'TYPE_SEGMENT',
    'location': {'latitude': 49.0128558, 'longitude': 8.4023569},
    'location_on_segment': {'latitude': 49.0128086, 'longitude': 8.4024206},
    'place_id': 'ChIJ52PBik4Gl0cRyfQjXZjeKgg',
    'segment_position': 0.03768588230013847,
    'unsuitable_travel_modes': []}],
  'address_components': [{'long_name': '3',
    'short_name': '3',
    'types': ['street_number']},
   {'long_name': 'Schloßbezirk',
    'short_name': 'Schloßbezirk',
    'types': ['route']},
   {'long_name': 'Innenstadt-West',
    'short_name': 'Innenstadt-West',
    'types': ['political', 'sublocality', 'sublocality_level_1']},
   {'long_name': 'Karlsruhe',
    'short_name': 'Karlsruhe',
    'types': ['locality', 'political']},
   {'long_name': 'Karlsruhe',
    'short_name': 'KA',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'Baden-Württemberg',
    'short_name': 'BW',
    'types': ['administrative_area_level_1', 'political']},


To ensure that this procedure is going to work, we can do a small test:

In [37]:
# This cell might take sometime to run
new_city_values = {}

for c in courts["name"].unique():
    try:
        l = gmaps.geocode(c)[0].get("formatted_address")
        new_city_values[c] = l
    except:
        print("There was a problem with {}".format(c))

There was a problem with Europäischer Gerichtshof
There was a problem with Schleswig-Holsteinisches Landesverfassungsgericht


So, there seems to be a problem with two courts that the googlemaps API can't find. About the Europäischer Gerichtshof, it would suffice to change the name to "European Supreme Court" to yield correct results. However, since this project focus on german courts and not european ones, we'll drop any entries related to the Europäischer Gerichtshof.

As for the Schleswig-Holsteinisches Landesverfassungsgericht, google can't find where it's located. However, with a quick [quick google search](https://www.schleswig-holstein.de/DE/Justiz/LVG/Kontakt/kontakt_node.html;jsessionid=F20A48B2D2A16A222943350C48126AB7.delivery2-replication) we can see that the court is located in the city of Schleswig.

We'll now make the corresponding changes.

In [38]:
courts.loc[courts["name"]=="Schleswig-Holsteinisches Landesverfassungsgericht","city"] = "Schleswig"

In [39]:
courts.loc[courts["name"]=="Schleswig-Holsteinisches Landesverfassungsgericht"]

Unnamed: 0,id,name,slug,city,state,jurisdiction,level_of_appeal
1760,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
1761,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
4212,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
4214,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
10059,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
13010,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
14212,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
14820,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
16161,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,
16162,1069,Schleswig-Holsteinisches Landesverfassungsgericht,lvgsh,Schleswig,Schleswig-Holstein,Verfassungsgerichtsbarkeit,


In [40]:
courts = courts[courts["name"]!="Europäischer Gerichtshof"]

We'll drop it from the cases DataFrame too.

In [41]:
cases = cases.loc[cases["slug"].str.contains("eugh",case=False)==False]

After this minor drawback, we can convert the dictionary containing the addresses of the courts to a DataFrame, in order to more easily clean these entries.

In [42]:
new_city_values = pd.DataFrame(data = new_city_values.values(),index= new_city_values.keys())
new_city_values.columns = ["address"]
new_city_values = new_city_values["address"].str.split(",",expand=True)
new_city_values

Unnamed: 0,0,1,2,3
Bundesgerichtshof,Herrenstraße 45 A,76133 Karlsruhe,Germany,
Bundesverwaltungsgericht,Simsonpl. 1,04107 Leipzig,Germany,
Bundesverfassungsgericht,Schloßbezirk 3,76131 Karlsruhe,Germany,
Bundesarbeitsgericht,Hugo-Preuß-Platz 1,99084 Erfurt,Germany,
Bundesfinanzhof,Ismaninger Str. 109,81675 München,Germany,
...,...,...,...,...
Amtsgericht Nienburg (Weser),Berliner Ring 98,31582 Nienburg/Weser,Germany,
Landgericht Hildesheim,Kaiserstraße 60,31134 Hildesheim,Germany,
Amtsgericht Dessau-Roßlau,Willy-Lohmann-Straße 33,06844 Dessau-Roßlau,Germany,
Amtsgericht Plettenberg,An der Lohmühle 5,58840 Plettenberg,Germany,


We can keep on working on the new_city_values DataFrame

In [43]:
new_city_values = new_city_values[1].str.split(expand=True)
new_city_values

Unnamed: 0,0,1,2,3,4
Bundesgerichtshof,76133,Karlsruhe,,,
Bundesverwaltungsgericht,04107,Leipzig,,,
Bundesverfassungsgericht,76131,Karlsruhe,,,
Bundesarbeitsgericht,99084,Erfurt,,,
Bundesfinanzhof,81675,München,,,
...,...,...,...,...,...
Amtsgericht Nienburg (Weser),31582,Nienburg/Weser,,,
Landgericht Hildesheim,31134,Hildesheim,,,
Amtsgericht Dessau-Roßlau,06844,Dessau-Roßlau,,,
Amtsgericht Plettenberg,58840,Plettenberg,,,


The results are almost ready to be merged in to the DataFrame, as most of them contain only the city name indicated in column 1. However, we have to check if all of them follow the pattern. This can be done by using a very simple regular expression.

In [44]:
new_city_values[new_city_values[0].str.contains("[\D]")] # Selects all rows with "not a digit" in the column 0

Unnamed: 0,0,1,2,3,4
Pfälzisches Oberlandesgericht Zweibrücken,Schlosspl.,7,,,
Rheinschifffahrtsobergericht Köln,Germany,,,,
Landgericht Paderborn,Germany,,,,
Sozialgericht Halle,Thüringer,Str.,16.0,,
Sozialgericht Mainz,Germany,,,,
Arbeitsgericht Aachen,Adalbertsteinweg,92,,,
Anwaltsgerichtshof NRW,Germany,,,,
Amtsgericht Ribnitz-Damgarten,Germany,,,,


It seems that the googlemaps API could not find appropriate entries for the courts above. Luckily their location is contained in their names. The exception goes for "Anwaltsgerichtshof NRW", which is located in [Hamm](https://www.olg-hamm.nrw.de/aufgaben/gerichtshoefe/anwaltsgericht/index.php).

We'll input these informations manually.

In [45]:
courts = courts.copy() # This prevents us from getting a SeetingWithCopy warning
courts.loc[courts["name"]=="Anwaltsgerichtshof NRW","city"]="Hamm"
new_city_values = new_city_values.drop("Anwaltsgerichtshof NRW",axis = "index")

In [46]:
missing = new_city_values.copy()
missing = missing[missing[0].str.contains("[\D]")]
values = [v[-1] for v in missing.reset_index()["index"].str.split()]
missing["values"] = values
# new_city_values = new_city_values.copy() # Again, this is just to prevent the SettingWithCopy Warning
for n in missing.index:
    new_city_values.loc[n,1] = missing.loc[n,"values"]

We can confirm changes here.

In [47]:
new_city_values.loc[missing.index]

Unnamed: 0,0,1,2,3,4
Pfälzisches Oberlandesgericht Zweibrücken,Schlosspl.,Zweibrücken,,,
Rheinschifffahrtsobergericht Köln,Germany,Köln,,,
Landgericht Paderborn,Germany,Paderborn,,,
Sozialgericht Halle,Thüringer,Halle,16.0,,
Sozialgericht Mainz,Germany,Mainz,,,
Arbeitsgericht Aachen,Adalbertsteinweg,Aachen,,,
Amtsgericht Ribnitz-Damgarten,Germany,Ribnitz-Damgarten,,,


Finally, we can update the values in the DataFrame.

In [48]:
for n in new_city_values.index:
    courts.loc[courts["name"]==n,"city"]= new_city_values.loc[n,1]

Below, we check if everything went alright.

In [49]:
courts[courts["city"].isnull()]

Unnamed: 0,id,name,slug,city,state,jurisdiction,level_of_appeal
21227,621,Amtsgericht Ahaus,ag-ahaus,,Nordrhein-Westfalen,Ordentliche Gerichtsbarkeit,Amtsgericht
21445,621,Amtsgericht Ahaus,ag-ahaus,,Nordrhein-Westfalen,Ordentliche Gerichtsbarkeit,Amtsgericht
24976,621,Amtsgericht Ahaus,ag-ahaus,,Nordrhein-Westfalen,Ordentliche Gerichtsbarkeit,Amtsgericht
25014,621,Amtsgericht Ahaus,ag-ahaus,,Nordrhein-Westfalen,Ordentliche Gerichtsbarkeit,Amtsgericht
45113,621,Amtsgericht Ahaus,ag-ahaus,,Nordrhein-Westfalen,Ordentliche Gerichtsbarkeit,Amtsgericht
48269,621,Amtsgericht Ahaus,ag-ahaus,,Nordrhein-Westfalen,Ordentliche Gerichtsbarkeit,Amtsgericht


So it seems that one court still does not have values on the city column (actually, it has a value of None). We'll fix it now.

In [50]:
courts.loc[courts["name"]=="Amtsgericht Ahaus","city"] = "Ahaus"

Finally, the city column is set. There no more missing values.

In [51]:
courts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47477 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               47477 non-null  int64 
 1   name             47477 non-null  object
 2   slug             47477 non-null  object
 3   city             47477 non-null  object
 4   state            47477 non-null  object
 5   jurisdiction     31938 non-null  object
 6   level_of_appeal  26045 non-null  object
dtypes: int64(1), object(6)
memory usage: 2.9+ MB


## The Jurisdiction Column

Finally, the last column to fix is the jurisdiction column. Below, we can see all the courts for which the values in the jurisdiction column are missing.

In [52]:
courts.loc[courts["jurisdiction"].isnull(),"name"].value_counts()

Bundesgerichtshof                             6128
Bundesfinanzhof                               2409
Oberlandesgericht Hamm                        1042
Oberlandesgericht Düsseldorf                   968
Niedersächsisches Oberverwaltungsgericht       844
Oberlandesgericht Köln                         783
Oberlandesgericht Karlsruhe                    382
Oberlandesgericht Stuttgart                    336
Oberlandesgericht Celle                        330
Oberlandesgericht Naumburg                     320
Hanseatisches Oberlandesgericht                307
Oberlandesgericht Koblenz                      246
Niedersächsisches Finanzgericht                215
Pfälzisches Oberlandesgericht Zweibrücken      176
Landesarbeitsgericht Niedersachsen             175
Oberlandesgericht Rostock                      143
Verwaltungsgericht Göttingen                   131
Schleswig-Holsteinisches Oberlandesgericht     126
Oberlandesgericht Braunschweig                  79
Oberlandesgericht Oldenburg    

With some knowledge of the german legal system, it's possible to insert the missing values.

### Bundesgerichtshof and Bundesfinanzhof

The jurisdiction of these two federal courts are, respectively, "Ordentliche Gerichtsbarkeit" and "Finanzgerichtsbarkeit".

In [53]:
courts.loc[courts["name"]=="Bundesgerichtshof","jurisdiction"] = "Ordentliche Gerichtsbarkeit"
courts.loc[courts["name"]=="Bundesfinanzhof","jurisdiction"] = "Finanzgerichtsbarkeit"

### Verwaltungsgerichtsbarkeit

Verwaltungsgerichte are responsible for the Verwaltungsgerichtsbarkeit. We can check which court names contain the word Verwaltung, in order to input the right jurisdiction value.

In [54]:
courts.loc[(courts["name"].str.contains("Verwaltung"))|
           courts["name"].str.contains("Oberverwaltung"),"name"].value_counts()

Oberverwaltungsgericht Nordrhein-Westfalen          1819
Verwaltungsgericht Köln                             1035
Verwaltungsgericht Düsseldorf                        951
Niedersächsisches Oberverwaltungsgericht             844
Verwaltungsgericht Gelsenkirchen                     817
Schleswig-Holsteinisches Verwaltungsgericht          747
Verwaltungsgerichtshof Baden-Württemberg             609
Verwaltungsgericht Magdeburg                         562
Oberverwaltungsgericht des Landes Sachsen-Anhalt     516
Schleswig Holsteinisches Oberverwaltungsgericht      424
Oberverwaltungsgericht Rheinland-Pfalz               395
Verwaltungsgericht Hannover                          347
Verwaltungsgericht Aachen                            310
Verwaltungsgericht Greifswald                        292
Verwaltungsgericht Neustadt an der Weinstraße        249
Verwaltungsgericht Minden                            234
Verwaltungsgericht Karlsruhe                         219
Verwaltungsgericht Hamburg     

In [55]:
courts.loc[(courts["name"].str.contains("Verwaltung"))|
           (courts["name"].str.contains("Oberverwaltung")),"jurisdiction"] = "Verwaltungsgerichtsbarkeit"

### Sozialgerichtsbarkeit

We can repeat the procedure we applied to the Verwaltungsgerichte to the Sozialgerichte.

In [56]:
courts.loc[courts["name"].str.contains("Sozialgericht", case = False),"name"].value_counts()

Bundessozialgericht                             1565
Landessozialgericht NRW                          574
Landessozialgericht Baden-Württemberg            386
Landessozialgericht Sachsen-Anhalt               318
Landessozialgericht Niedersachsen-Bremen         238
Schleswig-Holsteinisches Landessozialgericht     168
Sozialgericht Karlsruhe                          118
Sozialgericht Aachen                             100
Sozialgericht Detmold                             97
Sozialgericht Düsseldorf                          81
Landessozialgericht Rheinland-Pfalz               79
Landessozialgericht Mecklenburg-Vorpommern        66
Sozialgericht Duisburg                            64
Sozialgericht Dortmund                            57
Sozialgericht Stade                               56
Sozialgericht Osnabrück                           55
Sozialgericht Halle                               48
Sozialgericht Mainz                               45
Sozialgericht Lüneburg                        

In [57]:
courts.loc[courts["name"].str.contains("Sozialgericht",case = False),"jurisdiction"] = "Sozialgerichtsbarkeit"

### Ordentliche Gerichtsbarkeit

In [58]:
courts.loc[(courts["jurisdiction"]!="Verfassungsgerichtsbarkeit")&
          (courts["jurisdiction"]!="Arbeitsgerichtsbarkeit")&
          (courts["jurisdiction"]!="Verwaltungsgerichtsbarkeit")&
          (courts["jurisdiction"]!="Finanzgerichtsbarkeit")&
          (courts["jurisdiction"]!="Sozialgerichtsbarkeit"),"jurisdiction"] = "Ordentliche Gerichtsbarkeit"

In [59]:
courts.loc[courts["name"].str.contains("Sozialgericht"),"jurisdiction"] = "Sozialgerichtsbarkeit"

### Checking Results

Below, we can see the result of the Data Cleaning.

In [60]:
courts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47477 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               47477 non-null  int64 
 1   name             47477 non-null  object
 2   slug             47477 non-null  object
 3   city             47477 non-null  object
 4   state            47477 non-null  object
 5   jurisdiction     47477 non-null  object
 6   level_of_appeal  26045 non-null  object
dtypes: int64(1), object(6)
memory usage: 2.9+ MB


In [61]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47477 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            47477 non-null  int64 
 1   slug          47477 non-null  object
 2   file_number   47477 non-null  object
 3   date          47477 non-null  object
 4   created_date  47477 non-null  object
 5   updated_date  47477 non-null  object
 6   type          47477 non-null  object
 7   ecli          32462 non-null  object
dtypes: int64(1), object(7)
memory usage: 3.3+ MB


Notice that there are no more missing values in the city,state or jurisdiction colum. Also, the number of cases in both datasets match. 

The column named level_of_appeal in the court DataFrame and the column named "ecli" in the cases dataset are not so relevant for the analysis (even though they could be very important to answer other questions), so we'll leave them as they are for now.

## Saving Clean Data Frames

At this point, it would also be interesting to save the clean DataFrames.

In [62]:
courts.to_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\courts_clean.csv")
cases.to_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\cases_clean.csv")

# Combining DataFrames

As last step, we can combine both datasets, courts and cases, in order to have all the information gathered in one place. 

In [63]:
merged = pd.concat([cases,courts],axis = 1)

The result looks like this:

In [64]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47477 entries, 0 to 49999
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               47477 non-null  int64 
 1   slug             47477 non-null  object
 2   file_number      47477 non-null  object
 3   date             47477 non-null  object
 4   created_date     47477 non-null  object
 5   updated_date     47477 non-null  object
 6   type             47477 non-null  object
 7   ecli             32462 non-null  object
 8   id               47477 non-null  int64 
 9   name             47477 non-null  object
 10  slug             47477 non-null  object
 11  city             47477 non-null  object
 12  state            47477 non-null  object
 13  jurisdiction     47477 non-null  object
 14  level_of_appeal  26045 non-null  object
dtypes: int64(2), object(13)
memory usage: 5.8+ MB


In [65]:
merged.head()

Unnamed: 0,id,slug,file_number,date,created_date,updated_date,type,ecli,id.1,name,slug.1,city,state,jurisdiction,level_of_appeal
0,328393,bgh-2020-05-07-ix-zb-5619,IX ZB 56/19,2020-05-07,2020-05-29T10:00:15Z,2020-05-29T10:07:14Z,Beschluss,ECLI:DE:BGH:2020:070520BIXZB56.19.0,4,Bundesgerichtshof,bgh,Karlsruhe,Baden-Württemberg,Ordentliche Gerichtsbarkeit,Bundesgericht
1,328192,bverwg-2020-04-22-2-b-5219,2 B 52/19,2020-04-22,2020-05-21T10:00:05Z,2020-05-21T10:06:18Z,Beschluss,ECLI:DE:BVerwG:2020:220420B2B52.19.0,5,Bundesverwaltungsgericht,bverwg,Leipzig,Sachsen,Verwaltungsgerichtsbarkeit,Bundesgericht
2,328242,bgh-2020-04-21-ii-zr-5618,II ZR 56/18,2020-04-21,2020-05-23T10:00:15Z,2020-05-23T10:07:16Z,Urteil,ECLI:DE:BGH:2020:210420UIIZR56.18.0,4,Bundesgerichtshof,bgh,Karlsruhe,Baden-Württemberg,Ordentliche Gerichtsbarkeit,Bundesgericht
3,327286,bverfg-2020-03-25-2-bvr-11320,2 BvR 113/20,2020-03-25,2020-04-17T10:00:22Z,2020-04-17T10:06:52Z,Nichtannahmebeschluss,ECLI:DE:BVerfG:2020:rk20200325.2bvr011320,3,Bundesverfassungsgericht,bverfg,Karlsruhe,Baden-Württemberg,Verfassungsgerichtsbarkeit,Bundesgericht
4,327121,bverfg-2020-03-18-1-bvr-33720,1 BvR 337/20,2020-03-18,2020-04-09T10:00:18Z,2020-04-09T10:08:59Z,Nichtannahmebeschluss,ECLI:DE:BVerfG:2020:rk20200318.1bvr033720,3,Bundesverfassungsgericht,bverfg,Karlsruhe,Baden-Württemberg,Verfassungsgerichtsbarkeit,Bundesgericht


We have to change the column names, as there are two columns named "id" in the DataFrame.

In [66]:
merged.columns = ['id', 'slug', 'file_number', 'date', 'created_date', 'updated_date',
       'type', 'ecli', 'court_id', 'name', 'slug', 'city', 'state', 'jurisdiction',
       'level_of_appeal']

Finally, we save the DataFrame to a merged file

In [67]:
merged.to_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\merged.csv")

# Analysis

Since the scope of this project is not analyzing the data, this is just a superficial analysis. However, at the present state of the merged DataFrame, some simple things can already be visualized.

In [68]:
merged = pd.read_csv(r"C:\Users\celio\Data Analysis\Projects\Open Legal Data\merged.csv")

## Which Federal Court has the most cases ?

In [69]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [70]:
bundesgerichte = merged[merged["name"].str.contains("Bundes")]
bundesgerichte = bundesgerichte.groupby("name").agg("count")["id"].reset_index()

In [71]:
c = alt.Chart(bundesgerichte,
         title = "Number Of Cases According to Federal Court").mark_bar().encode(
    y = alt.Y("name", 
              axis = alt.Axis(title = "Federal Court", titleFontSize=12,labelFontSize=12, labelAngle = 0,
                             labelPadding =10)),
    x = alt.X("id", axis = alt.Axis(title = "Number of Cases", titleFontSize=12)),
    color = "name")

t = c.mark_text(align = "left",
               baseline= "middle",
               dx = 3).encode(text = "id")

(c+t).properties(width=500,height=250)

## What are the most prolific Courts ?

We'll not count the federal courts for these charts.

In [72]:
j = dict.fromkeys(merged["jurisdiction"].unique(),"")

for key in j:
    j[key] = merged.loc[(merged["jurisdiction"]==key)&(merged["name"].str.contains("Bundes")==False)]
    j[key] = j[key].groupby("name").agg("count")["id"].reset_index().sort_values("id",ascending =False).head(7)
    
j_charts ={}

for key in j:
    chart = alt.Chart(j[key], title = key).mark_bar().encode(
        y= alt.Y("name",axis = alt.Axis(title="Court", labelFontSize=12,titleFontSize=12)),
        x = alt.X("id",axis= alt.Axis(title="Number of Cases", labelFontSize=12,titleFontSize=12)),
        color = "name").properties(width = 500, height = 250)
    text = chart.mark_text(align = "left",
                           baseline = "middle",
                           dx = 3).encode(text="id")
    j_charts[key] = (chart+text)
                           


### Arbeitsgerichtsbarkeit

In [73]:
j_charts["Arbeitsgerichtsbarkeit"]

### Finanzgerichtsbarkeit

In [74]:
j_charts["Finanzgerichtsbarkeit"]

### Ordentlich Gerichtsbarkeit

In [75]:
j_charts["Ordentliche Gerichtsbarkeit"]

### Sozialgerichtsbarkeit

In [76]:
j_charts["Sozialgerichtsbarkeit"]

### Verwaltungsgerichtsbarkeit

In [77]:
j_charts["Verwaltungsgerichtsbarkeit"]

## Cases By State

We can also visualize interesting facts about the German Federal States.

In [78]:
state_data = merged.loc[merged["name"].str.contains("Bund")==False].groupby(["state","jurisdiction"]).count()["slug"]
state_data = state_data.reset_index()

s = {}
for i in state_data["state"].unique():
    s[i] = state_data.loc[state_data["state"]==i]

s_charts = {}
for key in s:
    b = alt.Chart(s[key], title = key).mark_bar().encode(
        y = alt.Y("jurisdiction", axis = alt.Axis(title="Jurisdiction", titleFontSize=12,
                                                  labelFontSize=12)),
        x = alt.X("slug", axis = alt.Axis(title = "Number of Cases", titleFontSize =12,
                                         labelFontSize=12)),
        color = "jurisdiction")
    s_charts[key] = b
    t = b.mark_text(align ="left",
                   baseline = "middle",
                   dx = 3).encode(text = "slug")
    s_charts[key] = (b+t).properties(width = 470, height = 200)

s_charts["Bayern"]


In [79]:
s_charts["Baden-Württemberg"]

In [80]:
s_charts['Nordrhein-Westfalen']

# Conclusion

This Project's goal was to demonstrate the possibilities of the Open Legal Data API. Moreover, it aimed at the cleaning and complementation of the data, instead of the analysis. Considering this, we extracted data from 50.000 cases from the API. To allow for better organization and cleaning, we separated the data in two data sets, one containing data from the courts which produced the sentences, and another containing the data from the cases.

We spote three major problems in the data about the courts. Three columns in this data set had a lot of missing data. Thanks to other endpoints from the Open Legal Data API as well as the functionalities provided by the googlemaps API, we could complement the missing entries for the "city","state" and "jurisdiction" columns.

Finally, we merged both datasets together and we did some initial exploratory data analysis.

The next steps are:
    
    - Compare the content of the cases across the Federal States. This could allow us to find out if there are any differences on how courts on different states judge similar cases or if there are considerable differences in the amount of compensation granted from courts in different locations.
    
    - Try to identify if any state or any court has to deal with a reoccuring situation more often than other courts and try to find out why (let's say that the Arbeitsgericht in Munich gets more cases for illegal terminations of working contracts than the average court. What could be the reasons for this?).

