# HOW-TO: Collect Caselaw Data from the Caselaw Access Project (CAP) API

##ABOUT: The Caselaw Access Project (CAP) API

[Harvard Law School's Caselaw Access Project](https://case.law/) provides an API (Application Programming Interface) that researchers can use to pull metadata and text for: <br>


> **6,930,777** State *and* Federal Cases<br>
> **1,842,484** Federal Cases 

Before calling the API, take a moment to explore the API's documentation available on the site. 

Additionally, the site provides an online search option for caselaw that will help in understanding the metadata available, the filtering options, and the overall structure for individual cases. 

There are several ways of searching this database, but this tutorial will focus on pulling a collection of cases (text and metadata) using a specific search term. 

**EXAMPLE**: 
https://api.case.law/v1/cases/?full_case=true&jurisdiction=ohio&search=magic

### API Key & Access Limits
Before you can start coding an API Request, you will need to register an account on Case.Law and obtain an API Key here: https://case.law/user/register/
<br>Once registered, you will see an API Key listed on your account profile page. You will use this key in the API Call code itself.<br>

**NOTE on ACCESS**: Registering as a researcher will give you access to 500 cases a day. It's reccommended that you test your code then, on a smaller sample before launching a larger request. If your research needs exceed 500 cases / day, please contact Erin at the DSC: erin.mccabe@uc.edu


## CODING a DATA REQUEST
Once you obtain an API Key, start setting up the code to pull cases by  importing the Python libraries needed for the code program. In this case they are: **requests** (to help make the request-call) and **json** (to help structure results).

In [None]:
import requests 
import json

### Structuring the Request

*Note*: The Case.Law API Request URL is highly customizeable. There are a lot of different "endpoints" you can employ by concatenating them to your Request URL to filter your results. Explore them all here: https://case.law/docs/site_features/api#endpoint-cases<br>

In this demonstration - we will use the following Example-URL as the basis for our demo API call. 

https://api.case.law/v1/cases/?full_case=true&jurisdiction=ohio&search=magic

Let's look at each of this URL's components.<br> 
* First, is the base url (aka the primary endpoint) for cases: *api.case.law/v1/cases/*
* Next, is the parameter narrowing to full_case (text): *full_case=true*
* Then, is the parameter that narrows to state-level cases in the Ohio jurisdiction: *jurisdiction=us*
* Finally, the search-term that displays cases containing 'magic': *search=magic*

We can click this URL to explore the results of our search directly on the Case.Law site. Understanding the JSON structure is crucial to directing your code to the right data. <br>

You'll see the first 3 lines are: Count, Next, and Previous. These contain useful data about these specific results wholistically.  
* **Count**: Number of cases that match our search
* **Next**: Case results are 100/page the next URL is where to find the following 100 case results. 
* **Previous**: Page back 100 case results

The next component is "results" where you will find a list of the actual cases themselves. We need to navigate here to collect the case-data. 
* **Results**: the list of cases 

### Making the Request & Pulling the Data
Now that we can see what to expect and have explored the JSON strutcure, we'll implement the URL in our code using the Python Requests module and print out 
* The request's status code (200 if it works) and 
* The 'Count' number of result-cases to verify we are pulling the right information.

In [None]:
url = 'https://api.case.law/v1/cases/?full_case=true&jurisdiction=ohio&search=magic'

r = requests.get(
      url,
      headers={'Authorization': 'Token YOUR KEY'}
  )

res = r.json()

print("Status: ", r.status_code)
print(res['count'], "Ohio cases with 'Magic'")

Status:  200
249 Ohio cases with 'Magic'


Having verified the successful status and viewed the number of results, we can now start collecting data from the API call's results. <br>

For each of the 249 cases in our results, we want to collect just the following parts:
* CaseID
* URL
* Date
* Title
* Case Text 

We want to compile a JSON List (Variable Name: results) of Dictionaries (Variable Name: case_dict) wherer 1 Case = 1 Dictionary containing those 5 data components.<br>



In [None]:
results = []

#THESE COUNTER VARIABLES WILL JUST HELP US SEE HOW MANY CASES / PAGES of CASES WE HAVE AS THE CODE RUNS 
counter = 0
p_counter = 1

In [None]:
for case in res['results']:
    case_dict = {}
    case_dict['caseID'] = case['id']
    case_dict['url'] = case['url']
    case_dict['date'] = case['decision_date']
    case_dict['title'] = case['name_abbreviation']
    if case['casebody']['data']['opinions']:
        case_dict['text'] = case['casebody']['data']['opinions'][0]['text']
        results.append(case_dict)
        
    counter = counter + 1

print("case_count: ", counter)

case_count:  100


### Pagination
You'll see that our Case Count is at 100 instead of the full 249 results. This is due to pagination. The CaseLaw API has a default of 100 case results per page. You can set this to a higher limit by adding **&page_size=1000** to the request URL, but I have found that setting this higher than 1000 can cause calls to break. <br>

We have only 249 cases in our example but, for the sake of demonstration, lets work though how to paginate through pages of results, collecting the cases from each page (in our 'magic' example, there are only 3 pages).

*REMEMBER: API Calls are limited to 500 cases a day.*

In [None]:
    while res["next"]:
        p_counter = p_counter + 1 
        print ('pgCount: ', p_counter, ' of ', res['count']/100)
        res = requests.get(res["next"], headers={'Authorization': 'Token 1474a62178a2cfa095b2d916b6bf159917f51f6d'}).json()
        for case in res['results']:
            results_dict = {}
            results_dict['id'] = case['id']
            results_dict['link'] = case['url']
            results_dict['date'] = case['decision_date']
            results_dict['title'] = case['name_abbreviation']
            if case['casebody']['data']['opinions']:
                results_dict['text'] = case['casebody']['data']['opinions'][0]['text']
                results.append(results_dict)

pgCount:  2  of  2.49
pgCount:  3  of  2.49


Now that cases are collected from all 3 pages of results and appended to the full 'results' JSON list, print the length of the list to verify that it matches the 249 cases we expect.


In [None]:
print(len(results))

249


Success! The final step is saving our 249 Case List to a JSON file.

In [None]:
with open('caselaw_magicOH.json', 'w') as json_file:
  json.dump(results, json_file)