# Lab 3 

 __case.law API, word frequencies, regex, concordances and collocations__

In this lab, we'll be going over how to access the case.law data using an API. We'll also examine the dataset, introduce Regex, and go over concordances and collocations.


# **Part 1 - Introduction to APIs and the case.law Dataset**

For LS190 , we'll be extensively working with the **[case.law](https://case.law/)** database - which is a database of **360 years of United States caselaw.** To access this data we'll need to develop a simple understanding of APIs.

* API is an acronym for **Application Programming Interface.**  If you ask me, that's seems like a pretty vague and general term (unless you are a CS person who can explain what this means). The term itself most likely comes from the early days of computing. 

* Notwithstanding the confusing nature of the term, for the purposes of this class an **API allows us to interact with, access and download data from the case.law database.** This is why a **[case.law API KEY](https://case.law/docs/site_features/api)** becomes important - as this key allows us to download the case.law data. You can think of an API key as a magical phrase like "Open Sesame" - which lets you access a database where hidden treasures of data await!

* The overwhelming amounts of data has made **APIs and API KEYS** an important means of accessing data. For example, there's the **[Twitter API](https://developer.twitter.com/en/docs/twitter-api)** if you want to study tweets. Or **[NYTimes API](https://developer.nytimes.com/apis)** if you want to study the New York Times Archive.

The API for case.law is very-well documented and you can find examples of how to use the API by following the various **[jupyter notebooks they provided by the case.law team](https://github.com/harvard-lil/cap-examples)**. Honestly, these notebooks are excellent - __so check them out!__

In [2]:
import pandas as pd
import os
import sys
sys.path.append('..')

import lzma
import json

from config import settings_base as settings # Notice that we have a "config" folder - which the "settings" script
from config import utils                     # and also has the "utils.py" script - which contains utility codes

Above, we are importing a couple of libraries. 
* **`lzma`** allows us to decompress the case.law data
* **`json`** allows us to access the *dictionary* data structure 
* **`config`** is a folder which contains **settings_base** python script. This script should contain your **API KEY** 
 * Note that each API key is unique and different. Because these notebooks are published on github, we cannot share the API Key. This is why I ask you to get the API key from the kind folks at case.law as soon as possible.
* Finally, **`utils`** is a python script which contains helper functions written by case.law folks which allow us to download their data.

The examples code below is based on the **[example notebooks](https://github.com/harvard-lil/cap-examples)**  example notebooks  written by the wonderful people working at case.law. The case.law project is incredibly important as it allows us to access **huge amounts of case text data** without having to pay a subscription for services like LexisNexis or Westlaw. This is a wonderful example of data democratization.

## Downloading the data

In [6]:
# Get Case Data for Hawaii (as it's a small-ish jurisdiction)
compressed_file = utils.get_and_extract_from_bulk(jurisdiction="Hawaii", 
                                                  data_format="json")

The dataset is stored as a [jsonl](https://jsonlines.org/) file - which stands for "json lines." A .JSON is itself a dictionary, and "lines" stands for the fact that each entry in the data is stored as a line. 

In [None]:
# Assume we are dealing with json data (if data_format is changed to xml or
# change this cell's os.path.join line)
if not compressed_file.endswith('.xz'):
    compressed_file = os.path.join(compressed_file, 
                                   "data", 
                                   "data.jsonl.xz") 

The above code makes sure that we can load the json file.

In [None]:
cases = []
print("File path:", compressed_file)
with lzma.open(compressed_file) as infile:
    for line in infile:
        record = json.loads(str(line, 'utf-8'))
        cases.append(record)

print("Case count: %s" % len(cases))

We now have a __list__ of all the cases. Examine the first entry below:

In [None]:
cases[0]

**Note:**

If we want to **limit** the amount of cases we are interested in (for lack of memory or space), we can use the following code below - which will take the first 500 cases from the dataset:

In [None]:
max_records = 500

cases_reduced = []
with lzma.open(compressed_file) as infile:
    for count, line in enumerate(infile):           ## enumerate() allows us to count the iterations 
        record = json.loads(str(line, 'utf-8'))     ## in this case, we want 500 cases
        cases_reduced.append(record)
        if count == max_records - 1:
            break

print("Case count: %s" % len(cases_reduced))

## Putting the data into a pandas datafarme

We now have a list called **"cases"** which contains all the cases from Hawaii.


Now let's make it into a pandas dataframe:

In [None]:
df = pd.DataFrame(cases)
df.head()

We now have access to case.law data's, Hawaii dataset. Fascinating!

Note, however, that we don't really see the "text" of the decisions. The actual **text** is contained within a dictionary in the column named **casebody.** So it's a dictionary within a dictionary - these data structures can get pretty complicated!

## Getting to the text

Let's examine the data structure of **"casebody"** a little bit further.

In [None]:
df['casebody'][0]

We can write another for loop to extract the text:

In [None]:
opinion_texts = []
for i in range(len(df)):
    if df['casebody'][i]['data']['opinions']:
        text = df['casebody'][i]['data']['opinions'][0]['text'] # .lower() to lowercase
        opinion_texts.append(text)
    else:
        opinion_texts.append("No Text Found") ## If no text is found, have a "NAN" entry - eg. df.loc[df['text'] == 'No Text Found']
        
        
        
    

In [None]:
print(opinion_texts[4])

Let's reinsert the "opinion_texts" list into the dataframe under the column __'text'__

In [None]:
df['text'] = opinion_texts

In [None]:
df.head()

Let's drop the columns which contain the **metadata** - ie data about data (like for example "last page") - and don't seem important at the moment (since we care only about text).

In [None]:
df_cleaned = df[['decision_date', 'name_abbreviation', 'text']] ## keep only these columns
df_cleaned

# **Part 2 -  Word frequency over time**

Now that we have the data - let's try looking at simple word frequencies over time. 

Below, we'll be using code from the [case.law API example codes](https://github.com/harvard-lil/cap-examples/blob/develop/ngrams/ngrams.ipynb) on n-grams.

First, let's convert our "Decision Date" column into datetime format. "Datetime" Format allows us to work with "numbers" as datetime objects - ie, their months and years and days. This is convenient because dates are not like normal numbers - usually months end at 30, and restart to 1. So it's obvious that it needs its own heuristic.

In [None]:
df_cleaned['decision_date'] = pd.to_datetime(df_cleaned["decision_date"])

Now let's extract the year from our newly converted datetime column. Looking at "words over year" is a good simple way of seeing __general trends__ in the law and legal language.

In [None]:
df_cleaned['year'] = df_cleaned['decision_date'].dt.year

In [None]:
df_cleaned.head()

Let's define a function called "search_ngram" which counts all the **occurrences** (or frequencies) of a given word over a given year. Thus, for example, if we cared about the word "robbery", how many times did it appear over time in the Hawaii dataset, and what were the trends of this word over time?

In [None]:
def search_ngram(ngram):
    pairs = []
    for year in df_cleaned["year"].unique():                           ## list all unique years
        temp = df_cleaned[df_cleaned["year"] == year]["text"].tolist() ## extract all the text for a given year
        temp = " ".join(temp).lower()                                  ## make into a string via .join and lowercase
        total_number_of_words = len(temp.split(" "))                   ## count the tokenized words - use for relative frequency
        ngram_count = temp.count(ngram.lower())
        pairs.append((year, 
                      ngram_count/total_number_of_words))              ## normalize ngram count by total word count
   
    return pd.DataFrame(pairs, columns=['Year', 'Normalized Frequency'])




Let's see what the function does:

In [None]:
robbery = search_ngram("robbery")
robbery

## Plotting

Now let's plot our results:

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

plt.figure(figsize = (15,8))
sns.lineplot(data = robbery, 
             x = "Year", 
             y = "Normalized Frequency")

Let's try something else. Since the output of the search_ngram() function is a Pandas Dataframe, we can explicitly put it in plotting code below. 

A good example is the word "computer". The word itself never existed before computers were invented, so if we don't see its occurrence in the past, we can conclude that the code is capturing trends in the text data. This  can be thought of as a __"sanity check"__ - ie a simple hypothesis that we all know to be true that is captured by the code.

In [None]:
plt.figure(figsize = (15,8))
sns.lineplot(data = search_ngram('Computer'), 
             x = "Year", 
             y = "Normalized Frequency")

Finallly, let's try one more word. To make the code easily reproducible, you can create a "word_of_interest" string, which you input into the data below.

In [None]:
word_of_interest = "Crown"

plt.figure(figsize = (15,8))
sns.lineplot(data = search_ngram(word_of_interest), 
             x = "Year", 
             y = "Normalized Frequency").set(title = word_of_interest + " normalized frequency over time");

What does this pattern of the word "Crown" represent? This seems very interesting.

You can explore the case.law data further here if you'd like or draw some more plots. 

**Note:** Again - I encourage you to check out the case.law **[example notebooks](https://github.com/harvard-lil/cap-examples)**. For instance, the **[Cartwright notebook - which shows who was Illinois' most prolific judge](https://github.com/harvard-lil/cap-examples/blob/develop/bulk_exploration/cartwright.ipynb)** is pretty fascinating!

# **Part 3 - Introduction to  Regular Expressions (ReGex)**

As we saw above, word frequencies are a very powerful tool for examining trends in text data - despite its simplicity. This simplicity has the added benefit of being __intuitive in interpretation__

Nevertheless, there are other methods we can use to examine text. The most (in)famous way of searching for patterns in text is REGEX, or regular expressions. 

This can be especially useful if we want to remove some "bad patterns" - boilerplate, useless headers or footers. 

**Sets, Quantifiers, and Special Characters**

Regex (regular expressions) is a very powerful tool to find patterns in text. One of the best ways to learn Regex is by using Regex 101 to practice matching words in a body of text.

[RegexR](https://regexr.com/)

[Regex101](https://regex101.com/)

[Regex Reference Sheet](http://www.rexegg.com/regex-quickstart.html#ref)

For example, say we had a text and we wanted to find every instance of a word within that text.

In [None]:
import regex as re
text = "Samuel and I went down to the river yesterday! Samuel isn't a very good swimmer, though. Good thing our friend Ilya was there to help."

# the findall() function finds every instance of a specified word pattern within a text
re.findall(r'Samuel', text)

Let's say that instead of only wanting to find Samuel, we wanted to find every word in the text starting with 'Sa'. What would we do? Use pattern matching!

In [None]:
re.findall(r'Sa[a-z]*', text)

You may be wondering what the [a-z] in the Sa[a-z] pattern means. This is called a **set** in regex. When characters are within a set, such as  [abcde], any one character will match. However, regex has a special rule where [a-z] means the same thing as [abcde...xyz].

Here are some more:
~~~ 
[0-9]        any numeric character
[a-z]        any lowercase alphabetic character
[A-Z]        any uppercase alphabetic character
[aeiou]      any vowel (i.e. any character within the brackets)
[0-9a-z]     to combine sets, list them one after another 
[^...]       exclude specific characters
~~~


You still may be wondering how the entirety of Sahit was able to be matched if only one character within [a-z] would match. The answer is something called a **quantifier**!

Rules:
~~~ 
*        0 or more of the preceding character/expression
+        1 or more of the preceding character/expression
?        0 or 1 of the preceding character/expression
{n}      n copies of the preceding character/expression 
{n,m}    n to m copies of the preceding character/expression 
~~~

Say that now, you only wanted to return Samuel when the name was mentioned at the beginning of the text.

In [None]:
re.findall(r'^Samuel', text)

**Special characters**, such as the *^* which was just used in the pattern above, match strings that have a specific placement in a sentence. For example, *^* matches the subsequent pattern only if it is at the beginning of the string. This is why only a single 'Samuel' was returned.

Rules:
~~~ 
.         any single character except newline character
^         start of string
$         end of entire string
\n        new line
\r        carriage return
\t        tab

~~~

**Python RegEx Methods**

* `re.findall(pattern, string)`: Returns all phrases that match your pattern in the string.

* `re.sub(pattern, replacement, string)`: Return the string after replacing the leftmost non-overlapping occurrences of the pattern in string with replacement

* `re.split(pattern, string)`: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. 

**Pandas RegEx Methods**

Pandas also has its own [__built in methods__](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/) of working with regex (without calling a separate "re" library)

* `df['column'].str.extract(pattern)`: Extract pattern into a new column

* `df['column'].str.replace(pattern, replace):` Replace pattern with something else (usually a "space" if you want to remove


------------------

## Practical Regex cleaning example

Let's go back to our case.law dataset and try to "clean" it of some repetitions in the data. 

For example - the __"OPINION OF THE COURT"__ in the beginning of every decision is not really useful information. 

**How would we use Regex to clean it up?**


In [None]:
df = df_cleaned

In [None]:
df.tail()

In [None]:
print(df.text[200][:1000]) 
print("\n")
print(df.text[300][:1000])
print("\n")
print(df.text[1][:1000])

As you can see, sometimes it's written as "OPINION __OP__ THE COURT" with a P, sometimes it's capitalized and sometimes it isn't. 

In my experience, the easiest way to work with these patterns is to copy them to [__RegexR__](https://regexr.com/), and work there on a small sample (like the one above). 

The important thing to note is that you must **search for patterns** in the text that will enable you to clean up the data. But __be very careful__ because REGEX can be very powerful - making an incorrect pattern could remove text that you didn't want to lose. 

### First Try - an obvious pattern

After testing out some patterns, the simplest pattern we could use is as follows:

In [None]:
pattern = r'(Opinion of the Court|OPINION OP THE COURT BY|OPINION OF THE COURT BY)' # remember, in python we need the paranthesis
 
df['pattern'] = df['text'].str.extract(pattern, expand = True)

In [None]:
df.head(10)

And actually... it's pretty good!

We can conditionally check the ones that "weren't captured" - ie **NaN** - by subsetting the dataframe. 

We do that by creating a new dataframe which has the condition of having only the rows that contain **NaN** in the "cleaned_text" column

In [None]:
df_test = df[df['pattern'].isna()]
df_test

As we can see, it's not really capturing all the useless text - see row 18220 for instance. 

### Second try 

Another pattern that emerges is that the first line of the text is usually useless. The first line is seperated by a "/n" symbol. We could try that. 

In [None]:
pattern = r"(^[\s\S]+?(?=\n))"

df['pattern2'] = df['text'].str.extract(pattern)

In [None]:
df

Again, not a perfect pattern. We are losing some information by capturing the names of Judges in the pattern - and thus we could remove that. 

But this is just an example. Now that we tested the pattern using `.str.extract()`, we can proceed to remove the pattern using `.str.replace()` by replacing the pattern with a space.

In [None]:
df['cleaned_text'] = df['text'].str.replace(pattern, " ")

In [None]:
df.head()

# **Part 4 - Concordances and Collocations**

## Concordances
To continue this examination of simple NLP methods, let us now look at concordances.

A concordance lists every instance of a given word, together with some of its context. Concordances are __fundamentally important__ if we want to understand the meaning of a word in a context.

Here we look up the word "petition" in casebody of the California Dataset by entering text followed by a period, then the term concordance, and then placing “petition” in parentheses.


In order to do concordances, we'll import __NLTK__, which is a good simple library for doing NLP tasks in Python. You will notice that we will be importing a lot of packages - which is just a way of showing that the NLP space in python is incredibly diverse and there are numerous libraries which can do a lot of different things. 

In [None]:
import nltk
nltk.download('punkt')
from nltk.text import Text
from nltk.tokenize import word_tokenize # import tokenizer function from nltk


In [None]:
df.head()

In [None]:
cases = df['cleaned_text'][:100]  ## Make a list of cases out of the first 100 cases from the dataframe

Now, we must convert these cases into a single string

In [None]:
case_corpus = " ".join(cases).lower()  

In [None]:
case_corpus[:1000]

Now we tokenize - more on this in the next lab.

In [None]:
case_corpus_tokenized = word_tokenize(case_corpus)
text = Text(case_corpus_tokenized)

In [None]:
text.concordance('fraud')

Recall that the word "crown" was used a lot in the past in Hawaii. Why is that? 

Let's examine.

In [None]:
text.concordance("crown", 
                 width=110) ## width determines how many characters before and after we want to examine

In [None]:
## if we want phrases we need to put them in a list
text.concordance(["good", "faith"],  
                 width=110)

**Write your own code to explore the occurrence of other words of interest**

In [None]:
...

## Collocations
Collocations are expressions of multiple words which commonly co-occur. These collocations are measured using **Pointwise Mutual Information** - which gives the probability of "two events co-occuring" - in this case, two words co-occuring. This can give us measures of associations between words - things like phrases, or co-occuring words can be revealed.

[A pretty good explanation of PMI](https://stats.stackexchange.com/a/522504) is given on **stackexchange.**

Note: stackexchange (and google generally) is a wonderful resource for all things relating to NLP and statistics. Although for this course, I do not emphasize concrete statistical knowledge or mathematical formulas, you should still get some **intuitive** understanding of what these measures like PMI do. 


In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
fourgram_measures = nltk.collocations.QuadgramAssocMeasures()

In [None]:
bigram_finder = BigramCollocationFinder.from_words(case_corpus_tokenized,
                                                   window_size = 5)

bigram_finder.apply_freq_filter(5)  # appear at least N times
bigram_finder.nbest(bigram_measures.pmi, 20) # show top 20 PMI scoring

Do you see any interesting patterns that emerge from the dataset?

We can also examine the **score of these bigrams as they are in PMI**. 

The [mathematics](https://en.wikipedia.org/wiki/Pointwise_mutual_information) are not that important for the purposes of the class - as we care more about intuition behind these measures.

It suffices to say that:
* A bigger **positive PMI** score implies that a word1 (event1) tends to co-occur more with word2 (event2). 
* A **PMI score of 0** means that the two words (events) are independent. 
* A **negative PMI** score can mean that the two words are uninformative. In practice, usually, [Positive PMI (or __PPMI__)](https://stats.stackexchange.com/a/284573) is used (where the negative values are not included).

To learn more about this, see generally Jurafsky and Martin referecen text.

In [None]:
for i in bigram_finder.score_ngrams(bigram_measures.pmi):
    print(i)

Let's clean the text a bit - remove numbers for example, perpahs this will reveal more patterns? 

__We'll get more into "preprocessing raw text" in our next lab__ - think of this as a simple introduction.

In [None]:
from gensim.parsing.preprocessing import strip_numeric

case_corpus_tokenized = word_tokenize(strip_numeric(case_corpus)) ## we added the strip_numeric function
text = Text(case_corpus_tokenized)

In [None]:
bigram_finder = BigramCollocationFinder.from_words(case_corpus_tokenized, 
                                                   window_size = 10)

bigram_finder.apply_freq_filter(5) 
bigram_finder.nbest(bigram_measures.pmi, 30)

Let's try it with **Trigrams**

In [None]:
# Trigram
trigram_finder = TrigramCollocationFinder.from_words(case_corpus_tokenized,
                                                       window_size = 5)

In [None]:
trigram_finder.apply_freq_filter(5) 
trigram_finder.nbest(trigram_measures.pmi, 30)

Try doing this on **fourgrams** on your own

In [None]:
# FourGram
...

## Phrasemachine

[Phrasemachine](https://github.com/slanglab/phrasemachine) is another convenient way of finding multi-word expressions. Again, the mathematics of it is not important - what's important is the functionality and how to apply it in python.

To work with "Phrasemachine" we first will have to import Spacy - which is a library with very powerful NLP tools - more on it in subsequent labs. 

In [None]:
import spacy
import phrasemachine
nlp = spacy.load("en_core_web_sm")


Spacy is incredibly powerful as it allows us to examine all the linguistic aspects of a given string. We'll cover this more in the upcoming lab.

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, 
          token.lemma_, 
          token.pos_) 
        

Now that we see the power of Spacy, we can apply it to our corpus.

In order to run the phrasemachine, we need to **tokenize** the case_corpus text first. Tokenization is essentially a process of demarcating a "string" into seperate "tokens". 

* For example - __"this string needs to be tokenized"__
* Will become - **["this", "string", "needs", "to", "be", "tokenized"]**

More on this in the next lab.

In [None]:
# create a spacy NLP pipeline
doc = nlp(case_corpus[:100000])        ## limit our text to 100000 characters - otherwise it will take too long      
tokens = [token.text for token in doc] ## tokenize
pos = [token.pos_ for token in doc]    ## tag parts of speech

In [None]:
print(case_corpus[:128])
print(tokens[:20])
print(pos[:20])

In [None]:
phrases = phrasemachine.get_phrases(tokens=tokens, 
                                    postags=pos)


Notice the data structure of the "phrases" - it's a __dictionary__. 

The "counts" key is itself a dictionary known as a [__"counter class"__](https://docs.python.org/3/library/collections.html#collections.Counter) - which counts the frequency of an item as its "value" in the "key-value" pair meaning of the term. 

Because it's a "counter" we can apply the ".most_common" method on this "counter" to see the most common phrases that emerge. 


In [None]:
phrases['counts'].most_common(30)


Note: in order to count words using the "counter" class, we had to tokenize our text. Again, we will cover more on this in the next lab.