## 1. Importing Libraries and Vocabulary

In [1]:
import json
import gensim
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import strip_non_alphanum
from nltk import word_tokenize
import nltk
import numpy as np

In [2]:
# you only need to do this once!
# for this tutorial, I downloaded "popular", though I don't know if I could get away with something smaller
#import nltk
#nltk.download()

## 2. Downloading Data

The caselaw download is available at https://case.law

Caselaw proves "all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States." You might want to spend some time on the site getting more familar with the data and thinking how you might want to analyze it. 

Data is available as an API and as a bulk JSON or XML formatted download. Because we will be working with large numbers of records, we'll use the bulk download. Public downloads are avialable for Arkansas, Illinios, and New Mexico. This workshop will use the JSON format. 

https://case.law/bulk/download/

To get the JSON formatted data, download the TEXT version (rather than the XML version). To extract the data, you'll need to first unzip the download file "Illinois-20190416-text.zip" (the exact name of this file contains a date stamp, yours may be different). The unzipped file will contain a "data" folder that contains a "data.json.xz" file, which is another compressed file. You'll need to unpack this file - one way is to use the xz utility. 

%: xz -d data.jsonl.xz

You can work with it using the json library in python. For this workshop, I renamed it illinois_data.json1 and put it in a folder named caselaw_data, since I plan to work with data from other states as it becomes available. 

## 3. Importing Data

In the next step, we'll read the data into a list using the json library. One thing to keep in mind here is that the file is very large and will likely overwhelm your laptop. You may eventually want to move to a cluster to process larger amounts of data. For now, let's limit the number of rows to 1000 to get the code up and running. We can increase it later. Keep in mind, the data may be chronologically ordered, so the first 1000 is not a random sampling of this data!

In [3]:
max_records = 1000

data = []
with open('caselaw_data/illinois_data.jsonl') as f:
    for i, line in enumerate(f):
        data.append(json.loads(line))
        #if i % 1000 == 0:
        #    print(i, "processed")
        # set max of 10,000 rows for now, increase or remove this to get the entire data set
        if i > max_records:
            break

The approach to parsing a "jsonl" differs slightly from parsing a single JSON document - in the jsonl format, each row containes a line object representing a JSON document. Here's the first one...

In [4]:
data[0]

{'casebody': {'data': {'attorneys': ['Michael Ratliff-El, of Pontiac, appellant pro se.',
    'Lisa Madigan, Attorney General, of Chicago (Joel D. Bertocchi, Solicitor General, and Richard S. Huszagh, Assistant Attorney General, of counsel), for appellee.'],
   'head_matter': 'MICHAEL RATLIFF-EL, Plaintiff-Appellant, v. KENNETH R. BRILEY, Defendant-Appellee.\nThird District\nNo. 3—01—0727\nOpinion filed May 2, 2003.\nMichael Ratliff-El, of Pontiac, appellant pro se.\nLisa Madigan, Attorney General, of Chicago (Joel D. Bertocchi, Solicitor General, and Richard S. Huszagh, Assistant Attorney General, of counsel), for appellee.',
   'judges': [],
   'opinions': [{'author': 'PRESIDING JUSTICE McDADE',
     'text': 'PRESIDING JUSTICE McDADE\ndelivered the opinion of the court:\nMichael Ratliff-El filed a complaint for mandamus relief (735 ILCS 5/14 — 101 et seq. (West 2000)) naming Kenneth R. Briley, warden of the Stateville Correctional Center, as the defendant. The trial court granted the

The JSON libary in python stores each document as a dictionary object type

In [5]:
type(data[0])

dict

## 4. Extracting and Cleaning Data

For this workshop, we'll work only with the text of the court decision. If you're interested in other fields, you can expand the loop below to extract other elements of the JSON document.

The exact strategy you employ to clean data can be very dependent on the type of analysis you want to do. This workshop will cover a few common strategies:

* removing non alphanumeric characters
* removing common stop words
* removing very short or long words
* removing purely numerical data
* removing words that are not in a particular corpus (such as the oxford english dictionary)

Again, these are common techniques, but depending on your study and data, they may not be necessary or even advisable. The strategy you take will emerge from your data and the type of analysis you plan to do.


### Removing non alphanumeric characters

This loop uses the gemsin library to strip non alphanumeric characters. It also collapses multiple spaces into a single space.

In [6]:
tokenized_sentences = []
        
opinion_texts = []
for i in range(len(data)):
    if data[i]['casebody']['data']['opinions']:
        text = data[i]['casebody']['data']['opinions'][0]['text'].lower()
        text = strip_non_alphanum(text)
        opinion_texts.append(' '.join(text.split()))
       

Here's the output from the first record. Take a look at this as compared to the text from the raw JSON file. Non alphanumeric characters are gone, but you can still read the text and understand what it means. However, you may already be able to see how cleaning data can result in loss of precision.

As an exercise, take a look at the original document and consider if there are any terms you'd be interested in preserving that may be lost through the data transformation here. 

In [7]:
opinion_texts[0]

'presiding justice mcdade delivered the opinion of the court michael ratliff el filed a complaint for mandamus relief 735 ilcs 5 14 101 et seq west 2000 naming kenneth r briley warden of the stateville correctional center as the defendant the trial court granted the defendant s motion to dismiss under section 2 615 of the code of civil procedure 735 ilcs 5 2 615 west 2000 on appeal ratliff el argues that the trial court erred by dismissing his complaint because it stated a cause of action for mandamus relief we affirm background ratliff el is a prisoner at stateville in his complaint ratliff el contended that the defendant had a clear duty to follow the illinois administrative procedure act act 5 ilcs 100 1 1 et seq west 2000 during adjustment committee and grievance proceedings at stateville as a result of such proceedings ratliff el had been disciplined with revocation of good time credits ratliff el submitted that he was entitled to mandamus relief because the defendant had failed t

### Remove stopwords

The above text still contains a large number of "stop words", common words that may not too ubiquitous to be useful (emphasis on the *may*). Words like "a", "they", "and" increase the word count and require extra processing time and computing resources, and may not be helpful in your analysis. 

The loop below uses the gensim utility to remove common stopwords

In [8]:
for i in range(len(opinion_texts)):
    opinion_texts[i] = remove_stopwords(opinion_texts[i])

In [9]:
opinion_texts[0]

'presiding justice mcdade delivered opinion court michael ratliff el filed complaint mandamus relief 735 ilcs 5 14 101 et seq west 2000 naming kenneth r briley warden stateville correctional center defendant trial court granted defendant s motion dismiss section 2 615 code civil procedure 735 ilcs 5 2 615 west 2000 appeal ratliff el argues trial court erred dismissing complaint stated cause action mandamus relief affirm background ratliff el prisoner stateville complaint ratliff el contended defendant clear duty follow illinois administrative procedure act act 5 ilcs 100 1 1 et seq west 2000 adjustment committee grievance proceedings stateville result proceedings ratliff el disciplined revocation good time credits ratliff el submitted entitled mandamus relief defendant failed follow act procedures alleged defendant s failure follow act procedures violated rights defendant filed section 2 615 motion dismiss ratliff el s mandamus complaint failure state cause action motion dismiss defend

### Remove very long or short words
### Remove purely numerical data

The text still contains a large number of short or purely numeric characters. These may or may not contain valuable information - for now, we'll remove them. 

In [10]:
for i in range(len(opinion_texts)):
    opinion_texts[i] = ' '.join([s for s in opinion_texts[i].split() if len(s) > 1 or s.isdigit() == False])

In [11]:
opinion_texts[0]

'presiding justice mcdade delivered opinion court michael ratliff el filed complaint mandamus relief 735 ilcs 14 101 et seq west 2000 naming kenneth r briley warden stateville correctional center defendant trial court granted defendant s motion dismiss section 615 code civil procedure 735 ilcs 615 west 2000 appeal ratliff el argues trial court erred dismissing complaint stated cause action mandamus relief affirm background ratliff el prisoner stateville complaint ratliff el contended defendant clear duty follow illinois administrative procedure act act ilcs 100 et seq west 2000 adjustment committee grievance proceedings stateville result proceedings ratliff el disciplined revocation good time credits ratliff el submitted entitled mandamus relief defendant failed follow act procedures alleged defendant s failure follow act procedures violated rights defendant filed section 615 motion dismiss ratliff el s mandamus complaint failure state cause action motion dismiss defendant argued act a

### Removing words that are not in a particular corpus (such as the oxford english dictionary)

You may have noticed that there are a number of words such as "jj" above that are nor part of the standard english language. These may be the result of data cleaning (such as artifacts of markup language). They may also be the result of legal language or other important data. As always, cleaning data is a decision you make that may or may not be necessary or desirable. 

In [12]:
words = set(nltk.corpus.words.words())

for i in range(len(opinion_texts)):
    opinion_texts[i] = ' '.join([s for s in opinion_texts[i].split() if s in words])

In [13]:
opinion_texts[0]

'justice opinion court el complaint mandamus relief west naming r warden correctional center defendant trial court defendant s motion dismiss section code civil procedure west appeal el trial court complaint stated cause action mandamus relief affirm background el prisoner complaint el defendant clear duty follow administrative procedure act act west adjustment committee grievance result el revocation good time el mandamus relief defendant follow act defendant s failure follow act defendant section motion dismiss el s mandamus complaint failure state cause action motion dismiss defendant act apply adjustment committee grievance department doc trial court defendant s motion dismiss el analysis motion dismiss mandamus action doc duty follow act el state agency follow act doc state agency adjustment committee grievance meet statutory definition defendant clear duty follow act doc adjustment committee grievance trial court defendant s motion dismiss mandamus cause action defendant act clau

## 5. Tokenize the sentences

We've cleaned the data, but it is still stored in large blocks of text. The next step, *tokenization*, will convert each line into a list of individual words. 

In [14]:
 for text in opinion_texts:
    tokenized_sentences.append(word_tokenize(text))

In [15]:
tokenized_sentences[0][:50]

['justice',
 'opinion',
 'court',
 'el',
 'complaint',
 'mandamus',
 'relief',
 'west',
 'naming',
 'r',
 'warden',
 'correctional',
 'center',
 'defendant',
 'trial',
 'court',
 'defendant',
 's',
 'motion',
 'dismiss',
 'section',
 'code',
 'civil',
 'procedure',
 'west',
 'appeal',
 'el',
 'trial',
 'court',
 'complaint',
 'stated',
 'cause',
 'action',
 'mandamus',
 'relief',
 'affirm',
 'background',
 'el',
 'prisoner',
 'complaint',
 'el',
 'defendant',
 'clear',
 'duty',
 'follow',
 'administrative',
 'procedure',
 'act',
 'act',
 'west',
 'adjustment',
 'committee',
 'grievance',
 'result',
 'el',
 'revocation',
 'good',
 'time',
 'el',
 'mandamus',
 'relief',
 'defendant',
 'follow',
 'act',
 'defendant',
 's',
 'failure',
 'follow',
 'act',
 'defendant',
 'section',
 'motion',
 'dismiss',
 'el',
 's',
 'mandamus',
 'complaint',
 'failure',
 'state',
 'cause',
 'action',
 'motion',
 'dismiss',
 'defendant',
 'act',
 'apply',
 'adjustment',
 'committee',
 'grievance',
 'departm

## 6. Fitting/Training a Model

In [16]:
model = gensim.models.Word2Vec(tokenized_sentences, size=100, window=5, min_count=1, 
                               sg=1, alpha=0.025, iter=5, batch_words=10000, workers=1)

## 7. Investigate Word Embeddings

### So... what is produced?

Gensim uses a neural network to assign a multidimensional vector that captures the semantic relationship of this word to all the other words in the corpus. 

Here's what these vectors look like:

In [17]:
model.wv['bad']

array([ 0.03643684, -0.17218417, -0.5537324 , -0.21018052,  0.15187532,
        0.00651479,  0.05479949, -0.05293132, -0.42141828,  0.13815399,
       -0.21134709,  0.02931935,  0.28324974, -0.02381333,  0.05207988,
        0.50027734,  0.21709028, -0.306635  ,  0.19079474,  0.6014834 ,
       -0.41312742,  0.0956345 , -0.35529554, -0.03092842,  0.22190204,
        0.70499486,  0.03091049,  0.25539234, -0.45290878,  0.09679531,
        0.04670156,  0.13033292, -0.3066518 , -0.12561986, -0.00083316,
        0.16429497,  0.19067389, -0.26342615, -0.1275884 , -0.03583375,
        0.18348084,  0.25834987,  0.21210994, -0.4878925 , -0.10983816,
       -0.23589809, -0.47768015, -0.14076586, -0.18880452, -0.29701078,
       -0.01390715,  0.04932644,  0.51821655, -0.01656782,  0.3851205 ,
       -0.16695963,  0.34865925,  0.00842199,  0.4643649 ,  0.05221093,
        0.3787423 ,  0.38864246,  0.6665949 ,  0.31841886, -0.48745936,
        0.4202707 , -0.0072457 , -0.46270207, -0.6068896 ,  0.21

### Comparing semantic similarity of words

The similarity of one word to another in the corpus can be calculated using the cosine similarity of the vectors assigned to each word. Note that the semantic similarity may not capture all aspects of a word. For instance, antonymns may be used in such similar sentence structure that they have similar word vectors, even though they have opposite meanings. 

First, let's use Gensim to find the words with the highest cosine similarity to an existing word.

In [18]:
model.wv.most_similar('bad')

  if np.issubdtype(vec.dtype, np.int):


[('faith', 0.790492832660675),
 ('unconscionably', 0.772097647190094),
 ('unconscionable', 0.7203121185302734),
 ('debarment', 0.7001519799232483),
 ('inducement', 0.6802564859390259),
 ('partiality', 0.6730676889419556),
 ('good', 0.6723031997680664),
 ('unclean', 0.6700391173362732),
 ('dealing', 0.6604183912277222),
 ('unexcused', 0.6552610993385315)]

### Words with different contexts

You may notice that many of the words associated with "bad" represent different contexts. Bad can mean immoral, insincere, or untrue. We can use gensim to remove certain contexts or emphasize others when searching for words with similar semantic usage.

In [19]:
model.wv.most_similar(positive=['bad','trouble'])

  if np.issubdtype(vec.dtype, np.int):


[('hesitant', 0.8111292123794556),
 ('inquiring', 0.8083885908126831),
 ('fay', 0.8050452470779419),
 ('apologize', 0.799991250038147),
 ('raving', 0.7999798059463501),
 ('blamed', 0.7987805008888245),
 ('embarrassed', 0.7986078262329102),
 ('ladies', 0.7967567443847656),
 ('inculpate', 0.7953906655311584),
 ('glad', 0.795342206954956)]

In [20]:
model.wv.most_similar(positive=['bad','trouble'])

  if np.issubdtype(vec.dtype, np.int):


[('hesitant', 0.8111292123794556),
 ('inquiring', 0.8083885908126831),
 ('fay', 0.8050452470779419),
 ('apologize', 0.799991250038147),
 ('raving', 0.7999798059463501),
 ('blamed', 0.7987805008888245),
 ('embarrassed', 0.7986078262329102),
 ('ladies', 0.7967567443847656),
 ('inculpate', 0.7953906655311584),
 ('glad', 0.795342206954956)]

You can combine positive and negative context as you search for words with similar vectors.

In [21]:
model.wv.most_similar(positive=['bad','faith'], negative=['bad','trouble'])

  if np.issubdtype(vec.dtype, np.int):


[('unconscionable', 0.35328567028045654),
 ('covenant', 0.32791322469711304),
 ('good', 0.32210659980773926),
 ('dealing', 0.31609469652175903),
 ('franchise', 0.28974485397338867),
 ('uniformity', 0.2852109670639038),
 ('arbitrary', 0.2817656099796295),
 ('unjust', 0.2801235318183899),
 ('expulsion', 0.2787911891937256),
 ('taxing', 0.27601128816604614)]

### Semantic similarity of two words

You can use Gensim to retrieve the semantic similarity of two words. This operation will return the cosine similarity between the terms.

In [22]:
print(model.wv.similarity('bad', 'faith'))

0.79049283


  if np.issubdtype(vec.dtype, np.int):


### Calculating the cosine similarity

We can get the vectors for each word and calculate the cosine directly

In [23]:
unconscionably_vec = model.wv['bad']
bad_vec = model.wv['faith']

cos_sim = np.dot(bad_vec, unconscionably_vec) / (np.linalg.norm(bad_vec) * np.linalg.norm(unconscionably_vec))

print(cos_sim)

0.79049283


### Working directly with vectors

Gensim has an API to work direcly with vectors rather than terms. 

In [24]:
model.wv.similar_by_vector('bad', 10)

  if np.issubdtype(vec.dtype, np.int):


[('faith', 0.790492832660675),
 ('unconscionably', 0.772097647190094),
 ('unconscionable', 0.7203121185302734),
 ('debarment', 0.7001519799232483),
 ('inducement', 0.6802564859390259),
 ('partiality', 0.6730676889419556),
 ('good', 0.6723031997680664),
 ('unclean', 0.6700391173362732),
 ('dealing', 0.6604183912277222),
 ('unexcused', 0.6552610993385315)]

In [25]:
model.wv.similar_by_vector(bad_vec, 10)

  if np.issubdtype(vec.dtype, np.int):


[('faith', 1.0),
 ('good', 0.8033825159072876),
 ('bad', 0.7904927730560303),
 ('unconscionably', 0.6902737617492676),
 ('unconscionable', 0.6809736490249634),
 ('dealing', 0.679458737373352),
 ('debarment', 0.6715279817581177),
 ('effort', 0.6250653266906738),
 ('negotiate', 0.6236554384231567),
 ('magistrate', 0.61832594871521)]

In [26]:
print(len(bad_vec))
print(type(bad_vec))
print(bad_vec)

100
<class 'numpy.ndarray'>
[-0.18664266  0.34442145 -0.5971364  -0.21855739  0.36882696 -0.11404219
 -0.03792524 -0.08468732 -0.58372694  0.0603081   0.00891511 -0.10518771
  0.4472261  -0.02930365 -0.0365122   0.5461091   0.1892696  -0.25511888
  0.5638195   0.6129434  -0.6074797   0.06631307  0.03017858  0.1082346
  0.18708515  0.60511726  0.05742465  0.0660982  -0.30348986  0.18216199
 -0.0150054   0.06177811  0.16687822 -0.3641817  -0.6101002  -0.05959177
  0.2647674  -0.5743243  -0.31134284  0.02320238  0.7611934   0.19185841
  0.17334846 -0.5994322  -0.3598473  -0.03490052 -0.70460284 -0.8259293
 -0.26805013 -0.20060232  0.46817958  0.39286086  0.9499842  -0.32680005
  0.6463698  -0.29781976  0.4368817   0.2784824   0.47075573 -0.05048323
  0.6263575   0.35134163  0.787858    0.46513265 -0.70617217  0.3970767
 -0.47805664 -0.23859577 -0.6347393   0.34582186 -0.00752765 -0.3455703
 -0.14830695 -0.01692032  0.04650878  0.40777424 -0.20334414  0.3362769
 -1.0188442   0.04589049  0.

You can create vectors yourself and find terms with semantic similarity

In [27]:
vec = np.random.uniform(-1, 1, 100)
model.wv.similar_by_vector(vec, 10)

  if np.issubdtype(vec.dtype, np.int):


[('government', 0.1932455152273178),
 ('convenience', 0.1886080503463745),
 ('political', 0.18698269128799438),
 ('committee', 0.17843322455883026),
 ('accused', 0.16968272626399994),
 ('mission', 0.16647890210151672),
 ('economic', 0.16288602352142334),
 ('active', 0.16048485040664673),
 ('fetus', 0.15829981863498688),
 ('end', 0.15676000714302063)]

## 7. Plotting And Visualization

For this exercise, we'll visualize the degree of positive and netagive semantic association of virtues and vices with "good" and "bad".

I left a few of the virtues and vices out for this exercise - you may need to increase the number of records you process to include all these words, and you may need to increase it more to get good or interesting associations. For now, I'd recommend you keep it small and get it running, you can expand once you have the code working.

In [28]:
virtues_vices = ['courage', 'charity', 'honor', 'truth', 'honesty', 'greed', 'cruelty', 'pride']

In [None]:
bad_score = [model.wv.similarity(['bad'], word) for word in virtues_vices]
good_score = [model.wv.similarity(['good'], word) for word in virtues_vices]

  if np.issubdtype(vec.dtype, np.int):


In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
_, ax = plt.subplots(figsize=(20,20))
ax.scatter(bad_score, good_score, alpha=1, color='b')
for i in range(len(virtues_vices)):
    ax.annotate(virtues_vices[i], (bad_score[i], good_score[i])).set_fontsize(16)
ax.set_xlim(.25, 1.1)
ax.set_ylim(.25, 1.1)
ax.yaxis.label.set_fontsize(16)
ax.xaxis.label.set_fontsize(16)
plt.xlabel('bad score')
plt.ylabel('good score')
for item in (ax.get_xticklabels() + ax.get_yticklabels()):
    item.set_fontsize(16)
plt.plot([0, 1], [0, 1], linestyle='--');