# CaseLaw dataset to assist with Law-Research - EDA
---
<dl>
  <dt>Acquiring the dataset</dt>
  <dd>We initially use dataset of all cases in USA to be able to train it and as a proof of concept.</dd>
  <dd>The dataset is available in XML format, which we will put in mongodb or firebase format based on how unstructured the dataset is.</dd>
    <dd>dataset url: (https://case.law/)
</dd>

  <dt>Research</dt>
  <dd>We are looking into <em>NLP</em>, <em>LSTM</em> and <em>Sentiment Analysis</em>.</dd>
</dl>

In [7]:
import jsonlines
from pymongo import MongoClient

In [8]:
# client = MongoClient()
client = MongoClient()
db = client.legal_ai
cases = db.cases

In [9]:
some_date = '1820-01'

In [10]:
print(int(some_date[0:4])<1950)

True


In [16]:
id_saved = []
with jsonlines.open('../data.jsonl') as reader:
    for obj in reader:
        if int(obj['decision_date'][0:4])>1950:
            case_id = cases.insert_one(obj).inserted_id
            id_saved.append(case_id)

In [17]:
len(id_saved)

30075

In [7]:
# 2 Keyword generation?
import enchant
d = enchant.Dict("en_US")
d.synonym("contract")

AttributeError: 'Dict' object has no attribute 'synonym'

In [8]:
from py_thesaurus import Thesaurus

input_word = "dream"

new_instance = Thesaurus(input_word)

# Get the synonyms according to part of speech
# Default part of speech is noun

print(new_instance.get_synonym())

print(new_instance.get_synonym(pos='verb'))

print(new_instance.get_synonym(pos='adj'))

No Internet Connection
[]
No Internet Connection
[]
No Internet Connection
[]


In [60]:
from PyDictionary import PyDictionary
dictionary=PyDictionary()
print (dictionary.meaning("indentation"))

[]


## Testing out Similarity Mechanism
---
### Setup
- Test PyDictionary to build keywords
- Construct a mechanism, to extract keywords, and store in a searchable manner.
---
### Search
- Build keywords out of your search
- Search among dataset keywords
- Nearest dates, highest weight, highest precidence shows up
- Pagination scroll, continues the search.

In [6]:
# Assume this dataset
dataset = ["The contract was breached by the defendant", "The defendant assaulted the victim on multiple occassions"]

In [30]:
# Build keywords for this dataset. 
from rake_nltk import Rake
from nltk.corpus import stopwords 
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.
keyword_dataset = []

In [31]:
text="The contract was breached by the defendant"
for each_text in dataset:
    a=r.extract_keywords_from_text(each_text)
    b=r.get_ranked_phrases()
    c=r.get_ranked_phrases_with_scores()
    keyword_dataset.append(c)

In [32]:
keyword_dataset

[[(1.0, 'defendant'), (1.0, 'contract'), (1.0, 'breached')],
 [(4.0, 'multiple occassions'), (4.0, 'defendant assaulted'), (1.0, 'victim')]]

In [34]:
from PyDictionary import PyDictionary
dictionary=PyDictionary()


In [35]:
dictionary.meaning("defendant")

KeyboardInterrupt: 

In [36]:
dictionary.synonym("Life")

Life has no Synonyms in the API




  return BeautifulSoup(requests.get(url).text)


## Transforming dataset
---
### Extract the first data and study it
- Identify the key elements that need to be transformed & list them
- Build a mechanism to transform for one datapoint.
---
### Perform for entire dataset
- Run a loop and apply the same changes for every datapoints.

In [18]:
# Extracting the first element
first_case = cases.find_one()

In [20]:
import xml.etree.ElementTree as ET
root = ET.fromstring(first_case['casebody']['data'])

In [21]:
root

<Element '{http://nrs.harvard.edu/urn-3:HLS.Libr.US_Case_Law.Schema.Case_Body:v1}casebody' at 0x11232c2c8>

# Getting the case body cleaned into a seperate field on db
> 

In [71]:
summary=''
for child in root:
    for sub_child in child:
        if 'footnotemark' in sub_child.tag[sub_child.tag.index("}")+1:] or 'author' in sub_child.tag[sub_child.tag.index("}")+1:]:
            continue
        summary+=sub_child.text + "\n"

In [73]:
print(summary)

The executive secretary of the State Game and Fish, Commission declined to authorize payment of $105 to four persons who claimed to have killed seven wolves over six months old, as evidenced by certificate of the Boone County Court. See Act 183 of 1949. The statute is entitled “An Act to authorize . . . counties ... to pay bounties for the killing of wolves and to provide that the State . . . shall pay an equal sum as a bounty, and for other purposes.” The emergency clause is a finding that farmers are suffering irreparable damage ‘£ from wolves destroying cattle and other live stock.” The measure received two-thirds of the votes of all members elected to each branch of the General Assembly.
The complaint is a petition for mandamus alleging that Boone county has paid bounties on the basis of $20 for each wolf killed; but, § 2 of Act 183, the amount payable by the State cannot exceed $15.
The answer, among other defenses, asserts that T. H. McAmis as Secretary of the Game and Fish Commi

# Do the same for all the files now!

In [74]:
all_cases = cases.find()

In [78]:
for each_case in all_cases:
    root = ET.fromstring(each_case['casebody']['data'])
    summary=''
    for child in root:
        for sub_child in child:
            if 'footnotemark' in sub_child.tag[sub_child.tag.index("}")+1:] or 'author' in sub_child.tag[sub_child.tag.index("}")+1:]:
                continue
            summary+=sub_child.text + "\n"
    myquery = { "_id": each_case['_id'] }
    newvalues = { "$set": { "summary": summary } }
    cases.update_one(myquery, newvalues)