<a href="https://colab.research.google.com/github/tiffanyfu7/legalduel-1b-ai-studio/blob/main/DateExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Trying to extract dates from a document.
Some methods may include SpaCy, Duckling (Meta), Spark, or Stanford CoreNLP. There may be other ways but try these ones first.



# SpaCy

https://www.qualicen.de/natural-language-processing-timeline-extraction-with-regexes-and-spacy/

In [27]:
# install
def _pip_magic(line):
  !pip install {line}

_pip_magic('daterangeparser')
_pip_magic('spacy')
_pip_magic('en_core_web_sm')

Collecting daterangeparser
  Downloading DateRangeParser-1.3.2-py3-none-any.whl.metadata (1.5 kB)
Downloading DateRangeParser-1.3.2-py3-none-any.whl (23 kB)
Installing collected packages: daterangeparser
Successfully installed daterangeparser-1.3.2


In [28]:
# imports
import re
import spacy
import requests
import IPython
from daterangeparser import parse

In [29]:
# import txt file
chronology_facts_1 = open('/content/Chronology_Facts_1.txt', 'r')

nlp = spacy.load('en_core_web_sm')
doc_1 = nlp(chronology_facts_1.read())

In [30]:
# named entities
for ent in doc_1.ents:
  print("{} -> {}".format(ent.text,ent.label_))

January 15, 2021 -> DATE
James Rosco -> PERSON
UltraGuard -> WORK_OF_ART
AutoMart -> PRODUCT
Albany -> GPE
New York -> GPE
LubriTech Industries, Inc. -> ORG
LubriTech -> ORG
2020 -> DATE
Ford -> GPE
January 20, 2021 -> DATE
Speedy Lube -> PRODUCT
Albany -> GPE
February 5, 2021 -> DATE
Speedy Lube -> PERSON
February 7, 2021 -> DATE
February 15, 2021 -> DATE
the New York State Thruway -> ORG
Three days later -> DATE
Albany Auto Repair -> ORG
February 16, 2021 -> DATE
February 18, 2021 -> DATE
March 1, 2021 -> DATE
LubriTech -> ORG
LubriTech -> ORG
March 10, 2021 -> DATE
ChemTest Labs -> PERSON
March 25, 2021 -> DATE
April 2, 2021 -> DATE
the New York State Department of Consumer Protection -> ORG
LubriTech -> ORG
May 15, 2021 -> DATE
ChemTest Labs -> ORG
December 10, 2020 -> DATE
the end of that month -> DATE
the day before June 2, 2021 -> DATE
LubriTech -> ORG
the Supreme Court of New York -> ORG
Albany County -> GPE
Emily Thompson -> PERSON
July 10, 2021 -> DATE
8/1/2021 -> DATE
LubriT

In [31]:
# date entities
for ent in filter(lambda e: e.label_=="DATE", doc_1.ents):
  print(ent.text)

January 15, 2021
2020
January 20, 2021
February 5, 2021
February 7, 2021
February 15, 2021
Three days later
February 16, 2021
February 18, 2021
March 1, 2021
March 10, 2021
March 25, 2021
April 2, 2021
May 15, 2021
December 10, 2020
the end of that month
the day before June 2, 2021
July 10, 2021
8/1/2021
August 15, 2021
September 5, 2021
December 20, 2020
October 10, 2021
three weeks
October 31, 2021
November 15, 2021
January 10, 2022
two weeks
three days later
March 15, 2022
Three days later
June 10, 2024


In [18]:
def dep_subtree(token, dep):
  deps =[child.dep_ for child in token.children]
  child=next(filter(lambda c: c.dep_==dep, token.children), None)
  if child != None:
    return " ".join([c.text for c in child.subtree])
  else:
    return ""

In [33]:
p = re.compile(r'\[\d+\]')

In [36]:
def extract_events_regex(line):
  matches = []
  # capture thee digit and four digit years (1975) and ranges (1975-1976)
  found = re.findall('In (\d\d\d\d?[/\–]?\d?\d?\d?\d?),? ?([^\\.]*)', line)
  try:
    matches = matches + list(map(lambda f: (f[0] if len(f[0])>3 else "0"+f[0] ,f[0],f[1]),found))
  except:
   return []
  return matches

In [37]:
def extract_events_spacy(line):
  line=p.sub('', line)
  events = []
  doc = nlp(line)
  for ent in filter(lambda e: e.label_=='DATE',doc.ents):
    try:
      start,end = parse(ent.text)
    except:
      # could not parse the dates, hence ignore it
      continue
    current = ent.root
    while current.dep_ != "ROOT":
      current = current.head
    desc = " ".join(filter(None,[
                                 dep_subtree(current,"nsubj"),
                                 dep_subtree(current,"nsubjpass"),
                                 dep_subtree(current,"auxpass"),
                                 dep_subtree(current,"amod"),
                                 dep_subtree(current,"det"),
                                 current.text,
                                 dep_subtree(current,"acl"),
                                 dep_subtree(current,"dobj"),
                                 dep_subtree(current,"attr"),
                                 dep_subtree(current,"advmod")]))
    events = events + [(start,ent.text,desc)]
  return events

In [41]:
with open('/content/Chronology_Facts_1.txt', 'r') as f: # Open the file in read mode
  text = f.read()  # Read the entire file
  extract_events_spacy(text)

In [43]:
extract_events_spacy("Facts of the Case On January 15, 2021, the plaintiff, James Rosco, purchased a batch of motor oil branded as 'UltraGuard' from a local retailer, AutoMart, located in Albany, New York. The motor oil was manufactured by the defendant, LubriTech Industries, Inc. ('LubriTech'). The plaintiff used the motor oil in his 2020 Ford Mustang on January 20, 2021, during a routine oil change performed at Speedy Lube, an auto service center in Albany. On February 5, 2021, the plaintiff began to notice unusual noises emanating from the engine of his vehicle. Concerned, he took the car back to Speedy Lube on February 7, 2021, where the mechanics conducted a preliminary inspection but found no immediate issues. However, the noises persisted, and on February 15, 2021, the plaintiff's vehicle broke down on the New York State Thruway. Three days later the plaintiff changed the license plates on the car. The vehicle was towed to Albany Auto Repair on February 16, 2021. After a thorough inspection, the mechanics discovered significant engine damage, which they attributed to the motor oil used during the last oil change. The plaintiff was informed of this diagnosis on February 18, 2021. The mechanics noted that the motor oil appeared to have degraded prematurely, causing insufficient lubrication and leading to engine failure. On March 1, 2021, the plaintiff contacted LubriTech to report the issue and seek compensation for the damages. LubriTech responded on March 10, 2021, denying any fault and asserting that their product met all industry standards. The plaintiff then commissioned an independent laboratory, ChemTest Labs, to analyze the motor oil. The lab results, received on March 25, 2021, indicated that the motor oil contained an excessive amount of a chemical compound known as 'Polymer X,' which is known to cause rapid degradation under high temperatures. On April 2, 2021, the plaintiff filed a complaint with the New York State Department of Consumer Protection, which initiated an investigation into LubriTech's manufacturing processes. The investigation report, released on May 15, 2021, corroborated the findings of ChemTest Labs, revealing that a batch of motor oil produced on December 10, 2020, contained a higher-than-acceptable level of Polymer X due to a manufacturing error. The report was produced at the end of that month. On the day before June 2, 2021, the plaintiff filed a lawsuit against LubriTech in the Supreme Court of New York, Albany County, seeking damages for the cost of engine repairs, loss of use of the vehicle, and other related expenses. The case was assigned to Judge Emily Thompson, who scheduled the initial hearing for July 10, 2021. During the discovery phase, which commenced on 8/1/2021, it was revealed that LubriTech had received multiple complaints about the same batch of motor oil. Internal emails from LubriTech, dated August 15, 2021, showed that the company was aware of the issue but chose not to issue a recall or notify consumers. On September 5, 2021, the plaintiffs legal team deposed LubriTechs head of quality control, who admitted under oath that the company had identified the contamination on December 20, 2020, but decided against taking corrective action due to cost concerns. This testimony was pivotal in establishing LubriTech's knowledge and negligence. The trial commenced on October 10, 2021, and lasted for three weeks. On October 31, 2021, the jury returned a verdict in favor of the plaintiff, awarding him 100,000 in punitive damages. LubriTech filed an appeal on November 15, 2021, challenging both the verdict and the damages awarded. The Appellate Division, Third Department, heard the appeal on January 10, 2022, and issued a decision two weeks and three days later, affirming the lower court's ruling. LubriTech then sought leave to appeal to the New York Court of Appeals, which was granted on March 15, 2022. Three days later, the Chief Justice remarked that the appellate lawyers was going to be disbarred for improper conduct.  The case is now before the New York Court of Appeals, with oral arguments scheduled for June 10, 2024. The plaintiff contends that LubriTech's actions constituted gross negligence and seeks to uphold the lower courts' decisions. The defendant argues that the damages awarded were excessive and that the plaintiff failed to prove causation adequately. This case presents significant questions regarding product liability and corporate responsibility, particularly in the context of consumer safety and the duty to inform. The outcome will have far-reaching implications for manufacturers and consumers alike.")

[(datetime.datetime(2021, 1, 15, 0, 0),
  'January 15, 2021',
  "Facts of the Case purchased a batch of motor oil branded as ' UltraGuard '"),
 (datetime.datetime(2021, 1, 20, 0, 0),
  'January 20, 2021',
  'The plaintiff used the motor oil'),
 (datetime.datetime(2021, 2, 5, 0, 0),
  'February 5, 2021',
  'the plaintiff began'),
 (datetime.datetime(2021, 2, 7, 0, 0),
  'February 7, 2021',
  'he took the car back to Speedy Lube'),
 (datetime.datetime(2021, 2, 15, 0, 0),
  'February 15, 2021',
  'the noises persisted However'),
 (datetime.datetime(2021, 2, 16, 0, 0),
  'February 16, 2021',
  'The vehicle was towed'),
 (datetime.datetime(2021, 2, 18, 0, 0),
  'February 18, 2021',
  'The plaintiff was informed'),
 (datetime.datetime(2021, 3, 1, 0, 0),
  'March 1, 2021',
  'the plaintiff contacted LubriTech'),
 (datetime.datetime(2021, 3, 10, 0, 0),
  'March 10, 2021',
  'LubriTech responded'),
 (datetime.datetime(2021, 3, 25, 0, 0),
  'March 25, 2021',
  'The lab results , received on Marc

# Duckling

https://github.com/facebook/duckling

# Spark

https://www.johnsnowlabs.com/extracting-exact-dates-from-natural-language-text/#:~:text=DateMatcher%20and%20MultiDateMatcher%20are%20rule,their%20performances%20are%20the%20same.

# Stanford CoreNLP

https://datascience.stackexchange.com/questions/45854/date-extraction-in-python