<a href="https://colab.research.google.com/github/tiffanyfu7/legalduel-1b-ai-studio/blob/main/DateExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Trying to extract dates from a document.
Some methods may include SpaCy, Duckling (Meta), Spark, or Stanford CoreNLP. There may be other ways but try these ones first.



# SpaCy

https://www.qualicen.de/natural-language-processing-timeline-extraction-with-regexes-and-spacy/

In [None]:
# install
def _pip_magic(line):
  !pip install {line}

_pip_magic('daterangeparser')
_pip_magic('spacy')
_pip_magic('en_core_web_sm')

Collecting daterangeparser
  Downloading DateRangeParser-1.3.2-py3-none-any.whl.metadata (1.5 kB)
Downloading DateRangeParser-1.3.2-py3-none-any.whl (23 kB)
Installing collected packages: daterangeparser
Successfully installed daterangeparser-1.3.2


In [None]:
# imports
import re
import spacy
import requests
import IPython
from daterangeparser import parse

In [None]:
# import txt file
chronology_facts_1 = open('/content/Chronology_Facts_1.txt', 'r')

nlp = spacy.load('en_core_web_sm')
doc_1 = nlp(chronology_facts_1.read())

FileNotFoundError: [Errno 2] No such file or directory: '/content/Chronology_Facts_1.txt'

In [None]:
# named entities
for ent in doc_1.ents:
  print("{} -> {}".format(ent.text,ent.label_))

January 15, 2021 -> DATE
James Rosco -> PERSON
UltraGuard -> WORK_OF_ART
AutoMart -> PRODUCT
Albany -> GPE
New York -> GPE
LubriTech Industries, Inc. -> ORG
LubriTech -> ORG
2020 -> DATE
Ford -> GPE
January 20, 2021 -> DATE
Speedy Lube -> PRODUCT
Albany -> GPE
February 5, 2021 -> DATE
Speedy Lube -> PERSON
February 7, 2021 -> DATE
February 15, 2021 -> DATE
the New York State Thruway -> ORG
Three days later -> DATE
Albany Auto Repair -> ORG
February 16, 2021 -> DATE
February 18, 2021 -> DATE
March 1, 2021 -> DATE
LubriTech -> ORG
LubriTech -> ORG
March 10, 2021 -> DATE
ChemTest Labs -> PERSON
March 25, 2021 -> DATE
April 2, 2021 -> DATE
the New York State Department of Consumer Protection -> ORG
LubriTech -> ORG
May 15, 2021 -> DATE
ChemTest Labs -> ORG
December 10, 2020 -> DATE
the end of that month -> DATE
the day before June 2, 2021 -> DATE
LubriTech -> ORG
the Supreme Court of New York -> ORG
Albany County -> GPE
Emily Thompson -> PERSON
July 10, 2021 -> DATE
8/1/2021 -> DATE
LubriT

In [None]:
# date entities
for ent in filter(lambda e: e.label_=="DATE", doc_1.ents):
  print(ent.text)

January 15, 2021
2020
January 20, 2021
February 5, 2021
February 7, 2021
February 15, 2021
Three days later
February 16, 2021
February 18, 2021
March 1, 2021
March 10, 2021
March 25, 2021
April 2, 2021
May 15, 2021
December 10, 2020
the end of that month
the day before June 2, 2021
July 10, 2021
8/1/2021
August 15, 2021
September 5, 2021
December 20, 2020
October 10, 2021
three weeks
October 31, 2021
November 15, 2021
January 10, 2022
two weeks
three days later
March 15, 2022
Three days later
June 10, 2024


In [None]:
def dep_subtree(token, dep):
  deps =[child.dep_ for child in token.children]
  child=next(filter(lambda c: c.dep_==dep, token.children), None)
  if child != None:
    return " ".join([c.text for c in child.subtree])
  else:
    return ""

In [None]:
p = re.compile(r'\[\d+\]')

In [None]:
def extract_events_regex(line):
  matches = []
  # capture thee digit and four digit years (1975) and ranges (1975-1976)
  found = re.findall('In (\d\d\d\d?[/\–]?\d?\d?\d?\d?),? ?([^\\.]*)', line)
  try:
    matches = matches + list(map(lambda f: (f[0] if len(f[0])>3 else "0"+f[0] ,f[0],f[1]),found))
  except:
   return []
  return matches

In [None]:
def extract_events_spacy(line):
  line=p.sub('', line)
  events = []
  doc = nlp(line)
  for ent in filter(lambda e: e.label_=='DATE',doc.ents):
    try:
      start,end = parse(ent.text)
    except:
      # could not parse the dates, hence ignore it
      continue
    current = ent.root
    while current.dep_ != "ROOT":
      current = current.head
    desc = " ".join(filter(None,[
                                 dep_subtree(current,"nsubj"),
                                 dep_subtree(current,"nsubjpass"),
                                 dep_subtree(current,"auxpass"),
                                 dep_subtree(current,"amod"),
                                 dep_subtree(current,"det"),
                                 current.text,
                                 dep_subtree(current,"acl"),
                                 dep_subtree(current,"dobj"),
                                 dep_subtree(current,"attr"),
                                 dep_subtree(current,"advmod")]))
    events = events + [(start,ent.text,desc)]
  return events

In [None]:
with open('/content/Chronology_Facts_1.txt', 'r') as f: # Open the file in read mode
  text = f.read()  # Read the entire file
  extract_events_spacy(text)

In [None]:
extract_events_spacy("Facts of the Case On January 15, 2021, the plaintiff, James Rosco, purchased a batch of motor oil branded as 'UltraGuard' from a local retailer, AutoMart, located in Albany, New York. The motor oil was manufactured by the defendant, LubriTech Industries, Inc. ('LubriTech'). The plaintiff used the motor oil in his 2020 Ford Mustang on January 20, 2021, during a routine oil change performed at Speedy Lube, an auto service center in Albany. On February 5, 2021, the plaintiff began to notice unusual noises emanating from the engine of his vehicle. Concerned, he took the car back to Speedy Lube on February 7, 2021, where the mechanics conducted a preliminary inspection but found no immediate issues. However, the noises persisted, and on February 15, 2021, the plaintiff's vehicle broke down on the New York State Thruway. Three days later the plaintiff changed the license plates on the car. The vehicle was towed to Albany Auto Repair on February 16, 2021. After a thorough inspection, the mechanics discovered significant engine damage, which they attributed to the motor oil used during the last oil change. The plaintiff was informed of this diagnosis on February 18, 2021. The mechanics noted that the motor oil appeared to have degraded prematurely, causing insufficient lubrication and leading to engine failure. On March 1, 2021, the plaintiff contacted LubriTech to report the issue and seek compensation for the damages. LubriTech responded on March 10, 2021, denying any fault and asserting that their product met all industry standards. The plaintiff then commissioned an independent laboratory, ChemTest Labs, to analyze the motor oil. The lab results, received on March 25, 2021, indicated that the motor oil contained an excessive amount of a chemical compound known as 'Polymer X,' which is known to cause rapid degradation under high temperatures. On April 2, 2021, the plaintiff filed a complaint with the New York State Department of Consumer Protection, which initiated an investigation into LubriTech's manufacturing processes. The investigation report, released on May 15, 2021, corroborated the findings of ChemTest Labs, revealing that a batch of motor oil produced on December 10, 2020, contained a higher-than-acceptable level of Polymer X due to a manufacturing error. The report was produced at the end of that month. On the day before June 2, 2021, the plaintiff filed a lawsuit against LubriTech in the Supreme Court of New York, Albany County, seeking damages for the cost of engine repairs, loss of use of the vehicle, and other related expenses. The case was assigned to Judge Emily Thompson, who scheduled the initial hearing for July 10, 2021. During the discovery phase, which commenced on 8/1/2021, it was revealed that LubriTech had received multiple complaints about the same batch of motor oil. Internal emails from LubriTech, dated August 15, 2021, showed that the company was aware of the issue but chose not to issue a recall or notify consumers. On September 5, 2021, the plaintiffs legal team deposed LubriTechs head of quality control, who admitted under oath that the company had identified the contamination on December 20, 2020, but decided against taking corrective action due to cost concerns. This testimony was pivotal in establishing LubriTech's knowledge and negligence. The trial commenced on October 10, 2021, and lasted for three weeks. On October 31, 2021, the jury returned a verdict in favor of the plaintiff, awarding him 100,000 in punitive damages. LubriTech filed an appeal on November 15, 2021, challenging both the verdict and the damages awarded. The Appellate Division, Third Department, heard the appeal on January 10, 2022, and issued a decision two weeks and three days later, affirming the lower court's ruling. LubriTech then sought leave to appeal to the New York Court of Appeals, which was granted on March 15, 2022. Three days later, the Chief Justice remarked that the appellate lawyers was going to be disbarred for improper conduct.  The case is now before the New York Court of Appeals, with oral arguments scheduled for June 10, 2024. The plaintiff contends that LubriTech's actions constituted gross negligence and seeks to uphold the lower courts' decisions. The defendant argues that the damages awarded were excessive and that the plaintiff failed to prove causation adequately. This case presents significant questions regarding product liability and corporate responsibility, particularly in the context of consumer safety and the duty to inform. The outcome will have far-reaching implications for manufacturers and consumers alike.")

[(datetime.datetime(2021, 1, 15, 0, 0),
  'January 15, 2021',
  "Facts of the Case purchased a batch of motor oil branded as ' UltraGuard '"),
 (datetime.datetime(2021, 1, 20, 0, 0),
  'January 20, 2021',
  'The plaintiff used the motor oil'),
 (datetime.datetime(2021, 2, 5, 0, 0),
  'February 5, 2021',
  'the plaintiff began'),
 (datetime.datetime(2021, 2, 7, 0, 0),
  'February 7, 2021',
  'he took the car back to Speedy Lube'),
 (datetime.datetime(2021, 2, 15, 0, 0),
  'February 15, 2021',
  'the noises persisted However'),
 (datetime.datetime(2021, 2, 16, 0, 0),
  'February 16, 2021',
  'The vehicle was towed'),
 (datetime.datetime(2021, 2, 18, 0, 0),
  'February 18, 2021',
  'The plaintiff was informed'),
 (datetime.datetime(2021, 3, 1, 0, 0),
  'March 1, 2021',
  'the plaintiff contacted LubriTech'),
 (datetime.datetime(2021, 3, 10, 0, 0),
  'March 10, 2021',
  'LubriTech responded'),
 (datetime.datetime(2021, 3, 25, 0, 0),
  'March 25, 2021',
  'The lab results , received on Marc

##More Exploration With Spacy

Code is from Chatgpt. My goal was to see what piece of code works and once it works, I will explore why/how it works. This is how I learn new concepts in general. In the end, I will summarize what this code does. I tried it on 2 chronologies and seems to work ok so far.


In [None]:
!pip install spacy python-docx #install library to create and manipulate Microsoft Word (.docx) files
!python -m spacy download en_core_web_sm #install spacy for NLP

#mount google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!pip install python-docx
from docx import Document
import spacy
import docx
import re #working with regular expressions, allowing you to search, match, and manipulate strings based on patterns.
from datetime import datetime, timedelta

# Load the SpaCy model
nlp = spacy.load('en_core_web_sm') #load pretrained model

# Function to convert a time expression into a timedelta
def parse_time_expression(expression):
    weeks = 0
    days = 0
    match_weeks = re.search(r'(\d+)\s*weeks?', expression)
    match_days = re.search(r'(\d+)\s*days?', expression)

    if match_weeks:
        weeks = int(match_weeks.group(1))
    if match_days:
        days = int(match_days.group(1))

    return timedelta(weeks=weeks, days=days)


# Function to extract dates and summarize events
def extract_dates_and_events(docx_path):
    # Read the .docx file
    doc = docx.Document(docx_path)

    # Concatenate all paragraphs into a single string
    full_text = "\n".join([para.text for para in doc.paragraphs if para.text.strip()])

    # Process the text with SpaCy
    doc_nlp = nlp(full_text)

    events = []
    date_pattern = r'\b(?:\d{1,2} \w+ \d{4}|\w+ \d{1,2}, \d{4}|\d{1,2}/\d{1,2}/\d{4})\b'
    time_expression_pattern = r'\b(?:\d+\s*weeks?\s*(?:and\s*)?\d*\s*days?)\b'

    # Find dates and corresponding events
    for sent in doc_nlp.sents:
        date_matches = re.findall(date_pattern, sent.text)
        time_expressions = re.findall(time_expression_pattern, sent.text)

        # Check for time expressions and calculate new dates
        for time_expr in time_expressions:
            # Find the most recent date in the text to add the time expression to
            if date_matches:
                last_date = date_matches[-1]
                if "/" in last_date:
                    parsed_date = datetime.strptime(last_date, "%m/%d/%Y")
                else:
                    try:
                        parsed_date = datetime.strptime(last_date, "%B %d, %Y")
                    except ValueError:
                        parsed_date = datetime.strptime(last_date, "%d %B %Y")

                time_delta = parse_time_expression(time_expr)
                new_date = parsed_date + time_delta
                formatted_new_date = new_date.strftime("%B %d, %Y")
                event_description = sent.text.replace(time_expr, f"New date: {formatted_new_date}").strip().replace("\n", " ")
                events.append((new_date, formatted_new_date, event_description))

        # Process normal date matches
        for date in date_matches:
            if "/" in date:  # MM/DD/YYYY format
                parsed_date = datetime.strptime(date, "%m/%d/%Y")
            else:  # Other formats
                try:
                    parsed_date = datetime.strptime(date, "%B %d, %Y")
                except ValueError:
                    parsed_date = datetime.strptime(date, "%d %B %Y")

            formatted_date = parsed_date.strftime("%B %d, %Y")
            event_description = sent.text.replace(date, "").strip().replace("\n", " ")
            events.append((parsed_date, formatted_date, event_description))

    # Sort events by date
    events.sort(key=lambda x: x[0])

    # Return formatted events
    return [f"{event[1]}: {event[2]}" for event in events]

# Path to the .docx file in your Google Drive
docx_path = '/content/drive/My Drive/LegalDuel 1B/Chronology Examples/Motor Oil Chronology.docx'

# Extract dates and events
legal_events = extract_dates_and_events(docx_path)

# Print the output
print("Legal Chronology for James Rosco v. LubriTech Industries, Inc.\n")
for event in legal_events:
    print(event)


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Mounted at /content/drive
Legal Chronology for James Rosco v. LubriTech Industries, Inc.

December 10, 2020: The investigation report, released on May 15, 2021, corroborated the findings of ChemTest Labs, revealing that a batch of motor oil produced on , contained a higher-than-acceptable level of Polymer X due to a manufacturing error.
De

# Duckling

*Tiffany

https://github.com/facebook/duckling

# Spark

https://www.johnsnowlabs.com/extracting-exact-dates-from-natural-language-text/#:~:text=DateMatcher%20and%20MultiDateMatcher%20are%20rule,their%20performances%20are%20the%20same.

# Stanford CoreNLP

*Zoe

https://datascience.stackexchange.com/questions/45854/date-extraction-in-python

https://stanfordnlp.github.io/stanza/ner.html

In [None]:
# Stanford CoreNLP -> Stanza
# install
_pip_magic('stanza')
_pip_magic('dateparser')

Collecting dateparser
  Downloading dateparser-1.2.0-py2.py3-none-any.whl.metadata (28 kB)
Downloading dateparser-1.2.0-py2.py3-none-any.whl (294 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dateparser
Successfully installed dateparser-1.2.0


In [None]:
# imports
import stanza
from datetime import datetime # for manipulation of dates
import dateparser # for advanced date parsing

# initialization in english, tokenizes and assigns ner tags
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: ner
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Done loading processors!


In [None]:
# gets document
document = open('/Chronology_Facts_1.txt', 'r')
document = document.read()

# runs the nlp
document_nlp = nlp(document)

In [None]:
# function to do advanced date parsing
def parse_advanced_date(date_text):
  settings={
      'DATE_ORDER': 'MDY',
      'PREFER_DAY_OF_MONTH': 'first', # if no day, defaults to first of the month
      'PREFER_MONTH_OF_YEAR': 'first' # if no month, defaults to January
  }

  # parses dates
  parsed_date = dateparser.parse(date_text, settings=settings)

  if parsed_date:
    return parsed_date

  return None

In [None]:
chronology = []
curr_event = []

# traverses every sentence
for sentence in document_nlp.sentences:
  date_found = None

  # traverses the sentence's entities, extracts dates
  for entity in sentence.ents:
    if entity.type == 'DATE':
      date_found = parse_advanced_date(entity.text)

  sentence_txt = ' '.join([word.text for word in sentence.words])

  # if a date is found, extract the text around it
  if date_found:
    chronology.append([date_found, sentence_txt])
  else:
    # saves the text so it can be used in relation to another date if need be
    curr_event.append(sentence_txt)
    if curr_event:
      chronology.append([None, ' '.join(curr_event)])
      curr_event = []

In [None]:
# does not seem to work properly...
# so far the code that works the best is what yomna did with spacy

# sorts by date
dated_events = [item for item in chronology if item[0] is not None]
dated_events = sorted(dated_events, key=lambda x: x[0])

chronology = dated_events

# print
print('Chronology of Events:\n')
for date, event in chronology:
  if date:
    print(f'{date.strftime("%B %d, %Y")}: {event}')

Chronology of Events:

December 10, 2020: The investigation report , released on May 15 , 2021 , corroborated the findings of ChemTest Labs , revealing that a batch of motor oil produced on December 10 , 2020 , contained a higher - than - acceptable level of Polymer X due to a manufacturing error .
December 20, 2020: On September 5 , 2021 , the plaintiff 's legal team deposed LubriTech 's head of quality control , who admitted under oath that the company had identified the contamination on December 20 , 2020 , but decided against taking corrective action due to cost concerns .
January 15, 2021: On January 15 , 2021 , the plaintiff , James Rosco , purchased a batch of motor oil branded as " UltraGuard " from a local retailer , AutoMart , located in Albany , New York .
January 20, 2021: The plaintiff used the motor oil in his 2020 Ford Mustang on January 20 , 2021 , during a routine oil change performed at Speedy Lube , an auto service center in Albany .
February 05, 2021: On February 5 