With the goal of producing an AI tool to help internal colleauges and clients, we test different methods.  Initially, we are gearing towards building a small tool in-house to reduce cost.   Goal of this exercise is building a custom Named Entity Recongition machine.

User will supply a text of their request (ex. work description, resource needed) and the tool will identify the relevant item needed and output in an easily consumable fashion.



In [4]:
import spacy

Below is an online tool to create training data for SpaCy NER.

https://tecoholic.github.io/ner-annotator/

### SpaCy pre-trained model

In [8]:
#load base model
nlp = spacy.load("en_core_web_sm")

In [9]:
#sample text generated by chat GPT
text = """Seeking legal resources for a land dispute case in Westchester. 
Specifically interested in real estate and property law expertise, with a focus on zoning regulations and property rights. 
Require support from a Spanish-speaking legal team and prefer an office located within the Westchester area. 
Appreciate any relevant case studies, legal documentation, or scholarly articles."""


In [10]:
#selections for GPT to use to generate training data
practice_areas = ['labour', 'corporate', 'intellectual property', 'criminal', 'family', 'real estate', 'property'
                  , 'administrative', 'commercial', 'bankruptcy', 'immigration', 'tax', 'civil', 'health', 'insurance'
                  , 'construction', 'dispute resolution', 'environmental', 'lawsuit', 'business', 'competition'
                  , 'constitutional', 'education']


locations = ['chicago','new york','washington','pittsburgh','los angeles','boston','miami','atlanta','richmond','milwaukee'
            , 'seattle','san francisco','palo alto','cleveland','san diego', 'houston','kansas city','nashbille','philadelphia'
            , 'detroit','dallas']


languages = ['english','french','spanish','german','chinese','japanese','korean','arabic','portuguese','russian','hindi'
             , 'malay','thai']

In [11]:
doc = nlp(text)

In [12]:
print (doc)

Seeking legal resources for a land dispute case in Westchester. 
Specifically interested in real estate and property law expertise, with a focus on zoning regulations and property rights. 
Require support from a Spanish-speaking legal team and prefer an office located within the Westchester area. 
Appreciate any relevant case studies, legal documentation, or scholarly articles.


### SpaCy base model + minor training

In [13]:
#testing base model's ability identify keywords

legal_entities = []
for ent in doc.ents:
    #print (ent)
    print(ent.text, ":", ent.label_)
    if ent.label_ == "LAW" or ent.label_ == "LOC":
        legal_entities.append(ent.text)

print("Legal entities found:")
for entity in legal_entities:
    print(entity)



Westchester : ORG
Specifically : ORG
Spanish : LANGUAGE
Westchester : LOC
Legal entities found:
Westchester


As expected, the base model doesn't do a good job of identifying keywords especially when it comes to legal space.

In [14]:
#all entities that this model supports
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [21]:
#setting custom NER

from spacy.training import Example

#import blank model
nlp = spacy.blank("en")

#from training data example, manually locate the keywords NER should look for and it's category.
label = 'PRACTICE_AREA'
train_data = [
    ("interested in real estate and property law expertise, with a focus on zoning regulations and property rights", {"entities": [(2, 4, label)]}),
    ("interested in real estate and property law expertise, with a focus on zoning regulations and property rights", {"entities": [(5, 7, label)]})
    
]

# Add the custom entity to the model's pipeline
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner")
else:
    ner = nlp.get_pipe("ner")

# Add your custom label to the model's entity recognizer
ner.add_label(label)

# Begin the training process
nlp.begin_training()


# Train the model with your annotated data
for text, annotations in train_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    nlp.update([example], losses={})
    
    
    
# Save the trained model
nlp.to_disk("custom_ner_model")

# Load the trained custom NER model
nlp = spacy.load("custom_ner_model")  


# Sample text for testing
test_text = "We need legal advice for a potential real estate transaction."

# Apply the NER model to the test text
doc = nlp(test_text)

# Extract entities from the test text
for ent in doc.ents:
    print(ent.text, ent.label_)

print('is this working')

is this working


Even though the training data specifically mentions 'Real Estate' and the test data mentions 'Real Estate' but the base model wasn't able to recognize it.

In [22]:
#another test
test_text = "We need legal advice for a potential real estate transaction."

In [17]:
doc = nlp(test_text)

In [18]:
for ent in doc.ents:
    print (ent)
    print(ent.text, ":", ent.label_)

In [19]:
print(doc)

We need legal advice for a potential real estate transaction.


Like before, there are no recognition.

### Building custom NER w Spacy

In [23]:
!python -m spacy info


[1m

spaCy version    3.7.2                         
Location         C:\Users\gray.kim\AppData\Local\anaconda3\Lib\site-packages\spacy
Platform         Windows-10-10.0.19045-SP0     
Python version   3.11.5                        
Pipelines        en_core_web_sm (3.7.1)        



In [24]:

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

# load a blank spacy model
nlp = spacy.blank("en") 
# create a DocBin object
db = DocBin()

Using the online tool noted at the top of the paper, I asked chat GPT to come up with 5 long text of user inputs then I ran the text through the tool to identify, in a given script, skill and practice area and location and language.

In [26]:
import json
f = open('custom_ner_model/sample_data/training_data.json')
train_data = json.load(f)

In [27]:
train_data['annotations'][0]

['Our legal team specializing in labour law is currently managing a case in Chicago. We require comprehensive legal resources and expertise in employment contracts and labor disputes. Fluency in Spanish within the legal team is essential for effective communication with our diverse workforce. Please provide any relevant case studies or legal documentation related to labour law.\r',
 {'entities': [[31, 41, 'PRACTICE_AREA'],
   [74, 82, 'LOCATION'],
   [141, 151, 'PRACTICE_AREA'],
   [152, 161, 'PRACTICE_AREA'],
   [166, 181, 'PRACTICE_AREA'],
   [193, 200, 'LANG'],
   [367, 377, 'PRACTICE_AREA']]}]

In [28]:
# save the docbin object
for text, annot in tqdm(train_data['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)

db.to_disk("./training_data.spacy") 

100%|██████████| 6/6 [00:00<00:00, 859.52it/s]


In [30]:
#start up model

! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency -F

[38;5;3m[!] To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4m[i] Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m[+] Auto-filled config with all values[0m
[38;5;2m[+] Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [31]:
#train model

! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy


[38;5;4m[i] Saving to output directory: .[0m
[38;5;4m[i] Using CPU[0m
[1m
[38;5;2m[+] Initialized pipeline[0m
[1m
[38;5;4m[i] Pipeline: ['tok2vec', 'ner'][0m
[38;5;4m[i] Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     30.22    0.00    0.00    0.00    0.00
 39     200        237.61   1199.70  100.00  100.00  100.00    1.00
102     400          0.00      0.00  100.00  100.00  100.00    1.00
169     600          0.00      0.00  100.00  100.00  100.00    1.00
265     800          0.00      0.00  100.00  100.00  100.00    1.00
365    1000          0.48      0.42  100.00  100.00  100.00    1.00
500    1200          0.00      0.00  100.00  100.00  100.00    1.00
700    1400          0.00      0.00  100.00  100.00  100.00    1.00
900    1600          0.00      0.00  100.00  100.00  100.00    1.00
1100    1800          0.00      0.00  100

## model testing

### test 1

In [32]:
#categorize into keyword buckets
def buckets(prompt):
    
    prac_a = []
    expert = []
    lang = []
    locat = []
    
    for ent in prompt.ents:
    #print(ent.text,":",ent.label_)
        if ent.label_ == 'PRACTICE_AREA': prac_a.append(ent.text)
        elif ent.label_ == 'EXPERTISE': expert.append(ent.text)
        elif ent.label_ == 'LANG': lang.append(ent.text)
        elif ent.label_ == 'LOCATION': locat.append(ent.text)
    
    return prac_a, expert, lang, locat

In [33]:
#load model
nlp_ner = spacy.load("model-best") 

In [34]:
long_prompt = nlp_ner("""Our legal team, specializing in corporate law, is currently managing a complex business acquisition case in Washington. We require comprehensive legal resources and expertise in mergers and acquisitions, with a specific focus on due diligence and contract negotiations. Fluency in Spanish within the legal team is crucial for effective communication with our diverse international stakeholders. Our legal office, strategically located in the heart of Washington, allows us to stay updated with the latest legislative changes and business regulations.

In addition to our involvement in corporate law, we have recently taken on a significant construction dispute case in Boston. We are seeking expert guidance in construction law and arbitration procedures to resolve the ongoing disputes over contractual obligations and project timelines. Proficiency in Mandarin within the legal team is essential as we are working closely with several Chinese construction firms involved in the dispute. Our extensive experience in handling complex construction cases has enabled us to provide effective legal strategies and negotiation tactics to ensure swift resolutions.

Furthermore, our firm has been dedicated to providing comprehensive legal services in intellectual property law in San Francisco for over two decades. We specialize in patent registrations, trademark infringements, and copyright litigations, catering to a diverse clientele ranging from tech startups to established multinational corporations. With our multilingual legal team proficient in German, Japanese, and French, we have successfully represented our clients in numerous high-profile IP cases both domestically and internationally.

As part of our commitment to serving our clients' needs, we have expanded our practice to include healthcare law and regulatory compliance in New York. Our legal experts are well-versed in healthcare regulations, compliance standards, and litigation procedures, providing strategic counsel to healthcare providers, institutions, and pharmaceutical companies. Fluency in Arabic within our legal team has enabled us to effectively communicate with our Middle Eastern clients and navigate the complexities of healthcare regulations in the international market.

Our firm's diverse expertise extends to environmental law, with a particular focus on sustainability and green initiatives in Los Angeles. We have successfully represented clients in cases involving environmental compliance, renewable energy projects, and land development regulations. Our proficiency in Korean and Chinese languages has facilitated our engagement with international stakeholders seeking our legal guidance on various environmental initiatives and projects.

With our dedication to providing top-tier legal services across diverse practice areas and our multilingual capabilities, we remain committed to serving our clients with the highest standards of professionalism and expertise.""")


In [35]:
#print out results

p,e,la,lo = buckets(long_prompt)

print('Practice Areas: ', p)
print('Expertise: ', e)
print('Language: ', la)
print('Location: ', lo)


Practice Areas:  ['corporate law', 'mergers and', 'corporate law', 'construction law', 'project timelines.', 'intellectual property', 'patent registrations', 'litigation procedures', 'environmental law', 'land development', 'languages has', 'environmental initiatives']
Expertise:  ['business acquisition', 'diligence and', 'business regulations.', 'the dispute', 'enabled us', 'numerous high', 'internationally.', 'healthcare regulations', 'compliance standards', 'the international', 'cases involving', 'environmental compliance,', 'projects.', '-tier legal']
Language:  ['Washington.', 'Spanish', 'Washington', 'Mandarin', 'German', 'Japanese', 'French', 'Arabic', 'Middle Eastern clients', 'Korean', 'Chinese']
Location:  ['Boston.', 'San Francisco', 'New York', 'Los Angeles']


This is great improvement from the pre-trained non-legal model and base model with minor training.  
There are some errors like 'mergers and', 'project timelines', 'language has' as practice areas.
'the dispute', 'enabled us', 'numerous high' for expertise and so on.  
But this is a milestone better than the initial attempt given that the model only trained off of 5 documents, albeit there were many keywords examples imbedded.


We will run a few more tests.

In [50]:
print('Total keywords identified: 43')
print('True Positives: 27')
print('Accuracy: 63%')

Total keywords identified: 43
True Positives: 27
Accuracy: 63%


In [43]:
spacy.displacy.render(long_prompt, style="ent", jupyter=True)

#### Test 2

In [37]:
long_prompt_2 = nlp_ner("""Our legal team, specializing in civil law, is currently handling a high-profile lawsuit in Houston. We require comprehensive legal resources and expertise in civil litigation, with a specific focus on dispute resolution and trial preparation. Fluency in French within the legal team is crucial for effective communication with our French-speaking clients. Our office, strategically located in the bustling city of Houston, allows us to provide timely and effective legal representation to our clients.

In addition to our involvement in civil law, we have recently taken on a complex administrative case in San Diego. We are seeking expert guidance in administrative law and regulatory compliance to navigate the intricate regulatory landscape and ensure our clients' interests are protected. Proficiency in Russian within the legal team is essential as we are representing several international clients facing administrative challenges. Our extensive experience in handling administrative cases has equipped us with the necessary skills to advocate for our clients effectively.

Furthermore, our firm has a strong emphasis on competition law and antitrust regulations in Miami. We specialize in providing legal counsel to businesses and corporations, ensuring compliance with competition laws and regulations, and representing clients in cases involving antitrust violations and unfair business practices. With our multilingual legal team proficient in Chinese, Spanish, and Portuguese, we have successfully represented clients in complex antitrust cases both domestically and internationally.

As part of our commitment to serving our clients' diverse needs, we have expanded our practice to include education law and policy advocacy in Philadelphia. Our legal experts are well-versed in education regulations, policy frameworks, and student rights, providing strategic counsel to educational institutions, students, and parents. Fluency in Arabic within our legal team has enabled us to effectively communicate with our Arabic-speaking clients and advocate for their educational rights in the region.

Our firm's diverse expertise also extends to the field of health insurance law, with a particular focus on ensuring fair and just insurance coverage for our clients in Dallas. We have successfully represented numerous clients in cases involving insurance disputes, coverage denials, and bad faith insurance practices. Our proficiency in Japanese and Korean languages has facilitated our communication with international clients seeking our legal guidance on various health insurance matters.

With our commitment to excellence and our multilingual capabilities, we are dedicated to providing top-tier legal services across diverse practice areas and ensuring the protection of our clients' rights and interests.""")


In [38]:
#categorize for export
p,e,la,lo = buckets(long_prompt_2)

print('Practice Areas: ', p)
print('Expertise: ', e)
print('Language: ', la)
print('Location: ', lo)


Practice Areas:  ['civil law', 'civil law', 'administrative case in', 'administrative law', 'equipped us', 'providing legal', 'laws and', 'unfair business practices', 'complex antitrust']
Expertise:  ['-profile lawsuit', 'civil litigation', 'compliance to', 'protected.', 'handling administrative', 'cases involving', 'internationally.', 'education regulations', 'the region', 'cases involving', 'Korean languages has', '-tier legal', 'interests.']
Language:  ['French', 'French-speaking', 'the', 'Houston', 'Russian', 'Chinese', 'Spanish', 'Portuguese', 'Arabic', 'Arabic-speaking', 'Dallas.', 'Japanese']
Location:  ['Houston.', 'San Diego', 'Miami.', 'Philadelphia.']


In [51]:
print('Total keywords identified: 35')
print('True Positives: 18')
print('Accuracy: 51%')

Total keywords identified: 35
True Positives: 18
Accuracy: 51%


In [39]:
#highlight in text
spacy.displacy.render(long_prompt_2, style="ent", jupyter=True)

#### Test 3


In [44]:
lp3 = nlp_ner("""Our legal team, specializing in dispute resolution, is currently handling a complex case in Milwaukee. We require comprehensive legal resources and expertise in alternative dispute resolution methods and negotiation tactics. Fluency in Korean within the legal team is crucial for effective communication with our Korean-speaking clients. Our office, strategically located in the heart of Milwaukee, allows us to provide timely and effective legal representation to our clients.

In addition to our involvement in dispute resolution, we have recently taken on a sensitive commercial case in Palo Alto. We are seeking expert guidance in commercial law and contract disputes to ensure fair and just business transactions and agreements. Proficiency in Malay within the legal team is essential as we are representing several international companies facing commercial challenges. Our extensive experience in handling commercial cases has equipped us with the necessary skills to negotiate on behalf of our clients effectively.

Furthermore, our firm has a strong emphasis on environmental law and sustainability initiatives in Dallas. We specialize in providing legal counsel to businesses and organizations seeking to comply with environmental regulations and promote green practices. With our multilingual legal team proficient in Arabic, Spanish, and Portuguese, we have successfully guided clients through various environmental initiatives and provided effective solutions to promote sustainable practices.

As part of our commitment to serving our clients' diverse needs, we have expanded our practice to include insurance law and policy advocacy in Cleveland. Our legal experts are well-versed in insurance regulations, policy frameworks, and claim settlements, providing strategic counsel to policyholders and insurance companies. Fluency in Hindi within our legal team has enabled us to effectively communicate with our Hindi-speaking clients and advocate for their insurance rights in the region.

Our firm's diverse expertise also extends to the field of criminal law, with a particular focus on defending clients in Los Angeles. We have successfully represented numerous clients in cases involving criminal defense, plea negotiations, and trial representation. Our proficiency in Russian and Chinese languages has facilitated our communication with international clients seeking our legal guidance on various criminal law matters.

With our commitment to excellence and our multilingual capabilities, we are dedicated to providing top-tier legal services across diverse practice areas and ensuring the protection of our clients' rights and interests.""")

In [45]:
#categorize for export
p,e,la,lo = buckets(lp3)

print('Practice Areas: ', p)
print('Expertise: ', e)
print('Language: ', la)
print('Location: ', lo)

Practice Areas:  ['commercial law', 'transactions and', 'equipped us', 'environmental law', 'providing legal', 'provided effective', 'criminal law', 'languages has']
Expertise:  ['dispute resolution', 'alternative dispute', 'negotiation tactics.', 'dispute resolution', 'environmental regulations', 'environmental initiatives', 'insurance regulations', 'insurance companies.', 'the region', 'cases involving', '-tier legal', 'interests.']
Language:  ['Korean', 'Korean-speaking', 'Milwaukee', 'Malay', 'Dallas.', 'Arabic', 'Spanish', 'Portuguese', 'Hindi', 'Hindi-speaking', 'Russian', 'Chinese']
Location:  ['Milwaukee.', 'Palo Alto', 'Cleveland.', 'Los Angeles']


In [52]:
print('Total keywords identified: 36')
print('True Positives: 26')
print('Accuracy: 72%')

Total keywords identified: 36
True Positives: 26
Accuracy: 72%


In [46]:
#highlight in text
spacy.displacy.render(lp3, style="ent", jupyter=True)

### test with web scraped bio

In [47]:
#using scraped bio from WebScraper.ipynb

long_prompt = nlp_ner("""Mike Abcarian is managing partner of the firm's Dallas office. For over 30 years he has represented Fortune 500 corporations, units of local government, and local business interests in labor and employment matters. He has handled hundreds of lawsuits in federal and state courts with an exceptional success record, including lead counsel defense of complex litigation and nationwide class actions. Many of Mike's successful cases resulted in defense verdicts for employer clients following trial by jury. Mike also handles complex workplace safety matters, including fatality investigations, and has represented employers in high-visibility proceedings before the Occupational Safety and Health Administration (OSHA). He has handled significant compensation compliance matters--some involving thousands of employees--in proceedings before the Wage & Hour Division of the U. S. Department of Labor (USDOL). Mike also appears frequently before the Equal Employment Opportunity Commission (EEOC) defending employers in discrimination matters.  He also represents employers before the National Labor Relations Board (NLRB) in union representation proceedings and unfair labor practice proceedings, and in arbitration of labor disputes and labor contract negotiations. Throughout his career, Mike has been a sought-after speaker and a prolific author on labor and employment law issues. He is "AV" Peer Review Rated by Martindale-Hubbell for preeminent skill and ethics, and he has been listed in Texas Super Lawyers every year since 2004. Mike has also been listed in Best Lawyers in America since 2012 and was listed in Chambers USA since 2016. In 2018, Mike was inducted as a Fellow into The College of Labor and Employment Lawyers. Election as a Fellow is the highest recognition by an attorney's colleagues of sustained outstanding performance in the profession, exemplifying integrity, dedication and excellence. """)


In [48]:
#print out results

p,e,la,lo = buckets(long_prompt)

print('Practice Areas: ', p)
print('Expertise: ', e)
print('Language: ', la)
print('Location: ', lo)


Practice Areas:  ['local business interests', 'labor and', 'federal and', 'litigation and', 'Health Administration (', 'proceedings before', 'labor practice', 'arbitration of', 'labor disputes and', 'labor contract negotiations', 'labor and', 'employment law', 'Best Lawyers in', 'Labor and']
Expertise:  ['office.', 'defense verdicts', 'high-', 'compliance matters', 'employees--', 'discrimination matters.', 'union representation', '-after speaker', 'ethics,', 'every year since', '2018,', 'the profession,', 'excellence.']
Language:  ['Abcarian', ').', 'Labor', '(', 'Commission (EEOC', 'Board (', 'Review', 'Rated', 'Martindale-Hubbell', 'Lawyers', 'America', 'College', 'Lawyers.', 'Fellow']
Location:  ['Labor Relations', 'Texas Super', 'Chambers USA']


In [53]:
print('Total keywords identified: 44')
print('True Positives: 8')
print('Accuracy: 19%')

Total keywords identified: 44
True Positives: 8
Accuracy: 19%


In [54]:
print('bad recognition.  Train data with more bio data')

bad recognition.  Train data with more bio data


Interesting find here.  The previous tests 1~3 resulted in a 50% or more accuracy.  Scraped bios performed very badly.  There are many good reasons for this.  Firstly the biggest issue is the model has been trained without enough diversity.  It was provided training data with very similar format with the keywords replaced.  Now that the model has gotten a never-before-seen script, it did poorly on the recognition.  Another reason is there were just not enough training data provided.  Final reason is that the scraped bios are just written in different style.  The last point pulls up a good point where building a custom NER with some training data would make it difficult to use in production especially when the model was to be used by each clients with their own writing styles and keywords.  We can't make it general enough to cover for majority of our clients.