# Named Entity Recognition (NER) and Keyword Extraction

This notebook coves the topic of identifying which celebrities JUUL targeted to promote their vaping product.

For researchers who would prefer to work with optical character recognition (OCR) text for JUUL vs State of North Carolina case from within their own database systems, these files are available for free download via the link below.
https://ucsf.app.box.com/v/IDL-DataSets/file/1447029625798

Note: The link provides access to the most current dataset, as the website undergoes a new release each month. Ensure that you have sufficient storage available to download the zip file with the OCR text (~32GB).


#### Step 1: Retrieve documents ids relevant to the query using the API wrapper

Import the required libraries

In [1]:
import re
import polars as pl
from industryDocumentsWrapper import IndustryDocsSearch

Code to retrieve document ids which will then be used to retrieve OCR text

In [48]:
query = '(collection:"JUUL labs Collection" AND case:"State of North Carolina" AND Youth OR adolescent AND type:"Email")' # modify as required
api = IndustryDocsSearch()
api.query(q=query, n=-1)
api.save('celebrity_email_query.parquet', format='parquet')


100/11052 documents collected
200/11052 documents collected
300/11052 documents collected
400/11052 documents collected
500/11052 documents collected
600/11052 documents collected


KeyboardInterrupt: 

In [2]:
celebrity_emails = pl.read_parquet('celebrity_email_query.parquet')
nc_emails = pl.read_parquet('../../data/juul_nc_emails.parquet')

In [3]:
# Filter the nc_emails DataFrame to only include entries whose 'id' is present in celebrity_emails
filtered_nc_emails = nc_emails.filter(pl.col("id").is_in(celebrity_emails["id"]))

# Display the shape of the original and filtered DataFrames to see how many emails match
print(f"Original nc_emails shape: {nc_emails.shape}")
print(f"Celebrity emails shape: {celebrity_emails.shape}")
print(f"Filtered nc_emails shape: {filtered_nc_emails.shape}")

# Preview the filtered DataFrame
filtered_nc_emails.select(["id", "bates", "type"]).head(5)

Original nc_emails shape: (1685701, 71)
Celebrity emails shape: (34091, 25)
Filtered nc_emails shape: (34090, 71)


id,bates,type
str,str,str
"""hqmp0299""","""JLI05753504""","""email"""
"""gypy0299""","""JLI04787118""","""email"""
"""xypy0299""","""JLI04787139""","""email"""
"""ssnx0338""","""JLI42903959""","""email"""
"""nnyx0338""","""JLI00448237""","""email"""


In [4]:
filtered_nc_emails.write_parquet('celebrity_emails_ocr_text.parquet')

#### Named Entity Recognition (NER) is a method that identifies and classifies key information within the text into predefined categories such as names of people, organizations, locations, dates, and other entities.

#### For instance, NER can be used to identify key celebrities and organizations within the legal documents. This can be achieved using BERT (Bidirectional Encoder Representations from Transformers) which is a model designed to help machines understand human language more effectively. It's based on a type of neural network architecture called a transformer, which is particularly good at processing sequences of data, like sentences.

#### Transformer-based models can be fine-tuned for specific tasks by training them on specialized datasets. The model's ability to recognize different categories, such as sentiment, named entities, or topics, depends on what data it was trained on.



Run the code cell below to perform name entity recognition analysis directly using the OCR text (email). The output is seperated by each individual tag for ease of analysis

In [6]:
df = pl.read_parquet('celebrity_emails_ocr_text.parquet')

**Let's look at some of the columns that we're working with**

In [7]:
df.columns

['id',
 'tid',
 'bates',
 'type',
 'description',
 'title',
 'author',
 'mentioned',
 'attending',
 'copied',
 'recipient',
 'redacted',
 'collection_name',
 'pages',
 'exhibit_number',
 'document_date',
 'date_added_ucsf',
 'date_modified_ucsf',
 'date_added_industry',
 'date_modified_industry',
 'date_produced',
 'date_shipped',
 'deposition_date',
 'date_privilege_logged',
 'case',
 'industry',
 'drug',
 'adverse_ruling',
 'area',
 'bates_alternate',
 'box',
 'brand',
 'country',
 'language',
 'court',
 'format',
 'express_waiver',
 'file',
 'genre',
 'keywords',
 'bates_master',
 'other_number',
 'request_number',
 'minnesota_request_number',
 'privilege_code',
 'topic',
 'witness',
 'cited',
 'availability',
 'grant_number',
 'source',
 'folder',
 'series',
 'chemical',
 'food',
 'rights',
 'attachment',
 'attachmentnum',
 'conversation',
 'conversationid',
 'custodian',
 'datereceived',
 'datesent',
 'filename',
 'filepath',
 'messageid',
 'subject',
 'timereceived',
 'timesent',

**Let's slim down our dataframe to just some columns that may be relevant**

In [8]:
emails_df = df.select(["id", "bates", "type", "document_date", "title", "author", "mentioned", "copied", "recipient", "ocr_text"])

In [9]:
emails_df.sample(3)

id,bates,type,document_date,title,author,mentioned,copied,recipient,ocr_text
str,str,str,str,str,str,str,str,str,str
"""sgyk0288""","""JLI06795128""","""email""","""Tue Jan 16 16:00:00 PST 2018""","""Re: These are the stories that…","""Ben Schwartz <bschwartz@gacapi…",""" ""","""Jenny Kim <""jenny kim <jenny@j…","""Vittal Kadapakkam <""vittal kad…","""From: To: CC: Sent: Subjec…"
"""qhxb0302""","""JLI05118491""","""email""","""Mon Dec 03 16:00:00 PST 2018""","""Re:L Redacted""","""Redacted""",""" """,""" ""","""Jessica Taylor""","""From: To: Sent: Subject: …"
"""hllp0316""","""JLI04498841""","""email""","""Mon Apr 13 17:00:00 PDT 2015""","""Re: Pax 2 Same Day Delivery Se…","""Paul Moraes <paul@ploom.com>""",""" ""","""Laina Payne <""laina payne <lai…","""Rafael Burde <""rafael burde <r…","""From: To: CC: Paul Moraes…"


**Now, we'll go ahead and import the necessary Python libraries and set up our model for NER**

In [10]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, ModernBertConfig, pipeline
from itertools import tee, islice

  from .autonotebook import tqdm as notebook_tqdm


**[bert-base-NER](https://huggingface.co/dslim/bert-base-NER) is a model that is built on the [base-BERT](https://huggingface.co/google-bert/bert-base-uncased) large language model and has been specifically fine-tuned for NER tasks.**

In [11]:
# Initialize the BERT NER model and tokenizer
model_name = "dslim/bert-base-NER"
# config = ModernBertConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Initialize the pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [12]:
def extract_relevant_text(text):
    match = re.search(r'Subject:.*?CONFIDENTIAL|Re:.*?CONFIDENTIAL', text, re.DOTALL)
    if match:
        relevant_text = match.group(0)
        # Remove 'CONFIDENTIAL' and the leading part up to "Subject:" or "Re:"
        relevant_text = re.sub(r'(Subject:|Re:)', '', relevant_text)
        relevant_text = relevant_text.replace('CONFIDENTIAL', '').strip()
        return relevant_text
    return ""

In [13]:
# Apply the extract_relevant_text function to the ocr_text column and create a new column
emails_df = emails_df.with_columns(
    pl.col("ocr_text").map_elements(extract_relevant_text, return_dtype=pl.String).alias("cleaned_text")
)

# Display the first few rows to see the results
print(f"Shape of emails_df with cleaned text: {emails_df.shape}")
print("Sample of cleaned text:")
emails_df.select(["id", "bates", "cleaned_text"]).head(5)

Shape of emails_df with cleaned text: (34090, 11)
Sample of cleaned text:


id,bates,cleaned_text
str,str,str
"""hqmp0299""","""JLI05753504""","""Kate Morgan on behalf of Kate …"
"""gypy0299""","""JLI04787118""","""Jessica Edmondson 3/2/2018 10…"
"""xypy0299""","""JLI04787139""","""Jessica Edmondson on behalf of…"
"""ssnx0338""","""JLI42903959""","""Michael Swanson on behalf of M…"
"""nnyx0338""","""JLI00448237""","""Kate Morgan on behalf of Kate …"


In [14]:
# Function to extract NER tags
def extract_ner_tags(text):
    ner_results = ner_pipeline(text)
    return ner_results

**Let's take a look at what the model pulled from the first email** 

To see what the NER entities refer to, see the [bert-base-NER description](https://huggingface.co/dslim/bert-base-NER)

In [20]:
extract_ner_tags(emails_df['cleaned_text'][1])

[{'entity': 'B-PER',
  'score': 0.9996094,
  'index': 1,
  'word': 'Jessica',
  'start': 0,
  'end': 7},
 {'entity': 'I-PER',
  'score': 0.9990388,
  'index': 2,
  'word': 'Edmond',
  'start': 8,
  'end': 14},
 {'entity': 'I-ORG',
  'score': 0.39536187,
  'index': 41,
  'word': 'On',
  'start': 124,
  'end': 126},
 {'entity': 'I-MISC',
  'score': 0.785701,
  'index': 42,
  'word': 'Fr',
  'start': 127,
  'end': 129},
 {'entity': 'I-LOC',
  'score': 0.33906993,
  'index': 43,
  'word': '##i',
  'start': 129,
  'end': 130},
 {'entity': 'B-PER',
  'score': 0.9995409,
  'index': 55,
  'word': 'Jessica',
  'start': 157,
  'end': 164},
 {'entity': 'I-PER',
  'score': 0.9978004,
  'index': 56,
  'word': 'Edmond',
  'start': 165,
  'end': 171},
 {'entity': 'B-PER',
  'score': 0.99941754,
  'index': 221,
  'word': 'Jessica',
  'start': 708,
  'end': 715},
 {'entity': 'I-PER',
  'score': 0.9981589,
  'index': 222,
  'word': 'Edmond',
  'start': 716,
  'end': 722},
 {'entity': 'B-LOC',
  'score':

In [None]:
# Add NER data to the emails_df dataframe
ner_df = emails_df.with_columns(
    pl.col("cleaned_text").map_elements(lambda text: extract_ner_tags(text) if text else [], return_dtype=pl.List).alias("ner_entities")
)

# Let's check a sample to see the results
ner_df.select(["id", "bates", "cleaned_text", "ner_entities"]).sample(5)

ComputeError: KeyboardInterrupt: 

**Let's go ahead and save this new dataframe to a parquet file**

In [50]:
ner_df.write_parquet('celebrity_emails_ner.parquet')

In [59]:
category_df.head(5)

id,word,tag
str,str,str
"""ffbf0286""","""Kevin""","""LABEL_1"""
"""ffbf0286""",""" <kburns©ju""","""LABEL_1"""
"""ffbf0286""",""".""","""LABEL_1"""
"""ffbf0286""","""> ""","""LABEL_1"""
"""ffbf0286""",""" Morgan Ashley""","""LABEL_1"""


Run the code cell below to remove duplicates in the word column for the person file. This is useful in cases where you you want a more concise list of words.

In [4]:
# Remove duplicates in ner_output_PER based on word column
# Load the parquet file
for category in categories:
    df = pl.read_parquet(f'ner_output_{category}.parquet')

    # Remove duplicates based on the 'word' column
    df_cleaned = df.unique(pl.col('word'))
    print(df_cleaned.shape)

    # Save the de-duplicated DataFrame to a new Parquet file
    df_cleaned.write_parquet(f'ner_output_{category}_cleaned.parquet',)

(36, 3)
(10, 3)
(18, 3)
(61, 3)


**Similarly, the another model can be used to extract keywords from the OCR text**

In [22]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Load the parquet file
df = pl.read_parquet('celebrity_emails_ocr_text.parquet')

# Initialize the keyword extraction model and tokenizer
model_name = "transformer3/H2-keywordextractor"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Create a pipeline for keyword extraction
keyword_extraction_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

# Function to extract text between "Subject:"/"Re:" and "CONFIDENTIAL".
# This function searches for text between "Subject:" or "Re:" and "CONFIDENTIAL" within a given text to extract content from the email body.
def extract_relevant_text(text):
    match = re.search(r'(Subject:|Re:).*?(CONFIDENTIAL|$)', text, re.DOTALL)
    if match:
        relevant_text = match.group(0)
        # Remove 'CONFIDENTIAL' and the leading part up to "Subject:" or "Re:"
        relevant_text = re.sub(r'(Subject:|Re:)', '', relevant_text)
        relevant_text = relevant_text.replace('CONFIDENTIAL', '').strip()
        return relevant_text
    return ""

# Function to extract keywords
def extract_keywords(text):
    keyword_results = keyword_extraction_pipeline(text)
    return keyword_results

# Process each OCR text and extract keywords
keyword_data = []
for row in df.iter_rows():
    doc_id = row[df.columns.index('id')]
    text = extract_relevant_text(str(row[df.columns.index('ocr_content')]))
    if text:
        try:
            keyword_results = extract_keywords(text)
            for result in keyword_results:
                keywords = result['generated_text'].split(", ")
                for keyword in keywords:
                    keyword_data.append({'id': doc_id, 'keyword': keyword})
        except Exception:
            pass

# Convert the results to a DataFrame
keyword_df = pl.DataFrame(keyword_data)
print(keyword_df)
# Save the results to a new Parquet file
output_path = 'keyword_output.parquet'
keyword_df.write_parquet(output_path)

Device set to use mps:0


ValueError: 'ocr_content' is not in list

In [52]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig, pipeline


In [53]:
# Load the parquet file
df = pl.read_parquet('juul_query_with_ocr.parquet')

# Initialize the keyword extraction model and tokenizer
model_name = "google-t5/t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_config(config)

# Create a pipeline for keyword extraction
keyword_extraction_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)


Device set to use mps:0


In [54]:
def extract_relevant_text(text):
    match = re.search(r'(Subject:|Re:).*?(CONFIDENTIAL|$)', text, re.DOTALL)
    if match:
        relevant_text = match.group(0)
        # Remove 'CONFIDENTIAL' and the leading part up to "Subject:" or "Re:"
        relevant_text = re.sub(r'(Subject:|Re:)', '', relevant_text)
        relevant_text = relevant_text.replace('CONFIDENTIAL', '').strip()
        return relevant_text
    return ""

In [55]:
def extract_keywords(text):
    keyword_results = keyword_extraction_pipeline(text)
    return keyword_results

In [None]:
# Process each OCR text and extract keywords
keyword_data = []
for row in df.iter_rows():
    doc_id = row[df.columns.index('id')]
    text = extract_relevant_text(str(row[df.columns.index('ocr_text')]))
    if text:
        try:
            keyword_results = extract_keywords(text)
            for result in keyword_results:
                keywords = result['generated_text'].split(", ")
                for keyword in keywords:
                    keyword_data.append({'id': doc_id, 'keyword': keyword})
        except Exception:
            pass

keyword_df = pl.DataFrame(keyword_data)
keyword_df.sample(5)