# Named Entity Recognition for Text-Sample Characterisation

The `demo-ner.ipynb` notebook provides a demonstration of Named Entity Recognition (NER) using a pre-trained model. NER is a natural language processing (NLP) technique used to identify and classify specific entities in a text, such as names of people, organizations, locations, dates, and other proper nouns. The process involves tagging these entities with predefined categories, enabling machines to better understand and analyse textual data.  

The notebook is based on [US Department of Labor's Ableist Language Detector](https://github.com/USDepartmentofLabor/ableist-language-detector). It includes steps for loading the model, processing text data, and extracting named entities such as persons, organizations, locations, and more. We can use this information in bias identification and mitigation by detecting and categorising references to specific demographic groups, such as gender, ethnicity, or nationality, within text data. By analysing the frequency and context in which these entities appear, NER helps identify patterns of bias, such as underrepresentation or stereotypical portrayals. This information can then be used to adjust content or algorithms, thus ensuring more balanced and fair treatment of different groupss.

In [None]:
import pandas as pd
from csv import DictReader
from ner.detect import BiasedLanguage
from ner.detect import get_biased_words, find_biased_language

In [1]:
WORDLIST_CSV_PATH = "data/ableist_word_list.csv"

ableist_words_df = pd.read_csv(WORDLIST_CSV_PATH)
ableist_words_df.head()

Unnamed: 0,word,dependent,dependencies,alternative_words,example
0,climb,False,,"ascend, raise, work atop",Ascend a ladder to work atop roofs of customers
1,touch,False,,"activate, inspect, diagnose",Inspect the thickness of clothing material
2,feel,False,,"activate, inspect, diagnose",Inspect the thickness of clothing material
3,hand,False,,"move, install, operate, manage, put, place, tr...",Transport boxes from shipping dock to truck
4,carry,False,,"move, install, operate, manage, put, place, tr...",Transport boxes from shipping dock to truck


A biased language includes five main attributes:
1. The biased word as string
2. A boolean to check if the word has any dependencies
3. A set of dependencies
4. Alternative words that can refer to same biased words
5. An example sentence

Using this CSV format the BiasedLanguage data structure can automatically generate the representation. In this example, it is ableist words.

In [None]:
ABLEIST_VERBS = {}
with open(WORDLIST_CSV_PATH, "r") as wordlist_csv:
    reader = DictReader(wordlist_csv)
    for row in reader:
        row_data = BiasedLanguage(**row)
        ABLEIST_VERBS[row_data.word] = row_data

In [3]:
print(ABLEIST_VERBS)

{'climb': BiasedLanguage(word='climb', dependent=False, alternative_words=['ascend', 'raise', 'work atop'], example='Ascend a ladder to work atop roofs of customers', dependencies=None), 'touch': BiasedLanguage(word='touch', dependent=False, alternative_words=['activate', 'inspect', 'diagnose'], example='Inspect the thickness of clothing material', dependencies=None), 'feel': BiasedLanguage(word='feel', dependent=False, alternative_words=['activate', 'inspect', 'diagnose'], example='Inspect the thickness of clothing material', dependencies=None), 'hand': BiasedLanguage(word='hand', dependent=False, alternative_words=['move', 'install', 'operate', 'manage', 'put', 'place', 'transfer', 'transport'], example='Transport boxes from shipping dock to truck', dependencies=None), 'carry': BiasedLanguage(word='carry', dependent=False, alternative_words=['move', 'install', 'operate', 'manage', 'put', 'place', 'transfer', 'transport'], example='Transport boxes from shipping dock to truck', depende

In [None]:
# This functionality is also available inside the FAID library
biased_words = get_biased_words(WORDLIST_CSV_PATH)
print(biased_words)

{'climb': BiasedLanguage(word='climb', dependent=False, alternative_words=['ascend', 'raise', 'work atop'], example='Ascend a ladder to work atop roofs of customers', dependencies=None), 'touch': BiasedLanguage(word='touch', dependent=False, alternative_words=['activate', 'inspect', 'diagnose'], example='Inspect the thickness of clothing material', dependencies=None), 'feel': BiasedLanguage(word='feel', dependent=False, alternative_words=['activate', 'inspect', 'diagnose'], example='Inspect the thickness of clothing material', dependencies=None), 'hand': BiasedLanguage(word='hand', dependent=False, alternative_words=['move', 'install', 'operate', 'manage', 'put', 'place', 'transfer', 'transport'], example='Transport boxes from shipping dock to truck', dependencies=None), 'carry': BiasedLanguage(word='carry', dependent=False, alternative_words=['move', 'install', 'operate', 'manage', 'put', 'place', 'transfer', 'transport'], example='Transport boxes from shipping dock to truck', depende

In [None]:
JOB_DESCRIPTION_FILE = "data/sample_job_descriptions/short_job_description.txt"
with open(JOB_DESCRIPTION_FILE, "r") as jd_file:
        job_description_text = jd_file.read()

result = find_biased_language(job_description_text, WORDLIST_CSV_PATH)
print(f"Found {len(result)} instances of ableist language.\n")
if len(result) > 0:
    for i, ableist_term in enumerate(result):
        print(
            f"Match #{i+1}\n"
            f"PHRASE: {ableist_term} | LEMMA: {ableist_term.lemma} | "
            f"POSITION: {ableist_term.start}:{ableist_term.end} | "
            f"ALTERNATIVES: {ableist_term.data.alternative_words} | "
            f"EXAMPLE: {ableist_term.data.example}\n"
        )

Found 4 instances of ableist language.

Match #1
PHRASE: lifting | LEMMA: lift | POSITION: 21:22 | ALTERNATIVES: ['move', 'install', 'operate', 'manage', 'put', 'place', 'transfer', 'transport'] | EXAMPLE: Transport boxes from shipping dock to truck

Match #2
PHRASE: bend | LEMMA: bend | POSITION: 37:38 | ALTERNATIVES: ['lower oneself', 'drop', 'move to', 'turn'] | EXAMPLE: Install new ethernet cables under floor rugs

Match #3
PHRASE: move your hands | LEMMA: move your hand | POSITION: 7:10 | ALTERNATIVES: ['observe', 'operate', 'transport', 'transfer', 'activate'] | EXAMPLE: Operates a machine using a lever

Match #4
PHRASE: move your wrists | LEMMA: move your wrist | POSITION: 31:34 | ALTERNATIVES: ['observe', 'operate', 'transport', 'transfer', 'activate'] | EXAMPLE: Operates a machine using a lever



In [6]:
JOB_DESCRIPTION_FILE = "data/sample_job_descriptions/long_job_description.txt"
with open(JOB_DESCRIPTION_FILE, "r") as jd_file:
        job_description_text = jd_file.read()

result = find_biased_language(job_description_text, WORDLIST_CSV_PATH)
print(f"Found {len(result)} instances of ableist language.\n")
if len(result) > 0:
    for i, ableist_term in enumerate(result):
        print(
            f"Match #{i+1}\n"
            f"PHRASE: {ableist_term} | LEMMA: {ableist_term.lemma} | "
            f"POSITION: {ableist_term.start}:{ableist_term.end} | "
            f"ALTERNATIVES: {ableist_term.data.alternative_words} | "
            f"EXAMPLE: {ableist_term.data.example}\n"
        )

Found 8 instances of ableist language.

Match #1
PHRASE: run | LEMMA: run | POSITION: 562:563 | ALTERNATIVES: ['move to', 'move about', 'traverse'] | EXAMPLE: Moves about the office regularly to meet with staff

Match #2
PHRASE: read | LEMMA: read | POSITION: 600:601 | ALTERNATIVES: ['assess', 'comprehend', 'discover', 'distinguish', 'detect', 'evaluate', 'find', 'identify', 'interpret', 'observe', 'recognize', 'understand'] | EXAMPLE: Detect errors in submitted forms

Match #3
PHRASE: lifting | LEMMA: lift | POSITION: 713:714 | ALTERNATIVES: ['move', 'install', 'operate', 'manage', 'put', 'place', 'transfer', 'transport'] | EXAMPLE: Transport boxes from shipping dock to truck

Match #4
PHRASE: carrying | LEMMA: carry | POSITION: 715:716 | ALTERNATIVES: ['move', 'install', 'operate', 'manage', 'put', 'place', 'transfer', 'transport'] | EXAMPLE: Transport boxes from shipping dock to truck

Match #5
PHRASE: lifting | LEMMA: lift | POSITION: 720:721 | ALTERNATIVES: ['move', 'install', 'op

## What should we record?

- [ ] The word-list creation process. The process of defining and extract the words.
- [ ] Data sources reviewed in the sampling process.
- [ ] Final list of words

## How should we record?

Note that, this experiment is not a fairness evaluation or mitigation experiment. It is a new data creation process. The output data is the list of words. So, we focus on creating the data card.

In [1]:
import sys
sys.path.append('../../')
from faid import logging as faidlog
faidlog.init_log()

[92mModel log file created.[0m
[92mData log file created.[0m
[92mRisks log file created.[0m
[92mTransparency log file created.[0m


In [2]:
datacard = faidlog.DataCard()

In [None]:
description = {
    'name': 'Ableist Words',
    'summary': '',
    'dataset_link': '',
    'repository_link': '',
    'intro_paper': '',
    'publishing_organization': '',
    'tasks': [],
    'characteristics': ["structured", "tabular"],
    'feature_types': ["numerical", "categorical"],
    'target_col': 'N/A',
    'index_col': 'N/A',
    'year_of_dataset_creation': '2025',
    'last_updated': '',
    'industry_types': [],
    'publishing_poc': {},
    'owners': [],
    'authors': [],
    'funding_sources': []
 }

<faid.logging.data_card_utils.DataCard at 0x107bf8970>

In [None]:
datacard.set_description(description=description)
datacard.save()