In [1]:
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')

text = nlp("""

AI-based machine learning techniques are going beyond the cloud-based data center, as processing of vital IoT sensor data moves much closer to where the data first resides.

The move will be enabled by new artificial intelligence (AI)-equipped chips. These include embedded microcontrollers with narrower memory and power consumption requirements than GPUs (graphical processing units), FPGAs (field-programmable gate arrays) and other specialized IC types first used to answer data scientists’ questions in the cloud data centers of Amazon Web Services, Microsoft and Google.

It was in these clouds that machine learning and related neural network use exploded. But the rise of IoT created a data onslaught that required edge-based machine learning as well.

Now, cloud providers, Internet of Things (IoT) platform makers, and others see benefit in processing data at the edge before turning it over to the cloud for analytics.

Making AI decisions at the edge reduces latency and makes real-time response to sensor data more practical and possible. Still, what people call “edge AI” takes many forms. And how to power it with next-gen IoT presents challenges in terms of presenting good-quality actionable data.

Edge Computing Workloads Grow

Edge-based machine learning could drive significant growth of AI in the IoT market, which Mordor Intelligence estimates will grow at a 27.3% CAGR through to 2026.

That is buttressed by Eclipse Foundation IoT Group research in 2020, which pegged AI at 30% as the most commonly cited edge computing workload among IoT developers.

For many applications, replicating the endless racks of servers that enabled parallel machine learning on the cloud is not an option. IoT edge cases that benefit from local processing are many, and highlighted by varied cases of operations monitoring. The processors, for example, watch events triggered by pressure gauge changes on an oil rig, detection of an anomaly on a distant power line, or captured video surveillance of an issue at a factory.

The last case is one of those most widely pursued. Application of AI that parses image data at the edge has proved a fertile area. But there are many complex processing needs for event processing using IoT device-gathered data.

The Value of Edge Compute

Still, cloud-based IoT analytics will endure, said Steve Conway, senior adviser, Hyperion Research. But the distance data must travel brings processing latency. Moving data to and from a cloud naturally creates lag; the round trip takes time.

“There is something called the speed of light,” Conway quips. “And you cannot exceed it.” As result, a hierarchy of processing is developing on the edge.

Other than devices and board-level implementations, this hierarchy includes IoT gateways and data centers in manufacturing that expand architectural options available for next-generation IoT system development.

In the long view, edge AI architecture is yet another generational shift in data processing  focus  – but a key one, according to Saurabh Mishra, senior manager for product marketing at SAS’s IoT and Edge division.

“There is a progression here,” he said. “At one time, the idea was centralizing your data. You can do that for certain industries and certain use cases – ones where data was already created in a context, such as in a data center,” he said.

It’s not really possible to efficiently – and economically – move that to the cloud for analysis,” Mishra said, who noted that SAS has created validated edge IoT reference architectures on top of which customers can build AI and analytical applications. Striking a balance between cloud and edge AI will be a fundamental requirement, he said.

Finding balance begins with consideration of the amount of data needed to run machine learning models, according to Frédéric Desbiens, program manager, IoT and Edge Computing at the Eclipse Foundation. That is where the new intelligent processors come in.

“AI accelerators at the edge can do local processing before sending the data somewhere else. But, this requires you to think about the functional requirements, including the software stack and storage needed,” Desbiens said.

AI Edge Chip Abundance

The rise of cloud-based machine learning was influenced by the rise of the high-memory bandwidth GPU, often in the form of a NVIDIA semiconductor. That success drew the attention of other chip makers.

In-house AI-specific processors followed from hyperscale cloud-players Google, AWS and Microsoft.

That AI chip battle has been joined by leading lights such as AMD, Intel, Qualcomm, and ARM Technology (which, for its part, last year was acquired by NVIDIA).

In turn, embedded microprocessor and systems-on-a-chip mainstays like Maxim Integrated, NXP Semiconductors, Silicon Labs, STM Microelectronics and others began to focus on adding AI abilities to the edge.

Today,  IoT and edge processing needs have attracted AI chip start-ups that include EdgeQ,  Graphcore, Hailo, Mythic and others. Processing on the edge is constrained. Barriers include memory available, energy consumed and cost, emphasizes Hyperion’s Steve Conway.

“The embedded processors are very important, as energy use is very important,” Conway said. “The GPUs and CPUs are not tiny dies, and GPUs, particularly, use a ton of energy,” he said, referring to the relatively large silicon form factors GPUs and CPUs can take on.

Making Neurals Fit the Part

Data movement is a factor in energy consumption on the edge, advises Kris Ardis, executive director of Maxim Integrated’s microcontroller and software algorithm businesses. Recently, the company released the MAX78000, which pairs a low-power controller with a neural net processor that can run on battery-powered IoT devices.

“If you can do a computation at the very edge, you save bandwidth, and communications power. The challenge is taking the neural net and making it fit in the part,” Ardis said.

Individual IoT devices based on the chip can feed IoT gateways, which also have a useful part to play, combining rollups of data from devices, and further filtering data that may go to the cloud in order to analyze overall operations, he indicated.

Other semiconductor device makers also are adjusting to a trend that sees compute moving nearer to where data is. They are part of the effort to broaden the capabilities of developers, even as their hardware choices grow.

Bill Pearson, vice president of Intel’s IoT group admits there was a time when “the CPU was the answer to all problems.” Trends like edge AI belie that now.

He uses the term “XPU” to represent a variety of chip types that support different uses. But, he adds, the variety should be supported by a single software application programming interface (API).

To aid software developers, Intel recently released Version 2021.2 of the OpenVINO toolkit for inference on edge systems. It provides a, common environment for development among Intel components including CPUs, GPUs, and Movidius Visual Processing Units. As well, Intel offers DevCloud for the Edge software to forecast performance of neural network inference on different Intel hardware, according to Pearson.

The drive to simplify is marked at GPU powerhouse NVIDIA too.

“The industry has to make it easier for people that aren’t AI specialists,” said Justin Boitano, vice president and general manager for Enterprise and Edge Computing, NVIDIA.

That may take the form of NVIDIA Jetson, which includes a low-power ARM processor. Named with a nod to the ‘60s science-fiction cartoon series, Jetson is intended to provide GPU-accelerated parallel processing in mobile embedded systems.

Recently, to ease vision system development, NVIDIA rolled out Jetson JetPack 4.5, which includes the first production version of its Vision Programming Interface (VPI).

With time, edge AI development chores will be handled more by IT shops, and less by AI researchers with deep knowledge of machine learning, Boitano said.

The Tiny ML That Roared

The skills needed to migrate machine learning methods from the vast cloud to the constrained edge device are not easily gained. But new software techniques are being applied to enable compact edge AI, while easing the task of the developer.

In fact, industry has experienced the rise of “Tiny ML” approaches. These make do with less power and use limited memory, while achieving capable inference-operations-per-second ratings.

Various machine learning tooling to reduce edge processing requirements have emerged, including Apache MXNet,  Edge Impulse’s EON, Facebook’s Glow, Foghorn Lightning Edge ML, Google TensorFlow Lite, Microsoft ELL, OctoML’s Octomizer and others.

Down-sizing neural net processing is a main target here, and the techniques are several. Among these are quantization, binarization and pruning, according to Sastry Malladi, who is CTO at Foghorn, a maker of a software platform that supports a variety of edge and on-premises implementations.

Quantization of neural net processing focuses on use of low bit-width math. Binarization, in turn, is used to reduce the complexity of computations. And, pruning is used to reduce the number of neural nodes that must be processed.

Malladi admits that is a daunting gamut for most developers to traverse – especially across a range of hardware. The efforts behind Foghorn’s Lightning platform, he said, are intended to abstract the complexity in machine learning on the edge.

The goal is to allow line operators and reliability engineers, for example, to work with drag-and-drop interfaces, rather than application programming interfaces and software development kits, which are less intuitive and require more coding knowledge.

Software that simplifies development and runs across multiple types of edge AI hardware is also a focus for Edge Impulse, makers of a development platform for embedded machine learning.

Ultimately, machine learning maturation means some model miniaturization, according to Zach Shelby, CEO, Edge Impulse.

“Once, the direction of the research was toward bigger and bigger models of more and more complexity,” Shelby said. “But, as machine learning hit prime time, people started to care about efficiency again.” That led to Tiny ML.

Software that can work on existing IoT infrastructure is necessary, while supporting a path to new varieties of hardware, he said. Edge Impulse tools allow cloud-based modeling of algorithms and events on available hardware, Shelby continued, so that users can try different options before they make selections.

Keep Your Eyes on Vision

On the edge, computer vision has become a prominent use case for AI, especially in the form of deep learning, which employs multiple layers of neural networks and unsupervised techniques to achieve results in image pattern recognition.

Vision system architecture is undergoing shifts today, as cameras on the very edge add processing capabilities via embedded hardware for deep learning, according to Forrester Research’s Kjell Carlsson, principal analyst. But finding the best application targets can be a challenge.

“The issue with AI on the edge is that you more frequently end up looking at use cases that are ‘net new,’” he said.

Developing these greenfield solutions has inherent risk, Carlsson said, so a helpful tactic is to focus on use cases that offer a high benefit to cost ratio, even if the pattern recognition accuracy might trail that of full-fledged existing systems.

Overall, Carlsson said edge AI could help fulfill IoT’s original promise, which has lagged at times as implementers sorted through myriad potential use cases.

“IoT on its own had some limitations. Now, with AI, machine learning and deep learning that makes IoT more applicable – as well as valuable,” he said.


""")


displacy.render(text, style = 'ent', jupyter=True)

In [2]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
text = nlp("""

Egypt (Arabic: مصر Miṣr [mesˁr], Egyptian Arabic pronunciation: [mɑsˤr]), officially the Arab Republic of Egypt, is a transcontinental country spanning the northeast corner of Africa and the Sinai Peninsula in the southwest corner of Asia. It is bordered by the Mediterranean Sea to the north, the Gaza Strip of Palestine and Israel to the northeast, the Red Sea to the east, Sudan to the south, and Libya to the west. The Gulf of Aqaba in the northeast separates Egypt from Jordan and Saudi Arabia. Cairo is the capital and largest city of Egypt, while Alexandria, the second-largest city, is an important industrial and tourist hub at the Mediterranean coast.[20] At approximately 100 million inhabitants, Egypt is the 14th-most populated country in the world, and the third-most populated in Africa.

Egypt has one of the longest histories of any country, tracing its heritage along the Nile Delta back to the 6th–4th millennia BCE. Considered a cradle of civilisation, Ancient Egypt saw some of the earliest developments of writing, agriculture, urbanisation, organised religion and central government.[21] Egypt was an early and important centre of Christianity, later adopting Islam from the seventh century onwards. Cairo became the capital of the Fatimid Caliphate in the tenth century, and of the Mamluk Sultanate in the 13th century. Egypt then became part of the Ottoman Empire in 1517, before its local ruler Muhammad Ali established modern Egypt as an autonomous Khedivate in 1867.

The country was then occupied by the British Empire and gained independence in 1922 as a monarchy. Following the 1952 revolution, Egypt declared itself a republic. For a brief period between 1958 and 1961 Egypt merged with Syria to form the United Arab Republic. Egypt fought several armed conflicts with Israel in 1948, 1956, 1967 and 1973, and occupied the Gaza Strip intermittently until 1967. In 1978, Egypt signed the Camp David Accords, which recognised Israel in exchange for its withdrawal from the occupied Sinai. After the Arab Spring, which led to the 2011 Egyptian revolution and overthrow of Hosni Mubarak, the country faced a protracted period of political unrest; this included the election in 2012 of a brief, short-lived Muslim Brotherhood-aligned Islamist government spearheaded by Mohamed Morsi, and its subsequent overthrow after mass protests in 2013.

Egypt's current government, a semi-presidential republic led by president Abdel Fattah el-Sisi since he was elected in 2014, has been described by a number of watchdogs as authoritarian and responsible for perpetuating the country's poor human rights record. Islam is the official religion of Egypt, and Arabic is its official language.[1] The great majority of its people live near the banks of the Nile River, an area of about 40,000 square kilometres (15,000 sq mi), where the only arable land is found. The large regions of the Sahara desert, which constitute most of Egypt's territory, are sparsely inhabited. About 43% of Egypt's residents live across the country's urban areas,[22] with most spread across the densely populated centres of greater Cairo, Alexandria and other major cities in the Nile Delta. Egypt is considered to be a regional power in North Africa, the Middle East and the Muslim world, and a middle power worldwide.[23] It is a developing country having a diversified economy, which is the largest in Africa, the 38th-largest economy by nominal GDP and 127th by nominal GDP per capita.[24] Egypt is a founding member of the United Nations, the Non-Aligned Movement, the Arab League, the African Union, Organisation of Islamic Cooperation, World Youth Forum, and a member of BRICS.

Names
The English name "Egypt" is derived from the Ancient Greek "Aígyptos" ("Αἴγυπτος"), via Middle French "Egypte" and Latin "Aegyptus". It is reflected in early Greek Linear B tablets as "a-ku-pi-ti-yo".[25] The adjective "aigýpti-"/"aigýptios" was borrowed into Coptic as "gyptios", and from there into Arabic as "qubṭī", back formed into "قبط" ("qubṭ"), whence English "Copt". Prominent Ancient Greek historian and Geographer, Strabo, provided a folk etymology stating that "Αἴγυπτος" (Aigýptios) had originally evolved as a compound from "Aἰγαίου ὑπτίως" Aegaeou huptiōs, meaning "Below the Aegean".[26]

"Miṣr" (Arabic pronunciation: [misˤɾ]; "مِصر") is the Classical Quranic Arabic and modern official name of Egypt, while "Maṣr" (Egyptian Arabic pronunciation: [mɑsˤɾ]; مَصر) is the local pronunciation in Egyptian Arabic.[27] The current name of Egypt, Misr/Misir/Misru, stems from the Ancient Semitic name for it. The term originally connoted "Civilization" or "Metropolis".[28] Classical Arabic Miṣr (Egyptian Arabic Maṣr) is directly cognate with the Biblical Hebrew Mitsráyīm (מִצְרַיִם / מִצְרָיִם), meaning "the two straits", a reference to the predynastic separation of Upper and Lower Egypt. Also mentioned in several Semitic languages as Mesru, Misir and Masar.[28] The oldest attestation of this name for Egypt is the Akkadian "mi-iṣ-ru" ("miṣru")[29][30] related to miṣru/miṣirru/miṣaru, meaning "border" or "frontier".[31] The Neo-Assyrian Empire used the derived term , Mu-ṣur.[32]

The ancient Egyptian name of the country was
km	m	t
O49 (𓆎 𓅓 𓏏𓊖) km.t, which means black land, likely referring to the fertile black soils of the Nile flood plains, distinct from the deshret (⟨dšṛt⟩), or "red land" of the desert.[33][34] This name is commonly vocalised as Kemet, but was probably pronounced [kuːmat] in ancient Egyptian.[35] The name is realised as K(h)ēmə (Bohairic Coptic: ⲭⲏⲙⲓ, Sahidic Coptic: ⲕⲏⲙⲉ) in the Coptic stage of the Egyptian language, and appeared in early Greek as Χημία (Khēmía).[36][37] Another name was ⟨tꜣ-mry⟩ "land of the riverbank".[38] The names of Upper and Lower Egypt were Ta-Sheme'aw (⟨tꜣ-šmꜥw⟩) "sedgeland" and Ta-Mehew (⟨tꜣ mḥw⟩) "northland", respectively.




Egypt (Arabic: مصر Miṣr [mesˁr], Egyptian Arabic pronunciation: [mɑsˤr]), officially the Arab Republic of Egypt, is a transcontinental country spanning the northeast corner of Africa and the Sinai Peninsula in the southwest corner of Asia. It is bordered by the Mediterranean Sea to the north, the Gaza Strip of Palestine and Israel to the northeast, the Red Sea to the east, Sudan to the south, and Libya to the west. The Gulf of Aqaba in the northeast separates Egypt from Jordan and Saudi Arabia. Cairo is the capital and largest city of Egypt, while Alexandria, the second-largest city, is an important industrial and tourist hub at the Mediterranean coast.[20] At approximately 100 million inhabitants, Egypt is the 14th-most populated country in the world, and the third-most populated in Africa.

Egypt has one of the longest histories of any country, tracing its heritage along the Nile Delta back to the 6th–4th millennia BCE. Considered a cradle of civilisation, Ancient Egypt saw some of the earliest developments of writing, agriculture, urbanisation, organised religion and central government.[21] Egypt was an early and important centre of Christianity, later adopting Islam from the seventh century onwards. Cairo became the capital of the Fatimid Caliphate in the tenth century, and of the Mamluk Sultanate in the 13th century. Egypt then became part of the Ottoman Empire in 1517, before its local ruler Muhammad Ali established modern Egypt as an autonomous Khedivate in 1867.

The country was then occupied by the British Empire and gained independence in 1922 as a monarchy. Following the 1952 revolution, Egypt declared itself a republic. For a brief period between 1958 and 1961 Egypt merged with Syria to form the United Arab Republic. Egypt fought several armed conflicts with Israel in 1948, 1956, 1967 and 1973, and occupied the Gaza Strip intermittently until 1967. In 1978, Egypt signed the Camp David Accords, which recognised Israel in exchange for its withdrawal from the occupied Sinai. After the Arab Spring, which led to the 2011 Egyptian revolution and overthrow of Hosni Mubarak, the country faced a protracted period of political unrest; this included the election in 2012 of a brief, short-lived Muslim Brotherhood-aligned Islamist government spearheaded by Mohamed Morsi, and its subsequent overthrow after mass protests in 2013.

Egypt's current government, a semi-presidential republic led by president Abdel Fattah el-Sisi since he was elected in 2014, has been described by a number of watchdogs as authoritarian and responsible for perpetuating the country's poor human rights record. Islam is the official religion of Egypt, and Arabic is its official language.[1] The great majority of its people live near the banks of the Nile River, an area of about 40,000 square kilometres (15,000 sq mi), where the only arable land is found. The large regions of the Sahara desert, which constitute most of Egypt's territory, are sparsely inhabited. About 43% of Egypt's residents live across the country's urban areas,[22] with most spread across the densely populated centres of greater Cairo, Alexandria and other major cities in the Nile Delta. Egypt is considered to be a regional power in North Africa, the Middle East and the Muslim world, and a middle power worldwide.[23] It is a developing country having a diversified economy, which is the largest in Africa, the 38th-largest economy by nominal GDP and 127th by nominal GDP per capita.[24] Egypt is a founding member of the United Nations, the Non-Aligned Movement, the Arab League, the African Union, Organisation of Islamic Cooperation, World Youth Forum, and a member of BRICS.

Names
The English name "Egypt" is derived from the Ancient Greek "Aígyptos" ("Αἴγυπτος"), via Middle French "Egypte" and Latin "Aegyptus". It is reflected in early Greek Linear B tablets as "a-ku-pi-ti-yo".[25] The adjective "aigýpti-"/"aigýptios" was borrowed into Coptic as "gyptios", and from there into Arabic as "qubṭī", back formed into "قبط" ("qubṭ"), whence English "Copt". Prominent Ancient Greek historian and Geographer, Strabo, provided a folk etymology stating that "Αἴγυπτος" (Aigýptios) had originally evolved as a compound from "Aἰγαίου ὑπτίως" Aegaeou huptiōs, meaning "Below the Aegean".[26]

"Miṣr" (Arabic pronunciation: [misˤɾ]; "مِصر") is the Classical Quranic Arabic and modern official name of Egypt, while "Maṣr" (Egyptian Arabic pronunciation: [mɑsˤɾ]; مَصر) is the local pronunciation in Egyptian Arabic.[27] The current name of Egypt, Misr/Misir/Misru, stems from the Ancient Semitic name for it. The term originally connoted "Civilization" or "Metropolis".[28] Classical Arabic Miṣr (Egyptian Arabic Maṣr) is directly cognate with the Biblical Hebrew Mitsráyīm (מִצְרַיִם / מִצְרָיִם), meaning "the two straits", a reference to the predynastic separation of Upper and Lower Egypt. Also mentioned in several Semitic languages as Mesru, Misir and Masar.[28] The oldest attestation of this name for Egypt is the Akkadian "mi-iṣ-ru" ("miṣru")[29][30] related to miṣru/miṣirru/miṣaru, meaning "border" or "frontier".[31] The Neo-Assyrian Empire used the derived term , Mu-ṣur.[32]

The ancient Egyptian name of the country was
km	m	t
O49
 (𓆎 𓅓 𓏏𓊖) km.t, which means black land, likely referring to the fertile black soils of the Nile flood plains, distinct from the deshret (⟨dšṛt⟩), or "red land" of the desert.[33][34] This name is commonly vocalised as Kemet, but was probably pronounced [kuːmat] in ancient Egyptian.[35] The name is realised as K(h)ēmə (Bohairic Coptic: ⲭⲏⲙⲓ, Sahidic Coptic: ⲕⲏⲙⲉ) in the Coptic stage of the Egyptian language, and appeared in early Greek as Χημία (Khēmía).[36][37] Another name was ⟨tꜣ-mry⟩ "land of the riverbank".[38] The names of Upper and Lower Egypt were Ta-Sheme'aw (⟨tꜣ-šmꜥw⟩) "sedgeland" and Ta-Mehew (⟨tꜣ mḥw⟩) "northland", respectively.

""")


displacy.render(text, style = 'ent', jupyter=True)

# Named Entity Recognition (NER) Model Training

This notebook outlines the process of training a Named Entity Recognition (NER) model using a Conditional Random Field (CRF) algorithm. The dataset used includes sentences with Part-of-Speech (POS) tags and Named Entity Recognition (NER) tags.

## 1. Load and Clean the Dataset

First, we load the dataset and convert the string representations of lists in the 'POS' and 'Tag' columns into actual Python lists.


In [10]:
import pandas as pd
from ast import literal_eval

# Load the dataset
data = pd.read_csv('ner.csv')

# Convert string lists to actual Python lists for 'POS' and 'Tag' columns
data['POS'] = data['POS'].apply(literal_eval)
data['Tag'] = data['Tag'].apply(literal_eval)


# Check and Remove Problematic Rows
### We ensure that each sentence, POS, and tag list are of the same length. Rows where this is not the case are identified and removed.

In [11]:
# Find rows where the lengths do not match
mismatch = data.apply(lambda row: len(row['Sentence'].split()) != len(row['POS']) or len(row['POS']) != len(row['Tag']), axis=1)

# Display the problematic rows
problematic_rows = data[mismatch]
print(problematic_rows)


            Sentence #                                           Sentence  \
76        Sentence: 77  " And I think they 'll want a one-stop shop in...   
10051  Sentence: 10052  In a telephone interview to discuss the issues...   
19817  Sentence: 19818  He says they and about 300 party supporters ar...   
47591  Sentence: 47592  U.S. weather forecasters say Hurricane Wilma h...   

                                                     POS  \
76     [``, CC, PRP, VBP, PRP, MD, VB, DT, JJ, NN, IN...   
10051  [IN, DT, NN, NN, TO, VB, DT, NNS, IN, NN, ,, P...   
19817  [PRP, VBZ, PRP, CC, IN, CD, NN, NNS, VBP, VBG,...   
47591  [NNP, NN, NNS, VBP, NNP, NNP, VBZ, VBN, TO, DT...   

                                                     Tag  
76     [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...  
10051  [O, O, O, O, O, O, O, O, O, O, O, O, O, B-org,...  
19817  [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...  
47591  [B-geo, O, O, O, O, O, O, O, O, O, O, O, O, O,...  


In [12]:
data_cleaned = data[~mismatch]  # Keep only rows where lengths match


In [13]:
# Proceed with the cleaned data
X_train = [sent2features(row['Sentence'].split(), row['POS']) for _, row in data_cleaned.iterrows()]
y_train = [sent2labels(row['Tag']) for _, row in data_cleaned.iterrows()]

# 2. Feature Extraction
### We define functions to convert sentences and POS tags into feature sets, which are required for training the CRF model.

In [16]:
def word2features(sent, pos, i):
    word = sent[i]
    postag = pos[i]

    features = {
        'word': word,
        'postag': postag,
        'is_capitalized': word[0].isupper(),
        'is_all_caps': word.isupper(),
        'is_all_lower': word.islower(),
        'prefix-1': word[:1],
        'prefix-2': word[:2],
        'suffix-1': word[-1:],
        'suffix-2': word[-2:],
        'is_digit': word.isdigit(),
    }

    if i > 0:
        prev_word = sent[i-1]
        prev_postag = pos[i-1]
        features.update({
            'prev_word': prev_word,
            'prev_postag': prev_postag,
        })
    else:
        features['BOS'] = True  # Beginning of Sentence

    if i < len(sent) - 1:
        next_word = sent[i+1]
        next_postag = pos[i+1]
        features.update({
            'next_word': next_word,
            'next_postag': next_postag,
        })
    else:
        features['EOS'] = True  # End of Sentence

    return features

def sent2features(sent, pos):
    # Ensure that the lengths of the sentence and POS tags match
    if len(sent) != len(pos):
        return None  # skip this row

    return [word2features(sent, pos, i) for i in range(len(sent))]


# Convert the tag list to labels
def sent2labels(tag):
    return tag


# 3. Prepare Training Data
### We prepare the training data by extracting features and labels from the cleaned dataset.

In [25]:
X_train = []
y_train = []

for _, row in data.iterrows():
    sent = row['Sentence'].split()
    pos = row['POS']
    tag = row['Tag']

    if len(sent) == len(pos) == len(tag):
        features = sent2features(sent, pos)
        if features:
            X_train.append(features)  
            y_train.append(sent2labels(tag))  

# Flatten X_train and y_train to ensure they align
X_train_flat = [item for sublist in X_train for item in sublist]
y_train_flat = [label for sublist in y_train for label in sublist]

# Ensure lengths match
assert len(X_train_flat) == len(y_train_flat), f"Lengths do not match: {len(X_train_flat)} vs {len(y_train_flat)}"


In [21]:

!pip install sklearn-crfsuite

Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.10 sklearn-crfsuite-0.5.0


# 4. Train the CRF Model
### We initialize and train the CRF model, then evaluate its performance using precision, recall, and F1 score metrics.

In [26]:
import sklearn_crfsuite
from sklearn_crfsuite import metrics

# Initialize the CRF model
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=False
)

# Train the model
crf.fit(X_train, y_train)


In [27]:
y_pred = crf.predict(X_train)

labels = list(crf.classes_)
labels.remove('O')  # 'O' is not an entity

# Generate classification report
metrics.flat_f1_score(y_train, y_pred, average='weighted', labels=labels)
report = metrics.flat_classification_report(
    y_train, y_pred, labels=labels, digits=3
)

print(report)


              precision    recall  f1-score   support

       B-geo      0.910     0.954     0.932     37643
       B-gpe      0.985     0.953     0.969     15870
       B-per      0.947     0.928     0.937     16989
       I-geo      0.902     0.931     0.916      7414
       B-org      0.919     0.855     0.886     20142
       I-org      0.947     0.934     0.940     16784
       B-tim      0.963     0.929     0.946     20333
       B-art      0.943     0.704     0.806       402
       I-art      0.958     0.774     0.857       297
       I-per      0.940     0.958     0.949     17250
       I-gpe      0.972     0.692     0.808       198
       I-tim      0.934     0.893     0.913      6528
       B-nat      0.898     0.612     0.728       201
       B-eve      0.875     0.731     0.796       308
       I-eve      0.891     0.708     0.789       253
       I-nat      0.927     0.745     0.826        51

   micro avg      0.937     0.928     0.932    160663
   macro avg      0.932   