<a href="https://colab.research.google.com/github/assermahmoud99/internship-tasks/blob/main/(ner)_News_Articles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# 0. Downloading the dataset

Here, we download the English CoNLL03 dataset directly from KaggleHub. This dataset is widely used for NER tasks and contains newswire text with named entity annotations. We prepare three splits: train, validation, and test.



In [None]:
import kagglehub

# Download the CoNLL2003 dataset
path_train = kagglehub.dataset_download("alaakhaled/conll003-englishversion", path="train.txt")
path_valid = kagglehub.dataset_download("alaakhaled/conll003-englishversion", path="valid.txt")
path_test  = kagglehub.dataset_download("alaakhaled/conll003-englishversion", path="test.txt")

print("Train file path:", path_train)
print("Validation file path:", path_valid)
print("Test file path:", path_test)


Downloading from https://www.kaggle.com/api/v1/datasets/download/alaakhaled/conll003-englishversion?dataset_version_number=1&file_name=train.txt...


100%|██████████| 650k/650k [00:00<00:00, 92.0MB/s]

Extracting zip of train.txt...





Using Colab cache for faster access to the 'conll003-englishversion' dataset.
Using Colab cache for faster access to the 'conll003-englishversion' dataset.
Train file path: /root/.cache/kagglehub/datasets/alaakhaled/conll003-englishversion/versions/1/train.txt
Validation file path: /kaggle/input/conll003-englishversion/valid.txt
Test file path: /kaggle/input/conll003-englishversion/test.txt


# 1. Preprocessing into sentences

The CoNLL03 files are stored in token-per-line format, with blank lines separating sentences. This function rebuilds those into proper sentences by joining tokens until a blank line. Here we load the first 200 sentences for testing and print a preview.















In [None]:
# Step 1. Install & import
!pip install spacy kagglehub
import spacy
import pandas as pd
import kagglehub
from spacy.matcher import Matcher
from spacy import displacy


def load_sentences(filepath, limit=200):
    sentences = []
    with open(filepath, "r", encoding="utf-8") as f:
        block = []
        for line in f:
            line = line.strip()
            if not line:
                if block:
                    sentences.append(" ".join(block))
                    block = []
            else:
                word = line.split()[0]
                block.append(word)
            if len(sentences) >= limit:
                break
    return sentences

train_sentences = load_sentences(path_train, limit=200)
print(train_sentences[:5])


['-DOCSTART-', 'EU rejects German call to boycott British lamb .', 'Peter Blackburn', 'BRUSSELS 1996-08-22', 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .']


# 2. Loading spaCy models
We load two different spaCy models:


*   en_core_web_sm: a small, faster model.

*   en_core_web_md: a larger, more accurate model.



This allows us to compare model-based NER performance between lightweight and more advanced spaCy models.












In [None]:
nlp_sm = spacy.load("en_core_web_sm")
nlp_md = spacy.load("en_core_web_md")


# 3. Model-based NER (small model)

Here we apply spaCy’s pretrained small model (sm) to extract named entities from all sentences. The results are stored in a DataFrame with entity text and label (e.g. PERSON, ORG, LOC). We then:



*   Show the first 20 detected entities.

*   Count how many entities of each type appear.
*   Display the most frequent entity mentions.




This demonstrates the model-based NER approach.











In [None]:

all_entities = []

for sent in train_sentences:
    doc = nlp_sm(sent)
    for ent in doc.ents:
        all_entities.append((ent.text, ent.label_))

# Save to DataFrame
entities_df = pd.DataFrame(all_entities, columns=["entity", "label"])
print(entities_df.head(20))

print("Entity type counts:\n")
print(entities_df["label"].value_counts())

print("\nMost common entities:\n")
print(entities_df["entity"].value_counts().head(20))



                     entity   label
0                        EU     ORG
1                    German    NORP
2                   British    NORP
3           Peter Blackburn  PERSON
4                  BRUSSELS     GPE
5                1996-08-22    DATE
6   The European Commission     ORG
7                  Thursday    DATE
8                    German    NORP
9                   British    NORP
10                  Germany     GPE
11    the European Union 's     ORG
12         Werner Zwingmann  PERSON
13                Wednesday    DATE
14                  Britain     GPE
15               Commission     ORG
16     Nikolaus van der Pas  PERSON
17       the European Union     ORG
18               last month    DATE
19                  EU Farm     ORG
Entity type counts:

label
GPE         139
DATE         91
NORP         88
PERSON       74
ORG          73
CARDINAL     47
MONEY        11
PERCENT      11
LOC           5
ORDINAL       4
EVENT         2
TIME          2
QUANTITY      1
LANGUAGE 

# 4. Rule-based NER (small model)

This section uses spaCy’s Matcher to implement a rule-based NER system:


*   PROPER_NOUN: sequences of capitalized words.
*   MONEY: numbers followed by words like “dollars”, “million”.
*   DATE: number followed by a capitalized token (like “1996 August”).

We then apply these patterns to the dataset and count how many matches each rule produced. This shows how custom rules can detect entities without training a model.







In [None]:

matcher = Matcher(nlp_sm.vocab)
pattern = [{"IS_TITLE": True, "OP": "+"}]
matcher.add("PROPER_NOUN", [[{"IS_TITLE": True, "OP": "+"}]])
matcher.add("MONEY", [[{"LIKE_NUM": True}, {"LOWER": {"IN": ["dollars","usd","$","million","billion"]}}]])
matcher.add("DATE", [[{"LIKE_NUM": True}, {"IS_TITLE": True}]])

rule_entities = []
for sent in train_sentences:
    doc = nlp_sm(sent)
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        rule_entities.append((span.text, nlp_sm.vocab.strings[match_id]))

rule_df = pd.DataFrame(rule_entities, columns=["entity", "rule_label"])
print(rule_df['rule_label'].value_counts())


rule_label
PROPER_NOUN    828
DATE            11
MONEY            2
Name: count, dtype: int64


# 5. Highlighting extracted entities



Here we take a real sentence from the dataset and display its named entities both in text (entity → label) and visually using displaCy. This satisfies the requirement to highlight and categorize extracted entities.







In [None]:

sample_text = train_sentences[10]
doc = nlp_sm(sample_text)

print("=== Highlighted Entities ===")
for ent in doc.ents:
    print(f"{ent.text:15} --> {ent.label_}")

displacy.render(doc, style="ent", jupyter=True)


=== Highlighted Entities ===
Fischler        --> PERSON
EU              --> ORG


# 6. Model-based NER (medium model)


We repeat the model-based extraction using en_core_web_md. This allows direct comparison with the small model, since the medium model usually detects more entities and labels them more accurately.

In [None]:
all_entities_md = []

for sent in train_sentences:
    doc = nlp_md(sent)
    for ent in doc.ents:
        all_entities_md.append((ent.text, ent.label_))

entities_df_md = pd.DataFrame(all_entities_md, columns=["entity", "label"])
print(entities_df_md.head(20))
print("Entity type counts:\n")
print(entities_df_md["label"].value_counts())

print("\nMost common entities:\n")
print(entities_df_md["entity"].value_counts().head(20))


                     entity   label
0                -DOCSTART-     ORG
1                        EU     ORG
2                    German    NORP
3                   British    NORP
4           Peter Blackburn  PERSON
5       BRUSSELS 1996-08-22     ORG
6   The European Commission     ORG
7                  Thursday    DATE
8                    German    NORP
9                   British    NORP
10                  Germany     GPE
11    the European Union 's     ORG
12         Werner Zwingmann  PERSON
13                Wednesday    DATE
14                  Britain     GPE
15               Commission     ORG
16     Nikolaus van der Pas  PERSON
17       the European Union     ORG
18               last month    DATE
19                  EU Farm     ORG
Entity type counts:

label
GPE         149
NORP         93
DATE         92
ORG          85
PERSON       79
CARDINAL     46
PERCENT      11
MONEY        10
LOC           4
ORDINAL       4
PRODUCT       3
EVENT         2
TIME          2
QUANTITY 

# 7. Rule-based NER (medium model)

We also apply the rule-based approach on the medium model. This demonstrates that rule-based NER can be applied independently of which spaCy pipeline we use.


In [None]:
matcher_md = Matcher(nlp_md.vocab)
pattern = [{"IS_TITLE": True, "OP": "+"}]
matcher_md.add("PROPER_NOUN", [[{"IS_TITLE": True, "OP": "+"}]])
matcher_md.add("MONEY", [[{"LIKE_NUM": True}, {"LOWER": {"IN": ["dollars","usd","$","million","billion"]}}]])
matcher_md.add("DATE", [[{"LIKE_NUM": True}, {"IS_TITLE": True}]])

rule_entities_md = []
for sent in train_sentences:
    doc_md = nlp_md(sent)
    matches = matcher(doc_md)
    for match_id, start, end in matches:
        span_md = doc_md[start:end]
        rule_entities_md.append((span_md.text, nlp_md.vocab.strings[match_id]))

rule_df_md = pd.DataFrame(rule_entities_md, columns=["entity", "rule_label"])
print(rule_df['rule_label'].value_counts())

rule_label
PROPER_NOUN    828
DATE            11
MONEY            2
Name: count, dtype: int64
