### spaCy Processing Pipeline Challenge

ALLANAI Learning Series — NLP Module

#### Objective

In this challenge, we’ll build a complete Natural Language Processing (NLP) pipeline using spaCy.
You will process both:

- A single sentence, and
- A sentiments dataset (CSV file)

The pipeline will include tokenization, stopword removal, part-of-speech tagging, dependency parsing, lemmatization, and named entity recognition (NER).

#### Step 1: Import and Setup

In [3]:
import spacy
import pandas as pd

# Download spaCy English model (run once)
!python -m spacy download en_core_web_sm

# Load model
nlp = spacy.load("en_core_web_sm")


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m5.4 MB/s[0m  [33m0:00:02[0mm [31m5.5 MB/s[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


#### Step 2: Single Sentence Processing

In [6]:
text = "London is a beautiful city. Mahira is learning NLP with ALLANAI Labs."

doc = nlp(text)

print("=== Tokens and POS Tags ===")
for token in doc:
    print(f"{token.text:<15}{token.pos_:<10}{token.dep_}")

print("\n=== Named Entities ===")
for ent in doc.ents:
    print(f"{ent.text:<25} → {ent.label_}")


=== Tokens and POS Tags ===
London         PROPN     nsubj
is             AUX       ROOT
a              DET       det
beautiful      ADJ       amod
city           NOUN      attr
.              PUNCT     punct
Mahira         PROPN     nsubj
is             AUX       aux
learning       VERB      ROOT
NLP            PROPN     dobj
with           ADP       prep
ALLANAI        PROPN     compound
Labs           PROPN     pobj
.              PUNCT     punct

=== Named Entities ===
London                    → GPE
Mahira                    → PERSON
NLP                       → ORG
ALLANAI Labs              → ORG


In [25]:
text = "London is a beautiful city. Mahira is learning NLP with ALLANAI Labs."

doc = nlp(text)

print("=== Tokens and POS Tags ===")

for token in doc:
    print(f"{token.text:<15}{token.pos_:<10}{token.dep_}") # Prints each token, its POS tag, and dependency label in neat aligned columns


=== Tokens and POS Tags ===
London         PROPN     nsubj
is             AUX       ROOT
a              DET       det
beautiful      ADJ       amod
city           NOUN      attr
.              PUNCT     punct
Mahira         PROPN     nsubj
is             AUX       aux
learning       VERB      ROOT
NLP            PROPN     dobj
with           ADP       prep
ALLANAI        PROPN     compound
Labs           PROPN     pobj
.              PUNCT     punct


#### Step 3: Tokenization, Stopwords & Lemmatization

In [28]:
filtered = [(token.text,token.lemma_) for token in doc if not token.is_stop and not token.is_punct]
print(filtered)

[('London', 'London'), ('beautiful', 'beautiful'), ('city', 'city'), ('Mahira', 'Mahira'), ('learning', 'learn'), ('NLP', 'NLP'), ('ALLANAI', 'ALLANAI'), ('Labs', 'Labs')]


#### Step 4: Mock Sentiment Dataset

Instead of reading a file, we’ll create our own small dataset.

In [31]:
data = {
    "text": [
        "I love this product, it’s amazing!",
        "The service was slow and disappointing.",
        "Wonderful experience, highly recommend it.",
        "Not worth the price at all.",
        "The quality was okay but delivery was late."
    ]
}

df = pd.DataFrame(data)
df


Unnamed: 0,text
0,"I love this product, it’s amazing!"
1,The service was slow and disappointing.
2,"Wonderful experience, highly recommend it."
3,Not worth the price at all.
4,The quality was okay but delivery was late.


#### Step 5: Apply spaCy Pipeline to Dataset

In [34]:
results = []

for line in df['text']:
    doc = nlp(line)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    lemmas = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
    
    results.append({
        "Sentence": line,
        "Cleaned_Tokens": " ".join(lemmas),
        "Entities": entities
    })

processed_df = pd.DataFrame(results)
processed_df


Unnamed: 0,Sentence,Cleaned_Tokens,Entities
0,"I love this product, it’s amazing!",love product amazing,[]
1,The service was slow and disappointing.,service slow disappointing,[]
2,"Wonderful experience, highly recommend it.",wonderful experience highly recommend,[]
3,Not worth the price at all.,worth price,[]
4,The quality was okay but delivery was late.,quality okay delivery late,[]


#### Step 6: Summary

You’ve now built a complete spaCy NLP pipeline that can handle raw text or small datasets — all without needing any external files.

This forms the foundation for sentiment analysis, keyword extraction, or text classification in future notebooks.