# Day 65 – Introduction to spaCy Library

In this session, I learned about the **spaCy library**, one of the most powerful tools for **Natural Language Processing (NLP)** in Python.
I explored how spaCy processes text using its **efficient NLP pipeline**, covering key concepts like **tokenization**, **part-of-speech tagging**, **named entity recognition (NER)**, and **dependency parsing**.

The goal of this session was to understand **how spaCy simplifies complex text processing tasks** and prepares data for **real-world NLP applications**.

---


## Introduction to spaCy

**spaCy** is one of the most powerful and efficient **Natural Language Processing (NLP)** libraries in Python, designed specifically for **industrial and real-world applications**.
It provides fast, accurate, and easy-to-use tools for text processing and linguistic analysis.

Unlike older libraries such as **NLTK**, spaCy focuses on **production-ready performance**, offering pre-trained pipelines for multiple languages.

---

## Why spaCy?

spaCy is preferred by developers and data scientists because of its:

* **Speed:** Written in Cython (a blend of Python and C), making it extremely fast.
* **Ease of Use:** Clean, consistent API for working with NLP pipelines.
* **Pre-trained Models:** Ready-to-use models for tokenization, POS tagging, NER, and dependency parsing.
* **Production Focus:** Designed for building real-world applications like chatbots, extractors, and sentiment systems.
* **Integration:** Works smoothly with libraries like **scikit-learn**, **Hugging Face**, and **PyTorch**.

---

## spaCy NLP Pipeline

When you pass text into a spaCy model, it goes through several processing stages known as the **NLP pipeline**:

1. **Tokenization** → Splitting text into meaningful units (tokens).
2. **Part-of-Speech (POS) Tagging** → Identifying the grammatical role of each token (noun, verb, adjective, etc.).
3. **Dependency Parsing** → Understanding how words are related grammatically within a sentence.
4. **Named Entity Recognition (NER)** → Detecting named entities like people, places, organizations, and dates.
5. **Lemmatization** → Converting words to their base or dictionary form.


Example:

> “Apple is looking at buying U.K. startup for $1 billion.”

* **Entities:** Apple (ORG), U.K. (GPE), $1 billion (MONEY)

---

## Key Components in spaCy

| Component   | Description                             | Example Output                   |
| ----------- | --------------------------------------- | -------------------------------- |
| **Doc**     | Processed text returned by the pipeline | Entire sentence/document         |
| **Token**   | Individual word, punctuation, or symbol | “Apple”, “is”, “looking”         |
| **Span**    | Slice of a Doc (subset of tokens)       | “Apple is looking”               |
| **Vocab**   | Stores word information and vectors     | Contains lexical attributes      |
| **Matcher** | Rule-based pattern matching engine      | Detects patterns like “New York” |

---

## Common spaCy Functionalities

* **Tokenization:** Breaks sentences into words.
* **Stopword Removal:** Filters out common uninformative words (e.g., “the”, “is”).
* **Lemmatization:** Converts words to their base form (“running” → “run”).
* **POS Tagging:** Identifies grammatical category (noun, verb, etc.).
* **Named Entity Recognition (NER):** Detects entities like names, locations, money, etc.
* **Dependency Parsing:** Understands how words relate syntactically.

---

## Comparison: spaCy vs NLTK

| Feature               | **spaCy**                      | **NLTK**             |
| --------------------- | ------------------------------ | -------------------- |
| Purpose               | Industrial / Production        | Research / Teaching  |
| Speed                 | Very fast (Cython-based)       | Slower (pure Python) |
| Ease of Use           | Clean API                      | More complex setup   |
| Deep Learning Support | Yes (via `spacy-transformers`) | Limited              |
| Pre-trained Models    | Built-in                       | Mostly user-trained  |
| Focus                 | Modern, real-world NLP         | Academic learning    |

---

## Applications of spaCy

* Information extraction from documents
* Chatbots and conversational AI
* Sentiment analysis
* Named Entity Recognition (NER) systems
* Resume and document parsing
* Text summarization and classification
* Legal, medical, and financial text analysis

---


## Installation and Setup

In [1]:
# Install spaCy 
#!pip install spacy

# Download the English language model (small)
#!python -m spacy download en_core_web_sm

# For larger models:
# en_core_web_md  (medium)
# en_core_web_lg  (large)

## Loading the Model

In [2]:
import spacy

#Load the english language model

nlp = spacy.load("en_core_web_sm")

#Example text
text = "Elon Musk founded SpaceX in 2002 and acquired Twitter in 2022."

#Process the text
doc = nlp(text)

doc

Elon Musk founded SpaceX in 2002 and acquired Twitter in 2022.

## NER

In [3]:
# print named entities found in the text
print("Named Entities, Phrases and Concepts:")
for ent in doc.ents:
    print(f"{ent.text:15} {ent.label_:10} {ent.start_char:10} {ent.end_char:10}")


Named Entities, Phrases and Concepts:
Elon Musk       PERSON              0          9
2002            DATE               28         32
Twitter         PERSON             46         53
2022            DATE               57         61


## Tokenization

In [4]:
for token in doc:
    print(token.text)

Elon
Musk
founded
SpaceX
in
2002
and
acquired
Twitter
in
2022
.


## Part of Speech (POS) Tagging

In [5]:
for token in doc:
    print(token.text, ":", token.pos_)

Elon : PROPN
Musk : PROPN
founded : VERB
SpaceX : PROPN
in : ADP
2002 : NUM
and : CCONJ
acquired : VERB
Twitter : PROPN
in : ADP
2022 : NUM
. : PUNCT


## Lemmatization & Dependency Parsing

In [6]:
# Tokens with POS, Lemma (base form), and Dependency relation
for token in doc:
    print(token.text, ":", token.pos_, "-->", token.lemma_, "| Dependency:", token.dep_)

Elon : PROPN --> Elon | Dependency: compound
Musk : PROPN --> Musk | Dependency: nsubj
founded : VERB --> found | Dependency: ROOT
SpaceX : PROPN --> SpaceX | Dependency: dobj
in : ADP --> in | Dependency: prep
2002 : NUM --> 2002 | Dependency: pobj
and : CCONJ --> and | Dependency: cc
acquired : VERB --> acquire | Dependency: conj
Twitter : PROPN --> Twitter | Dependency: dobj
in : ADP --> in | Dependency: prep
2022 : NUM --> 2022 | Dependency: pobj
. : PUNCT --> . | Dependency: punct


## All in one: POS, Lemma, Dependency, Shape, Alphabet check, Stop word check

In [7]:
for token in doc:
    print(token.text, ":", token.pos_, "-->", token.lemma_, "|", 
          "Dep:", token.dep_, "| Shape:", token.shape_, 
          "| Alpha:", token.is_alpha, "| Stopword:", token.is_stop)

Elon : PROPN --> Elon | Dep: compound | Shape: Xxxx | Alpha: True | Stopword: False
Musk : PROPN --> Musk | Dep: nsubj | Shape: Xxxx | Alpha: True | Stopword: False
founded : VERB --> found | Dep: ROOT | Shape: xxxx | Alpha: True | Stopword: False
SpaceX : PROPN --> SpaceX | Dep: dobj | Shape: XxxxxX | Alpha: True | Stopword: False
in : ADP --> in | Dep: prep | Shape: xx | Alpha: True | Stopword: True
2002 : NUM --> 2002 | Dep: pobj | Shape: dddd | Alpha: False | Stopword: False
and : CCONJ --> and | Dep: cc | Shape: xxx | Alpha: True | Stopword: True
acquired : VERB --> acquire | Dep: conj | Shape: xxxx | Alpha: True | Stopword: False
Twitter : PROPN --> Twitter | Dep: dobj | Shape: Xxxxx | Alpha: True | Stopword: False
in : ADP --> in | Dep: prep | Shape: xx | Alpha: True | Stopword: True
2022 : NUM --> 2022 | Dep: pobj | Shape: dddd | Alpha: False | Stopword: False
. : PUNCT --> . | Dep: punct | Shape: . | Alpha: False | Stopword: False


---
## Summary

In this session, I explored the **spaCy library**, a fast and modern NLP framework used for **linguistic analysis and text processing**.
I learned how to load language models, process text, and perform tasks such as **tokenization**, **part-of-speech tagging**, **named entity recognition**, and **dependency parsing**.

spaCy is a powerful, modern NLP toolkit designed not just for experimentation — but for building **real-world, scalable NLP applications**.

spaCy’s simple and efficient API made it easy to analyze text while maintaining high speed and accuracy.


## Key Learning

* Understood what **spaCy** is and why it’s used in modern NLP workflows.
* Learned about spaCy’s **NLP pipeline** and how it processes text step-by-step.
* Practiced key operations like **tokenization**, **POS tagging**, **lemmatization**, and **NER**.
* Explored how **dependency parsing** helps understand relationships between words.
* Compared **spaCy vs NLTK**, and saw why spaCy is preferred for production use.
* Understood that spaCy is an **industrial-grade NLP toolkit** built for real-world AI applications.

---