# Spacy Module

SpaCy is a popular open-source library for advanced Natural Language Processing (NLP) in Python. It is designed for performance and ease of use, enabling developers to process large volumes of text quickly and efficiently

> pip install spacy

> python -m spacy download en_core_web_sm

In [1]:

import spacy

# Load the spaCy model

nlp = spacy.load("en_core_web_sm")


# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

print("\nPrinting doc \n",doc,"Printed doc succesfully\n\n")


# Print each token
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)



Printing doc 
 Apple is looking at buying U.K. startup for $1 billion Printed doc succesfully


Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN dep xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


## Detailed Explanation of Tokens
In Natural Language Processing (NLP), tokens are the basic building blocks of text. A token is a single occurrence of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. Tokenization is the process of splitting text into these units.

### Definition and Purpose:

> Token:

A token is an instance of a sequence of characters in a text that is grouped together as a semantic unit for processing. Tokens can be words, punctuation marks, numbers, or any other meaningful units of text.
> Purpose

Tokenization is essential for many NLP tasks as it converts a stream of text into smaller, manageable pieces (tokens) that can be processed individually. This is the first step in many NLP tasks like parsing, part-of-speech tagging, named entity recognition, etc.

### Tokenization Process:

The process of tokenization involves dividing a text into a list of tokens. For example, the sentence "Hello, world!" could be tokenized into ["Hello", ",", "world", "!"].


Tokenization can be complex because it involves handling various language-specific rules and exceptions, such as contractions in English (e.g., "don't" -> ["do", "n't"]) or compound words in German.

### Types of Tokens:

> Words: 

The most common type of tokens. For example, "Hello", "world".

> Punctuation: 

Punctuation marks are often treated as separate tokens. For example, ",", "!".


> Numbers:

Numeric values are also treated as tokens. For example, "123".

> Special Characters:

 Any other characters like "$", "&", etc.

### Tokenization Techniques:

> Whitespace Tokenization: 

Splits text by spaces. Simple but can miss important information.

> Punctuation-based Tokenization: 

Splits text by punctuation marks.

> Regular Expression-based Tokenization: 

Uses regular expressions to define complex splitting rules.

> Language-specific Tokenization: 

Uses language-specific rules to accurately tokenize text, handling things like contractions and compound words.

## Tokenization in spaCy
SpaCy is a popular library for NLP in Python that provides efficient and robust tokenization out-of-the-box. Here’s how tokens are handled in spaCy:

### Detailed Token Attributes in spaCy

`Text:` The original text of the token.

`Lemma:` The base form of the token (useful for normalization).

`POS (Part of Speech):` The syntactic category of the token (e.g., noun, verb).

`Tag:` Detailed part-of-speech tag.

`Dep (Dependency):` The syntactic dependency label (e.g., subject, object).

`Shape:` The shape of the token (e.g., "Xxxx" for "Apple").

`Is Alpha:` Boolean indicating if the token consists of alphabetic characters.

`Is Stop:` Boolean indicating if the token is a stop word (common words that are often removed in preprocessing).

In [2]:

# Loading a Language Model:

import spacy
nlp = spacy.load("en_core_web_sm")

# Processing Text:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")


# Accessing Tokens:
for token in doc:
    print(token.text)


# Token Attributes:
for token in doc:
    print(f"Text: {token.text}, Lemma: {token.lemma_}, POS: {token.pos_}, Tag: {token.tag_}, Dep: {token.dep_}, Shape: {token.shape_}, Is Alpha: {token.is_alpha}, Is Stop: {token.is_stop}")


Text: Apple, Lemma: Apple, POS: PROPN, Tag: NNP, Dep: nsubj, Shape: Xxxxx, Is Alpha: True, Is Stop: False
Text: is, Lemma: be, POS: AUX, Tag: VBZ, Dep: aux, Shape: xx, Is Alpha: True, Is Stop: True
Text: looking, Lemma: look, POS: VERB, Tag: VBG, Dep: ROOT, Shape: xxxx, Is Alpha: True, Is Stop: False
Text: at, Lemma: at, POS: ADP, Tag: IN, Dep: prep, Shape: xx, Is Alpha: True, Is Stop: True
Text: buying, Lemma: buy, POS: VERB, Tag: VBG, Dep: pcomp, Shape: xxxx, Is Alpha: True, Is Stop: False
Text: U.K., Lemma: U.K., POS: PROPN, Tag: NNP, Dep: dobj, Shape: X.X., Is Alpha: False, Is Stop: False
Text: startup, Lemma: startup, POS: NOUN, Tag: NN, Dep: dep, Shape: xxxx, Is Alpha: True, Is Stop: False
Text: for, Lemma: for, POS: ADP, Tag: IN, Dep: prep, Shape: xxx, Is Alpha: True, Is Stop: True
Text: $, Lemma: $, POS: SYM, Tag: $, Dep: quantmod, Shape: $, Is Alpha: False, Is Stop: False
Text: 1, Lemma: 1, POS: NUM, Tag: CD, Dep: compound, Shape: d, Is Alpha: False, Is Stop: False
Text: billi

### Named Entity Recognition Labels
![image.png](attachment:image.png)

In [9]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


### Key Components

Tokenizer: Splits text into tokens (words, punctuation, etc.). This is the first component in the pipeline.

In [12]:
doc = nlp("This is a sentence.")
print("the doc is: ",doc)
for token in doc:
    print(token.text)


the doc is:  This is a sentence.
This
is
a
sentence
.


Tagger: Assigns part-of-speech (POS) tags to each token.

In [24]:
for token in doc:
    print(token.text, token.pos_)


Apple PROPN
is AUX
looking VERB
at ADP
buying VERB
U.K. PROPN
startup NOUN
for ADP
$ SYM
1 NUM
billion NUM


Parser: Analyzes the syntactic structure and assigns dependency labels.

In [14]:
for token in doc:
    print(token.text, token.dep_, token.head.text)


This nsubj is
is ROOT is
a det sentence
sentence attr is
. punct is


Named Entity Recognizer (NER): Detects named entities like names of people, organizations, locations, dates, etc.

In [25]:
for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
U.K. GPE
$1 billion MONEY


Lemmatizer: Normalizes words to their base forms.

In [16]:
for token in doc:
    print(token.text, token.lemma_)


This this
is be
a a
sentence sentence
. .


### Custom Components
You can add custom components to the spaCy pipeline.

In [17]:
@spacy.Language.component("custom_component")
def custom_component(doc):
    # Process the doc and modify it
    print("Custom component:", doc)
    return doc

nlp.add_pipe("custom_component", last=True)

doc = nlp("This is a sentence.")


Custom component: This is a sentence.


### Visualization
SpaCy integrates well with the displacy module for visualizing dependencies and named entities.



In [18]:
from spacy import displacy

# Visualize dependency parse
displacy.serve(doc, style="dep")

# Visualize named entities
displacy.serve(doc, style="ent")





Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.





Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


### Saving and Loading Models
You can save your models to disk and load them later.

In [21]:
# Save model
nlp.to_disk("/huggingface/")

# Load model
nlp = spacy.load("/huggingface/")


## Conclusion
SpaCy is a powerful library for NLP tasks, offering tools for tokenization, POS tagging, parsing, named entity recognition, and more. It is optimized for performance and can handle large volumes of text efficiently. The library is highly extensible, allowing for custom components and advanced features like rule-based matching and custom attributes.