<a href="https://colab.research.google.com/github/dharanpreethi/Dragoman_Text_Mining-/blob/main/spaCy_NLP_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Spacy**

spaCy is a free, open-source Python library that performs natural language processing tasks(NLP).




### **What is an NLP Model?**
- An **NLP model** is a machine learning-based tool designed to process and understand human language. It helps computers perform various tasks such as:
  - Tokenization (splitting text into words/sentences)
  - Part-of-Speech Tagging (identifying nouns, verbs, adjectives, etc.)
  - Named Entity Recognition (detecting names, places, dates, etc.)
  - Dependency Parsing (understanding grammatical relationships)
  - Word vector processes


### **What is a Pretrained Model?**
- A **pretrained model** is a machine learning model that has already been trained on a **large dataset** and can be used immediately without requiring additional training.
- SpaCy’s pretrained models have been trained on sources like:
  - Books
  - News articles
  - Web pages
  - Scientific papers

---

### **When to Use Different spaCy Models and Suitable Corpora**

Each **spaCy model** (`en_core_web_sm`, `en_core_web_md`, `en_core_web_lg`, `en_core_web_trf`) is suited for different types of text analysis. Below is a guide on when to use each model and what kind of corpora are suitable for them.


## **1. `en_core_web_sm` (Small Model)**
**Best for:** Basic NLP tasks that require speed over accuracy.  

### **When to Use It?**
- When working with small corpora or short texts.
- When you need fast processing (e.g., real-time applications).
- When you don’t need word vectors or deep linguistic features.

### **Example Corpora:**
- Social media text (tweets, Facebook posts)
- Customer service logs (chat messages)
- News headlines (short news snippets)
- Product reviews (brief texts from e-commerce sites)*

* since you have this corpus already, we can use it for the


### **What It Can Do?**
- Tokenization
- Part-of-Speech (POS) Tagging
- Named Entity Recognition (NER) (limited accuracy)
- Dependency Parsing

### **Limitations:**
- Lower accuracy for complex NLP tasks
- No word vectors (can't capture semantic similarity)

## **2. `en_core_web_md` (Medium Model)**
**Best for:** More detailed text analysis that includes word vectors.  

### **When to Use It?**
- When you need better accuracy in NER and POS tagging.
- When analyzing longer texts where semantic meaning matters.
- When working with word similarity or topic modeling.

### **Example Corpora:**
- Newspaper articles (medium-length documents)
- Legal documents (contracts, court rulings)
- Academic research papers (general studies, essays)
- Blogs and opinion pieces (longer informal texts)

### **What It Can Do?**
- Everything in `en_core_web_sm`
- More accurate Named Entity Recognition
- Word vectors support (can understand similarity between words)

### **Limitations:**
- Not as accurate as `en_core_web_lg` for complex linguistic tasks
- Still not ideal for deep-learning-based NLP

## **3. `en_core_web_lg` (Large Model)**
**Best for:** High-accuracy NLP tasks that require word vectors.  

### **When to Use It?**
- When you need high accuracy in NER, dependency parsing, and POS tagging.
- When working on detailed linguistic analysis or text similarity tasks.
- When handling large corpora where accuracy is crucial.

### **Example Corpora:**
- Historical texts (Shakespearean plays, classical literature)
- Scientific research papers (long, technical language)
- Medical records (diagnostic reports, clinical notes)
- Political speeches (transcripts from debates, government records)

### **What It Can Do?**
- Everything in `en_core_web_md`
- Most accurate NER and POS tagging
- Better support for text classification
- Deep understanding of word relationships

### **Limitations:**
- Slower than smaller models
- Requires more memory and processing power

## **4. `en_core_web_trf` (Transformer Model)**
**Best for:** Deep learning NLP tasks, research, and complex linguistic understanding.  

### **When to Use It?**
- When absolute accuracy is required, even at the cost of speed.
- When working with deep semantic relationships in long texts.
- When performing machine learning research or AI-based NLP.

### **Example Corpora:**
- Books and long-form literature (novels, poetry collections)
- Philosophical and theoretical texts (discourse analysis)
- Highly structured documents (legal contracts, regulatory documents)
- Multilingual or complex sentences (documents that require high context understanding)

### **What It Can Do?**
- Best accuracy for Named Entity Recognition (NER)
- Handles long, complex sentences better
- Understands contextual meaning better than all other models
- Ideal for research in deep learning NLP

### **Limitations:**
- Very slow (not ideal for real-time applications)
- Requires a powerful GPU to run efficiently
- Not needed for simple NLP tasks




##**Now lets try each model one by one**
##**In small model, we can cover the following topics**

1. Loading spaCy and the small model
2. Reading a corpus of text files
3. Tokenization
4. Part-of-Speech (POS) Tagging
5. Named Entity Recognition (NER)
6. Lemmatization
7. Dependency Parsing
8. Stopword Removal
9. Noun Phrase Extraction




```
# This is formatted as code
```

###**Step 1: Install and Load spaCy**

In [None]:

!pip install spacy
#Install and Load spaCy




In [None]:
#Download the en_core_web_sm Model
#This downloads the English (en_core_web_sm) model, which contains NLP features
#like tokenization, POS tagging, and named entity recognition.

!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m128.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: e

In [None]:
#Load the NLP Model
#spacy.load("en_core_web_sm") loads the pre-trained small model into
#memory for processing text.
import spacy
# Load the small English NLP model
nlp_sm = spacy.load("en_core_web_sm")
nlp_md=spacy.load("en_core_web_md")
nlp_md.max_length = 2000000
nlp_lg = spacy.load("en_core_web_lg")
nlp_lg.max_length = 2000000
nlp_trf = spacy.load("en_core_web_trf")
nlp_trf.max_length = 2000000

###**Step 2: Read a Corpus of Text Files**

In this code, we use google.colab import files and glob to load files. The former helps us load the files into colab working diretcory. The later module is used to find file paths that match a pattern of text files using *. It is commonly used to read multiple files in a folder without manually specifying each filename.

In [None]:
from google.colab import files

# Upload multiple files manually
uploaded_files = files.upload()

In [None]:
b

Loaded 5 documents.


**Explanation and a few questions:**

1. **glob.glob("*.txt")** :Finds all .txt files in the Colab working directory.
2. **for file_path in text_files**: Loops through each uploaded file.
3. **with open(file_path, "r", encoding="utf-8") as file**: Reads the file content.
3. **texts.append(file.read())**: Stores the text in a list (texts).


- Why encoding utf-8?
- why r ?
- what are file path and text_files here?

##**Step 3: Tokenization**

In [None]:
import spacy

nlp_sm = spacy.load("en_core_web_sm")
nlp_sm.max_length = 3000000  # Increase max length (set higher than your text length)

for i in texts:
    doc = nlp_sm(i)  # Process each text
    print("Tokens for:", i[:50], "...")  # Print only a preview
    for token in doc:
        print(token.text)

NameError: name 'texts' is not defined

###**A few questions:**

1. What is i here?
2. What is token here?
3. What is text here?
4. what is doc here?
5. Print only the first 10 tokens from a text and see


##**Step 4: Named Entity Recognition (NER)**

In [None]:
print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} ===== {ent.label_}")

###**Explanation**

1. NER identifies real-world entities (names, locations, dates, etc.).
2. ent.text: The entity's name.
3. ent.label_: The type of entity (e.g., PERSON, GPE for locations).

But how to print only particular entities and first 10 entities?
clue: can we use 'if' statement

###**Step 5: Part-of-Speech (POS) Tagging**

In [None]:
for token in doc[:10]:
    print(f"{token.text}: {token.pos_}")

###**Explanation**

1. POS tagging assigns grammatical categories (NOUN, VERB, ADJ, etc.).
2. token.pos_:Displays POS tag for each token.

but again how to print only any particular pos? you can follow the same pattern you used to print out specific entity in the previous code

###**Step 6: Lemmatization**

In [None]:
for token in doc[:10]:
    print(f"{token.text}: {token.pos_}")

###**Explanation**

1. Lemmatization converts words to their base form.
2. Example: "running" → "run", "children" → "child".

now store lemmatized words in a sperate list and print a few of them

###**Step 7: Noun Phrases**

In [None]:
print("Noun Phrases:")
for chunk in doc.noun_chunks:
    print(chunk.text)

###**Explanation**

Noun chunks are phrases centered around nouns.

###**Step 7: Stopword removal**

In [None]:
print("Words without Stopwords:")
for token in doc:
    if not token.is_stop:  # Exclude stopwords
        print(token.text)


Use *STOP_WORDS |* as a variable to add custom stopwords in dictionary format and use *STOP_WORDS* for remove words from the default stopwords list

In [None]:
STOP_WORDS |= custom_stopwords
STOP_WORDS -= {"not", "never"}

NameError: name 'STOP_WORDS' is not defined

In [None]:
import spacy

!python -m spacy download en_core_web_md

# Load the medium-sized spaCy model
nlp_md = spacy.load("en_core_web_md")
nlp_md.max_length = 2000000

# Process a sample sentence
doc_md = nlp_md("William Shakespeare was born in Stratford-upon-Avon.")

# Extract and print named entities
for ent in doc_md.ents:
    print(f"{ent.text} -> {ent.label_}")


Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
William Shakespeare -> PERSON
Stratford -> GPE


In [None]:
import glob

#We use glob.glob() to retrieve all files matching a specific pattern

# Get a list of all text files in the folder
text_files = glob.glob("*.txt")
texts = []
for file_path in text_files:
    with open(file_path, "r", encoding="utf-8") as file:
        texts.append(file.read())

print(f"Loaded {len(texts)} documents.")

Loaded 5 documents.


In [None]:
#Tokenizing a Text

# Process text using spaCy

for i in texts:
    doc = nlp_md(i)  # Process each text
    print("Tokens for:", i)
    for token in doc:
      print(token.text)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
car
.



'
Let
him
be
,
'
Him
said
,
relief
blowing
her
words
into
large
light


bubbles
that
rolled
off
her
tongue
and
floated
effervescently
into


the
orange
air
.
'
He
feels
frightened
by
all
this
-
this
coming
and


going
.
You
know
he
's
not
used
to
it
.
'



Tara
nodded
sadly
.
But
this
was
not
all
that
was
on
her
mind
.


There
was
another
block
,
halting
her
.
She
tried
to
force
her
voice


past
that
block
.
'
Shall
I
tell
Raja
-
?
'



'
Yes
,
'
Him
urged
,
her
voice
flying
,
buoyant
.
'
Tell
him
how
we
're


not
used
to
it
-
Baba
and
I.
Tell
him
we
never
travel
any
more
.
Tell


him
we
could
n't
come
-
but
he
should
come
.
Bring
him
back
with


you
,
Tara
-
or
tell
him
to
come
in
the
winter
.
All
of
them
.
And
he




175




can
see
Sharma
about
the
firm
-
and
settle
things
.
And
see
to
Hyder


Ali
's
old
house
-
and
repair
it
.
Tell
him
I
'm
-
I
'm
waiting
for
him
-1


want
him
to
come
-1
want
to
see
him
.
'


In [None]:
print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} ===== {ent.label_}")

Named Entities:
Anita Desai ===== PERSON
India ===== GPE
1980 ===== DATE
1984 ===== DATE
1999 ===== DATE
the Booker Prize ===== WORK_OF_ART
The Village by the Sea ===== GPE
the Guardian Award for Children's
Fiction ===== WORK_OF_ART
1982 ===== DATE
Anita Desai ===== PERSON
the Royal Society of
Literature ===== ORG
London ===== GPE
the American Academy of Arts ===== ORG
New York ===== GPE
Girton College ===== ORG
the University of
Cambridge ===== ORG
the Writing Program ===== ORG
M.I.T. ===== ORG
India ===== GPE
Boston ===== GPE
Massachusetts ===== GPE
Cambridge ===== GPE
England ===== GPE
Merchant Ivory Productions ===== ORG
46 8 10 975 3
Copyright © ===== DATE
Anita Desai ===== PERSON
1980 ===== DATE
Anita Desai ===== PERSON
Patents Act ===== LAW
1988 ===== DATE
The Waste Land' ===== ORG
'Little Gidding' in Collected Poems ===== WORK_OF_ART
1909
- 1962 ===== DATE
T S Eliot ===== ORG
Faber & Faber Ltd ===== ORG
Harcourt
Brace Jovanovich Inc. ===== ORG
Emily Dickinson's ===== PERSON
Col

In [None]:
import spacy

!python -m spacy download en_core_web_lg

# Load the medium-sized spaCy model
nlp_lg = spacy.load("en_core_web_lg")
nlp_lg.max_length = 2000000

# Process a sample sentence
doc_md = nlp_md("William Shakespeare was born in Stratford-upon-Avon.")

# Extract and print named entities
for ent in doc_md.ents:
    print(f"{ent.text} -> {ent.label_}")


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
William Shakespeare -> PERSON
Stratford -> GPE


In [None]:

for i in texts:
    doc = nlp_lg(i)  # Process each text
    print("Tokens for:", i)
    for token in doc:
      print(token.text)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
car
.



'
Let
him
be
,
'
Him
said
,
relief
blowing
her
words
into
large
light


bubbles
that
rolled
off
her
tongue
and
floated
effervescently
into


the
orange
air
.
'
He
feels
frightened
by
all
this
-
this
coming
and


going
.
You
know
he
's
not
used
to
it
.
'



Tara
nodded
sadly
.
But
this
was
not
all
that
was
on
her
mind
.


There
was
another
block
,
halting
her
.
She
tried
to
force
her
voice


past
that
block
.
'
Shall
I
tell
Raja
-
?
'



'
Yes
,
'
Him
urged
,
her
voice
flying
,
buoyant
.
'
Tell
him
how
we
're


not
used
to
it
-
Baba
and
I.
Tell
him
we
never
travel
any
more
.
Tell


him
we
could
n't
come
-
but
he
should
come
.
Bring
him
back
with


you
,
Tara
-
or
tell
him
to
come
in
the
winter
.
All
of
them
.
And
he




175




can
see
Sharma
about
the
firm
-
and
settle
things
.
And
see
to
Hyder


Ali
's
old
house
-
and
repair
it
.
Tell
him
I
'm
-
I
'm
waiting
for
him
-1


want
him
to
come
-1
want
to
see
him
.
'


In [None]:
print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} ===== {ent.label_}")

Named Entities:
day ===== DATE
Anita Desai ===== PERSON
India ===== GPE
1980 ===== DATE
1984 ===== DATE
1999 ===== DATE
the Booker Prize and The Village ===== WORK_OF_ART
the Guardian Award for Children's
Fiction ===== WORK_OF_ART
1982 ===== DATE
Anita Desai ===== PERSON
the Royal Society ===== ORG
London ===== GPE
the American Academy of Arts ===== ORG
New York ===== GPE
Girton College ===== ORG
the University of
Cambridge ===== ORG
the Writing Program at M.I.T. ===== ORG
India ===== GPE
Boston ===== GPE
Massachusetts ===== GPE
Cambridge ===== GPE
England ===== GPE
Merchant Ivory Productions ===== ORG
2001 ===== DATE
46 ===== CARDINAL
10 975 ===== CARDINAL
1980 ===== DATE
Anita Desai ===== PERSON
1988 ===== DATE
The Waste Land' ===== ORG
'Little Gidding' in Collected Poems 1909
- 1962 ===== WORK_OF_ART
Faber & Faber Ltd ===== ORG
Harcourt ===== GPE
Brace Jovanovich Inc. ===== ORG
Emily Dickinson's ===== PERSON
Little, Brown ===== ORG
8c ===== CARDINAL
Faber 8c Faber Ltd ===== ORG
'The

In [None]:
!python -m spacy validate


⠙ Loading compatibility table...⠹ Loading compatibility table...[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.11/dist-packages/spacy[0m

NAME              SPACY            VERSION                            
en_core_web_trf   >=3.7.2,<3.8.0   [38;5;2m3.7.3[0m   [38;5;2m✔[0m
en_core_web_md    >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m
en_core_web_lg    >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m
en_core_web_sm    >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m



In [None]:
!pip install spacy[transformers]
!python -m spacy download en_core_web_trf


Collecting spacy-transformers<1.4.0,>=1.1.2 (from spacy[transformers])
  Downloading spacy_transformers-1.3.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers<1.4.0,>=1.1.2->spacy[transformers])
  Downloading spacy_alignments-0.9.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.7 kB)
Downloading spacy_transformers-1.3.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (756 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.2/756.2 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading spacy_alignments-0.9.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.0/314.0 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: spacy-alignments, spacy-transformers
Successfully installed spacy-alignments-0.9.1 spacy-transformers-1

In [None]:
nlp_trf = spacy.load("en_core_web_trf")
nlp_trf.max_length = 2000000

NameError: name 'spacy' is not defined