In [32]:
!pip install spacy



In [33]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m71.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [34]:
import spacy

In [35]:
nlp = spacy.load("en_core_web_sm")
# en_core_web_sm is a small, english language machine learning model that has numbers of word vectors stored in it. 

In [36]:
with open ("/kaggle/input/wiki-us-txt/wiki_us.txt", "r") as f:
    text = f.read()

In [37]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [38]:
doc = nlp(text)
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [39]:
print(len(text))
print(len(doc))

3521
654


In [40]:
for token in text[0:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [41]:
for token in doc[0:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


The reason for the difference between len(text) (3521 characters) and len(doc) (654 tokens) when using nlp = spacy.load("en_core_web_sm") and doc = nlp(text) is due to spaCy's tokenization process. The len(text) counts all characters, including spaces, punctuation, and special characters, totaling 3521. In contrast, len(doc) counts the number of tokens—individual words, punctuation marks, or other linguistic units—identified by the "en_core_web_sm" model, resulting in 654 tokens. This reduction occurs because:

The model splits the text into meaningful units (e.g., "United" and "States" as separate tokens), ignoring some characters like extra spaces or treating punctuation as standalone tokens.
The ratio of characters to tokens (approximately 5.4:1) reflects the average length of tokens, which is typical for English text with varied punctuation and structure.
This discrepancy is a normal outcome of spaCy's processing, tailored to linguistic analysis rather than raw character counting.

In [42]:
for token in text.split()[:10]:
    print (token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


Sentence Boundary Detection

In [43]:
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [44]:
sentence1 = doc.sents[0]
print(sentence1)

TypeError: 'generator' object is not subscriptable

In [45]:
# in python, 'generator' object is not subscriptable. 
# We cannot go and iterate through them. 
# to overcome that, we can convert sentence1 into a list

sentence1=list(doc.sents)[0]

sentence2=list(doc.sents)[1]


print(f"Sentence 1: {sentence1} and \n\nSentence 2: {sentence2}")

Sentence 1: The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. and 

Sentence 2: It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]


In [46]:
for token in doc[:10]:
    print (token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [47]:
token2=sentence1[2] #3rd index of sentence1
print(token2)

States


In [48]:
token2.text #the clean text of the token object

'States'

In [49]:
token2.left_edge 

The

In [50]:
token2.right_edge 

,

In [51]:
token2.ent_type #entity type

384

In [52]:
token2.ent_type_ #entity type
# GPE: Geo Political Entity

'GPE'

In [53]:
token2.ent_iob_
# I: Inside of an entity
# B: Beginning of an entity
# O: Outside of an enitity

'I'

In [54]:
token2.lemma_
# root form of the word

'States'

In [55]:
sentence1[12]

known

In [56]:
sentence1[12].lemma_

'know'

In [57]:
token2.morph
# sing: Singular

Number=Sing

In [58]:
sentence1[12].morph

Aspect=Perf|Tense=Past|VerbForm=Part

In [59]:
token2.pos_
# part of speech
# propn = proper noun

'PROPN'

In [60]:
token2.dep_
# nsubj: noun subject
# dep: dependecy

'nsubj'

In [61]:
token2.lang_
# en: english

'en'

In [62]:
text = "Mike enjoys playing football"
doc2=nlp(text)
print(doc2)

Mike enjoys playing football


In [63]:
for token in text:
    print(token.text,token.pos_,token.dep_)

AttributeError: 'str' object has no attribute 'text'

In [65]:
for token in doc2:
    print(token.text,token.pos_,token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj


## Visualization

In [66]:
from spacy import displacy
displacy.render(doc2, style="dep")

In [70]:
# displaying the dependencies
displacy.render(doc, style="dep")

In [67]:
for ent in doc.ents: #iterating through the entities detected by SpaCy and their labels in the doc object
    print(ent.text, ent.label_)

The United States of America GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
1775–1783 CARDINAL
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
Spanish NORP
American War and EVENT
World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vie

In [69]:
# displaying the entities
displacy.render(doc, style="ent")

# Entity Types and Explanations

- **GPE (Geopolitical Entity)**: Refers to countries, cities, states, or other geopolitical entities.
- **LOC (Location)**: Indicates non-GPE specific locations, such as continents, regions, or geographical features.
- **CARDINAL (Cardinal Number)**: Represents numerical values, such as counts or quantities.
- **NORP (Nationalities or Religious/Political Groups)**: Denotes nationalities, religious, or political groups.
- **QUANTITY (Quantity)**: Indicates measured amounts, such as area or distance.
- **ORDINAL (Ordinal Number)**: Refers to positional numbers, like first, second, or third.
- **DATE (Date)**: Specifies a point or period in time.
- **ORG (Organization)**: Represents organizations, institutions, or events with organizational structure.
- **EVENT (Event)**: Denotes historical events or periods.
- **FAC (Facility)**: Refers to facilities or man-made structures.
- **PERSON (Person)**: Indicates individual people.
- **PERCENT (Percentage)**: Represents percentage values.

SpaCy prioritized proper nouns, numbers with context (e.g., "50 states"), and time references (e.g., "1848") while ignoring common words or phrases without entity significance