In [42]:
!pip install spacy



In [43]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------- ----------------------------- 3.4/12.8 MB 16.8 MB/s eta 0:00:01
     --------------------------- ------------ 8.9/12.8 MB 21.3 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 21.2 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [44]:
import spacy

In [45]:
nlp = spacy.load('en_core_web_sm')

## Task 1: Tokenization

In [46]:
# 1. Write a Python script to tokenize the following text: "The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!"
text = ("""The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!""")
doc = nlp(text)
tokens_spacy = [token.text for token in doc]
print(tokens_spacy)

['The', 'quick', 'brown', 'fox', 'does', "n't", 'jump', 'over', 'the', 'lazy', 'dog', '.', 'Natural', 'Language', 'Processing', 'is', 'fascinating', '!']


#### How does spaCy process the various tokens?:
SpaCy splits the text into individual tokens-- words, punctuation marks, or contraactions-- each of which has certain attributes including text (text_), base form of the word (.lemma_), gramatical features (.morph), and the word it depends on in the sentence structure (.head). 
#### How does spaCy handle punctuation marks like periods and commas?:
SpaCy handles punctuation marks seperately, making them their own token rather than being attatched to words. 
#### What happens when the text includes contractions (e.g., "don't")?:
When the text has contractions, spaCy splits them into multiple tokens. For example, "doesn't" is split into "does" and "n't". This is because"n't" is not recognized as part of the verb "does," but rather a negation.

## Task 2: Part-of-Speech Tagging

In [47]:
# 1. Extend your script to include part-of-speech tagging for the tokens.
for token in doc:
    print(f"{token.text} - {token.lemma_}: {token.pos_}")

The - the: DET
quick - quick: ADJ
brown - brown: ADJ
fox - fox: NOUN
does - do: AUX
n't - not: PART
jump - jump: VERB
over - over: ADP
the - the: DET
lazy - lazy: ADJ
dog - dog: NOUN
. - .: PUNCT
Natural - Natural: PROPN
Language - Language: PROPN
Processing - processing: NOUN
is - be: AUX
fascinating - fascinating: ADJ
! - !: PUNCT


#### Identify the POS tags for "quick," "jumps," and "is."
As demonstrated in my code, the POS tag for "quick" is ADJ (adjective), for "jumps" is VERB, and for "is" is AUX (auxiliary verb). 
#### Why might POS tagging be useful for tasks like grammar checking or machine translation?
For grammar checking, POS can help identify incorrect word usage, like checking if a noun is missing a verb. For machine translation, knowing the POS of each word can help translate sentences most accurately. For example, understanding the difference between "run" as a verb and "run" as a noun. Finally for text analysis, POS can help in sentiment analysis, chatbots, and speech recognition processes by understanding sentence structure.

## Task 3: Named Entity Recognition (NER)

In [48]:
# 1. Modify your script to identify named entities in the following text: "Barack Obama was the 44th President of the United States. He was born in Hawaii."
text2 = ("""Barack Obama was the 44th President of the United States. He was born in Hawaii.""")
doc = nlp(text2)
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}({spacy.explain(ent.label_)})")

Barack Obama: PERSON(People, including fictional)
44th: ORDINAL("first", "second", etc.)
the United States: GPE(Countries, cities, states)
Hawaii: GPE(Countries, cities, states)


#### Which entities are recognized by spaCy?
As shown in my code, spaCy identified Barak Obama as a person, 44th as ordinal, President as a title, United States as a GPE (geopolitical entity, according to readings), and Hawaii as a GPE as well. 
#### What entity types are assigned to "Barack Obama" and "Hawaii"?
Barak Obama is assigned PERSON and Hawaii is assigned GPE (geopolitical entity). 


## Task 4: Experimentation

In [49]:
# 1. Write a new sentence or paragraph of your choice and run the spaCy pipeline on it.
text3 = ("""My friends and I saw Deftones in concert this weekend in Indianapolis.""")
doc = nlp(text3)

In [50]:
# Information about tokens and entities:
for token in doc:
    print(f"Text: {token.text}, POS: {token.pos_}, Lemma: {token.lemma_}, Morph: {token.morph}")

for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Text: My, POS: PRON, Lemma: my, Morph: Number=Sing|Person=1|Poss=Yes|PronType=Prs
Text: friends, POS: NOUN, Lemma: friend, Morph: Number=Plur
Text: and, POS: CCONJ, Lemma: and, Morph: ConjType=Cmp
Text: I, POS: PRON, Lemma: I, Morph: Case=Nom|Number=Sing|Person=1|PronType=Prs
Text: saw, POS: VERB, Lemma: see, Morph: Tense=Past|VerbForm=Fin
Text: Deftones, POS: PROPN, Lemma: Deftones, Morph: Number=Sing
Text: in, POS: ADP, Lemma: in, Morph: 
Text: concert, POS: NOUN, Lemma: concert, Morph: Number=Sing
Text: this, POS: DET, Lemma: this, Morph: Number=Sing|PronType=Dem
Text: weekend, POS: NOUN, Lemma: weekend, Morph: Number=Sing
Text: in, POS: ADP, Lemma: in, Morph: 
Text: Indianapolis, POS: PROPN, Lemma: Indianapolis, Morph: Number=Sing
Text: ., POS: PUNCT, Lemma: ., Morph: PunctType=Peri
Entity: Deftones, Label: ORG
Entity: this weekend, Label: DATE
Entity: Indianapolis, Label: GPE


In [51]:
# 2. Experiment with changing words, adding punctuation, or introducing typos.

#Changing words:
text3 = ("""My friends and I went to a Fleshwater concert last night in Chicago.""")
doc = nlp(text3)
for token in doc:
    print(f"Text: {token.text}, POS: {token.pos_}, Lemma: {token.lemma_}, Morph: {token.morph}")

for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Text: My, POS: PRON, Lemma: my, Morph: Number=Sing|Person=1|Poss=Yes|PronType=Prs
Text: friends, POS: NOUN, Lemma: friend, Morph: Number=Plur
Text: and, POS: CCONJ, Lemma: and, Morph: ConjType=Cmp
Text: I, POS: PRON, Lemma: I, Morph: Case=Nom|Number=Sing|Person=1|PronType=Prs
Text: went, POS: VERB, Lemma: go, Morph: Tense=Past|VerbForm=Fin
Text: to, POS: ADP, Lemma: to, Morph: 
Text: a, POS: DET, Lemma: a, Morph: Definite=Ind|PronType=Art
Text: Fleshwater, POS: PROPN, Lemma: Fleshwater, Morph: Number=Sing
Text: concert, POS: NOUN, Lemma: concert, Morph: Number=Sing
Text: last, POS: ADJ, Lemma: last, Morph: Degree=Pos
Text: night, POS: NOUN, Lemma: night, Morph: Number=Sing
Text: in, POS: ADP, Lemma: in, Morph: 
Text: Chicago, POS: PROPN, Lemma: Chicago, Morph: Number=Sing
Text: ., POS: PUNCT, Lemma: ., Morph: PunctType=Peri
Entity: Fleshwater, Label: NORP
Entity: last night, Label: TIME
Entity: Chicago, Label: GPE


In [52]:
#Adding punctuation:
text3 = ("""Wow! My friends and I went to a Deftones concert this weekend in Indianapolis and it was amazing!""")
doc = nlp(text3)
#Token attributes and part of speech tagging:
for token in doc:
    print(f"Text: {token.text}, POS: {token.pos_}, Lemma: {token.lemma_}, Morph: {token.morph}")
#Named entity recognition:
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Text: Wow, POS: INTJ, Lemma: wow, Morph: 
Text: !, POS: PUNCT, Lemma: !, Morph: PunctType=Peri
Text: My, POS: PRON, Lemma: my, Morph: Number=Sing|Person=1|Poss=Yes|PronType=Prs
Text: friends, POS: NOUN, Lemma: friend, Morph: Number=Plur
Text: and, POS: CCONJ, Lemma: and, Morph: ConjType=Cmp
Text: I, POS: PRON, Lemma: I, Morph: Case=Nom|Number=Sing|Person=1|PronType=Prs
Text: went, POS: VERB, Lemma: go, Morph: Tense=Past|VerbForm=Fin
Text: to, POS: ADP, Lemma: to, Morph: 
Text: a, POS: DET, Lemma: a, Morph: Definite=Ind|PronType=Art
Text: Deftones, POS: PROPN, Lemma: Deftones, Morph: Number=Sing
Text: concert, POS: NOUN, Lemma: concert, Morph: Number=Sing
Text: this, POS: DET, Lemma: this, Morph: Number=Sing|PronType=Dem
Text: weekend, POS: NOUN, Lemma: weekend, Morph: Number=Sing
Text: in, POS: ADP, Lemma: in, Morph: 
Text: Indianapolis, POS: PROPN, Lemma: Indianapolis, Morph: Number=Sing
Text: and, POS: CCONJ, Lemma: and, Morph: ConjType=Cmp
Text: it, POS: PRON, Lemma: it, Morph: Case

In [53]:
#Introducing Typos:
text3 = ("""My friends and I went to a Deaftones concert this weeknd in Indiannapolis.""")
doc = nlp(text3)
for token in doc:
    print(f"Text: {token.text}, POS: {token.pos_}, Lemma: {token.lemma_}, Morph: {token.morph}")

for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Text: My, POS: PRON, Lemma: my, Morph: Number=Sing|Person=1|Poss=Yes|PronType=Prs
Text: friends, POS: NOUN, Lemma: friend, Morph: Number=Plur
Text: and, POS: CCONJ, Lemma: and, Morph: ConjType=Cmp
Text: I, POS: PRON, Lemma: I, Morph: Case=Nom|Number=Sing|Person=1|PronType=Prs
Text: went, POS: VERB, Lemma: go, Morph: Tense=Past|VerbForm=Fin
Text: to, POS: ADP, Lemma: to, Morph: 
Text: a, POS: DET, Lemma: a, Morph: Definite=Ind|PronType=Art
Text: Deaftones, POS: PROPN, Lemma: Deaftones, Morph: Number=Sing
Text: concert, POS: NOUN, Lemma: concert, Morph: Number=Sing
Text: this, POS: DET, Lemma: this, Morph: Number=Sing|PronType=Dem
Text: weeknd, POS: NOUN, Lemma: weeknd, Morph: Number=Sing
Text: in, POS: ADP, Lemma: in, Morph: 
Text: Indiannapolis, POS: PROPN, Lemma: Indiannapolis, Morph: Number=Sing
Text: ., POS: PUNCT, Lemma: ., Morph: PunctType=Peri
Entity: Deaftones, Label: NORP
Entity: Indiannapolis, Label: GPE


#### How does spaCy handle your modifications?
#### Did any entities or tags change? Why might this happen?
When I changed some words, spaCy recognized Fleshwater as a new entity, but not an organization as it did with the well-established band Deftones. It was instead recognized as a NORP, which stands for nationalities, or religious or political group. Fleshwater, of course, is none of these. It is a lesser-known, newer band. Chicago is still recignized as a GPE like Indianapolis was, and last night is still recognized as a time, but not a DATE like "past weekend" was. 

When I added "!" to the sentence, spaCy still treated the punctuation as a seperate entity and its entity recognition stayed the same. 

Finally, when I introduced typos in the words Deftones, weekend, and Indianapolis, spaCy still recognized each as its own entity, but failed to classify weekend as a date. "Deaftones" was then recognized as a NORP rather than a organization like before. "Indiannapolis" was still recognized as a GPE. 

These changes likely happened because spaCy relies on pre-trained data to recognize named entities, and so if a word is uncommon (like Fleshwater, for example, a lesser-known band name), it may not be labeled as we expect. Grammar, context, and surrounding words also affect how spaCy tags parts of speech and entities, which might explain the differences in recognition when typos were added.