In [56]:
#import spaCy to set up code
import spacy
#load correct spaCy model
nlp = spacy.load("en_core_web_sm")

## Task 1: Tokenization

In [57]:
#bring in the text quote
text = "The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!"
doc = nlp(text)

#tokenized loop
for token in doc:
    print(f"Text: {token.text}")
    print(f"Head: {token.head}")
    print(f"Lemma: {token.lemma_}")
    print(f"Morph: {token.morph}")
    print("")

Text: The
Head: fox
Lemma: the
Morph: Definite=Def|PronType=Art

Text: quick
Head: fox
Lemma: quick
Morph: Degree=Pos

Text: brown
Head: fox
Lemma: brown
Morph: Degree=Pos

Text: fox
Head: jump
Lemma: fox
Morph: Number=Sing

Text: does
Head: jump
Lemma: do
Morph: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin

Text: n't
Head: jump
Lemma: not
Morph: Polarity=Neg

Text: jump
Head: jump
Lemma: jump
Morph: VerbForm=Inf

Text: over
Head: jump
Lemma: over
Morph: 

Text: the
Head: dog
Lemma: the
Morph: Definite=Def|PronType=Art

Text: lazy
Head: dog
Lemma: lazy
Morph: Degree=Pos

Text: dog
Head: over
Lemma: dog
Morph: Number=Sing

Text: .
Head: jump
Lemma: .
Morph: PunctType=Peri

Text: Natural
Head: Language
Lemma: Natural
Morph: Number=Sing

Text: Language
Head: Processing
Lemma: Language
Morph: Number=Sing

Text: Processing
Head: is
Lemma: processing
Morph: Number=Sing

Text: is
Head: is
Lemma: be
Morph: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin

Text: fascinating
Head: 

1. The spaCy processes the various tokens by switching it into a doc container, which separates the text by the quantity of words, with each token being self-contained with syntatic purpose. 
2. The spaCy uses punctuation marks as individual tokens.
3. When the text includes contractions, spaCy separates them into two tokens because the contractions represent two words. For instance, "don't" will be separeated into "do" and "n't", which represent "do" and "not". 

## Task 2: Part-of-Speech Tagging

In [None]:
#goes through each token to look at position and tag
for token in doc:
    print(f"Text: {token.text}")
    print(f"POS: {token.pos_}")
    print(f"Tag: {token.tag_}")
    print("")

Text: The
POS: DET
Tag: DT

Text: quick
POS: ADJ
Tag: JJ

Text: brown
POS: ADJ
Tag: JJ

Text: fox
POS: NOUN
Tag: NN

Text: does
POS: AUX
Tag: VBZ

Text: n't
POS: PART
Tag: RB

Text: jump
POS: VERB
Tag: VB

Text: over
POS: ADP
Tag: IN

Text: the
POS: DET
Tag: DT

Text: lazy
POS: ADJ
Tag: JJ

Text: dog
POS: NOUN
Tag: NN

Text: .
POS: PUNCT
Tag: .

Text: Natural
POS: PROPN
Tag: NNP

Text: Language
POS: PROPN
Tag: NNP

Text: Processing
POS: NOUN
Tag: NN

Text: is
POS: AUX
Tag: VBZ

Text: fascinating
POS: ADJ
Tag: JJ

Text: !
POS: PUNCT
Tag: .



1. The POS tag for "quick" is ADJ, the POS tag for "jump" is VERB, and the POS tag for "is" is AUX.
2. POS tagging may be useful for grammar checking or machine translation because it allows one to figure out when words are used wrong, to help better identify correct sentence structure, and to improve AI understanding of how words are used together. 

## Task 3: Named Entity Recognition (NER)

In [None]:
#brings in the obama text
text2 = "Barack Obama was the 44th President of the United States. He was born in Hawaii."
doc2 = nlp(text2)

#goes through the ents in the doc
for ent in doc2.ents:
    print (ent.text, ent.label_)


Barack Obama PERSON
44th ORDINAL
the United States GPE
Hawaii GPE


1. "Barack Obama", "44th", "the United States", and "Hawaii" are all identified as named entities.
2. "Barack Obama" is assigned to the entity type Person, while "Hawaii" is assigned to the entity type GPE (Geopolitical Entity).

## Task 4: Experimentation

In [60]:
#text to try
text3 = "Brynn likes to code. Her coding partners are Jack and Cecilia."
text4 = "Coding is Brynn's favorite subject! Jack and Ceci are her coding partners!"
text5 = "brynn enjoys her favorite subject, coding, with her coding partners, jack and cecilia."

doc3 = nlp(text3)
doc4 = nlp(text4)
doc5 = nlp(text5)

#ents tests, trial 1
print("Trial 1:")
print("Doc 3:")
for ent in doc3.ents:
    print (ent.text, ent.label_)
print("Doc 4:")
for ent in doc4.ents:
    print (ent.text, ent.label_)
print("Doc 5:")
for ent in doc5.ents:
    print (ent.text, ent.label_)

Trial 1:
Doc 3:
Brynn ORG
Jack PERSON
Cecilia GPE
Doc 4:
Brynn ORG
Jack PERSON
Doc 5:
brynn enjoys PERSON
jack PERSON


In [61]:
#tag test, trial 2
print("Trial 2:")
print("Doc 3:")
for token in doc3:
    print (token.text, token.tag_)
print("Doc 4:")
for token in doc4:
    print (token.text, token.tag_)
print("Doc 5:")
for token in doc5:
    print (token.text, token.tag_)


Trial 2:
Doc 3:
Brynn NNP
likes VBZ
to TO
code VB
. .
Her PRP$
coding VBG
partners NNS
are VBP
Jack NNP
and CC
Cecilia NNP
. .
Doc 4:
Coding NN
is VBZ
Brynn NNP
's POS
favorite JJ
subject NN
! .
Jack NNP
and CC
Ceci NNP
are VBP
her PRP$
coding VBG
partners NNS
! .
Doc 5:
brynn NNP
enjoys VBZ
her PRP$
favorite JJ
subject NN
, ,
coding NN
, ,
with IN
her PRP$
coding VBG
partners NNS
, ,
jack NN
and CC
cecilia NN
. .


In [62]:
#part of speech test, trial 3
print("Trial 3:")
print("Doc 3:")
for token in doc3:
    print (token.text, token.head)
print("Doc 4:")
for token in doc4:
    print (token.text, token.head)
print("Doc 5:")
for token in doc5:
    print (token.text, token.head)


Trial 3:
Doc 3:
Brynn likes
likes likes
to code
code likes
. likes
Her partners
coding partners
partners are
are are
Jack are
and Jack
Cecilia Jack
. are
Doc 4:
Coding is
is is
Brynn subject
's Brynn
favorite subject
subject is
! is
Jack are
and Jack
Ceci Jack
are are
her partners
coding partners
partners are
! are
Doc 5:
brynn enjoys
enjoys enjoys
her subject
favorite subject
subject enjoys
, subject
coding subject
, coding
with coding
her partners
coding partners
partners with
, partners
jack partners
and jack
cecilia jack
. enjoys


Trial 1: The system had great difficulty identifying the different names when lower case. It also referred to Cecilia as a GPE instead of a name, and refused to pick out Ceci as a person. Doc 5 also thought the person was "brynn enjoys" instead of just a verb.

Trial 2: This was a lot more similar between trials. Most of the words were identified the same way, because they were read on their own. The biggest difference was the lowercase switch of identifying Jack and Ceci as NN instead of NNP due to the lowercase. 

Trial 3: I wrote the sentences differently, so even though they essentially said the same thing, the syntax was incredibly different between sentences! The system picked up on this!
