# Basics

In [1]:
import spacy

## nlp objects

In [2]:
nlp = spacy.blank("en")

## Docs and tokens

In [3]:
doc = nlp("Hello, world!")

for token in doc:
  print(token.text)

Hello
,
world
!


## Indexing Tokens

In [4]:
# index into the doc to get a single token
token1 = doc[1]
print(token1.text)

,


so this nlp object automatically tokenizes its input. Tokenization here means distinguishing words from each other and non-verbal chars.

## Spans

In [5]:
# a span object is spacy's term for a slice of a doc

span = doc[1:3]

print(span.text)

, world


## Lexical Attributes

In [6]:
doc1 = nlp("It costs $5.")

# .i indexes the
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

Index:    [0, 1, 2, 3]
Text:     ['Hello', ',', 'world', '!']
is_alpha: [True, False, True, False]
is_punct: [False, True, False, True]
like_num: [False, False, False, False]


as you can see from the above, tokens come with attributes which we can structure into a table

Import example talkback dataset

In [7]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
f = open('/content/drive/My Drive/Data Science/Spring 2024/ANLP/AT1 dataset_AusRadioTalkback/COMNE3-plain.txt', 'r')
COMNE3 = f.read()
print(COMNE3)

 We are Talking Real Estate on eight-eighty-two Six P R thank you Len and good morning to Craig Turnbull from Aspire how are you mate.
 Good morning Harvey and good morning Perth the answer is fantastic.
 Fantastic that's very very good indeed nice uh day it'll be a bit later on to uh go out 'n' have a look at um well first of all the uh Eagles probably whacking the Hawks unfortunately   but also to go 'n' have a look at real estate today.
 Yeah great day to be uh in real estate ab absolutely I think the sun'll be out and uh not too hot and uh good time to be out there uh there's not that much on the market right now Harvey uh only about seven-thousand statistih statistically speaking and that's quite a lot less than than what's normally uh the case normally luh in a inverted commas average market there's about twelve-thousand 'n' what it means is there's a lot of competition for uh the good property out there. Uh lotta my mentoring students 'n' people that I that we deal with are sayi

In [9]:
nlp2 = spacy.blank("en")

COMNE3_doc = nlp2(COMNE3)

print(COMNE3_doc)

 We are Talking Real Estate on eight-eighty-two Six P R thank you Len and good morning to Craig Turnbull from Aspire how are you mate.
 Good morning Harvey and good morning Perth the answer is fantastic.
 Fantastic that's very very good indeed nice uh day it'll be a bit later on to uh go out 'n' have a look at um well first of all the uh Eagles probably whacking the Hawks unfortunately   but also to go 'n' have a look at real estate today.
 Yeah great day to be uh in real estate ab absolutely I think the sun'll be out and uh not too hot and uh good time to be out there uh there's not that much on the market right now Harvey uh only about seven-thousand statistih statistically speaking and that's quite a lot less than than what's normally uh the case normally luh in a inverted commas average market there's about twelve-thousand 'n' what it means is there's a lot of competition for uh the good property out there. Uh lotta my mentoring students 'n' people that I that we deal with are sayi

In [10]:
token1 = COMNE3_doc[1]
print(token1.text)

We


## Lexical attributes: scan for numbers

In [11]:
nlp = spacy.blank("en")

doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

for token in doc:
  if token.like_num:
    next_token = doc[token.i+1]
    if next_token.text == "%":
      print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


In [12]:
for token in COMNE3_doc:
  if token.like_num:
    next_token = COMNE3_doc[token.i+1]
    if next_token.text == "%":
      print("Percentage found:", token.text)

#  Trained Pipelines

In [13]:
nlp_pl = spacy.load("en_core_web_sm")

In [14]:
COMNE3_docX = nlp_pl(COMNE3)

In [15]:
for token in COMNE3_docX:
  # print text and predicted part-of-speech tag
  print(token.text, token.pos_)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
forward ADP
proposals NOUN
to PART
build VERB
a DET
marina NOUN
there ADV
and CCONJ
possibly ADV
c VERB
uh INTJ
canals NOUN
I PRON
think VERB
. PUNCT

  SPACE
Sorry INTJ
now ADV
I PRON
know VERB
where SCONJ
you PRON
're AUX
talking VERB
about ADP
yeah INTJ
  SPACE
look VERB
look VERB
I PRON
I PRON
have AUX
heard VERB
of ADP
it PRON
but CCONJ
they PRON
're AUX
facing VERB
some DET
stiff ADJ
opposition NOUN
from ADP
the DET
locals NOUN
  SPACE
uh INTJ
environmentals NOUN
um INTJ
and CCONJ
so ADV
on ADV
but CCONJ
uh INTJ
there PRON
's VERB
a DET
big ADJ
push NOUN
behind ADP
that DET
particular ADJ
section NOUN
and CCONJ
in ADP
time NOUN
it PRON
it PRON
will AUX
happen VERB
I PRON
think VERB
it PRON
should AUX
happen AUX
given VERB
that SCONJ
they PRON
can AUX
take VERB
care NOUN
of ADP
all DET
the DET
environmental ADJ
side NOUN
of ADP
things NOUN
  SPACE
um INTJ
I PRON
think VERB
it PRON
would AUX
really ADV
help VERB
uh IN

In Spacy, attributes that return strings usually end in an _ whereas those that return numbers don't

## Predicting syntanctic dependencies

In [16]:
COMNE3_docX

 We are Talking Real Estate on eight-eighty-two Six P R thank you Len and good morning to Craig Turnbull from Aspire how are you mate.
 Good morning Harvey and good morning Perth the answer is fantastic.
 Fantastic that's very very good indeed nice uh day it'll be a bit later on to uh go out 'n' have a look at um well first of all the uh Eagles probably whacking the Hawks unfortunately   but also to go 'n' have a look at real estate today.
 Yeah great day to be uh in real estate ab absolutely I think the sun'll be out and uh not too hot and uh good time to be out there uh there's not that much on the market right now Harvey uh only about seven-thousand statistih statistically speaking and that's quite a lot less than than what's normally uh the case normally luh in a inverted commas average market there's about twelve-thousand 'n' what it means is there's a lot of competition for uh the good property out there. Uh lotta my mentoring students 'n' people that I that we deal with are sayi