Playground Used to Test spaCy

spaCy overview:

Designed for production use to understand large volumes of text. Can be used to build information extraction or NL understanding systems, or to pre-process text for deep learning.

Spacy text blop allows for sentiment analysis.

Used to print the text

In [4]:
# Import spaCy
import spacy

# Create the English nlp object
nlp = spacy.blank("en")

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


Part 2:

In [5]:
# Import spaCy and create the English nlp object
import spacy

nlp = spacy.blank("en")

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


Part3:

In [8]:
# Import spaCy and create the English nlp object
import spacy

nlp = spacy.blank("en")

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


Finding percent signs in text

In [9]:
import spacy

nlp = spacy.blank("en")

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


Loading Pipeline

In [10]:
import spacy

# Load the "en_core_web_sm" pipeline
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


Using the Pipeline

In [11]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      ccomp     
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [12]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


You can get missing entities

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


Messing with the matcher.

In [14]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


In [15]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


Creating a doc and inserting spaces in desired areas

In [16]:
import spacy

nlp = spacy.blank("en")

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [17]:
import spacy

nlp = spacy.blank("en")

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


spaCy with textblob for sentiment analysis

In [20]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
text = "I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy."
nlp.add_pipe("spacytextblob")
doc = nlp(text)

print(doc._.blob.polarity)
# -0.125

print(doc._.blob.subjectivity)
# 0.9

print(doc._.blob.sentiment_assessments.assessments)
# [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]

-0.125
0.9
[(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]


Another example of them being used together
Code grabbed from this tutorial: https://importsem.com/evaluate-sentiment-analysis-in-bulk-with-spacy-and-python/

In [1]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
import pandas as pd
from bs4 import BeautifulSoup
import requests

# Loading the two pipelines. The first is for spacy NLP and the second is for
# textblob sentiment analysis
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

# Send in our urls. Then create empty lists for the scores
df = pd.read_csv("urls.csv")
urls = df["Address"].tolist()
url_sent_score = []
url_sent_label = []
total_pos = []
total_neg = []

# Interate through the URL list
for count, x in enumerate(urls):
    url = x

    # user-agent is used to help with bot blocking
    headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
    res = requests.get(url,headers=headers)
    html_page = res.text

    #beautiful soup is used for parsing
    soup = BeautifulSoup(html_page, 'html.parser')
    for script in soup(["script", "style","meta","label","header","footer"]):
        script.decompose()
        page_text = (soup.get_text()).lower()
        page_text = page_text.strip().replace("  ","") #Getting rid of white spaces
        page_text = "".join([s for s in page_text.splitlines(True) if s.strip("\r\n")])

        # Loading into spacy
        doc = nlp(page_text)
        sentiment = doc._.blob.polarity
        sentiment = round(sentiment,2)

        # Setting labels
        if sentiment > 0:
          sent_label = "Positive"
        else:
          sent_label = "Negative"

        url_sent_label.append(sent_label)
        url_sent_score.append(sentiment)

# Empty lists to store words
positive_words = []
negative_words = []

# This loops through a tuple object with the word, polarity (sentiment score), and subjectivity
for x in doc._.blob.sentiment_assessments.assessments:
  if x[1] > 0: # Evaluates the score (second item in tuple)
    positive_words.append(x[0][0])
  elif x[1] < 0:
    negative_words.append(x[0][0])
  else:
    pass

# Removes duplicates and creates a single long string
total_pos.append(', '.join(set(positive_words)))
total_neg.append(', '.join(set(negative_words)))

# Attatching everything to the dataframe
# Had to alter this part. Didn't originally have the pd.Series, only the stuff in ()
df["Sentiment Score"] = pd.Series(url_sent_score)
df["Sentiment Label"] = pd.Series(url_sent_label)
df["Positive Words"] = pd.Series(total_pos)
df["Negative Words"] = pd.Series(total_neg)

#optional export to CSV
df.to_csv("sentiment.csv")
df

Unnamed: 0,Address,Sentiment Score,Sentiment Label,Positive Words,Negative Words
0,https://www.nbcnews.com/news/us-news/ohio-dera...,0.09,Positive,"striking, legal, successfully, large, able, ne...","dead, down, killed, terrible"
1,https://www.cnn.com/2023/02/28/health/moderate...,0.09,Positive,,
2,https://abcnews.go.com/US/chicago-cop-shot-kil...,0.09,Positive,,


Another tutorial on spacy + textblob. Tutorial followed from youtube video: https://www.youtube.com/watch?v=6bg-TNoT5_Y&ab_channel=JCharisTech

In [4]:
#Load NLP
import spacy
nlp = spacy.load("en_core_web_sm")

In [3]:
# Explore NLP pipeline components
nlp.components # can also use nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x2c03d9c2620>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x2c03d9c2b00>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x2c03d9b0200>),
 ('senter', <spacy.pipeline.senter.SentenceRecognizer at 0x2c03d84e680>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x2c03d8b35c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x2c03db75e80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2c03d9b03c0>),
 ('spacytextblob',
  <spacytextblob.spacytextblob.SpacyTextBlob at 0x2c03db6bd30>)]

In [5]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x2c040a427a0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x2c040a41840>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x2c040996dc0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x2c0408e0680>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x2c0408dd200>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2c0409967a0>)]

# Using spacy textblob: Sentiment Analysis using textblob

In [6]:
from spacytextblob.spacytextblob import SpacyTextBlob

In [7]:
# Adding SpacyTextblob to NLP Pipeline
nlp.add_pipe("spacytextblob")

<spacytextblob.spacytextblob.SpacyTextBlob at 0x2c03da65150>

In [8]:
# Recheck pipeline
# Should show textblob at the end
nlp.components

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x2c040a427a0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x2c040a41840>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x2c040996dc0>),
 ('senter', <spacy.pipeline.senter.SentenceRecognizer at 0x2c02f3b8640>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x2c0408e0680>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x2c0408dd200>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2c0409967a0>),
 ('spacytextblob',
  <spacytextblob.spacytextblob.SpacyTextBlob at 0x2c03da65150>)]

In [20]:
mytext = "John love's eating apples when he works at Apple"

In [21]:
docx = nlp(mytext)

In [22]:
# Check Sentiment Polarity
docx._.polarity

0.5

In [24]:
# Check for subjectivity
docx._.subjectivity

0.6

In [25]:
# Check assessment: list polarity/subject for the assessed token
docx._.assessments

[(['love'], 0.5, 0.6, None)]

In [1]:
number = 5
number

5