
# **Text Representation and Rule-based Matching**


# **Practice Solution**

# **Vector Representation of Text Data**

Vector Representation of Text Data

Consider the following texts to use in this practice session:

1. I prefer the morning flight through Denmark.

2. The infrastructure of our school is wonderful.

3. Review 1: This movie is very scary and long. Review 2: This movie is not scary and is slow. Review 3: This movie is spooky and good.

4. Do not put rotten mangoes and sweet oranges together.

In [1]:
# Import the spaCy library, which is a popular NLP library.
import spacy

# Import the collections library, which provides a variety of data structures, such as dictionaries and lists.
import collections

# Import the dictionary objects Dict, List, and Tuple, which are commonly used data structures in Python.
from typing import Dict,List,Tuple

## **Task 1**

Represent text 3 in vector form using BOW. Provide the following for the words 'review and 'scary':

a) Bow vector representation

b) Dictionary values

In [10]:
"""Converts a text to a bag-of-words (BOW) representation.

  words: A list of strings representing the words in the text.
    dictionary: A dictionary mapping words to integers.

  Returns:
    A list of tuples, where each tuple contains a word ID and its frequency."""


def text2bow(words: List[str],dictinory: Dict[str,int]) -> List[Tuple[int,int]]:
    word_frequences = collections.defaultdict(int)

    for word in words:
        if word not in dictinory:                      # check condition
            dictinory[word]= len(dictinory)            # each word index and index location

        word_frequences[dictinory[word]] +=1

    return list(word_frequences.items())               # return word frequency
sample_text ='Review 1: This movie is very scary and long. Review 2: This movie is not scary and is slow. Review 3: This movie is spooky and good.'
dictionary = {}
print(text2bow(sample_text.split(),dictionary))

[(0, 3), (1, 1), (2, 3), (3, 3), (4, 4), (5, 1), (6, 2), (7, 3), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]


In [12]:
print("input text: \n{}".format(sample_text))

print("\nDictionary: \n{}".format(dictionary))

input text: 
Review 1: This movie is very scary and long. Review 2: This movie is not scary and is slow. Review 3: This movie is spooky and good.

Dictionary: 
{'Review': 0, '1:': 1, 'This': 2, 'movie': 3, 'is': 4, 'very': 5, 'scary': 6, 'and': 7, 'long.': 8, '2:': 9, 'not': 10, 'slow.': 11, '3:': 12, 'spooky': 13, 'good.': 14}


## **Task 2**

For the text, "Hey, Siri! Hey Siri!", how will you define pattern and implement token-based matching so you can obtain the following outcomes:

a) Hey Siri

b) Hey, Siri

Think about patterns to define.

In [15]:
# Import the Matcher class from the spaCy library.
from spacy.matcher import Matcher

# Load the spaCy English language model.
nlp = spacy.load("en_core_web_sm")

# Create a Matcher object.
matcher = Matcher(nlp.vocab)

# Define a pattern to match the word
pattern = [{"LOWER":'hey'},{'LOWER':'siri'}]

# Add the pattern to the Matcher object with the ID "Hey siri".
matcher.add("Hey siri",[pattern])

# Process the text using the spaCy language model.
doc = nlp("Hey siri")

# Find all matches of the "battlefield" pattern in the text.
matches = matcher(doc)

# Iterate over the matches and print the text of each match.
for match_id,start,end in matches:
    # Get the string ID of the matched pattern.
    string_id = nlp.vocab.strings[match_id]

    # Get the span of the matched text.
    span = doc[start:end]

    # Print the text of the matched span.
    print(span.text)

Hey siri


In [16]:
# Import the Matcher class from the spaCy library.
from spacy.matcher import Matcher

# Load the spaCy English language model.
nlp = spacy.load("en_core_web_sm")

# Create a Matcher object.
matcher = Matcher(nlp.vocab)

# Define a pattern to match the word
pattern = [{"LOWER":'hey'},{"IS_PUNCT":True},{'LOWER':'siri'}]

# Add the pattern to the Matcher object with the ID "Hey siri".
matcher.add("Hey siri",[pattern])

# Process the text using the spaCy language model.
doc = nlp("Hey, siri")

# Find all matches of the "battlefield" pattern in the text.
matches = matcher(doc)

# Iterate over the matches and print the text of each match.
for match_id,start,end in matches:
    # Get the string ID of the matched pattern.
    string_id = nlp.vocab.strings[match_id]

    # Get the span of the matched text.
    span = doc[start:end]

    # Print the text of the matched span.
    print(span.text)

Hey, siri


## **Task 3**

Do the tokens 'apple', 'orange', 'pikkstn', and 'German' have a vector representation in spaCy?

Are they part of the pipeline's vocabulary in spacy oout-of- vocabulary?

In [18]:
 # Load the text into a spaCy Doc object.
doc = nlp('apple orange pikkstn German')

# Iterate over the tokens in the Doc object.
for token in doc:
    # Print the vector length of the token.
    print('Vector Length:\n',token.vector.shape)

    # Print the vector representation of the token.
    print('Word Vector Representation\n',token.vector)

Vector Length:
 (96,)
Word Vector Representation
 [-0.01291151 -0.63742214  0.6336341   0.7220289  -0.7089878  -0.2020278
 -0.21471749  1.7504792  -0.43238533 -0.75817096  1.8110483  -0.8448776
 -0.8092954  -0.60360026 -0.15348431  0.30619383 -0.9539893  -0.79956675
 -0.7785482  -0.8693065   0.21894354 -0.06897593  0.8038384   0.17259844
  0.2581646   0.7076305   0.8857086   0.782564   -0.4451243   1.1322079
 -0.69896    -1.0962689   0.1928064   1.0511543  -0.6390506   0.26279163
  1.7593797  -0.8621046   0.47793993  1.5560223  -0.92148495  1.457032
 -0.28774732  1.1776068  -0.6398139  -0.15469822  0.64170146  0.6397705
  0.10651273 -0.5398765  -0.13111869 -1.6336241   0.9770989  -0.49307543
 -0.4739711  -0.433877   -0.22383378 -0.52839124  0.8471283  -0.24316812
 -1.392698    0.22927427 -0.29445207 -1.8478808  -0.7132102  -1.077588
 -0.26076427  0.51486564 -0.5803723  -0.91216826  0.24041569  0.5029696
  0.4219088  -1.6083198  -0.07817796 -0.09576415  0.7770756  -0.04122347
  0.652579

## **Task 4**

For sentence 4, the phrases 'rotten mangoes' and 'sweet oranges should be matched using defined patterns ['ROTTEN mangoes', 'sweet oranges'].

How will you set the attributes to achieve this?

Sentence 4:

"Do not put rotten mangoes and sweet oranges together."

In [19]:
from spacy.matcher import PhraseMatcher

# Create a new PhraseMatcher object with case-insensitive matching.
matcher = PhraseMatcher(nlp.vocab,attr='LOWER')

# Create a list of phrases to match.
terms = ["ROTTEN mangoes","sweet oranges"]

# Create a list of spaCy Doc objects from the list of phrases.
patterns = [nlp.make_doc(text) for text in terms]

# Add the patterns to the PhraseMatcher object
matcher.add("Fruits",patterns)

# Load the text to search into a spaCy Doc object.
doc = nlp("Do not put rotten mangoes and sweet oranges together")

# Iterate over the matches found by the PhraseMatcher object.
for match_id,start,end in matcher(doc):
    # Print the matched text, along with a message indicating that the match was based on the lowercase token text.
    print("Matched based on lowercase token text: ",doc[start:end])

Matched based on lowercase token text:  rotten mangoes
Matched based on lowercase token text:  sweet oranges


## **Task 5**

Represent sentence 1 in the vector form using word vector representation.

What is the total length of output vectors?

Sentence 1:

"I prefer the morning flight through Denmark."

In [20]:
 # Load the text into a spaCy Doc object.
doc = nlp("I prefer the morning fight through Denmark")

# Iterate over the tokens in the Doc object.
for token in doc:
    # Print the vector length of the token.
    print("Vector length: \n",token.vector.shape)

    # Print the vector representation of the token.
    print("Word Vector Representation:\n",token.vector)

Vector length: 
 (96,)
Word Vector Representation:
 [-0.9435841  -0.13761951 -0.41831952 -0.15897208 -0.4302298  -0.04438317
  2.3614748   0.6231054   0.0122031  -0.79678786  2.2164228   1.634094
 -0.46185458  0.47664887 -1.5722715  -0.852956    0.7579042   1.0350271
 -0.91913295 -0.5994303  -0.90215635  0.22339064  0.22972403 -1.2096164
 -0.62598217 -0.50422657 -0.544212   -0.20867735 -1.2677568   0.25655213
 -0.12779951 -0.6996877  -0.82510996 -0.14156663 -0.42306674 -0.6998179
 -0.99681437 -0.4949299  -1.0885243   1.7996302  -0.9437039   0.33264816
  0.03097375  1.1109407  -0.7527068  -0.53995335  0.8292919   3.7861528
 -0.08327061 -0.09838206 -1.1750913  -1.1477796   1.4207778  -1.2538137
  0.56365967 -1.1316081   0.8606243  -0.9959679   0.16899098  1.0195899
  1.4902151   0.12664063  0.5253415  -0.46430543  1.2312306   0.4858403
 -0.897334   -0.46871835 -1.1867027  -1.9095289   0.29641783 -0.31166956
  1.0440685  -0.06307149 -1.1778319   0.52600086  1.2667065  -0.03227267
  0.7472

## **Task 6**

Find the similarity between each word of the input sentence 4. Answer the following questions:

a) The words 'rotten' and 'sweet' are out of vocabulary. Identify that the statement is True or False?

b) What are the similar values between 'mangoes' and 'oranges'?

What are the similar values between 'sweet' and 'oranges'?

Sentence 4:

'Do not put rotten Mangoes and sweet oranges together."

In [21]:
doc = str(nlp("Do not put rotten mangoes and sweet oranges together"))

for token in doc.split():
    if token == "rotten":
        print("text=",nlp(token).text," | Vector=",nlp(token).has_vector)#," | OOV=",nlp(token).is_oov)

    if token == "sweet":
        print("text=",nlp(token).text," | Vector=",nlp(token).has_vector)

text= rotten  | Vector= True
text= sweet  | Vector= True


In [22]:
doc = nlp("mangoes orange")

for token1 in doc:
    for token2 in doc:
        print(token1.text," | ", token2.text," | ", token1.similarity(token2))

mangoes  |  mangoes  |  1.0
mangoes  |  orange  |  -0.2636679708957672
orange  |  mangoes  |  -0.2636679708957672
orange  |  orange  |  1.0


  print(token1.text," | ", token2.text," | ", token1.similarity(token2))


In [23]:
doc=nlp('sweet oranges')

for token1 in doc:
    for token2 in doc:
        print(token1.text," | ", token2.text," | ", token1.similarity(token2))

sweet  |  sweet  |  1.0
sweet  |  oranges  |  0.14432832598686218
oranges  |  sweet  |  0.14432832598686218
oranges  |  oranges  |  1.0


  print(token1.text," | ", token2.text," | ", token1.similarity(token2))
