# **Text Representation and Rule-based Matching**

# **Challenge Solution**

## **Analyzing Gettysburg Address Text Data**


The Gettysburg Address is a speech that U S. President Abraham Lincoln delivered during the American Civil War at the dedication of the Soldiers' National Cemetery in Gettysburg, Pennsylvania, on the afternoon of November 19, 1863. It is one of the best-known speeches in American history.

Text representation is required to make the data suitable for text analysis. Use this speech data and perform the specified tasks mentioned in the subsequent slides to represent data in the text representation.

Click here to download the datafile

## **Task 1**

### **1.1 Import spacy, collections, English, and dictionary objects (Dict, List, Tuple).**



In [4]:
# Import the spaCy library, which is a popular NLP library.
import spacy

# Import the collections library, which provides a variety of data structures, such as dictionaries and lists.
import collections

# Import the dictionary objects Dict, List, and Tuple, which are commonly used data structures in Python.
from typing import Dict,List,Tuple


### **1.2 Load 'en_core_web_sm'.**


In [5]:
# Create a spaCy NLP pipeline.

nlp = spacy.load("en_core_web_sm")


### **1.3 Read the datafile**

In [9]:
# Import the files module from the google.colab package.
from google.colab import files

# Assign the variable upload to the files.upload() function, which opens a file upload dialog box.
upload = files.upload()

Saving DS3_C2_S4_GettysburgAddress_Data_Challenge.txt to DS3_C2_S4_GettysburgAddress_Data_Challenge.txt


In [11]:
# import pandas liabrary
import pandas as pd

# Open the file "DS3_C2_S4_GettysburgAddress_Data_Challenge.txt" for reading.
f = open("DS3_C2_S4_GettysburgAddress_Data_Challenge.txt")

# Read the entire contents of the file into a string variable called "content".
content = f.read()

# Convert the contents of the file to a string variable called "text"
text = str(content)

# Print text
print(text)

Four score and seven years ago our fathers brought forth upon this continent, a new nation,

 conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so

 conceived and so dedicated, can long endure. We are met on a great battle-field of that 

war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate— we can not consecrate— we can not hallow— this
 ground. The brave men, living and dead, who struggled here, have consecrated it, far above
 our poor power to add or detract. The world will little note, nor long remember what we say
 here, but it can never forget what they did here. 


66666666666666666666 7777777777777
444  222 2222  000

It is for us the living, rather, to be 
dedicate

## **Task 2**

Represent speech texts in vector form using BOW. Supply the following for the words 'dedicated' and 'nation'

a) Bow vector representation

b) Dictionary values

In [14]:
"""Converts a text to a bag-of-words (BOW) representation.

  words: A list of strings representing the words in the text.
    dictionary: A dictionary mapping words to integers.

  Returns:
    A list of tuples, where each tuple contains a word ID and its frequency."""

def text2bow(words:List[str], dictionary: Dict[str, int]):

    word_frequences = collections.defaultdict(int)
    print(word_frequences)

    for word in words:
        if word not in dictionary:                           # check condition
            dictionary[word] = len(dictionary)
        word_frequences[dictionary[word]] +=1

    return list(word_frequences.items())                     # return word frequency

sample_text = text                                           # input text
dictionary ={}                                               # empty dictionary
print(text2bow(sample_text.split(),dictionary))              # calling function
print(dictionary)                                            # print dictionary

defaultdict(<class 'int'>, {})
[(0, 1), (1, 1), (2, 5), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 3), (12, 1), (13, 7), (14, 2), (15, 3), (16, 2), (17, 4), (18, 1), (19, 3), (20, 8), (21, 9), (22, 1), (23, 10), (24, 1), (25, 1), (26, 3), (27, 1), (28, 1), (29, 1), (30, 8), (31, 1), (32, 3), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 1), (39, 2), (40, 3), (41, 1), (42, 5), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 5), (50, 1), (51, 5), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 5), (61, 1), (62, 3), (63, 5), (64, 2), (65, 1), (66, 1), (67, 1), (68, 1), (69, 3), (70, 3), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 5), (81, 1), (82, 1), (83, 1), (84, 1), (85, 2), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 2), (92, 1), (93, 1), (94, 2), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 2), (107, 1

## **Task 3**

For the speech texts, how will you define patterns and implement token-based matching to obtain the following outcomes?

a) battlefield

b) battle-field

Think about patterns to define.


In [25]:
# Import the Matcher class from the spaCy library.
from spacy.matcher import Matcher

# Load the spaCy English language model.
nlp = spacy.load("en_core_web_sm")

# Create a Matcher object.
matcher = Matcher(nlp.vocab)

# Define a pattern to match the word "battlefield".
pattern = [{"LOWER": "battle"},{'LOWER':'field'}]

# Add the pattern to the Matcher object with the ID "battlefield".
matcher.add("battlefield",[pattern])

# Process the text using the spaCy language model.
doc = nlp(content)

# Find all matches of the "battlefield" pattern in the text.
matches = matcher(doc)

# Iterate over the matches and print the text of each match.
for match_id,start,end in matches:
   # Get the string ID of the matched pattern.
    string_id = nlp.vocab.strings[match_id]

    # Get the span of the matched text.
    span = doc[start:end]

    # Print the text of the matched span.
    print(span.text)

In [17]:
# Import the Matcher class from the spaCy library.
from spacy.matcher import Matcher

# Load the spaCy English language model.
nlp = spacy.load("en_core_web_sm")

# Create a Matcher object.
matcher = Matcher(nlp.vocab)

# Define a pattern to match the word "battlefield".
pattern = [{"LOWER": "battle"},{"IS_PUNCT":True},{'LOWER':'field'}]

# Add the pattern to the Matcher object with the ID "battlefield".
matcher.add("battlefield",[pattern])

# Process the text using the spaCy language model.
doc = nlp(content)

# Find all matches of the "battlefield" pattern in the text.
matches = matcher(doc)

# Iterate over the matches and print the text of each match.
for match_id,start,end in matches:
    # Get the string ID of the matched pattern.
    string_id = nlp.vocab.strings[match_id]

    # Get the span of the matched text.
    span = doc[start:end]

    # Print the text of the matched span.
    print(span.text)

battle-field


## **Task 4**

Execute the following code to download 'en_core_web_lg' python- m spacy download en_core_web_lg. Load 'en_core_web_lg"

Do the tokens of the speech texts have a vector representation in spaCy?

Are they part of the pipeline's vocabulary in spaCy or out-of-vocabulary?

In [19]:
doc = nlp(text)
for token in doc:
    print(token.text," | ",token.has_vector," | ",token.is_oov)

Four  |  True  |  True
score  |  True  |  True
and  |  True  |  True
seven  |  True  |  True
years  |  True  |  True
ago  |  True  |  True
our  |  True  |  True
fathers  |  True  |  True
brought  |  True  |  True
forth  |  True  |  True
upon  |  True  |  True
this  |  True  |  True
continent  |  True  |  True
,  |  True  |  True
a  |  True  |  True
new  |  True  |  True
nation  |  True  |  True
,  |  True  |  True


   |  True  |  True
conceived  |  True  |  True
in  |  True  |  True
Liberty  |  True  |  True
,  |  True  |  True
and  |  True  |  True
dedicated  |  True  |  True
to  |  True  |  True
the  |  True  |  True
proposition  |  True  |  True
that  |  True  |  True
all  |  True  |  True
men  |  True  |  True
are  |  True  |  True
created  |  True  |  True
equal  |  True  |  True
.  |  True  |  True

  |  True  |  True
Now  |  True  |  True
we  |  True  |  True
are  |  True  |  True
engaged  |  True  |  True
in  |  True  |  True
a  |  True  |  True
great  |  True  |  True
civil  

## **Task 5**

For the speech texts, the phrases 'long endure' and 'resting place should be matched using defined patterns ['long endure', 'resting place'].

How will you set the attributes to achieve this?

In [21]:
# spaCy PhraseMatcher to find exact instances of phrases in a text.
from spacy.matcher import PhraseMatcher

# Create a new PhraseMatcher object.
matcher = PhraseMatcher(nlp.vocab)

# Create a list of phrases to match.
terms = ["long endure","resting place"]

# Create a list of spaCy Doc objects from the list of phrases.
patterns = [nlp.make_doc(text) for text in terms]

# Add the patterns to the PhraseMatcher object.
matcher.add("",patterns)

# Load the text to search into a spaCy Doc object.
doc = nlp(text)

# Iterate over the matches found by the PhraseMatcher object.
for match_id,start,end in matcher(doc):
    print(doc[start:end])         # Print the matched text.

long endure
resting place


## **Task - 6**

Represent speech text in the vector form using word vector representation.

What is the total length of output vectors?

In [22]:
# length of vector

doc.vector.shape

(96,)

In [24]:
# vector numbers

doc.vector

array([ 0.11224575, -0.2755308 , -0.0648795 , -0.03700626, -0.04325729,
        0.15692633,  0.05856009,  0.05960033,  0.07173745,  0.17168066,
       -0.13233267,  0.22620872, -0.26355806,  0.03330753, -0.07637579,
       -0.17697139,  0.11305751,  0.11847576, -0.01329103, -0.07913036,
        0.06090962,  0.30062747,  0.05478362, -0.27772897, -0.03164646,
       -0.03291905,  0.22353782,  0.12076934,  0.1242673 ,  0.05988366,
        0.00327474,  0.01875209,  0.42053464, -0.08467228,  0.229732  ,
       -0.22413518,  0.24621564, -0.1894398 , -0.04794111, -0.09005581,
       -0.31812707,  0.11526684, -0.06694183, -0.07767091,  0.18534514,
        0.07556131, -0.03481111,  0.15802333,  0.07112998, -0.0223304 ,
       -0.35075742, -0.00609015, -0.06427025,  0.07000443,  0.11222431,
        0.09198102, -0.0290524 , -0.05030384,  0.01403959, -0.11956887,
        0.1607308 , -0.10655373, -0.04611418,  0.09352808,  0.11231744,
        0.00842134,  0.02193254, -0.3816418 ,  0.1670724 ,  0.23

## **Task 7**

Find the similarity between each word of text that exists in the speech.
