# Unit Testing of File Reader and Language Processing

This notebook contains unit tests for functions contained within the fileReader.py and languageProcessing.py files. The GUI can be tested by directly running the gui.py file.

## Imports and Setup

In [35]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('words')
%pip install pypdf2

from languageProcessing import *
from fileReader import *



[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/nicholasderby/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/nicholasderby/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/nicholasderby/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nicholasderby/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/nicholasderby/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/nicholasderby/nltk_data...
[nltk_data]   Package words is al

Note: you may need to restart the kernel to use updated packages.


## File Reader Tests

### Testing to ensure PDF reader functions properly

Though fileReader.py does contain additional functions, they must be called directly by the GUI.

In [36]:
inputPDF = "./input/colorado-wikipedia-intro.pdf"
outputTXT = "./output/colorado-wikipedia-intro.txt"
controlTXT = "./input/colorado-intro-wiki.txt"

pdf_to_txt(inputPDF, outputTXT)

with open(outputTXT, 'r') as file:
    outputText = file.read()

with open(controlTXT, 'r') as file:
    controlText = file.read()

for outputWord, controlWord in zip(outputText.split(), controlText.split()):
    assert(outputWord == controlWord)

## Language Processing Tests

Test sentence detection. Since every sentence in the sample text ends in a period, checking that each detected sentence ends in a period is an effective test in this case.

In [37]:
text = controlText  # renaming for readability

sentences = sentence_detection(text)

for sentence in sentences:
    print(sentence)
    assert(sentence[len(sentence) - 1] == '.') 


Colorado is a state in the Western United States.
It is one of the Mountain states, sharing the Four Corners region with Arizona, New Mexico, and Utah.
It is also bordered by Wyoming to the north, Nebraska to the northeast, Kansas to the east, and Oklahoma to the southeast.
Colorado is noted for its landscape of mountains, forests, high plains, mesas, canyons, plateaus, rivers, and desert lands.
It encompasses most of the Southern Rocky Mountains, as well as the northeastern portion of the Colorado Plateau and the western edge of the Great Plains.
Colorado is the eighth most extensive and 21st most populous U.S. state.
The United States Census Bureau estimated the population of Colorado at 5,877,610 as of July 1, 2023, a 1.80% increase since the 2020 United States census.
The region has been inhabited by Native Americans and their ancestors for at least 13,500 years and possibly much longer.
The eastern edge of the Rocky Mountains was a major migration route for early peoples who sprea

Test tokenize, checking that each token properly matches up with the original text.

In [38]:
words = text.split()
tokens = tokenize(text)

for token in tokens:
    if token == "``" or token == "''":      # ignore quotes since tokenizer reformats them
        continue

    match = False
    for word in words:
        if token in word:
            match = True

    assert(match)

Test part_of_speech_tagging, ensuring that its output is as expected.

In [39]:
tags = part_of_speech_tagging(text)

for tag in tags:
    print(tag)

C
o
l
o
r
a
d
o
 
[
N
N
P
]
 
i
s
 
[
V
B
Z
]
 
a
 
[
D
T
]
 
s
t
a
t
e
 
[
N
N
]
 
i
n
 
[
I
N
]
 
t
h
e
 
[
D
T
]
 
W
e
s
t
e
r
n
 
[
N
N
P
]
 
U
n
i
t
e
d
 
[
N
N
P
]
 
S
t
a
t
e
s
 
[
N
N
P
S
]
 
.
 
[
.
]
 
I
t
 
[
P
R
P
]
 
i
s
 
[
V
B
Z
]
 
o
n
e
 
[
C
D
]
 
o
f
 
[
I
N
]
 
t
h
e
 
[
D
T
]
 
M
o
u
n
t
a
i
n
 
[
N
N
P
]
 
s
t
a
t
e
s
 
[
N
N
S
]
 
,
 
[
,
]
 
s
h
a
r
i
n
g
 
[
V
B
G
]
 
t
h
e
 
[
D
T
]
 
F
o
u
r
 
[
C
D
]
 
C
o
r
n
e
r
s
 
[
N
N
P
S
]
 
r
e
g
i
o
n
 
[
N
N
]
 
w
i
t
h
 
[
I
N
]
 
A
r
i
z
o
n
a
 
[
N
N
P
]
 
,
 
[
,
]
 
N
e
w
 
[
N
N
P
]
 
M
e
x
i
c
o
 
[
N
N
P
]
 
,
 
[
,
]
 
a
n
d
 
[
C
C
]
 
U
t
a
h
 
[
N
N
P
]
 
.
 
[
.
]
 
I
t
 
[
P
R
P
]
 
i
s
 
[
V
B
Z
]
 
a
l
s
o
 
[
R
B
]
 
b
o
r
d
e
r
e
d
 
[
V
B
N
]
 
b
y
 
[
I
N
]
 
W
y
o
m
i
n
g
 
[
V
B
G
]
 
t
o
 
[
T
O
]
 
t
h
e
 
[
D
T
]
 
n
o
r
t
h
 
[
N
N
]
 
,
 
[
,
]
 
N
e
b
r
a
s
k
a
 
[
N
N
P
]
 
t
o
 
[
T
O
]
 
t
h
e
 
[
D
T
]
 
n
o
r
t
h
e
a
s
t
 
[
N
N
]
 
,
 
[
,
]
 
K
a
n
s
a
s
 
[
N
N
P
]
 
t
o
 
[
T
O


Test word_frequency function, comparing its results to a simple command F search on the original file.

In [40]:
frequencies = word_frequency(text)
assert("colorado {11}" in frequencies)

Test convert_pos_to_wordnet to ensure that it accurately converts nltk part of speech tags to wordnet part of speech tags.

In [41]:
assert(convert_pos_to_wordnet('JS') == 'a')
assert(convert_pos_to_wordnet('VB') == 'v')
assert(convert_pos_to_wordnet('NP') == 'n')
assert(convert_pos_to_wordnet('RS') == 'r')
assert(convert_pos_to_wordnet('XL') == 'n')

Test lemmatization, ensuring a conjugated verb such as 'is' is properly lemmatized to 'be'.

In [42]:
lemmas = lemmatization(text)

for lemma, token in zip(lemmas, token):
    if token == 'is':
        assert lemma == 'be'

Test get_first_sentence.

In [43]:
assert(get_first_sentence() == "Colorado is a state in the Western United States.")

In [44]:
entities = named_entity_recognition(text)
assert("GPE Colorado" in entities)