# Abstract Project

Source1: https://www.science.org/doi/abs/10.1126/science.aaa8415

Source2: https://en.wikipedia.org/wiki/List_of_time_periods#Human_time_periods

In [1]:
import os
import codecs
import spacy
from spacy import displacy

# Read Text Document

In [2]:
# get current working directory
cwd = os.getcwd()

abstract_filepath = os.path.join(cwd, 'files', 'abstract.txt')

with open(abstract_filepath, 'r', encoding='utf-8', errors='ignore') as f:
    text = f.read()
    print(text)

Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of todays most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing.


In [3]:
history_filepath = os.path.join(cwd, 'files', 'history.txt')

with open(history_filepath, 'r', encoding='utf-8', errors='ignore') as fp:
    text1 = fp.read()
    print(text1)

Pre-History – Period between the appearance of Homo ("humans"; first stone tools c. three million years ago) and the invention of writing systems (for the Ancient Near East: c. five thousand years ago).
Paleolithic – the earliest period of the Stone Age
Lower Paleolithic – time of archaic human species, predates Homo sapiens
Middle Paleolithic – coexistence of archaic and anatomically modern human species
Upper Paleolithic – worldwide expansion of anatomically modern humans, the disappearance of archaic humans by extinction or admixture with modern humans; earliest evidence for pictorial art.
Mesolithic (Epipaleolithic) – a period in the development of human technology between the Palaeolithic and Neolithic periods.
Neolithic – a period of primitive technological and social development, beginning about 10,200 BCE in parts of the Middle East, and later in other parts of the world.
Chalcolithic (or "Eneolithic", "Copper Age") – still largely Neolithic in character, where early copper met

# Data Exploration

In [4]:
# load spacy instance
nlp = spacy.load('en_core_web_md')

In [5]:
# convert text file to spacy doc object
doc = nlp(text)
doc

Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of todays most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing.

In [6]:
print(f'Text length: {len(doc)}')

for idx, token in enumerate(doc):
    print(idx, token.text, token.pos_, token.dep_)

Text length: 134
0 Machine NOUN compound
1 learning NOUN nsubj
2 addresses VERB ROOT
3 the DET det
4 question NOUN dobj
5 of ADP prep
6 how SCONJ advmod
7 to PART aux
8 build VERB pcomp
9 computers NOUN dobj
10 that PRON nsubj
11 improve VERB relcl
12 automatically ADV advmod
13 through ADP prep
14 experience NOUN pobj
15 . PUNCT punct
16 It PRON nsubj
17 is AUX ROOT
18 one NUM attr
19 of ADP prep
20 todays NOUN pobj
21 most ADV advmod
22 rapidly ADV advmod
23 growing VERB amod
24 technical ADJ amod
25 fields NOUN appos
26 , PUNCT punct
27 lying VERB advcl
28 at ADP prep
29 the DET det
30 intersection NOUN pobj
31 of ADP prep
32 computer NOUN compound
33 science NOUN pobj
34 and CCONJ cc
35 statistics NOUN conj
36 , PUNCT punct
37 and CCONJ cc
38 at ADP conj
39 the DET det
40 core NOUN pobj
41 of ADP prep
42 artificial ADJ amod
43 intelligence NOUN nmod
44 and CCONJ cc
45 data NOUN conj
46 science NOUN pobj
47 . PUNCT punct
48 Recent ADJ amod
49 progress NOUN nsubjpass
50 in ADP prep
5

# Document Visualization

In [7]:
displacy.render(doc, style='dep')

In [8]:
# get all sentences and the number of tokens from the doc object
for sent in doc.sents:
    print(f'Sentence: {sent} \n Token count: {len(sent)}')

Sentence: Machine learning addresses the question of how to build computers that improve automatically through experience. 
 Token count: 16
Sentence: It is one of todays most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. 
 Token count: 32
Sentence: Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. 
 Token count: 35
Sentence: The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing. 
 Token count: 51


# Sentence Similarity

In [9]:
sentence1 = list(doc.sents)[0]
sentence2 = list(doc.sents)[1]
sentence3 = list(doc.sents)[2]
sentence4 = list(doc.sents)[3]

print(sentence1, '\n')
print(sentence2, '\n')
print(sentence3, '\n')
print(sentence4)

Machine learning addresses the question of how to build computers that improve automatically through experience. 

It is one of todays most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. 

Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. 

The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing.


In [10]:
print('Sentence 1 & 2 Similarity: {:.2%}'.format(sentence1.similarity(sentence2)))
print('Sentence 1 & 3 Similarity: {:.2%}'.format(sentence1.similarity(sentence3)))
print('Sentence 2 & 4 Similarity: {:.2%}'.format(sentence2.similarity(sentence4)))
print('Sentence 3 & 4 Similarity: {:.2%}'.format(sentence3.similarity(sentence4)))

Sentence 1 & 2 Similarity: 88.42%
Sentence 1 & 3 Similarity: 91.20%
Sentence 2 & 4 Similarity: 93.60%
Sentence 3 & 4 Similarity: 90.96%


In [11]:
# access document text
doc.text

'Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of todays most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing.'

In [12]:
[t.text for t in doc]

['Machine',
 'learning',
 'addresses',
 'the',
 'question',
 'of',
 'how',
 'to',
 'build',
 'computers',
 'that',
 'improve',
 'automatically',
 'through',
 'experience',
 '.',
 'It',
 'is',
 'one',
 'of',
 'todays',
 'most',
 'rapidly',
 'growing',
 'technical',
 'fields',
 ',',
 'lying',
 'at',
 'the',
 'intersection',
 'of',
 'computer',
 'science',
 'and',
 'statistics',
 ',',
 'and',
 'at',
 'the',
 'core',
 'of',
 'artificial',
 'intelligence',
 'and',
 'data',
 'science',
 '.',
 'Recent',
 'progress',
 'in',
 'machine',
 'learning',
 'has',
 'been',
 'driven',
 'both',
 'by',
 'the',
 'development',
 'of',
 'new',
 'learning',
 'algorithms',
 'and',
 'theory',
 'and',
 'by',
 'the',
 'ongoing',
 'explosion',
 'in',
 'the',
 'availability',
 'of',
 'online',
 'data',
 'and',
 'low',
 '-',
 'cost',
 'computation',
 '.',
 'The',
 'adoption',
 'of',
 'data',
 '-',
 'intensive',
 'machine',
 '-',
 'learning',
 'methods',
 'can',
 'be',
 'found',
 'throughout',
 'science',
 ',',
 'te

In [13]:
# iterate over the text in the document
for text in doc.text:
    print(text)

M
a
c
h
i
n
e
 
l
e
a
r
n
i
n
g
 
a
d
d
r
e
s
s
e
s
 
t
h
e
 
q
u
e
s
t
i
o
n
 
o
f
 
h
o
w
 
t
o
 
b
u
i
l
d
 
c
o
m
p
u
t
e
r
s
 
t
h
a
t
 
i
m
p
r
o
v
e
 
a
u
t
o
m
a
t
i
c
a
l
l
y
 
t
h
r
o
u
g
h
 
e
x
p
e
r
i
e
n
c
e
.
 
I
t
 
i
s
 
o
n
e
 
o
f
 
t
o
d
a
y
s
 
m
o
s
t
 
r
a
p
i
d
l
y
 
g
r
o
w
i
n
g
 
t
e
c
h
n
i
c
a
l
 
f
i
e
l
d
s
,
 
l
y
i
n
g
 
a
t
 
t
h
e
 
i
n
t
e
r
s
e
c
t
i
o
n
 
o
f
 
c
o
m
p
u
t
e
r
 
s
c
i
e
n
c
e
 
a
n
d
 
s
t
a
t
i
s
t
i
c
s
,
 
a
n
d
 
a
t
 
t
h
e
 
c
o
r
e
 
o
f
 
a
r
t
i
f
i
c
i
a
l
 
i
n
t
e
l
l
i
g
e
n
c
e
 
a
n
d
 
d
a
t
a
 
s
c
i
e
n
c
e
.
 
R
e
c
e
n
t
 
p
r
o
g
r
e
s
s
 
i
n
 
m
a
c
h
i
n
e
 
l
e
a
r
n
i
n
g
 
h
a
s
 
b
e
e
n
 
d
r
i
v
e
n
 
b
o
t
h
 
b
y
 
t
h
e
 
d
e
v
e
l
o
p
m
e
n
t
 
o
f
 
n
e
w
 
l
e
a
r
n
i
n
g
 
a
l
g
o
r
i
t
h
m
s
 
a
n
d
 
t
h
e
o
r
y
 
a
n
d
 
b
y
 
t
h
e
 
o
n
g
o
i
n
g
 
e
x
p
l
o
s
i
o
n
 
i
n
 
t
h
e
 
a
v
a
i
l
a
b
i
l
i
t
y
 
o
f
 
o
n
l
i
n
e
 
d
a
t
a
 
a
n
d
 
l
o
w
-
c
o
s
t
 
c
o
m
p
u
t
a
t
i
o
n
.
 
T


In [14]:
# get a span of tokens between 0 - 10
tok_10 = doc[:10]
tok_10

Machine learning addresses the question of how to build computers

In [15]:
displacy.render(docs=tok_10, style='dep')

In [16]:
# export given token attributes to a numpy ndarray
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA

token_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
token_array[:5]

array([[ 1826470356240629538,                   92,                    0,
                           1],
       [ 7342778914265824300,                   92,                    0,
                           1],
       [13677849008941035348,                  100,                    0,
                           1],
       [ 7425985699627899538,                   90,                    0,
                           1],
       [10779227342117629034,                   92,                    0,
                           1]], dtype=uint64)

In [17]:
# save the current state to a directory
path = os.path.join(os.getcwd(), 'files', 'saved_file')
doc.to_disk(path)

In [18]:
# export the document contents to a binary string
bytes = doc.to_bytes()[:100]
bytes

b'\x89\xa4text\xda\x03\x0cMachine learning addresses the question of how to build computers that improve automaticall'

In [19]:
# load history text file
doc1 = nlp(text1)

In [20]:
print(f'Text length: {len(doc1)}')

for i, tok in enumerate(doc1):
    print(i, tok.text, tok.pos_, tok.dep_)

Text length: 732
0 Pre ADJ compound
1 - NOUN dobj
2 History NOUN conj
3 – PUNCT punct
4 Period NOUN dep
5 between ADP prep
6 the DET det
7 appearance NOUN pobj
8 of ADP prep
9 Homo PROPN pobj
10 ( PUNCT punct
11 " PUNCT punct
12 humans NOUN appos
13 " PUNCT punct
14 ; PUNCT punct
15 first ADJ amod
16 stone NOUN compound
17 tools NOUN ROOT
18 c. NOUN appos
19 three NUM compound
20 million NUM nummod
21 years NOUN npadvmod
22 ago ADV advmod
23 ) PUNCT punct
24 and CCONJ cc
25 the DET det
26 invention NOUN conj
27 of ADP prep
28 writing NOUN compound
29 systems NOUN pobj
30 ( PUNCT punct
31 for ADP prep
32 the DET det
33 Ancient PROPN compound
34 Near PROPN compound
35 East PROPN pobj
36 : PUNCT punct
37 c. NOUN appos
38 five NUM compound
39 thousand NUM nummod
40 years NOUN npadvmod
41 ago ADV advmod
42 ) PUNCT punct
43 . PUNCT punct
44 
 SPACE dep
45 Paleolithic PROPN nsubj
46 – PUNCT punct
47 the DET det
48 earliest ADJ amod
49 period NOUN appos
50 of ADP prep
51 the DET det
52 Stone P

In [21]:
# iterate over the entities in the document
for ent in doc1.ents:
    print(ent)

first
three million years ago
five thousand years ago
Palaeolithic
Neolithic
about 10,200
BCE
the Middle East
Chalcolithic
Eneolithic
Copper Age
Neolithic
the Bronze Age
the late 4th millennium
BCE
the Ancient Near East
the Early Middle Ages
roughly less than five thousand years
third
BCE
Mesopotamia
Egypt
the Mediterranean Sea
Greece
Rome
Greco
Greek
Roman
Europe
North Africa
the Middle East
years
Han China
220
the Western Roman Empire
476
the Gupta Empire
the Sasanian Empire
651
5th
the 15th century
the Western Roman Empire
476
the Fall of Constantinople
1453
Renaissance
Discovery
Early Middle Ages
Dark Ages
Late Middle Ages
Early Modern Period
the Late Middle Ages
1500
the Fall of Constantinople
1453
the Italian Renaissance
West
the Ming Dynasty
East
Aztec
the mid-18th century
the French Revolution
today


In [22]:
doc1.has_vector

True

In [23]:
# visualize entities within the history text file
displacy.render(docs=doc1, style='ent')